Al-kaolin fractal aggregates terminal settling velocity modelling: an interpretable machine learning approach using symbolic regression

A modelagem por Aprendizado de Máquina (ML) da velocidade de sedimentação terminal de agregados é um avanço de ponta na otimização inteligente de processos de tratamento de água e esgoto. No entanto, a natureza complexa dos agregados frequentemente complica os processos de predição, levando à dependência de uma combinação de múltiplas ferramentas estatísticas, perda de informações e desafios de interpretação. Este estudo apresenta uma nova seleção de características morfológicas de agregados a partir de análise de imagem não intrusiva para modelar a velocidade de sedimentação terminal (TSV) de agregados fractais usando ML interpretável. Primeiramente, quatro modelos de ML (Floresta Aleatória (RF), Reforço de Gradiente Extremo (XGBoost), Máquina de Reforço de Gradiente Leve (LGBM) e Rede Neural Artificial (RNA)) foram treinados e validados. RF e XGBoost foram inicialmente usados para selecionar características de agregados extraíveis diretamente de imagens de flocos e validadas pela análise SHapley Additive ExPlanations (SHAP). Posteriormente, um ML simbólico interpretável (modelo PySR) foi treinado para gerar expressões (marcadas: PySR_1n, PySR_2n, PySR_1u, e PySR_2u) para modelagem direta de TSV a partir de imagens. As propriedades físicas do agregado são caracterizadas por tamanho maior e alta dimensão fractal 3D (mediana Df = 2,82), com a maioria dos agregados sem aglomeração. O TSV mediano foi de 2,65 × 10⁻² m/s, com uma relação significativa com o tamanho (R² = 0,59). Todos os modelos de ML não conseguiram atingir precisão superior a 92% quando submetidos apenas a características físicas. Enquanto isso, a inclusão de facilita a precisão aprimorada por todos os modelos (R²: 0,99 – 1,0). Os modelos PySR registraram precisão de treinamento e teste entre 98% e 99%, exceto para PySR_1u (teste R² = 0,94). As expressões menos complexas (PySR_2n e PySR_2u) tiveram grandes capacidades de generalização (R² = 0,98), enquanto PySR_1u registrou o menor erro de teste (RMSE: 2,158 x 10⁻³; MAE: 1,532 x 10⁻³). Ambos PySR_1n e PySR_1u registraram maior precisão de validação. Este estudo destaca que uma seleção otimizada de características agregadas melhora significativamente a precisão da previsão de TSV. As expressões derivadas do PySR são benéficas para a detecção de TSV agregado em tempo real a partir de imagens e facilitam o monitoramento eficiente do processo de tratamento.

Resumo (inglês)

Machine Learning (ML) modelling of aggregates terminal settling velocity is a cutting-edge advancement in smart water and wastewater treatment process optimisation. However, aggregate complex nature often complicates the prediction processes, leading to the reliance on a combination of multiple statistical tools, loss of information, and interpretation challenges. This study presents a novel aggregate morphological feature selection from non-intrusive image analysis for modelling terminal settling velocity (TSV) of fractal aggregates using interpretable ML. First, four ML models (Random Forest (RF), Extreme Gradient Boosting (XGBoost), Light Gradient Boost Machine (LGBM), and Artificial Neural Network (ANN)) were trained and validated. RF and XGBoost were initially used to select aggregate features extractable directly from floc images and validated by SHapley Additive exPlanations (SHAP) analysis. Subsequently, an interpretable symbolic ML (PySR model) was trained to generate expressions (tagged: PySR_1n, PySR_2n, PySR_1u, and PySR_2u) for direct TSV modelling from images. The aggregate’s physical properties are characterised by larger size and high 3D fractal dimension (median Df = 2.82), with most aggregates having no clumpiness. The median TSV was 2.65 × 10⁻² m/s, with a significant relationship with size (R² = 0.59). All ML models could not achieve beyond 92% accuracy when subjected to only physical features. Meanwhile, the inclusion of Df facilitates improved accuracy by all models (R²: 0.99 – 1.0). The PySR models recorded training and test accuracy between 98 and 99%, except for PySR_1u (test Df = 0.94). The less complex expressions (PySR_2n and PySR_2u) had great generalisation capabilities (R² = 0.98), while PySR_2u recorded the least test error (RMSE: 2.158 x 10⁻³; MAE: 1.532 x 10⁻³). Both PySR_1n and PySR_1u recorded higher validation accuracy. This study highlights that an optimised aggregate feature selection significantly improves TSV prediction accuracy. The PySR-derived expressions are beneficial for real-time aggregate TSV detection from images and facilitate efficient treatment process monitoring.

Palavras-chave

Interpretable modelling, Non-intrusive imaging analysis, Pollutant removal, Sedimentation, Smart water treatment, Modelagem interpretável, Análise de imagens não intrusiva, Remoção de poluentes, Sedimentação, Tratamento inteligente de água

Idioma

Inglês

Citação

BANKOLE, Afolashade Racheal. Al-kaolin fractal aggregates terminal settling velocity modelling: an interpretable machine learning approach using symbolic regression. Advisor: Rodrigo Braga Moruzzi. 2025. 89 f. Dissertation (Master’s in Civil and Environmental Engineering) – Faculty of Engineering, São Paulo State University (UNESP), Bauru, 2025.