Soybean yield prediction by machine learning and climate

Soybean cultivation plays an important role in Mato Grosso do Sul and around the world. Given the inherent complexity of the agricultural system, this study aimed to develop climate-based yield prediction models using ML, considering the most correlated meteorological variables for each condition, test the best model with independent data, and define zones of higher soybean yield in Mato Grosso do Sul to recommend better planting sites. The study was carried out in two stages. First, meteorological and soybean yield data obtained from 47 locations in the state of Mato Grosso do Sul were used to calibrate the machine learning (ML) algorithms. Second, the best algorithm was used to predict soybean yields throughout Mato Grosso do Sul. Daily meteorological data of air temperature (T, °C), precipitation (P, mm), global solar irradiance (Qg, MJ m−2 day−1), wind speed (u2, m s−1), net radiation (Rn, MJ m−2 day−1), and relative humidity (RH, %) of the NASA-POWER system from 2002 to 2021 were used. The reference evapotranspiration (ETo) by the standard FAO method and water balance (WB) by Thornthwaite and Mather (1955) were calculated for each collection point. The MLs used in this stage consisted of multiple linear regression (MLR), multilayer perceptron (MLP), support vector machine (SVM), random forest (RF), extreme gradient boosting (XGBOOSTING), and gradient boosted decision (GradBOOSTING). The ML models were calibrated using 70% of the data selected for training and 30% for validation. Algorithms were evaluated by accuracy, precision, and tendency. All analyses were performed using Python 3.8 software. Climate variables showed high spatial and seasonal variability throughout Mato Grosso do Sul (MS). Pearson’s univariate correlations between soybean yield and climate variables of the phenological period showed distinct relationships and different intensities. For instance, soil water storage (ARM) showed negative, neutral, and positive correlations in October, November, and December, respectively. The calibrated ML algorithms had a high precision and accuracy in both calibration and testing. For instance, the best model in the calibration was XGBOOSTING, which showed MAPE, R2, RMSE, MSE, and MAE values of 1.84%, 0.95, 2.06%, 4.24%, and 0.921%, respectively. Random forest (RF), extreme gradient boosting (XGBOOSTING), and gradient boosting (GradBOOSTING) were the most precise machine learning algorithms, with R2 values of 0.71, 0.62, and 0.62 in the test, respectively.