PEDRO HENRIQUE MORAES ANANIAS DETECÇÃO, MONITORAMENTO E PREVISÃO DE ANOMALIAS COM APLICAÇÃO EM ESTUDOS DE CASO RELACIONADOS À FLORAÇÃO DE ALGAS POTENCIALMENTE TÓXICAS UTILIZANDO DADOS MULTITEMPORAIS DE SENSORIAMENTO REMOTO E ALGORITMOS DE APRENDIZADO DE MÁQUINA 2021 PEDRO HENRIQUE MORAES ANANIAS DETECÇÃO, MONITORAMENTO E PREVISÃO DE ANOMALIAS COM APLICAÇÃO EM ESTUDOS DE CASO RELACIONADOS À FLORAÇÃO DE ALGAS POTENCIALMENTE TÓXICAS UTILIZANDO DADOS MULTITEMPORAIS DE SENSORIAMENTO REMOTO E ALGORITMOS DE APRENDIZADO DE MÁQUINA Dissertação apresentada ao Instituto de Ciência e Tecnologia, Universidade Estadual Paulista (Unesp), Campus de São José dos Campos; Centro Nacional de Monitoramento e Alertas de Desastres Naturais (Cemaden), como parte dos requisitos para a obtenção do título de MESTRE pelo Programa de Pós-Graduação em DESASTRES NATURAIS. Área: Desastres naturais. Linha de pesquisa: Instrumentação e análise de dados. Orientador: Prof. Dr. Rogério Galante Negri São José dos Campos 2021 Instituto de Ciência e Tecnologia [internet]. Normalização de tese e dissertação [acesso em 2021]. Disponível em http://www.ict.unesp.br/biblioteca/normalizacao Apresentação gráfica e normalização de acordo com as normas estabelecidas pelo Serviço de Normalização de Documentos da Seção Técnica de Referência e Atendimento ao Usuário e Documentação (STRAUD). Ficha catalográfica elaborada pela Biblioteca Prof. Achille Bassi e Seção Técnica de Informática, ICMC/USP com adaptações - STATI, STRAUD e DTI do ICT/UNESP. Renata Aparecida Couto Martins CRB-8/8376 Ananias, Pedro Henrique Moraes Detecção, monitoramento e previsão de anomalias com aplicação em estudos de caso relacionados à floração de algas potencialmente tóxicas utilizando dados multitemporais de sensoriamento remoto e algoritmos de Aprendizado de Máquina / Pedro Henrique Moraes Ananias. - São José dos Campos : [s.n.], 2021. 61 f. : il. Dissertação (Mestrado) - Pós-Graduação em Desastres Naturais - Universidade Estadual Paulista (Unesp), Instituto de Ciência e Tecnologia, São José dos Campos, 2021. Orientador: Rogério Galante Negri. 1. Sensoriamento Remoto Hiperespectral. 2. Floração de Algas. 3. Aprendizagem de Máquina. 4. Análise de Séries Temporais. I. Negri, Rogério Galante, orient. II. Universidade Estadual Paulista (Unesp), Instituto de Ciência e Tecnologia, São José dos Campos. III. Universidade Estadual Paulista 'Júlio de Mesquita Filho' - Unesp. IV. Universidade Estadual Paulista (Unesp). V. Título. BANCA EXAMINADORA Prof. Dr. Rogerio Galante Negri (Orientador) Universidade Estadual Paulista (Unesp) Instituto de Ciência e Tecnologia Campus de São José dos Campos Profa. Dra. Tatiana Sussel Gonçalves Mendes Universidade Estadual Paulista (Unesp) Instituto de Ciência e Tecnologia Campus de São José dos Campos Prof. Dr. Thales Sehn Körting Instituto Nacional de Pesquisas Espaciais (INPE) Divisão de Observação da Terra e GeoInformática São José dos Campos, 12 de agosto de 2021. SUMÁRIO LISTA DE ABREVIATURAS E SIGLAS .......................................................................... 4 RESUMO............................................................................................................................... 5 ABSTRACT ............................................................................................................................ 6 1 INTRODUÇÃO ................................................................................................................. 7 2 ARTIGOS .......................................................................................................................... 10 2.1 Artigo - “Anomalous behaviour detection using one-class support vector machine and remote sensing images: a case study of algal bloom occurrence in inland waters” ............................................................................................................................ 10 2.2 Artigo - “Forecasting algal bloom events in inland waters using remote sensing data and machine learning methods” .......................................................................... 33 3 CONSIDERAÇÕES FINAIS ........................................................................................... 55 REFERÊNCIAS.................................................................................................................... 57 LISTA DE ABREVIATURAS E SIGLAS ABD Anomalous Behaviour Detection ABF Anomalous Behaviour Forecasting AM Aprendizado de Máquina AP Aprendizado Profundo API Application Programming Interface FAI Floating Algae Index GEE Google Earth Engine HAB Harmful Algal Blooms IA Inteligência Artificial LSTM Long-short Term Memory MNDWI Modified Normalized Difference Water Index NDVI Normalized Difference Vegetation Index OC-SVM One-class Support Vector Machines RF Random Forest SR Sensoriamento Remoto SABI Surface Algae Bloom Index SVM Support Vector Machines ANANIAS, M. H. P.. Detecção, monitoramento e previsão de anomalias com aplicação em estudos de caso relacionados à floração de algas potencialmente tóxicas utilizando dados multitemporais de sensoriamento remoto e algoritmos de Aprendizado de Máquina. Dissertação. São José dos Campos: Universidade Estadual Paulista (Unesp), Instituto de Ciência e Tecnologia; Centro Nacional de Monitoramento e Alertas de Desastres Naturais (Cemaden), 2021. RESUMO Globalmente, foram observadas nos últimos anos severas mudanças ambientais e climáticas. Neste cenário, destacam-se o aumento de eventos de floração de algas tóxicas, responsáveis pela degradação da qualidade da água e, principalmente, ameaça à saúde dos seres vivos. Estudos apontam para a necessidade de monitoramento e previsão desse fenômeno, os quais podem ser conduzidos segundo a concentração de clorofila. Perante esta motivação, utilizando-se técnicas de Aprendizado de Máquina e séries multitemporais de dados obtidos por Sensoriamento Remoto, foram desenvolvidos dois novos métodos capazes de proporcionar suporte à detecção e previsão de floração de algas. O primeiro visa a detecção automática do fenômeno e aplica conceitos de classificação de imagens por meio do emprego do algoritmo One-class Support Vector Machine. O segundo método é responsável por prever o seu surgimento. Como forma de evidenciar o potencial e viabilidade das propostas, os algoritmos foram aplicados em estudos de caso em áreas suscetíveis à ocorrência de floração de algas tóxicas. Palavras-chave: Sensoriamento Remoto. Floração de Algas. Aprendizado de Máquina. Anomalia. Classificação. Séries multitemporais ANANIAS, M. H. P.. Anomaly detection, monitoring, and forecasting with application in case studies related to the of potentially harmful algal blooms using multitemporal remote sensing data and Machine Learning algorithms. Research projetct. São José dos Campos: São Paulo State University (Unesp), Institute of Science and Technology; National Center for Monitoring and Early Warning of Natural Disasters (Cemaden), 2021. ABSTRACT Globally, severe environmental and climate changes have been observed in recent years. In such a scenario, there was also an increase in the bloom of toxic algae, responsible for the degradation of water quality and threatening living beings’ health. Studies point to the need to monitor and predict this phenomenon with basis on the chlorophyll concentration information. In the face of this motivation, using Machine Learning techniques and Remote Sensing multitemporal image series, two novel methods were developed to detect and forecast algae bloom. The first aims at the automatic detection of the phenomenon and applies concepts of image classification through the use of the One-class Support Vector Machine algorithm. The second method is responsible for predicting its emergence. Different case studies were carried in order to prove the potential of the proposed methods. Keywords: Remote Sensing. Algal Bloom. Machine Learning. Anomaly. Classification, Multitem- poral series 7 1 INTRODUÇÃO Em um contexto global, mudanças ambientais como desmatamento, efeito estufa, desertificação ou perda da biodiversidade estão frequentemente relacionadas ao aumento da população humana e ao consequente descompasso na disponibilidade de recursos (GRIMMOND, 2007; CHAWLA; KARTHIKEYAN; MISHRA, 2020). Nesse sentido, Nagendra et al. (2013) salientam a importância em monitorá-las, mesmo diante do seu alto custo, a fim de permitir que grupos técnicos e o poder público tenham parâmetros realistas para a tomada de decisões. No Brasil, observa-se a organização de movimentos a partir dos anos 60 com o objetivo de pressionar autoridades a adotarem políticas e agendas de prevenção de problemas ambientais (FOWLER; AGUIAR, 1993). Em um estudo recente, Nobre et al. (2011) afirmam que os governos federal e estaduais têm trabalhado com foco em ampliar o conhecimento necessário para que o país possa responder aos efeitos das mudanças climáticas nos diversos setores da sociedade. Essencial para a vida, um dos recursos mais exigidos é a água, seja para consumo doméstico, seja para consumo agrícola ou industrial (MISHRA; COULIBALY, 2009). Segundo Wells et al. (2015), um dos problemas decorrentes desse aumento e responsável por severos danos aos ambientes aquáticos é a proliferação de cianobactérias (i.e. Harmful Algal Blooms – HAB). Causado por ações humanas (CARVALHO et al., 2013) ou mudanças climáticas (CASTRO; MOSER, 2012), esse fenômeno, também conhecido como floração, proliferação, afloramento ou bloom, é de difícil previsão (WELLS et al., 2015) e deve ser foco de estudos que busquem o seu monitoramento e forneçam suporte ao alerta precoce de sua ocorrência. Nesse sentido, a clorofila-a (Chl-a), um importante pigmento fotossintético e causadora da coloração esverdeada de plantas, algas e cianobactérias (MILENKOVIĆ et al., 2012), é ponto central em discussões sobre a qualidade da água. Diversos parâmetros são descritos por Chawla, Karthikeyan e Mishra (2020) como responsáveis pela sua qualidade, incluindo sedimentos em suspensão, turbidez, fósforo total, conteúdo orgânico dissolvido, temperatura e disco de Secchi. No caso da Chl-a, o seu desenvolvimento está diretamente relacionado às mudanças repentinas em temperatura da superfície, velocidade do vento, precipitação, estratificação da coluna de água e direção do fluxo de água (WELLS et al., 2015; SHI et al., 2019; ROUSSO et al., 2020). Ao monitorá-la, é possível determinar se um reservatório ou rio, por exemplo, encontra-se próprio para consumo humano (MATTHEWS; BERNARD; ROBERTSON, 2012; CARVALHO et al., 2013). Por sua vez, conforme destacado por Chawla, Karthikeyan e Mishra (2020), observa-se a deficiência de estações de monitoramento, sendo muitas vezes negligenciadas pelo poder público. Além do registro de sua ocorrência em diversos países do mundo (STUMPF et al., 2012; DUAN et al., 2015; YI et al., 2018; BINDING et al., 2018), florações de algas são frequentemente encontradas em território nacional (CARVALHO et al., 2013) em locais responsáveis pelo abastecimento urbano (BEYRUTH, 2000; OGASHAWARA et al., 2014) e pela geração de energia (MATSUMURA-TUNDISI; TUNDISI, 2005; WATANABE et al., 2018), por exemplo. 8 Além de afetarem o ecossistema e ameaçarem a saúde da população, apresentam risco para o tratamento de água, pois alteram seu odor e sabor (CARVALHO et al., 2013). A ampla observação da Terra por meio de Sensoriamento Remoto (SR) permite o desenvolvimento de soluções, especialmente na avaliação quantitativa, qualitativa e temporal da proliferação de algas potencialmente tóxicas. Atualmente, uma variedade de satélites (e.g., Terra, Aqua, Landsat-8, Sentinel-2) apresenta-se como base para estes algoritmos. Embora ferramentas que detectam automaticamente e permitam a geração de alertas precoces sejam escassas, são encontrados na literatura estudos baseados em dados empíricos e aplicação de índices espectrais com foco na detecção de HABs e estimação da concentração de Chl-a em ambientes aquáticos (GOWER et al., 2005; MATTHEWS; BERNARD; ROBERTSON, 2012). Um exemplo é aplicação do índice espectral Normalized Difference Vegetation Index (NDVI) (JR et al., 1973), inicialmente desenvolvido para o mapeamento de vegetação e utilizado em questões relacionadas às algas (ZHAO, 2003). Pode-se citar ainda o Floating Algae Index (FAI) (HU, 2009), onde Oyama et al. (2015) o utiliza no monitoramento dos níveis de cianobac- téria em lagos do Japão. Outros exemplos são o Modified Normalized Difference Water Index (MNDWI) (HAN-QIU, 2005), inicialmente desenvolvido para detectar a presença de água limpa em corpos d’água (XU, 2006) e Surface Algae Bloom Index (SABI) (ALAWADI, 2010), outro índice exclusivamente utilizado na detecção de algas. Adicionalmente, são propostos estudos que fazem uso de processos matemáticos como a aplicação dos modelos Empirical Orthogonal Function (EOF) para estimar a concentração de Chl-a no Lago Taihu, China (QI et al., 2014) e do Medium Resolution Continental Shelf (MRCS) (ALLEN et al., 2008) para detecção de eventos de HABs em zonas costeiras europeias. Diante do aumento na quantidade e qualidade de dados fornecidas por avançados sensores imageadores (YUAN et al., 2020; MARTÍNEZ-ÁLVAREZ; BUI, 2020), torna-se possível o aprimoramento da acurácia dos resultados de monitoramento, estimação e previsão da floração de algas. Esse ganho pode ser obtido com a aplicação de técnicas de Aprendizagem de Máquina (AM) e Aprendizado Profundo (AP), já aplicadas em outras áreas do conhecimento e expandidas na ciência do SR. O uso de Máquinas de Vetores de Suporte (Support Vector Machines – SVM) (CORTES; VAPNIK, 1995), por exemplo, é abordado em estudos realizados por Sun, Li e Wang (2009) e Zhang, Huang e Wang (2020). Já os resultados com a aplicação do algoritmo de Florestas Aleatórias (Random Forests – RF), inicialmente proposto por Breiman (2001), podem ser observados nas publicações de Song et al. (2015) e Kupssinskü et al. (2020). A implementação de classificação supervisionada é possível quando os dados de treinamento estão disponíveis. Da mesma forma, quando o objetivo é automatizar a detecção de anomalias, o modelo One-class Support Vector Machine (OC-SVM) proposto por Schölkopf et al. (2000) surge como alternativa. Munoz-Mari et al. (2010) destacaram os desafios de usar OC-SVM para classificar dados pouco representativos, o que é um problema recorrente neste tipo de aplicação. No entanto, esses estudos mostram que modificações no processo de modelagem de OC-SVM podem aumentar sua eficácia de classificação e potencial detecção de anomalias. 9 Na mesma direção, são observadas ferramentas que fazem uso de AP (CHO; CHOI; PARK, 2018; VILÁN et al., 2013). Pode-se citar o estudo desenvolvido por Lee et al. (2019), que implementa redes neurais de múltiplas camadas na detecção de marés vermelhas (FLAGEL- LATES, 1979) na península coreana. Em outro estudo, Barzegar, Aalami e Adamowski (2020) propõem um modelo híbrido composto por Redes Neurais Convolucionais (Convolutional Neural Network – CNN) (LECUN et al., 1989) e Memória Longa de Curto Prazo (Long Short-term Memory – LSTM) (HOCHREITER; SCHMIDHUBER, 1997) na previsão da quantidade de Chl-a em um lago localizado na Grécia. Conforme abordado anteriormente, a maioria dos modelos que se utilizam de imagens de SR se baseiam em modelos físicos (SATHYENDRANATH et al., 2001; FRANKLIN et al., 2020) (i.e. empíricos) ou são direcionados ao ambiente onde as amostras de campo foram coletadas, não sendo facilmente replicáveis em outros locais (YUAN et al., 2020). Desta forma, faz-se necessário o desenvolvimento de ferramentas universais e independentes de dados in situ. Portanto, utilizando-se de técnicas de AM e análise de séries temporais de imagens de SR adquiridas da plataforma Google Earth Engine Application Programming Interface (GEE API), foram desenvolvidos dois novos algoritmos que possibilitam, de forma totalmente automa- tizada, a detecção e previsão de anomalias em ambientes aquáticos. O primeiro, denominado Anomalous Behaviour Detection (ABD), é responsável por detectar a ocorrência de proliferação de algas em águas interiores com uso do modelo RF. Sua performance foi avaliada com base em dados in situ dos Lagos Erie (USA) e Taihu (China). O segundo, denominado Anomalous Behaviour Forecasting (ABF) e modelado a fim de prever a ocorrência do fenômeno em questão, foi avaliado por meio de três estudos de casos que compreendem os lagos Erie, Chilika (Índia) e Taihu. Este documento encontra-se organizado da seguinte forma: o Capítulo 2 apresenta os artigos “Anomalous behaviour detection using one-class support vector machine and remote sensing images: a case study of algal bloom occurrence in inland waters” e “Forecasting algal bloom events in inland waters using remote sensing data and machine learning methods”; O Capítulo 3 aborda as considerações finais do presente trabalho. 10 2 ARTIGOS 2.1 Artigo - “Anomalous behaviour detection using one-class support vector machine and remote sensing images: a case study of algal bloom occurrence in inland waters” O artigo a seguir foi publicado em 17 de março de 2021 no periódico INTERNATIONAL JOURNAL OF DIGITAL EARTH . http://dx.doi.org/10.1080/17538947.2021.1907462 Anomalous behaviour detection using one-class support vector machine and remote sensing images: a case study of algal bloom occurrence in inland waters Pedro Henrique Moraes Ananias a,b and Rogério Galante Negri a,b aGraduate Program in Natural Disasters, UNESP/CEMADEN, São José dos Campos, São Paulo, Brazil; bSciences Technology Institute, São Paulo State University (UNESP), São José dos Campos, São Paulo, Brazil ABSTRACT Algal blooms are a frequent subject in scientific discussions and are the focus of many recent studies, mainly due to their adverse effect on society. Given the lack of ground truth data and the need to develop tools for their detection and monitoring, this research proposes a novel method to automate detection. Concepts derived from multi-temporal image series processing, spectral indices and classification with One- class Support Vector Machine (OC-SVM) are used in this proposal. Imagery from multi-spectral sensors on Landsat-8 and MODIS were acquired through the Google Earth Engine API (GEE API). In order to evaluate our method, two bloom detection case studies (Lake Erie (USA) and Lake Taihu (China)) were performed. Comparisons were made with methods based on spectral index thresholds. Also, to demonstrate the performance of the OC-SVM classifier compared to other machine learning methods, the proposal was adapted to be used with a Random Forest (RF) classifier, having its results added to the analysis. In situ measurements show that the proposed method delivers highly accurate results compared to spectral index thresholding approaches. However, a drawback of the proposal refers to its higher computational cost. The application of the new method to a real-world bloom case is demonstrated. ARTICLE HISTORY Received 11 September 2020 Accepted 17 March 2021 KEYWORDS Remote sensing; spectral indices; unsupervised classification; anomalies; algal bloom detection 1. Introduction Natural disasters such as floods, fires, deforestation, thawing, water pollution, earthquakes, are becoming more widely reported (Marzuoli and Liu 2019). They are commonly related to anthro- pogenic causes, climate change, population growth, and overuse of land resources (Grimmond 2007; Chawla, Karthikeyan, and Mishra 2020). As described by the United Nations, I.S.f.D.R. (2015), a disaster is characterised by its ability to disorganise a society, causing human, material, economic or environmental losses. In this sense, and as part of a management system, it is strongly desirable that the authorities know about these events in advance, so that they can build early warning systems to protect society (He et al. 2019). Regarding aquatic environments, the appearance of phytoplankton is often associated with econ- omic losses and the consequent impact on society (Klemas 2011). In addition to the damage to the local ecosystem, some species are responsible for poisoning humans and animals (Ghatkar, Singh, and Shanmugam 2019). On the other hand, these algae comprise an important share of the aquatic © 2021 Informa UK Limited, trading as Taylor & Francis Group CONTACT Pedro Henrique Moraes Ananias pedro.ananias@unesp.br INTERNATIONAL JOURNAL OF DIGITAL EARTH 2021, VOL. 14, NO. 7, 921–942 https://doi.org/10.1080/17538947.2021.1907462 food base and are accountable for fixing approximately 50% of the CO2 present in the atmosphere (Ghatkar, Singh, and Shanmugam 2019). In the face of environmental degradation and recent climate changes, controlling algal blooms has become a challenge for humankind (Yi et al. 2018). Observation of the Earth through satellites and remote sensing images allows quantitative, qualitative, and temporal assessments of these blooms, making room for the development of algorithms to solve this problem. An example covered in Binding et al. (2018) highlights the concern about cyanobacterial pro- liferation in Lake Winnipeg (Canada) in recent decades, as this region provides essential support for multiple ecosystems, recreational and commercial activities, and is a source of hydroelectric power generation. The author also discusses that the impact is a function of the exposure time, location, quantity, and composition of the proliferating species. Chlorophyll-a (Chl-a) is a pigment necessary for photosynthesis and is found in all types of organisms that perform it, mainly algae (Wetzel 2001). Chl-a has been the central focus of recent discussions on its classification in inland waters (Watanabe et al. 2018). Song et al. (2015) pointed out several challenges in this process, including a lack of ground truth data and unbalanced samples. Another issue is the complications generated by obtaining field data, such as high costs and remote locations (Klemas 2011). Several studies have concerned the detection of potentially harmful algae using remote sensing. However, there have only been a few efforts to develop automatic detection and alert models for its presence (Song et al. 2015). One approach to this discussion is given by Sun, Li, and Wang (2009), who created a unified model to estimate the concentration of Chl-a in Lake Taihu (China) using support vector machines (SVM). In parallel, Song et al. (2015) discussed the use of artificial intelligence methods to predict algal bloom in Monterey Bay (California, USA). In the study, given a lack of field data, the authors pro- pose the use of the Random Forest model (RF) (Breiman 2001) and satellite images obtained from a Moderate Resolution Imaging Spectroradiometer (MODIS) andMedium Resolution Imaging Spec- trometer (MERIS) to detect algal distribution in the bay coast. The use of spectral indices and its thresholds for algal bloom detection in recent studies are found in the literature. One is the Normalized Difference Vegetation Index (NDVI) (Rouse et al. 1973), which was initially developed for vegetation mapping, but is formally used in studies related to algae. Another is the Floating Algae Index (FAI) (Hu 2009), which focuses on detecting floating algae. However, the use of these indices based on thresholds is not a permanent solution, as it does not consider the historical behaviour of the algae. The existing algorithms, such as Maximum Chloro- phyll Index (MCI) and Maximum Peak Height (MPH), were also considered. As pointed out by Shi et al. (2019), and following the scope of this research, the mentioned indices were not designed to be used with the Landsat-like data to solve the addressed problem, due to its limited spectral infor- mation and low signal-to-noise ratios. In this sense, (Shi et al. 2019) also suggested that new studies should follow the direction of elucidating algal bloom behaviour over time, including the develop- ment of tools that enable its prediction. Moreover, the implementation of supervised classification in remote sensing, including Chl-a concentration estimation, is possible when training data is available. Similarly, when the objective is to automate anomaly detection, the One-class Support Vector Machine (OC-SVM) model pro- posed by Schölkopf et al. (2000) emerges as a potential alternative. Muñoz-Marí et al. (2010) high- lighted the challenge of using OC-SVM to classify poorly representative data, which is a recurring problem in this type of application. However, these studies show that modifications in the model- ling process of OC-SVM may increase its classification effectiveness and potential anomaly detection. In light of the problems caused by algae presence, as well as the importance of its continuous monitoring, we propose a new algorithm based on fully automated unsupervised anomaly detection approaches, in order to verify the occurrence of such phenomenon in inland waters by using remote 922 P. H. M. ANANIAS AND R. G. NEGRI sensing images acquired through the Google Earth Engine Application Programming Interface (GEE API). The formalisation and construction of this new proposal uses concepts related to spec- tral indices, image processing, and classification based on OC-SVM. In order to validate it, two case studies with in situ measurements are carried out on Lake Erie (USA) and a portion of Lake Taihu (China). A comparison of the proposal, based on a Random Forest (RF) classifier, with the other methods mentioned earlier (Zhao 2003; Jia, Zhang, and Dong 2019) is also included in this study. Finally, to demonstrate the application of the proposed method in a real-world algal bloom case, we present an annual mapping (i.e. 2014 to 2018) of this event in Lake Taihu, compar- ing the results obtained using the study presented by Jia, Zhang, and Dong (2019). This article is organised as follows: Section 2 briefly reviews data classification and spectral indi- ces; Section 3 presents the proposed algorithm; Section 4 describes the study areas and data as well as the experiment design and the results; Section 4.1 discusses details about the study areas, images used in the algae identification process and selected reference samples for results evaluation; Section 5 presents and discusses the results obtained; Finally, Section 6 concludes this paper. 2. Theory rational 2.1. Preliminary notations Let I be the matrix representation of an image obtained by remote sensing. Each position of I is expressed in terms of s, defined over a regular grid S , N2. Usually, s is called pixel and is related to a given geographic location and its sensor measurements, expressed by the x [ X vector. We call the X vector space per attribute space. Thus, according to these elements and notations, I(s) = x determines which behaviour of I , in the respective s position, is expressed by the components of x = (x1, x2, . . . , xℓ). Assuming a defined geographic region over S support observed at distinct times ti, i = 1, 2, . . . , n, it is convenient to adopt I (i). Thus, I (i) and I (j) express the behaviour of the targets contained in the same region, but at different times. Due to the need to delimit certain portions of a given image, the use of masks becomes con- venient. For a given I (i) image, theM(i) image, also defined on the same support S, whose positions are associated with binary values (i.e. 0 or 1). Under these conditions, M(i)(s) = 0 has singular use for occluding values/vectors assigned to the position s in I (i). On the other hand,M(i)(s) = 1 ident- ifies s positions where there is a particular interest. Consequently, the process of hiding attributes of I (i) may be achieved through I (i) ⊗M(i), where ⊗ represents the multiplication between I (i) and M(i), with respect to every s [ S. In time, M(i) denotes the complement (i.e. 0 � 1) of the binary values in M(i). Among different applications that use of remote sensing images, it is necessary to identify and distinguish several types of targets that compose the observed landscape. For this purpose, image classification techniques are usually employed. The classification process consists of applying F:X � V to the attribute vectors x of each s [ S in order to associate a class vj [ V, j = 1, . . . , c. The distinct classification methods proposed in the literature establish ways of modelling F. Usually, supervised and unsupervised approaches are adopted in remote sensing. In supervised learning, a set of labelled attribute vectors is used to learn how F should assign an unlabelled vector x to a class ofΩ. Regarding the unsupervised approach, no labelled information is available, and F is modelled though the structural organisation found in the set where it is applied. Additionally, according to Congalton and Green (2009), it is essential before the analysis of map- pings obtained on remote sensing data that assessments are made regarding the accuracy of these mappings. Among different measures available in the literature, the kappa coefficient and the global accuracy allow to account for omissions and inclusions according to certain classes, as well as to compare different mappings between themselves (Bishop et al. 1977). F1-Score (van Rijsbergen INTERNATIONAL JOURNAL OF DIGITAL EARTH 923 1979) comprises an alternative way of measuring the ratio between true and false detections (Yang and Liu 1999). 2.2. One-class support vector machines Support Vector Machine (SVM) is a widespread classification technique, especially in remote sen- sing applications. A solid mathematical formulation, simple algorithmic architecture and high gen- eralisation ability are some features that highlight such method (Bruzzone and Persello 2009). Furthermore, as reported in Mountrakis, Im, and Ogole (2011), SVM has achieved similar or even higher accuracy results compared to other classification methods. Based on the original SVM conception, several variations have been proposed, for example, laplacian (Gu and Feng 2013), transductive (Li et al. 2018), context-sensitive (Bruzzone and Persello 2009; Negri, Dutra, and Sant’Anna 2014) and One-class (Schölkopf et al. 2001) SVMs. The last example, One-class SVM (OC-SVM), regards into an unsupervised approach motivated by quantile estimation (Glazer, Lindenbaum, and Markovitch 2013) and may be adopted for change detection (Negri et al. 2020) applications. Conceptually, from a given set of unlabelled observations, the OC- SVM obtains a model able to classify elements as part of such set with a false-positive/negative occurrence rate ν. Formally, we may write a function F:D , X � {+1, − 1}, where +1 output implies that input elements are in D, and −1 otherwise. The classifier F is given by: F(x) = sgn ∑m i=1 aiK(x, xi)− b ( ) , (1) where b = ∑m j=1 ajK(xi, x j) for any xi [ D, and K( · , · ) is a kernel function. The coefficient ai, i = 1, . . . , m, are obtained as solution of the following optimisation: min a1,...am ∑m i,j=1 aiajK xi, xj ( ) s.t. ai[ 0, 1 vm [ ] ∑m i=1 ai=1 ⎧⎪⎪⎨⎪⎪⎩ (2) The most significant characteristic that distinguishes this method from classic SVM approach lies in the optimisation problem expressed by Equation (2). It is worth noting that the OC-SVM is para- meterised by n [ [0, 1] and additional parameters related to the adopted kernel function. For example, when the RBF kernel (i.e. K(xi, x j) = exp (− g‖xi, x j‖2)) is adopted, g [ (0, 1) should also be adjusted. More details on kernel functions can be found in Shawe-Taylor and Cristianini (2004). As mentioned, the method is formalised on unsupervised learning and anomaly detection approaches, thus avoiding issues related to the difficulty of acquiring labelled data, since it only con- siders ‘regular samples’ in the learning process. Alternatively, it is essential to highlight that it is sensitive to the presence of noise in the dataset used in its decision rule modelling process; other- wise, it becomes liable to understand anomalies as regular occurrences. 2.3. Random forests Random Forest (RF) is another classifier employed in the recent remote sensing studies. Introduced by Breiman (2001), its idea lies in the use of a forest of decision trees. Moreover, it exploits the 924 P. H. M. ANANIAS AND R. G. NEGRI ensemble learning technique in order to combine the output of multiple decision trees through a major voting process, finally producing a classification decision. Assuming a training set D, nest sets with the same cardinality are replicated by a bootstrap sampling. Posteriorly, for each replica, it is considered a random attribute subset with a maximum of natt attributes and then used to build a decision tree. Parameters regarding such trees, like the maximum depth (pdepth), minimum samples to split (psplit), and minimum samples per leaf (pleaf ) should be tuned before the training process. More details and discussions regarding those par- ameters are found in Breiman (2001). Concerning the classification stage, as previously mentioned, the vector x is assigned to a class in Ω that produces significant concordance among all individual trees. Accordingly to Belgiu and Dră- guţ (2016), RF is a computationally efficient algorithm that does not overfit the final decision rule. 2.4. Spectral indices thresholding for algal bloom detection Spectral indices allow the extraction and analysis of remote sensing data. Despite a feature of inter- est, a spectral index can assist in its identification. This approach, also called ‘spectral enhancement’, is essential, given the impossibility of modifying orbiting imaging sensors and the difficulty of obtaining field data with the same spatial and temporal resolution (Verstraete and Pinty 1996). Generally, spectral indices are derived from algebraic operations on the attributes of X that characterise the behaviour of x assigned with every pixel s of I (i). Examples of spectral indices proposed in the literature are focussed on the characterisation of vegetation (Rouse et al. 1973), components in water bodies (Gao 1996), constructed areas (Huang, Lu, and Zhang 2014), soil moisture (Khanna et al. 2007), and others (Xue and Su 2017). Among the different applications benefited by the use of spectral indices is the detection of algal bloom (Hu 2009; Mishra and Mishra 2012; Zhang et al. 2014; Houborg et al. 2016; Watanabe et al. 2018). Such problem has received attention due to its harmful potential to man and the environment. Zhao (2003) proposes to distinguish algal concentration and bloom using thresholds in NDVI values. Admitting I (i), an image whose attributes express the behaviour of imaged targets at wave- length ranges in the electromagnetic spectrum, NDVI is characterised by the following function: fNDVI(x) = xNIR − xRed xNIR + xRed , (3) where xNIR and xRed refer to the behaviour of the target at near-infra-red and red wavelengths, respectively. Without loss of generality, fNDVI is applied to every s [ S since I (i)(s) = x. Also, since x = (x1, x2, . . . , xℓ), we have NIR, Red [ {1, 2, . . . , ℓ}. Consequently, the approach proposed in Zhao (2003) can be expressed in terms of the following decision rule: FNDVI(x) = v1, fNDVI(x) ≤ −0.15 v2, otherwise; { (4) being v1 and v2 classes referring to non-occurrence and occurrence of algae, respectively. Additionally, other approaches can be used to detect algal bloom, to mention the study by Jia, Zhang, and Dong (2019). In this case, the following threshold for FAI (Hu 2009) values is set: FFAI(x) = v1, fFAI(x) ≤ −0.004 v2, otherwise. { (5) such that: fFAI(x) = xNIR − xRed + (xSWIR − xRed)× lNIR − lRed lSWIR − lRed ( ) , (6) INTERNATIONAL JOURNAL OF DIGITAL EARTH 925 where xSWIR is the behaviour of the target at short-wave infra-red wavelength. Also, lSWIR ranges from 1608 to 1640 nm. 3. Automatic algal bloom detection 3.1. Conceptual formalisation Using a multi-temporal series of remote sensing images and based on the concepts of spectral indices and image classification, this work proposes an algorithm based on anomaly detection with case studies applied to algal blooms in aquatic environments. The diagram illustrated in Figure 1 presents an overview of this proposal and its steps, discussed according to its representation. Initially, the user limits the region of interest, which contains the water body, a spanning period for characterising a time series of images, a particular instant for which anomalies are to be detected Figure 1. General organisation of the proposed method. The ‘Modelling data selection’ section refers to the process of creating a reference set by automatically selecting between pixels associated with the possible occurrence or non-occurrence of algal bloom based on indices median variation. 926 P. H. M. ANANIAS AND R. G. NEGRI and the sensor, limited to Landsat-8 OLI (30 m spatial resolution) or MODIS MOD09GA.006 (500 m spatial resolution). According to the purposes of this research, anomalies are interpreted as an algal bloom occurrence. However, sudden changes in the target’s spectral response can also be detected as non-regularities. Once input parameters are set, a search for remote sensing images is performed. The Google Earth Engine (GEE) (Gorelick et al. 2017) platform is used to acquire data according to established criteria. As a result, images derived from the chosen sensor, with spectral bands ranging from blue to shortwave infra-red, are returned. To reduce the computational cost, those images are stored in a caching area, allowing them to be reused by the algorithm. Also, for each image, an auxiliary pro- duct is provided, allowing identification of elements such as water bodies and the presence of clouds. For convenience, further details regarding this subproduct are discussed later in Section 3.2 (see support data topic). Subsequently, the query performed allows the construction of time series images representing I (1), I (2), . . . , I (n). Such series have temporal ordering, where I (n) refers to the most recent instant. Also, we have I (k), where k ≈ n 2, expressing the particular instant. The difference between I (1) and I (k); or the equivalent between I (k) and I (n), expresses the input spanning period. Through the mentioned auxiliary product, masks are constructed to allow delimitation of the existing water body in the area of interest as well as mapping locations associated with the occur- rence of clouds and cloud shadows. Since its occurrence has a dynamic behaviour over time, it is necessary to define masks M(1) cl , . . . , M(n) cl for each image/instant I (1), . . . , I (n). On the other hand, the regions that comprise water are defined by a single mask Mwb. It is a common issue that, in applications involving the observation of the Earth’s surface via optical sensors, the occurrence of clouds provides a lack of information about the targets. This work considers images with good visibility, i.e. those where cloud cover is less than or equal to 50%. More information on this procedure is shown in Section 3.2 (cloud threshold topic). Under these conditions, the regions affected by this atmospheric phenomenon are ignored. For the sake of mathematical and computational simplicity, it is convenient to determine the median image Ĩ , whose attribute vector associated with each position s represents the median of each attribute, also in s, for the considered time series. From the initial time series and its supporting data (i.e. masks and median image), each image is limited to the water body, followed by correcting the target spectral response with the removal of cloud occurrences and median trend subtraction. Such a process is formally expressed by: I (i) := I (i) − Ĩ ( ) ⊗M(i) cl ⊗Mwb, i = 1, . . . , n, (7) where ‘−’ denotes the usual matrix subtraction and M(i) cl is the reverse of mask M(i) cl . Note that the expressed process provides a redefinition of each I (i). When in possession of the multi-temporal series of images, adjusted according to Equation (7), and facing the central objective of this proposal, which is the detection of algal blooms, measures that favour the identification of this phenomenon are computed. Such measures refer to the NDVI and FAI indices discussed in Section 2.4. With the exception of the image I (k) (i.e. the instant of interest), values from the spectral indices considered are observed in relation to all pixels of the time series. According to these index values, pro- cessed separately, it is possible to extract their average trendmz and its variation level, given in terms of the standard deviation sz, with z [ {NDVI, FAI}. In turn, the values performed by each index, and not limited by the range [mz − sz, mz + sz], may represent an anomaly occurrence according to z. While the OC-SVM model uses the D reference set represented by non-occurrence of algal bloom (i.e. [mz − sz, mz + sz], for z = {NDVI, FAI}), the RF algorithm is trained with all datasets, which also comprehends the anomaly class (i.e. ]−1, mz − sz] ⋃ [mz + sz, +1[, for z = {NDVI, FAI}). However, before modelling, the set of observations defined as regular or anom- alous is used as auxiliary information in the process of selecting the method’s parameters. Thus, it INTERNATIONAL JOURNAL OF DIGITAL EARTH 927 makes it possible to obtain a decision rule capable of identifying the occurrence of anomalies according to the behaviour of the considered spectral indices. It is important to note that the adjust- ment of parameters associated with each method (i.e. the ν and kernel function parameters related to OC-SVM or depth (pdepth), number of estimators (nest), feature subset size and minimum samples required to split (psplit) or to be a leaf node (pleaf ), regarding RF method) is conducted automatically. Further details on this procedure are discussed in Section 3.2 (see model parameter tuning topic). Lastly, I (k) is expressed in terms of the NDVI and FAI indices and then submitted for classifi- cation. As a result (i.e. output), maps in Tagged Image File Format (TIFF) and GeoJSON (Butler et al. 2016) formats delimit the water body between anomaly and regular classes (i.e. no anomalous observation). Along with the purposes, which drove this development, the introduced algorithm is denoted by the acronym ABD-OCSVM, where ABD stands for Anomalous Behaviour Detection. In this context, ABD-RF stands for the use of the RF model. 3.2. Implementation details The previous formalisation was simplified to make it clear. The following information completes the proposal. The code of the proposed algorithm is freely available at https://github.com/ pedroananias/abd. Programming language and libraries: The Python 3.6 (van Rossum and Drake 2011) programming language was used to implement the proposed algorithm. Additionally, functions provided in the Numpy (Van Der Walt, Colbert, and Varoquaux 2011) and Pandas (McKinney 2010) libraries were used for data manipulation. The Scikit-Learn (Pedregosa et al. 2011) library was used in OC-SVM and RF modelling and classification processes. Google Earth Engine API:To access Landsat-8 OLI and MODIS MOD09GA.006 satellites imageries, Google Earth Engine Application Programming Interface (API) (GEE-API 2019), compatible with the Python language, was used. This API allows automation of the image search process for a given period and region of interest. Also, it is worth mentioning that images from Landsat (Landsat 8 Surface Reflectance Code (LaSRC) algorithm (USGS 2017)) and MODIS (USGS 2020) sensors returned by this API are previously subjected to atmospheric correction and all processing and modelling steps after its extraction are performed outside the GEE platform. Model parameter tuning: As discussed in Section 2.2 and 2.3, OC-SVM and RF methods require the adjustment of parameters inherent in its formalisation (i.e. the ν parameter or numbers of trees in the forest). Given the high degree of freedom associated with the process of selecting appro- priate parameters for the method modelling step, a Randomized Grid Search procedure was employed (Rastrigin 1963; Baba 1981; Bergstra and Bengio 2012). This procedure consists of testing a finite set of parameter settings and selecting one that ensures higher accuracy. For OC-SVM, the search space that determines the tested settings is given from the n [ {10−1, 10−2, . . . , 10−7} value and a RBF kernel with g [ {10−1, 10−2, . . . , 10−7}. Regard- ing the RF parameters, with the Gini impurity measure guiding the nodes splitting process, the tested settings are nest [ {1, 5 · · · 250}, pdepth [ {1, 2, . . . , 30}, psplit [ {2, 4, . . . , 20}, pleaf [ {2, 4, . . . , 20} and natt [ { ��������� dim(X) √ , 100%, 75%, 50%}. In addition, this procedure is replicated in the decision rules modelling the data set for each possible configuration, accord- ing to a 10-fold cross-validation process. Decision rule modelling dataset size: As described above, the samples comprising the decision rule modelling dataset are defined as a function of the trend range [m− s, m+ s]. Given the extre- mely high number of examples (i.e. attribute vectors associated with each pixel) in the time series employed, the use of all available data for modelling the OC-SVMmethod becomes com- putationally impeding, thus motivating the use of randomly selected subsets. After preliminary testings with 10%, 5%, 2.5% and 1% of available examples, it was found that using 1% allowed 928 P. H. M. ANANIAS AND R. G. NEGRI results similar to other higher proportions with a lower computational cost. There is an expo- nential increase in the computational cost with an increasing decision rule modelling set size, as a time series can reach more than ten million pixels in its processing dataset. Thus, the 1% ratio is employed in determining the modelling set. Support data: Masks that delimit water bodies and cloud occurrence within the region of interest are used as supporting data in the developed algorithm. In turn, bitwise operations are used to operate binary values directly (Cavanagh 2013). In addition, the AND operator is commonly used to create filters and masks. In our study, the masks are derived from the operation per- formed on the object qa, which represents the quality assessment band of the working product, using function bitwiseAnd. More information can be found in Gorelick et al. (2017). For Land- sat-8 OLI, the quality band is known as ‘pixel_qa’, comprising bits 3 and 5 for cloud/sha- dow. MODIS MOD09GA.006 sensor uses a bitwise operation for cloud occurrence (bit 10) extraction through the ‘state_1km’ subproduct. The extraction of water bodies must be done with auxiliary products, called ‘MOD44W.006 Terra Land Water Mask Derived from MODIS and SRTM Yearly Global 250m’ and ‘NASA GLCF Landsat Global Inland Water’ throughout bands ‘water_mask’ and ‘water’, respectively. This is necessary to ensure that images comprising the time series have the same number of pixels. Cloud threshold: The occurrence of clouds in remote sensing data brings little or no information. The implementation of the proposed algorithm considered occurrence thresholds according to the percentages 25%, 50% and 75% to determine the best value for the selection and use of the images (i.e. good visibility). Preliminary analysis for the NDVI index, based on the data in Section 4.1, showed a mean and variance equal to −0.42 and 0.11, when considering images with up to 25% of cloud occurrence. This was −0.48 and 0.11 for 50%; −0.57 and 0.11 for 75% occurrence. Since the difference between these values is not statistically significant at 5%, it was decided to use the intermediate percentage equal to 50%. However, it is worth emphasising that the sites affected by the occurrence of this phenomenon are discarded from the composition of the database used in modelling the method. Spanning period: The spanning period is an input parameter responsible for determining the size of the time series used in the anomaly detection process. Note that the general trend that charac- terises the occurrence of anomalies (i.e. algal bloom) is determined as a function of time series characteristics. Thus, the spanning period may directly influence the anomaly detection pro- cess. If considered a spanning period of d days, the time series comprehends the period between d days after and before the particular instant (i.e. the image I (k)). In addition, the algorithm will automatically seek to allocate this period so that the excess on both ends of the series is reallo- cated from one to the other, considering the selected value and the image I (k). Preliminary tests using 90, 180 and 365 as inputs indicated that the 180-value is the most stable, in terms of accuracy. 4. Experiments In order to evaluate the proposed method, two case studies were conducted on the detection and mapping of algal bloom occurrence in water bodies. The first was Lake Erie, located in Ohio, USA. A portion of Lake Taihu, located in China, was the second study area. The assessment and validation of the proposed method were conceived by choosing dates and sensors according to the availability of ground truth data provided by the National Centers for Environmental Information of USA (Burtner et al. 2019, 2020) the and National Earth System Science Data Center, National Science and Technology Infrastructure of China (Ma 2016, 2017). Correlating its spatial and temporal occurrences, the definition of anomalous samples was based on the chlorophyll concentration .5.6 g/L, total phosphorus .21.7 g/L and Secchi depth ,3.0m for Lake Erie (Chapra and Dobson 1981; Kasich, Taylor, and Butler 2014) and chlorophyll INTERNATIONAL JOURNAL OF DIGITAL EARTH 929 .20 g/L for Lake Taihu (Xu et al. 2015; Wang et al. 2019). Regular samples comprised data not covered by the rules in either case. The results obtained were compared to the methods proposed in Gandhi et al. (2015) and Jia, Zhang, and Dong (2019), previously discussed in Section 2.4, which detect the occurrence of algal blooms based on thresholds for NDVI (Equation (4)) and FAI (Equation (5)) spectral indices, respectively. Overall accuracy, the kappa coefficient of agreement (Bishop et al. 1977) and F1-Score (van Rijsbergen 1979) were computed from reference samples identified on distinct dates. Also, hypothesis tests derived from the kappa measure were performed to evaluate the significance of the results. Furthermore, the analysed methods were compared on the basis of false/true posi- tives/negatives occurrence ratios. Additionally, the spanning period of 180 days, previously discussed in Section 3.2, was considered in the following experiments. The same inputs and analysis were tested through amodified version of the proposal using an RF classifier, comparing its performance against the OC-SVM method. The experiments were performed on a computer with an AMD Ryzen 9 3900X 12-core pro- cessor, and 32 GB of RAM running the Ubuntu Linux version 20.04 operating system. 4.1. Study areas and data As mentioned at the beginning of Section 4, two case studies were conducted. Regions referring to parts of Lake Erie (Figure 2) and Laike Taihu (Figure 3) determined the respective study areas. Lake Erie is one of the five Great Lakes in North America and has the eleventh largest surface area in the world; it is located between Ohio, Michigan, Pennsylvania, and New York states and covers an area of 25, 657 km2. Lake Erie has a total volume of 484 km3 and an average and maxi- mum depth of approximately 19m and 64m, respectively (Bolsenga and Herdendorf 1993). His- torically, severe algal blooms have been recorded since the ‘70s, predominantly during the summer, peaking between August and September (Stumpf et al. 2012). Through Figure 2, it is also possible to observe the water quality monitoring stations belonging to the Great Lakes Environ- mental Research Laboratory (GLERL), a multidisciplinary environmental research laboratory linked to the National Oceanic and Atmospheric Administration (NOAA). The third-largest lake in China, Lake Taihu is a source of water for 40 million people. Recently, high algal blooms affected more than 4 million residents and industries in the region (Duan et al. 2015). According to Gao et al. (2020), it has an area of 2, 338 km2 and an average depth of 1.9m. According to the authors, the period of rain begins in April and extends until July. However, the lake faces its driest period between February and March. According to the discussions held in Section 3, the algal bloom detection process uses images obtained from the Landsat-8 OLI and MODIS MOD09GA.006 sensors. This detection process, however, only uses one sensor at a time. Such images have 30 and 500m spatial resolutions, respect- ively, and spectral bands ranging from blue to short-wave infra-red wavelengths. After analysis of the good visibility images obtained by matching in situ data with the passage of the Landsat and MODIS satellites (i.e. less than 50% of cloud presence), and motivated by the tem- poral variation in algal bloom occurrence over Lake Erie, the following images were selected: June 3rd, 2019 – moderate; July 1st, 2019 – very high; August 19th, 2019 – very high; September 24th, 2019 – high (MODIS MOD09GA.006); and September 21st, 2015 – moderate (Landsat-8 OLI). Figure 4 illustrates the images by sensor of origin, highlighting the dates and validation samples considered for the occurrence or not of anomalies (that is, algae proliferation). The yellow arrows point to the exact location where the samples were selected, based on the NOAA/GLERL monitor- ing station locations. Table 1 summarises the number of pixels and polygons selected, as well as the total of training instants. For the study area on Lake Taihu, two images matching the ground truth dataset and the passage of Landsat satellite were acquired: September 13th, 2016 – moderate; and December 2nd, 2016 – moderate (Figure 5(c,d)). The yellow arrows point to the exact location where the samples were 930 P. H. M. ANANIAS AND R. G. NEGRI selected, based on the mapping of the chlorophyll concentration for days September 13th and December 2nd (Figure 5(a,b), respectively) provided by the National Earth System Science Data Center, National Science and Technology Infrastructure of China (Ma 2016, 2017). The thresholds used to determine whether blooms occurred are discussed in Section 4. Reference samples related to the occurrence or not of anomalies, as well as the total of training instants, are expressed in Table 2. Figure 2. Spatial location of the study area 1 – Lake Erie, Ohio, USA. Figure 3. Spatial location of the study area 2 – Lake Taihu, China. INTERNATIONAL JOURNAL OF DIGITAL EARTH 931 Figure 4. Images in false-color composition (NIR, red and green bands) from the Lake Erie study area for each considered instant. Regular and anomaly samples are identified by cyan and magenta polygons, respectively. Table 1. Summary of regular and anomaly reference samples related to Lake Erie for the considered instants. Pixels/Polygons MODIS MOD09GA.006 Landsat-8 OLI 2019 2015 Jun. 3rd Jul. 1st Aug. 19th Sep. 24th Sep. 21st Regular 29/4 20/2 31/2 27/2 266/1 Anomaly 27/3 37/6 35/6 34/6 281/5 Training instants 105 104 99 104 24 932 P. H. M. ANANIAS AND R. G. NEGRI 5. Results and discussion Taking the case study of Lake Erie as a starting point, anomaly mapping was obtained for each con- sidered date. Figure 6 shows the accuracy values associated with each tested method, expressed in terms of overall accuracy, kappa coefficient and F1-Score, and computed from the reference samples. Regarding the first image from June 3rd, 2019, the proposal outperforms its competitors. Con- cerning the second date (July 1st, 2019), the proposal again demonstrates higher accuracy, deliver- ing the first and second-best results with the application of the OC-SVM and RF algorithms, respectively. Similar to the first, the image from August 19th, 2019 presents the best result of the proposal. However, NDVI shows similar performance compared to the OC-SVM variation, Figure 5. Images in false-color composition (NIR, red and green bands) from the Lake Taihu study area for each considered instant. Regular and anomaly samples are identified by cyan and magenta polygons, respectively. At the top, Chl-a concentration maps from the National Earth System Science Data Center, National Science and Technology Infrastructure of China are shown. Table 2. Summary of regular and anomaly reference samples related to Lake Taihu for the considered instants. Pixels/Polygons September 13th, 2016 December 2nd, 2016 Regular 915/3 1452/1 Anomaly 922/2 1424/2 Training instants 10 10 INTERNATIONAL JOURNAL OF DIGITAL EARTH 933 producing the best result among the analysed indices. The last image, acquired on September 24th, 2019 with the MODIS sensor MOD09GA.006, also demonstrates the superiority of the proposal’s classification, with OC-SVM and RF showing the first and second-best results, respectively. On this date, the performance of competitors was still low, with FAI exhibiting the worst metrics. The Land- sat-8 OLI sensor, the single available image for this study area and acquired on September 21st, 2015, presents the best performance with ABD-OCSVM variance. Given the above results, the proposal using the OC-SVM algorithm showed the most regularity among its competitors. In order to accept or reject the superiority of its performance, hypothesis tests comparing kappa values were performed (Congalton and Green 2009). It was observed that the proposal is statistically superior at 1% significance (i.e. 99% confidence), except for the image from August 19th, where NDVI, ABD-OCSVM and ABD-RF are statistically equivalent. This behaviour is also true for the image from September 24th with ABD-OCSVM and ABD-RF. Figure 7 shows the True/False-Positive/Negative ratio graph for the first study area, where the acronyms FN, FP, TN and TP relate to false-negative, false-positive, true-negative and true-positive, respectively. The term ‘positive’ refers to the occurrence of anomalies and ‘negative’ refers to regular pixels/ regions. First, it should be noted that among the competitors, the ABD-OCSVM proposal delivers the lowest FN rate for all compared dates. The opposite is observed with the NDVI and FAI indices, which present high rates of FN throughout the evaluated period. Although the proposal using the Figure 6. Accuracy of the analysed methods related to the Lake Erie study area at distinct instants. Error bars denotes the respect- ive kappa standard deviation value. Figure 7. True(T)/False(F)-Positive(P)/Negative(N) ratio graph for the Lake Erie detection results. 934 P. H. M. ANANIAS AND R. G. NEGRI Figure 8. Mapping results obtained by the analysed methods for the Lake Erie study area related to the MODIS MOD09GA.006 (a)–(t) and Landsat-8 OLI (u)–(y) sensors. Regular, anomaly, cloud and land/abscence data regions are denoted in cyan, magenta, grey and white, respectively. Figure 9. Images considering the performance obtained by the tested methods for the Lake Taihu study area: (a) kappa coeffi- cient/Accuracy/F1-Score of the analysed methods at distinct instants. Error bars denotes the respective kappa standard deviation value. (b) True(T)/False(F)-Positive(P)/Negative(N) ratio graph. INTERNATIONAL JOURNAL OF DIGITAL EARTH 935 RF algorithm presents similar results of TP compared with the ABD-OCSVM variation for most of the evaluated dates, there is a decrease in the detection of TN and an increase in FP on June 3rd, 2019 for the model in question. These results indicate that competitors overestimate the occurrence of regular pixels in the tested samples, presenting high values of FN and ignoring the occurrence of algae in all images. As shown in the previous analysis, ABD-OCSVM has the most regular detection. Figure 8(a,f,k,p,u) represent the areas of interest and polygons of in situ samples for the evaluated dates (i.e. yellow arrows). Alternatively, Figure 8(b–e,g–j,l–o,q–t,v–y) refer to the detection results delivered by each one of the tested methods. Regarding Figure 8(b,c) from the MODIS NDVI and FAI first dates, respectively, a comparison with in situ samples (Figure 8(a)) shows an overestimation of FN by both approaches, which is cor- roborated by Figure 7. Concerning the ABD-RF model, the opposite behaviour is true as it over- estimates FP. Regarding the image from July 1st, 2019, ABD-OCSVM shows the best performance, followed by ABD-RF. For the August 19th, 2019 image, the FAI indice delivered high FP compared with NDVI, ABD-OCSVM and ABD-RF, the last three being statistically equiv- alent. Figure 8(p–t) (MODIS last image), show high FN delivered by spectral indices and corrobo- rate the detection equivalence and superiority of the ABD-OCSVM and ABD-RF variations. Finally, Figure 8(u–y) represent the only acquired image for Lake Erie using the Landsat-8 OLI sensor. It demonstrates the superiority of the proposal, having its variations showing lower FN compared to tested indices. Regarding the second study area through Figure 9(a), the superior performance of the proposal variants, in relation to NDVI and FAI thresholds, is evident for both dates (September 13th and December 2nd, 2016). This analysis is confirmed by Figure 9(b), where the image acquired in Sep- tember 13th, 2016 shows low FN and FP performance by the ABD-RF model, in contrast with the high FN proportion detected by spectral indices. In the same direction, the ABD-OCSVM model showed low FN values for December 2nd, 2016. In this image, however, NDVI and FAI obtained results equivalent to the proposal’s RF algorithm. Visual analysis of the second area’s results is possible through Figure 10(a–j). In this sense, Figure 10(a,f) represent the areas of interest selected according to ground truth data (i.e. yellow arrows). From another angle, Figure 10(b,c) confirm the non-detection of anomalous regions found in this image by NDVI and FAI indices, validating the results from the True/ False-Positive/Negative graph. However, Figure 10(i) stands out, and it is possible to observe the detection of a vast anomalous region by the ABD-OCSVM variant, corroborated by samples from Figure 10(f). Lastly, hypothesis tests derived from the kappa coefficient confirm the superiority of the ABD- OCSVM and ABD-RF models, being statistically superior to the spectral indices with a significance Figure 10. Mapping results obtained by the analysed methods for the Lake Taihu study area. Regular, anomaly, cloud and land/ abscence data regions are denoted in cyan, magenta, grey and white, respectively. 936 P. H. M. ANANIAS AND R. G. NEGRI of 1%. However, the ABD-RF variant is superior to the ABD-OCSVMmodel for the first date (Sep- tember 13th, 2016) and the opposite is true for the second date, where ABD-OCSVM presents a better result than ABD-RF. Although the proposal has considerably higher computational costs (as shown in Figure 11), it has the advantage of considering anomalies temporal behaviour and being an unsupervised Figure 11. Run-time of analysed methods for both study areas. Figure 12. Annual mapping of cyanobacteria blooms occurrence in the Lake Taihu from 2014 to 2018 using multi-temporal series of images extracted from GEE API and MODIS MOD09GA.006 sensor: (a)–(e) algal bloom coverage with water-leaving reflectance correction (Shenglei et al. 2016; Wang et al. 2018) and FAI.− 0.004 (threshold proposed by Jia, Zhang, and Dong 2019). (f)–(j) algal bloom coverage using ABD-OCSVM algorithm. Brackets show the total instants mapped by each year. INTERNATIONAL JOURNAL OF DIGITAL EARTH 937 approach, where no manual selection of training data is needed. The proposal creates datasets larger than millions of pixels, mainly in the parameter optimisation and decision rule modelling. These sets are filtered and automatically obtained, reaching 20 or even 100 instants (MODIS MOD09GA.006 sensor), depending on the spanning period selected. To minimise the impact of network resource and consequently reduce the computational cost, a caching system was adopted, storing the images according to the area of interest locally. During the classifier construction, sev- eral images are processed. In this step, acquisition time is therefore reduced, if already in the disk. 5.1. Real-world application: 5-year mapping of the spatial occurrence of cyanobacterial blooms in Lake Taihu In their study, Jia, Zhang, and Dong (2019) proposed a workflow responsible for annually mapping the occurrence of cyanobacteria in Lake Taihu in addition to the FAI.− 0.004 threshold (Section 2.4). With results obtained using the algorithm implemented via the GEE platform (Jia 2019), the authors discuss the spatio-temporal patterns of the mentioned phenomenon and its annual charac- teristics, as well as the effects of environmental factors on the proliferation of these bacteria between the years 2000 and 2018. They also highlight the application of the water-leaving reflectance correc- tion, recommended for Chinese inland waters greater than 25 km2 (Shenglei et al. 2016; Wang et al. 2018). To demonstrate the application of the ABD-OCSVM model in a real case study related to algae problems, the algorithm proposed by Jia, Zhang, and Dong (2019) was replicated, resulting in an annual mapping of algae blooming in Lake Taihu between 2014 and 2018 (Figure 12(a–e)). Simul- taneously, respecting the aforementioned spatiality and temporality, the results of the proposal are shown in Figure 12(f–j). In addition to the higher concentration verified by the ABD-OCSVM mapping compared to the algorithm proposed by Jia, Zhang, and Dong (2019), algae blooms occur in the central region of the lake, which is not widely verified in the first model. However, there is a spatial compatibility between both algorithms, especially in the most extreme regions, where the occurrence of the phenomenon is close to 100%. 6. Conclusions In summary, regarding the problem of algal blooms and their adverse effect, whether related to health (i.e. poisoning, water quality) or economic losses, it is essential to develop studies aimed at monitoring it. Given this motivation, an algorithm based on anomaly detection approaches to detect algal bloom occurrence in inland water, supported by spectral indices and the OC-SVM model, was proposed. In order to validate it, two case studies were conducted with in situmeasure- ments. In this direction, comparisons with similar proposals based in NDVI and FAI indices thresh- olding, were also included. The proposal was modified to be used with an RF classifier, comparing its results against the OC-SVM approach. Finally, to demonstrate the proposed algorithm in a real- world application and compare it with other studies, a 5-year mapping of algal bloom occurrence in Lake Taihu was presented, showing its spatial behaviour over time. The results showed that the proposed method using the OC-SVM algorithm has higher accuracy levels than the analysed competitors, showing averages for kappa coefficient, overall accuracy and F1-Score of 68%, 87% and 86%, respectively; against 47%, 78% and 75% for ABD-RF; 28%, 67% and 57% for NDVI; and −4%, 53% and 46% for FAI. Furthermore, it has been observed that the tested methods overestimate the non-occurrence of algal blooms (i.e. regular regions). On the other hand, it should be mentioned that the proposal is highly dependent on network connection and images with good visibility (i.e. less than 50% of clouds) in the training set and requires a higher computational effort. Still, given the lack of ground truth data, it was not possible to validate the behaviour of the proposal in regions with the presence 938 P. H. M. ANANIAS AND R. G. NEGRI of snow or ice. As mentioned before, sudden changes in the target’s spectral response can be detected as anomalies. The opposite is true when the phenomenon’s occurrence is too subtle, being detected as regular. Additionally, a positive or negative performance variation is expected at different instants, both for the proposal and for the evaluated indexes. Future perspectives for this study include (i) the development of alternative strategies to reduce computational cost; (ii) evaluate the behaviour of other spectral indices in the modelling process in order to extend the application for the proposed algorithm; (iii) implementation of moving average based strategies in the modelling process. Acknowledgments The authors would like to thank the National Centers for Environmental Information of USA (https://www.ncei. noaa.gov) and the National Earth System Science Data Center, National Science and Technology Infrastructure of China (http://www.geodata.cn) for providing data support for Lake Erie and Lake Taihu, respectively. Finally, they would like to thank the anonymous reviewers for their suggestions and comments that highly improved this work. Disclosure statement No potential conflict of interest was reported by the author(s). Funding The authors thank Fundação de Amparo á Pesquisa do Estado de São Paulo (FAPESP) (grants 2018/ 01033-3) for their financial support of this research. ORCID Pedro Henrique Moraes Ananias http://orcid.org/0000-0002-0924-1236 Rogério Galante Negri http://orcid.org/0000-0002-4808-2362 References Baba, N. 1981. “Convergence of a Random Optimization Method for Constrained Optimization Problems.” Journal of Optimization Theory and Applications 33 (4): 451–461. Belgiu, M., and L. Drăguţ. 2016. “Random Forest in Remote Sensing: A Review of Applications and Future Directions.” ISPRS Journal of Photogrammetry and Remote Sensing 114: 24–31. Bergstra, J., and Y. Bengio. 2012. “Random Search for Hyper-Parameter Optimization.” The Journal of Machine Learning Research 13 (1): 281–305. Binding, C., T. Greenberg, G. McCullough, S. Watson, and E. Page. 2018. “An Analysis of Satellite-Derived Chlorophyll and Algal Bloom Indices on Lake Winnipeg.” Journal of Great Lakes Research 44 (3): 436–446. Bishop, Y. M., S. E. Fienberg, P. W. Holland, R. J. Light, and F. Mosteller. 1977. “Book Review: Discrete Multivariate Analysis: Theory and Practice.” Applied Psychological Measurement 1 (2): 297–306. Bolsenga, S. J., and C. E. Herdendorf. 1993. Lake Erie and Lake St. Clair Handbook. Detroit, MI: Wayne State University Press. Breiman, L. 2001. “Random Forests.” Machine Learning 45 (1): 5–32. Bruzzone, L., and C. Persello. 2009. “A Novel Context-Sensitive Semisupervised SVM Classifier Robusttomislabeled Training Samples.” IEEE Transactions on Geoscience and Remote Sensing 47 (7): 2142–2154. Burtner, A., C. Kitchens, D. Fyffe, C. Godwin, T. Johengen, D. Stuart, R. Errera, D. Palladino, D. Fanslow, and D. Gossiaux. 2020. “Physical, Chemical, and Biological Water Quality Data Collected From a Small Boat in Western Lake Erie, Great Lakes From 2019-04-30 to 2019-10-07 (NCEI Accession 0209116)” (dataset). NOAA National Centers for Environmental Information. Accessed June 13, 2020. https://accession.nodc.noaa.gov/ 0209116. Burtner, A., D. Palladino, C. Kitchens, D. Fyffe, T. Johengen, D. Stuart, D. Fanslow, and D. Gossiaux. 2019. “Physical, Chemical, and Biological Water Quality Data Collected From a Small Boat in Western Lake Erie, Great Lakes INTERNATIONAL JOURNAL OF DIGITAL EARTH 939 From 2012-05-15 to 2018-10-09 (NCEI Accession 0187718)” (dataset). NOAA National Centers for Environmental Information. Accessed June 13, 2020. https://accession.nodc.noaa.gov/0187718. Butler, H., M. Daly, A. Doyle, S. Gillies, S. Hagen, and T. Schaub. 2016. The GeoJSON Format. Internet Engineering Task Force (IETF). Cavanagh, J. 2013. X86 Assembly Language and C Fundamentals. Boca Raton, FL: CRC Press. Chapra, S. C., and H. F. Dobson. 1981. “Quantification of the Lake Trophic Typologies of Naumann (Surface Quality) and Thienemann (Oxygen) with Special Reference to the Great Lakes.” Journal of Great Lakes Research 7 (2): 182– 193. Chawla, I., L. Karthikeyan, and A. K. Mishra. 2020. “A Review of Remote Sensing Applications for Water Security: Quantity, Quality, and Extremes.” Journal of Hydrology 585: Article ID: 124826. Congalton, R. G., and K. Green. 2009. Assessing the Accuracy of Remotely Sensed Data. Boca Raton, FL: CRC Press. Duan, H., S. A. Loiselle, L. Zhu, L. Feng, Y. Zhang, and R. Ma. 2015. “Distribution and Incidence of Algal Blooms in Lake Taihu.” Aquatic Sciences 77 (1): 9–16. Gandhi, G. M., S. Parthiban, N. Thummalu, and A. Christy. 2015. “NDVI: Vegetation Change Detection Using Remote Sensing and GIS-a Case Study of Vellore District.” Procedia Computer Science 57: 1199–1210. Gao, B.-C. 1996. “NDWI–A Normalized Difference Water Index for Remote Sensing of Vegetation Liquid Water From Space.” Remote Sensing of Environment 58 (3): 257–266. Gao, Y., G. Zhu, H. W. Paerl, B. Qin, J. Yu, and Y. Song. 2020. “A Study of Bioavailable Phosphorus in the Inflowing Rivers of Lake Taihu, China.” Aquatic Sciences 82 (1): 1. GEE-API. 2019. “Google Earth Engine API.” Accessed 29 October 2019. https://github.com/google/earthengine-api. Ghatkar, J. G., R. K. Singh, and P. Shanmugam. 2019. “Classification of Algal Bloom Species From Remote Sensing Data Using An Extreme Gradient Boosted Decision Tree Model.” International Journal of Remote Sensing 40 (24): 9412–9438. Glazer, A., M. Lindenbaum, and S. Markovitch. 2013. “q-ocsvm: A q-Quantile Estimator for High-Dimensional Distributions.” In Advances in Neural Information Processing Systems 26 (NIPS 2013). Stateline, NV. Gorelick, N., M. Hancher, M. Dixon, S. Ilyushchenko, D. Thau, and R. Moore. 2017. “Google Earth Engine: Planetary-Scale Geospatial Analysis for Everyone.” Remote Sensing of Environment 202: 18–27. Grimmond, S. U. 2007. “Urbanization and Global Environmental Change: Local Effects of Urban Warming.” Geographical Journal 173 (1): 83–88. Gu, Y., and K. Feng. 2013. “Optimized Laplacian SVM with Distance Metric Learning for Hyperspectral Image Classification.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 6 (3): 1109–1117. He, M., S. Pathak, U. Muaz, J. Zhou, S. Saini, S. Malinchik, and S. Sobolevsky. 2019. “Pattern and Anomaly Detection in Urban Temporal Networks.” Preprint. arXiv:1912.01960. Houborg, R., M. F. McCabe, Y. Angel, and E. M. Middleton. 2016. “Detection of Chlorophyll and Leaf Area Index Dynamics From Sub-Weekly Hyperspectral Imagery.” In Remote Sensing for Agriculture, Ecosystems, and Hydrology XVIII, 999812. Vol. 9998. International Society for Optics and Photonics. Hu, C. 2009. “A Novel Ocean Color Index to Detect Floating Algae in the Global Oceans.” Remote Sensing of Environment 113 (10): 2118–2129. Huang, X., Q. Lu, and L. Zhang. 2014. “A Multi-Index Learning Approach for Classification of High-Resolution Remotely Sensed Images Over Urban Areas.” ISPRS Journal of Photogrammetry and Remote Sensing 90: 36–48. Jia, T. 2019. “Earth Engine Code – Long-Term Spatial and Temporal Monitoring of Cyanobacteria Blooms Using Modis on Google Earth Engine: A Case Study in Taihu Lake.” Accessed 5 February 2020. https://code. earthengine.google.com/e6c4d627ec7f4f0dbc1a4f77fdeb3bb3. Jia, T., X. Zhang, and R. Dong. 2019. “Long-Term Spatial and Temporal Monitoring of Cyanobacteria Blooms Using Modis on Google Earth Engine: A Case Study in Taihu Lake.” Remote Sensing 11 (19): 2269. Kasich, J. R., M. Taylor, and C. W. Butler. 2014. Public Water System Harmful Algal Bloom Response Strategy. Ohio Environmental Protection Agency. Columbus, OH. Khanna, S., A. Palacios-Orueta, M. L. Whiting, S. L. Ustin, D. Riaño, and J. Litago. 2007. “Development of Angle Indexes for Soil Moisture Estimation, Dry Matter Detection and Land-Cover Discrimination.” Remote Sensing of Environment 109 (2): 154–165. Klemas, V. 2011. “Remote Sensing of Algal Blooms: An Overview with Case Studies.” Journal of Coastal Research 28 (1A): 34–43. Li, Y., Y. Wang, C. Bi, and X. Jiang. 2018. “Revisiting Transductive Support Vector Machines with Margin Distribution Embedding.” Knowledge-Based Systems 152: 200–214. Ma, R. 2016. “Lake Taihu Chlorophyll Inversion Product Data Set (2016)” (dataset). National Earth System Science Data Center, National Science and Technology Infrastructure of China. Accessed June 13, 2020. http://www. geodata.cn/data/datadetails.html?dataguid=122425543649945. Ma, R. 2017. “Lake Taihu Chlorophyll Inversion Product Data Set (2017)” (dataset). National Earth System Science Data Center, National Science and Technology Infrastructure of China. Accessed June 13, 2020. http://www. geodata.cn/data/datadetails.html?dataguid=107063659544667. 940 P. H. M. ANANIAS AND R. G. NEGRI Marzuoli, A., and F. Liu. 2019. “Monitoring of Natural Disasters Through Anomaly Detection on Mobile Phone Data.” In 2019 IEEE International Conference on Big Data (Big Data), 4089–4098. IEEE. McKinney, W. 2010. “Data Structures for Statistical Computing in Python.” In Proceedings of the 9th Python in Science Conference, 51–56. Vol. 445. Austin, TX. Mishra, S., and D. R. Mishra. 2012. “Normalized Difference Chlorophyll Index: A Novel Model for Remote Estimation of Chlorophyll-a Concentration in Turbid Productive Waters.” Remote Sensing of Environment 117: 394–406. Mountrakis, G., J. Im, and C. Ogole. 2011. “Support Vector Machines in Remote Sensing: A Review.” ISPRS Journal of Photogrammetry and Remote Sensing Society 66 (3): 247–259. Muñoz-Marí, J., F. Bovolo, L. Gómez-Chova, L. Bruzzone, and G. Camp-Valls. 2010. “Semisupervised One-Class Support Vector Machines for Classification of Remote Sensing Data.” IEEE Transactions on Geoscience and Remote Sensing 48 (8): 3188–3197. Negri, R. G., L. V. Dutra, and S. J. S. Sant’Anna. 2014. “An Innovative Support Vector Machine Based Method for Contextual Image Classification.” ISPRS Journal of Photogrammetry and Remote Sensing 87: 241–248. Negri, R. G., A. C. Frery, W. Casaca, S. Azevedo, M. Araújo, E. Silva, and E. Alcântara. 2020. “Spectral-Spatial Aware Unsupervised Change Detection with Stochastic Distances and Support Vector Machines.” IEEE Transactions on Geoscience and Remote Sensing 59 (4): 2863–2876. Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, and V. Dubourg, et al. 2011. “Scikit-Learn: Machine Learning in Python.” Journal of Machine Learning Research 12 (Oct): 2825–2830. Rastrigin, L. 1963. “The Convergence of the Random Search Method in the Extremal Control of a Many Parameter System.” Automaton & Remote Control 24: 1337–1342. Rouse Jr., J. W., R. H. Haas, J. Schell, and D. Deering. 1973. “Monitoring the Vernal Advancement and Retrogradation (Green Wave Effect) of Natural Vegetation.” Third ERTS Symp. 1: 309–317. Schölkopf, B., J. C. Platt, J. C. Shawe-Taylor, A. J. Smola, and R. C. Williamson. 2001. “Estimating the Support of a High-Dimensional Distribution.” Neural Computation 13 (7): 1443–1471. Schölkopf, B., R. C. Williamson, A. J. Smola, J. Shawe-Taylor, and J. C. Platt. 2000. “Support Vector Method for Novelty Detection.” In Advances in Neural Information Processing Systems: 582–588. Denver, CO. Shawe-Taylor, J., and N. Cristianini. 2004. Kernel Methods for Pattern Analysis. New York, NY, USA: Cambridge University Press. Shenglei, W., L. Junsheng, Z. Bing, S. Qian, Z. Fangfang, and L. Zhaoyi. 2016. “A Simple Correction Method for the Modis Surface Reflectance Product Over Typical InlandWaters in China.” International Journal of Remote Sensing 37 (24): 6076–6096. Shi, K., Y. Zhang, B. Qin, and B. Zhou. 2019. “Remote Sensing of Cyanobacterial Blooms in Inland Waters: Present Knowledge and Future Challenges.” Science Bulletin 64 (20): 1540–1556. Song, W., J. Dolan, D. Cline, and G. Xiong. 2015. “Learning-based Algal Bloom Event Recognition for Oceanographic Decision Support System Using Remote Sensing Data.” Remote Sensing 7 (10): 13564–13585. Stumpf, R. P., T. T. Wynne, D. B. Baker, and G. L. Fahnenstiel. 2012. “Interannual Variability of Cyanobacterial Blooms in Lake Erie.” PloS One 7 (8): e42444. Sun, D., Y. Li, and Q. Wang. 2009. “A Unified Model for Remotely Estimating Chlorophyll-a in Lake Taihu, China, Based on SVM and in Situ Hyperspectral Data.” IEEE Transactions on Geoscience and Remote Sensing 47 (8): 2957–2965. United Nations, I.S.f.D.R. 2015. Global Assessment Report on Disaster Risk Reduction 2015: Making Development Sustainable: The Future of Disaster Risk Management. UN. USGS. 2017. “Product Guide: Landsat 8 Surface Reflectance Code (LASRC) Product” (techreport). United States Geologycal Service. Accessed 1 August, 2019. https://www.usgs.gov/media/files/land-surface-reflectance-code- lasrc-product-guide. USGS. 2020. “MODIS/Terra Surface Reflectance Daily L2G Global 1 km and 500 m.” Acesso em 5 mar. 2020. https:// lpdaac.usgs.gov/products/mod09gav006/. Van Der Walt, S., S. C. Colbert, and G. Varoquaux. 2011. “The Numpy Array: a Structure for Efficient Numerical Computation.” Computing in Science & Engineering 13 (2): 22. van Rijsbergen, C. J. 1979. Information Retrieval. 2nd ed. London: Butterworths. http://www.dcs.gla.ac.uk/Keith/ Preface.html. van Rossum, G., and F. L. Drake. 2011. The Python Language Reference Manual. Godalming, UK: Network Theory Ltd. Verstraete, M. M., and B. Pinty. 1996. “Designing Optimal Spectral Indexes for Remote Sensing Applications.” IEEE Transactions on Geoscience and Remote Sensing 34 (5): 1254–1265. Wang, S., J. Li, B. Zhang, E. Spyrakos, A. N. Tyler, Q. Shen, F. Zhang, T. Kuster, M. K. Lehmann, and Y. Wu, et al. 2018. “Trophic State Assessment of Global Inland Waters Using a Modis-Derived Forel-Ule Index.” Remote Sensing of Environment 217: 444–460. INTERNATIONAL JOURNAL OF DIGITAL EARTH 941 Wang, M., M. Strokal, P. Burek, C. Kroeze, L. Ma, and A. B. Janssen. 2019. “Excess Nutrient Loads to Lake Taihu: Opportunities for Nutrient Reduction.” Science of the Total Environment 664: 865–873. Watanabe, F., E. Alcantara, T. Rodrigues, L. Rotta, N. Bernardo, and N. Imai. 2018. “Remote Sensing of the Chlorophyll-a Based on OLI/Landsat-8 and MSI/Sentinel-2A (Barra Bonita Reservoir, Brazil).” Anais da Academia Brasileira de Ciências 90 (2): 1987–2000. Wetzel, R. G. 2001. Limnology: Lake and River Ecosystems. 3rd ed. San Diego, CL: Academic Press. Xu, H., H. Paerl, B. Qin, G. Zhu, N. Hall, and Y. Wu. 2015. “Determining Critical Nutrient Thresholds Needed to Control Harmful Cyanobacterial Blooms in Eutrophic Lake Taihu, China.” Environmental Science & Technology 49 (2): 1051–1059. Xue, J., and B. Su. 2017. “Significant Remote Sensing Vegetation Indices: A Review of Developments and Applications.” Journal of Sensors 2017: Article ID: 1353691. Yang, Y., and X. Liu. 1999. “A Re-Examination of Text Categorization Methods.” In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 42–49. Berkeley, CL. Yi, H.-S., S. Park, K.-G. An, and K.-C. Kwak. 2018. “Algal Bloom Prediction Using Extreme Learning Machine Models at Artificial Weirs in the Nakdong River, Korea.” International Journal of Environmental Research and Public Health 15 (10): 2078. Zhang, Y., R. Ma, H. Duan, S. A. Loiselle, J. Xu, and M. Ma. 2014. “A Novel Algorithm to Estimate Algal Bloom Coverage to Subpixel Resolution in Lake Taihu.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 7 (7): 3060–3068. Zhao, D. 2003. “Application of NDVI to Detecting Algal Bloom in the Bohai Sea of China from AVHRR.” In Ocean Remote Sensing and Applications, 241–246. Vol. 4892. International Society for Optics and Photonics. 942 P. H. M. ANANIAS AND R. G. NEGRI 33 2.2 Artigo - “Forecasting algal bloom events in inland waters using remote sensing data and machine learning methods” O artigo a seguir foi submetido para apreciação do corpo editorial do periódico ENVI- RONMENTAL MODELLING & SOFTWARE em 23 de julho de 2021. Forecasting algal bloom events in inland waters using remote sensing data and machine learning methods A R T I C L E I N F O Keywords: Remote sensing Algal bloom Forecasting Google Earth Engine Classification A B S T R A C T The monitoring of water quality and algal blooms has received considerable attention from the scientific community, as these can pose risks to the health of living beings. The use of Remote Sensing can minimize costs in the in loco analysis processes, as well as being a generating source of data. Additionally, Machine Learning techniques and concepts favor the development of solutions for environmental analysis and monitoring. In this context, supported by classification models and multitemporal data series, it is proposed an automated method able to predict algal blooms. Such a proposal uses Modis images and meteorological/climatic products acquired from the Google Earth Engine platform. Three case studies involving predicting the phenomenon in lakes Erie (USA), Chilika (India) and Taihu (China) demonstrate a median global accuracy of 95%. The computational cost comprises the main drawback of the proposal. 1. Introduction Water quality is of vital importance for the Earth, mainly due to the recent increase in population and climate changes [11]. Thus, there is a growth in demand for both domestic and agricultural, increasing pressure on this resource and, consequently, on the environment [40]. Another consequence is the proliferation of cyanobacteria, responsible for severe damage to ecological structures and aquatic ecosystems [68]. This phenomenon, also known as Harmful Algal Blooms (HABs), implies severe risks to the behavior and health of living beings [53]. Several parameters are described by [11] as responsible for water quality, including suspended sediment, turbidity, total phosphorus, dissolved organic content, temperature, and Secchi disk. However, chlorophyll-a (Chl-a) stands out as the most used parameter [23]. Its extension is directly related to sudden changes in components such as surface temperature, wind speed, precipitation, water column stratification, and water flow direction [68, 53, 50]. [40] point to sustainable water resources management. Terrestrial monitoring stations are described as precarious, presenting outdated data and non-existent information. As highlighted by [11], they are not uniformly distributed and operate intermittently, primarily due to lack of investment and government interest. As a way of getting around this limitation, [11] address the use of remote sensing images in water quality management. An example can be seen in [45], where the authors present an algorithm based on the Empirical Orthogonal Functions model to estimate the concentration of Chl-a in Lake Taihu, China. [2] make use of images from the Moderate Resolution Imaging Spectroradiometer (Modis) sensor to assess the ability of the Medium Resolution Continental Shelf model to predict high algal bloom events in a eutrophic coastal sea. With the increase in the quantity and quality of data, technologies, and information from imaging sensors and Artificial Intelligence [72, 36], several studies are presented focusing on increasing the accuracy of Chl-a concentration estimation and prediction models of algal blooms. An example is a use of Support Vector Machine (SVM) and Landsat- 8/OLI images to estimate Chl-a concentration in [73] lakes. Also, [3] propose a study to detect algal bloom occurrences in inland waters with One-class Support Vector Machine. Using Sentinel-2/MSI images and the Random Forest (RF) model, the study of [54] addresses the prediction of Chl-a in two small water bodies. Additionally, approaches using deep learning models [12, 31], which often accurately approximate nonlinear environmental parameters [72], have become popular. In their study, [5] uses a hybrid model of Convolutional Neural Network and Long Short-term Memory (LSTM) capable of predicting the concentration of Chl-a in a lake in Greece. Applying LSTM nets, [12] demonstrate a 1- and 4-day Chl-a concentration prediction model in the Geum River (South Korea). In a study published by [63], a multilayer perceptron is applied to generate a cyanobacterial prediction model. However, it should be stressed that physical models, such as the one presented by [19], are strongly dependent on environment variables [72]. Similarly, [72] it also highlights that most of the models based on Remote Sensing data and Machine Learning concepts are trained with ground-truth/reference samples collected from the study area [72]. In ORCID(s): First Author et al.: Preprint submitted to Elsevier Page 1 of 20 Forecasting algal bloom events in inland waters using remote sensing data and machine learning methods both cases, obtaining the needed information may be critical. Thus, efforts are needed in the development of tools that not depends on reference data. In this context, by the use of multitemporal series of remote sensing images obtained by the Modis sensor, meteorological/climatic attributes and Machine Learning methods, this work proposes an algorithm able to forecast the occurrence of anomalies in aquatic environments. Three case studies related to the prediction of algal blooms in Lakes Erie (USA), Chilika (India) and Taihu (China) are carried to demonstrate the potential of the proposed method. In summary, the main contributions of this research are: (i) A unsupervised method designed to predict anomalous behavior; (ii) The proposed method is modular and, therefore, flexible regarding the use of other classification models, in addition to those presented in the following formalization; (iii) A conceptual formalization that, after convenient changes and adaptations, can be applied to other environmental issues, in addition to the detection of anomalies in inland waters. This article is organized as follows: Section 2 briefly analyzes data classification, machine learning models and spectral indices; Section 3 performs the formalization of the proposed algorithm; Section 4 describes the experiment, the study areas, the reference data, presents and discusses the obtained results; Finally, Section 5 presents conclusions about the proposal and studies developed. 2. Theory Rational 2.1. Preliminary notations A more extensive and comprehensive discussion about the LSTM method and its parameters are found in [26]. Let  be an image remotely obtained by a sensor whose pixels 𝑠 ∈  ⊂ N2 are associated to a attribute vector 𝐱𝑠 = [ 𝑥𝑠∶1,… , 𝑥𝑠∶𝑛 ] ∈  ⊂ R𝑛. Assuming that a given set of images composes a multitemporal series with 𝑡 distinct instants take from a same spatial domain  , the notation (𝓁), with 𝓁 = 1,… , 𝑡, is adopted as a spatio-temporal representation of these data. The image classification process comprises associating a class 𝜔𝑘 ∈ Ω = {𝜔1,… , 𝜔𝑐} to each 𝑠 ∈  by applying a function 𝐺 ∶  →  over 𝐱𝑠, where  = {1,… , 𝑐}. The different classification methods proposed in the literature differ in how 𝐺 is modeled. When modelling of this function is performed in a supervised way, it is used information from a training set  = { (𝐱𝑖, 𝑦𝑖) ∈  ×  ∶ 𝑖 = 1,… , 𝑚 } , where (𝐱𝑖, 𝑦𝑖) indicates that 𝐱𝑖 is assigned to the class 𝜔𝑎 if the scalar 𝑦𝑖 ∈ N is equal to 𝑎. Conveniently, we denote by (𝓁) the result that arises when applying 𝐺 on each position 𝑠 of (𝓁). 2.2. Machine learning models Given the importance of image classification in several applications involving Remote Sensing data, the devel- opment of more efficient and accurate methods has become a constant challenge [30, 66]. Among various examples, consolidated methods such as Support Vector Machine (SVM) [14], Random Forest (RF) [8] and the recent models based on neural networks have proven potential in the context of Remote Sensing applications. Introduced by Vladimir Vapnik [62], the SVM method comprises a supervised learning algorithm that aims to separate classes though a surface 𝑔(𝐱) = 𝐾(𝐱,𝐰) + 𝑏 whose margin is maximum [14, 41]; 𝐰 and 𝑏 are parameters that determine the separation surface, and 𝐾(⋅, ⋅) is a kernel function [52] conveniently adopted according to the complexity of the classification problem. As described by [9], starting from a dataset  = {(𝐱𝑖, 𝑦𝑖) ∈  ×{−1,+1} ∶ 𝑖 = 1,… , 𝑚}, where 𝑦𝑖 = ±1 indicates membership between two classes, the training of the SVM method comprises the calculation of the parameters 𝐰 and 𝑏 after solve the following optimization problem: max 𝛼 (∑𝑚 𝑖=1 𝛼𝑖 − 1 2 ∑𝑚 𝑖=1 ∑𝑚 𝑗=1 𝑦𝑖𝑦𝑗𝛼𝑖𝛼𝑗𝐾(𝐱𝑖, 𝐱𝑗) ) 𝑠.𝑡. ⎧⎪⎨⎪⎩ ∑𝑛 𝑖=1 𝑦𝑖𝛼𝑖 = 0 0 ≤ 𝛼𝑖 ≤ 𝐶 𝑖 = 1,… , 𝑛 (1) First Author et al.: Preprint submitted to Elsevier Page 2 of 20 Forecasting algal bloom events in inland waters using remote sensing data and machine learning methods where 𝛼𝑖 ∈ R are Lagrange multipliers and 𝐶 ∈ ℝ+ 0 is a penalty factor applied to misclassifications. Regarding the kernel, the radial basis function (RBF) 𝐾(𝐱𝑖,𝐰) = 𝑒−𝛾‖𝐱𝑖−𝐰‖2 , with 𝛾 ∈ ℝ+ 0 , is highlighted as a convenient option. The Lagrange multipliers obtained when solving Equation 1 allows then define the classification rule 𝐺(𝐱) = sgn ( 𝑚∑ 𝑖=1 𝑦𝑖𝛼𝑖𝐾(𝐱, 𝐱𝑖) + 𝑏 ) . More details are found in [15] and [57]. Proposed by [8], the RF method comprises a classification rule 𝐺 ∶  →  resulting from an ensemble of decision trees [16]. Formally, from a training set , it is made 𝑛𝑒𝑠𝑡 replications with same cardinality by bootstrap sampling. For each replica, a subset with up to 𝑛𝑎𝑡𝑡 attributes are randomly considered and then used to train a single decision tree. Parameters such as the maximum depth (𝑝𝑑𝑒𝑝𝑡ℎ), and the minimum examples for splitting and per leaf (𝑝𝑠𝑝𝑙𝑖𝑡 and 𝑝𝑙𝑒𝑎𝑓 , respectively) need to be tuned before the training process. A detailed discussions about these parameters are found in [8]. After training each tree 𝐺𝑘 ∶  →  , with 𝑘 = 1,… , 𝑛𝑒𝑠𝑡, a given unlabeled attribute vector 𝐱 is classified according to a class 𝜔𝑎 ∈ Ω whose agreement among the 𝑛𝑒𝑠𝑡 decision trees is maximum, that is: 𝐺(𝐱) = argmax 𝑎∈{1,…,𝑐} {𝑛𝑒𝑠𝑡∑ 𝑘=1 𝛿𝑎 ( 𝐺𝑘 (𝐱) )} (2) where 𝛿𝑎 ( 𝐺𝑘 (𝐱) ) = 1 when 𝐺𝑘 (𝐱) = 𝑎, otherwise, 𝛿𝑎 ( 𝐺𝑘 (𝐱) ) = 0. Concurrently with the previous methods, models based on neural networks concepts emerge as an effective alternative for the classification of Remote Sensing images [72]. Such models are characterized by their high generalization ability when dealing with uncorrelated data. Among dozen neural network proposals, the Long Short-term Memory (LSTM) comprises a model that inserts and exploits the temporal effects over the learning process. The diagram depicted in Figure 1 yields to understand this method. Initially, we assume 𝐱𝑖, with 𝑖 = 1,… , 𝑚, as a set of patterns (i.e., attribute vectors) ordered by its index 𝑖 and which are sequentially submitted to concatenation, element-wise multiplication and vector sum (⊙, ⊗ and ⊕) and then applied to sigmoid (𝜎) and hyperbolic tangent (𝜑) functions. Also, worth highlighting some elements in the training process, for example, 𝐜𝑖−1 and 𝐜𝑖 as a vector with the “previous state” and “current” of the network; 𝐟𝑖 a vector comprising “forgetting factors” for each component of the input vector; 𝐢𝑖 as a “signal modulation” of input information; and 𝐡𝑖−1 and 𝐡𝑖 as the network input and output vectors in the “previous” and “current” states. After successively insert the patterns 𝐱𝑖, and consider the respective predictions 𝐡𝑖, the training process stops when occurs the network convergence. It is worth noting that the classification is performed by a softmax function [18] adjusted using the outputs 𝐡𝑖, which class indicators 𝑦𝑖 are known from ( 𝐱𝑖, 𝑦𝑖 ) ∈ . According to this method, the combination between the described network and the softmax function comprise the classification model 𝐺. A more extensive and comprehensive discussion about the LSTM method and its parameters are found in [26]. 2.3. Algal bloom detection through spectral index thresholding Spectral indices can be understood as an attribute extraction process for Remote Sensing images whose purpose is to aid in the identification of specific targets. Over the years, several studies have proposed and used spectral indices, for example, for the characterization of water bodies [37, 20, 69] and vegetation health [49, 27, 33, 67]. In the context of Chl-a estimation, the detection of algal blooms has been performed through the adoption of rigid thresholds. In [74] the Normalized Difference Vegetation Index (NDVI) [49] is used to identify algae when such index achieve values greater than −0.15 Similarly, in [69] , the Modified Normalized Difference Water Index (MNDWI) [25] is used to characterize algal blooms when such index is less than zero. The Surface Algae Bloom Index (SABI) [1] and Floating Algae Index (FAI) [42] are proposals exclusively for detecting algae, whose occurrence is verified when such indices surpass the values −0.1 and −0.004, respectively. Table 1 summarizes the different above-mentioned spectral indices and the respective thresholds that characterize the occurrence of algae. Regarding the expressions, 𝑥𝑅𝑒𝑑 , 𝑥𝐺𝑟𝑒𝑒𝑛, 𝑥𝐵𝑙𝑢𝑒, 𝑥𝑁𝐼𝑅 and 𝑥𝑆𝑊 𝐼𝑅 represent the spectral behavior measured in the red, green, near-infrared and short-wave infrared wave spectral bands, respectively; 𝜆𝑅𝑒𝑑 , 𝜆𝑁𝐼𝑅 and 𝜆𝑆𝑊 𝐼𝑅 are the midpoint of the red, near-infrared, and short-wave infrared wavelength bands. First Author et al.: Preprint submitted to Elsevier Page 3 of 20 Forecasting algal bloom events in inland waters using remote sensing data and machine learning methods Figure 1: An overview of the LSTM architecture. Table 1 Summary of spectral indices and thresholds used for algal bloom detection. Spectral Index Expression Algae Referencethreshold NDVI 𝑥𝑁𝐼𝑅 − 𝑥𝑅𝑒𝑑 𝑥𝑁𝐼𝑅 + 𝑥𝑅𝑒𝑑 > −0.15 [74] MNDWI 𝑥𝐺𝑟𝑒𝑒𝑛 − 𝑥𝑆𝑊 𝐼𝑅 𝑥𝐺𝑟𝑒𝑒𝑛 + 𝑥𝑆𝑊 𝐼𝑅 < 0 [70] SABI 𝑥𝑁𝐼𝑅 − 𝑥𝑅𝑒𝑑 𝑥𝐵𝑙𝑢𝑒 + 𝑥𝐺𝑟𝑒𝑒𝑛 > −0.1 [1] FAI 𝑥𝑁𝐼𝑅 − [ 𝑥𝑅𝑒𝑑+( 𝑥𝑆𝑊 𝐼𝑅 − 𝑥𝑅𝑒𝑑 ) + × 𝜆𝑁𝐼𝑅−𝜆𝑅𝑒𝑑 𝜆𝑆𝑊 𝐼𝑅−𝜆𝑅𝑒𝑑 ] > −0.004 [42] 3. Anomalous behaviour forecasting This section introduces a novel proposal for predicting algal blooms in inland waters. In the following discussions, the emergence of this phenomenon will be called “anomaly”; otherwise, we adopt the term “regular”. Section 3.1 presents a conceptual formalization of the proposal. Section 3.2 gives relevant details regarding the proposal’s implementation. 3.1. Conceptual formalization Based on Machine Learning concepts and multitemporal data obtained by Remote Sensing, a method for predicting anomalous events in aquatic environments is proposed. For this reason, this method will be called Anomalous Behavior Forecasting (ABF). Figure 2 illustrates the main components of this proposal, which are structured in four main blocks. The first block is intended to build a multitemporal database for predicting algal blooms. Five basic parameters are needed at this stage: a region of interest where the forecasting is executed; a period of analysis for information extraction and modelling; the number of past step instants used to composite the information (i.e., the attribute vector) to build the forecast model; the nu