PEDRO HENRIQUE MORAES ANANIAS

DETECÇÃO, MONITORAMENTO E PREVISÃO DE ANOMALIAS
COM APLICAÇÃO EM ESTUDOS DE CASO RELACIONADOS À

FLORAÇÃO DE ALGAS POTENCIALMENTE TÓXICAS UTILIZANDO
DADOS MULTITEMPORAIS DE SENSORIAMENTO REMOTO E

ALGORITMOS DE APRENDIZADO DE MÁQUINA

2021


PEDRO HENRIQUE MORAES ANANIAS

DETECÇÃO, MONITORAMENTO E PREVISÃO DE ANOMALIAS COM
APLICAÇÃO EM ESTUDOS DE CASO RELACIONADOS À FLORAÇÃO DE ALGAS

POTENCIALMENTE TÓXICAS UTILIZANDO DADOS MULTITEMPORAIS DE
SENSORIAMENTO REMOTO E ALGORITMOS DE APRENDIZADO DE MÁQUINA

Dissertação apresentada ao Instituto de Ciência
e Tecnologia, Universidade Estadual Paulista
(Unesp), Campus de São José dos Campos;
Centro Nacional de Monitoramento e Alertas
de Desastres Naturais (Cemaden), como parte
dos requisitos para a obtenção do título de
MESTRE pelo Programa de Pós-Graduação em
DESASTRES NATURAIS.
Área: Desastres naturais. Linha de pesquisa:
Instrumentação e análise de dados.

Orientador: Prof. Dr. Rogério Galante Negri

São José dos Campos
2021


Instituto de Ciência e Tecnologia [internet]. Normalização de tese e dissertação
[acesso em 2021]. Disponível em http://www.ict.unesp.br/biblioteca/normalizacao

Apresentação gráfica e normalização de acordo com as normas estabelecidas pelo Serviço de
Normalização de Documentos da Seção Técnica de Referência e Atendimento ao Usuário e
Documentação (STRAUD).

Ficha catalográfica elaborada pela Biblioteca Prof. Achille Bassi e Seção Técnica de Informática, ICMC/USP 
 com adaptações - STATI, STRAUD e DTI do ICT/UNESP. 

 Renata Aparecida Couto Martins CRB-8/8376

Ananias, Pedro Henrique Moraes
   Detecção, monitoramento e previsão de anomalias com aplicação em estudos de
caso relacionados à floração de algas potencialmente tóxicas utilizando dados
multitemporais de sensoriamento remoto e algoritmos de Aprendizado de
Máquina / Pedro Henrique Moraes Ananias. - São José dos Campos : [s.n.],
2021.
   61 f. : il.

   Dissertação (Mestrado) - Pós-Graduação em Desastres Naturais - Universidade
Estadual Paulista (Unesp), Instituto de Ciência e Tecnologia, São José dos
Campos, 2021.
   Orientador: Rogério Galante Negri.
   

   1. Sensoriamento Remoto Hiperespectral. 2. Floração de Algas. 3.
Aprendizagem de Máquina. 4. Análise de Séries Temporais. I. Negri, Rogério
Galante, orient. II. Universidade Estadual Paulista (Unesp), Instituto de
Ciência e Tecnologia, São José dos Campos. III. Universidade Estadual
Paulista 'Júlio de Mesquita Filho' - Unesp. IV. Universidade Estadual
Paulista (Unesp). V. Título.


BANCA EXAMINADORA

Prof. Dr. Rogerio Galante Negri (Orientador)
Universidade Estadual Paulista (Unesp)

Instituto de Ciência e Tecnologia
Campus de São José dos Campos

Profa. Dra. Tatiana Sussel Gonçalves Mendes
Universidade Estadual Paulista (Unesp)

Instituto de Ciência e Tecnologia
Campus de São José dos Campos

Prof. Dr. Thales Sehn Körting
Instituto Nacional de Pesquisas Espaciais (INPE)

Divisão de Observação da Terra e GeoInformática

São José dos Campos, 12 de agosto de 2021.


SUMÁRIO

LISTA DE ABREVIATURAS E SIGLAS .......................................................................... 4
RESUMO............................................................................................................................... 5
ABSTRACT ............................................................................................................................ 6
1 INTRODUÇÃO ................................................................................................................. 7
2 ARTIGOS .......................................................................................................................... 10
2.1 Artigo - “Anomalous behaviour detection using one-class support vector machine

and remote sensing images: a case study of algal bloom occurrence in inland
waters” ............................................................................................................................ 10

2.2 Artigo - “Forecasting algal bloom events in inland waters using remote sensing
data and machine learning methods” .......................................................................... 33

3 CONSIDERAÇÕES FINAIS ........................................................................................... 55
REFERÊNCIAS.................................................................................................................... 57


LISTA DE ABREVIATURAS E SIGLAS

ABD Anomalous Behaviour Detection

ABF Anomalous Behaviour Forecasting

AM Aprendizado de Máquina

AP Aprendizado Profundo

API Application Programming Interface

FAI Floating Algae Index

GEE Google Earth Engine

HAB Harmful Algal Blooms

IA Inteligência Artificial

LSTM Long-short Term Memory

MNDWI Modified Normalized Difference Water Index

NDVI Normalized Difference Vegetation Index

OC-SVM One-class Support Vector Machines

RF Random Forest

SR Sensoriamento Remoto

SABI Surface Algae Bloom Index

SVM Support Vector Machines


ANANIAS, M. H. P.. Detecção, monitoramento e previsão de anomalias com aplicação
em estudos de caso relacionados à floração de algas potencialmente tóxicas utilizando
dados multitemporais de sensoriamento remoto e algoritmos de Aprendizado de Máquina.
Dissertação. São José dos Campos: Universidade Estadual Paulista (Unesp), Instituto de Ciência
e Tecnologia; Centro Nacional de Monitoramento e Alertas de Desastres Naturais (Cemaden),
2021.

RESUMO

Globalmente, foram observadas nos últimos anos severas mudanças ambientais e climáticas.
Neste cenário, destacam-se o aumento de eventos de floração de algas tóxicas, responsáveis pela
degradação da qualidade da água e, principalmente, ameaça à saúde dos seres vivos. Estudos
apontam para a necessidade de monitoramento e previsão desse fenômeno, os quais podem ser
conduzidos segundo a concentração de clorofila. Perante esta motivação, utilizando-se técnicas de
Aprendizado de Máquina e séries multitemporais de dados obtidos por Sensoriamento Remoto,
foram desenvolvidos dois novos métodos capazes de proporcionar suporte à detecção e previsão
de floração de algas. O primeiro visa a detecção automática do fenômeno e aplica conceitos de
classificação de imagens por meio do emprego do algoritmo One-class Support Vector Machine.
O segundo método é responsável por prever o seu surgimento. Como forma de evidenciar o
potencial e viabilidade das propostas, os algoritmos foram aplicados em estudos de caso em
áreas suscetíveis à ocorrência de floração de algas tóxicas.

Palavras-chave: Sensoriamento Remoto. Floração de Algas. Aprendizado de Máquina. Anomalia.
Classificação. Séries multitemporais


ANANIAS, M. H. P.. Anomaly detection, monitoring, and forecasting with application in case
studies related to the of potentially harmful algal blooms using multitemporal remote sensing
data and Machine Learning algorithms. Research projetct. São José dos Campos: São Paulo
State University (Unesp), Institute of Science and Technology; National Center for Monitoring
and Early Warning of Natural Disasters (Cemaden), 2021.

ABSTRACT

Globally, severe environmental and climate changes have been observed in recent years. In such
a scenario, there was also an increase in the bloom of toxic algae, responsible for the degradation
of water quality and threatening living beings’ health. Studies point to the need to monitor and
predict this phenomenon with basis on the chlorophyll concentration information. In the face of
this motivation, using Machine Learning techniques and Remote Sensing multitemporal image
series, two novel methods were developed to detect and forecast algae bloom. The first aims at
the automatic detection of the phenomenon and applies concepts of image classification through
the use of the One-class Support Vector Machine algorithm. The second method is responsible
for predicting its emergence. Different case studies were carried in order to prove the potential
of the proposed methods.

Keywords: Remote Sensing. Algal Bloom. Machine Learning. Anomaly. Classification, Multitem-
poral series


7

1 INTRODUÇÃO

Em um contexto global, mudanças ambientais como desmatamento, efeito estufa,
desertificação ou perda da biodiversidade estão frequentemente relacionadas ao aumento da
população humana e ao consequente descompasso na disponibilidade de recursos (GRIMMOND,
2007; CHAWLA; KARTHIKEYAN; MISHRA, 2020). Nesse sentido, Nagendra et al. (2013)
salientam a importância em monitorá-las, mesmo diante do seu alto custo, a fim de permitir que
grupos técnicos e o poder público tenham parâmetros realistas para a tomada de decisões.

No Brasil, observa-se a organização de movimentos a partir dos anos 60 com o objetivo
de pressionar autoridades a adotarem políticas e agendas de prevenção de problemas ambientais
(FOWLER; AGUIAR, 1993). Em um estudo recente, Nobre et al. (2011) afirmam que os
governos federal e estaduais têm trabalhado com foco em ampliar o conhecimento necessário
para que o país possa responder aos efeitos das mudanças climáticas nos diversos setores da
sociedade.

Essencial para a vida, um dos recursos mais exigidos é a água, seja para consumo
doméstico, seja para consumo agrícola ou industrial (MISHRA; COULIBALY, 2009). Segundo
Wells et al. (2015), um dos problemas decorrentes desse aumento e responsável por severos danos
aos ambientes aquáticos é a proliferação de cianobactérias (i.e. Harmful Algal Blooms – HAB).
Causado por ações humanas (CARVALHO et al., 2013) ou mudanças climáticas (CASTRO;
MOSER, 2012), esse fenômeno, também conhecido como floração, proliferação, afloramento ou
bloom, é de difícil previsão (WELLS et al., 2015) e deve ser foco de estudos que busquem o seu
monitoramento e forneçam suporte ao alerta precoce de sua ocorrência.

Nesse sentido, a clorofila-a (Chl-a), um importante pigmento fotossintético e causadora
da coloração esverdeada de plantas, algas e cianobactérias (MILENKOVIĆ et al., 2012), é ponto
central em discussões sobre a qualidade da água. Diversos parâmetros são descritos por Chawla,
Karthikeyan e Mishra (2020) como responsáveis pela sua qualidade, incluindo sedimentos em
suspensão, turbidez, fósforo total, conteúdo orgânico dissolvido, temperatura e disco de Secchi.
No caso da Chl-a, o seu desenvolvimento está diretamente relacionado às mudanças repentinas
em temperatura da superfície, velocidade do vento, precipitação, estratificação da coluna de água
e direção do fluxo de água (WELLS et al., 2015; SHI et al., 2019; ROUSSO et al., 2020). Ao
monitorá-la, é possível determinar se um reservatório ou rio, por exemplo, encontra-se próprio
para consumo humano (MATTHEWS; BERNARD; ROBERTSON, 2012; CARVALHO et al.,
2013). Por sua vez, conforme destacado por Chawla, Karthikeyan e Mishra (2020), observa-se a
deficiência de estações de monitoramento, sendo muitas vezes negligenciadas pelo poder público.

Além do registro de sua ocorrência em diversos países do mundo (STUMPF et al., 2012;
DUAN et al., 2015; YI et al., 2018; BINDING et al., 2018), florações de algas são frequentemente
encontradas em território nacional (CARVALHO et al., 2013) em locais responsáveis pelo
abastecimento urbano (BEYRUTH, 2000; OGASHAWARA et al., 2014) e pela geração de
energia (MATSUMURA-TUNDISI; TUNDISI, 2005; WATANABE et al., 2018), por exemplo.


8

Além de afetarem o ecossistema e ameaçarem a saúde da população, apresentam risco para o
tratamento de água, pois alteram seu odor e sabor (CARVALHO et al., 2013).

A ampla observação da Terra por meio de Sensoriamento Remoto (SR) permite o
desenvolvimento de soluções, especialmente na avaliação quantitativa, qualitativa e temporal
da proliferação de algas potencialmente tóxicas. Atualmente, uma variedade de satélites (e.g.,
Terra, Aqua, Landsat-8, Sentinel-2) apresenta-se como base para estes algoritmos.

Embora ferramentas que detectam automaticamente e permitam a geração de alertas
precoces sejam escassas, são encontrados na literatura estudos baseados em dados empíricos e
aplicação de índices espectrais com foco na detecção de HABs e estimação da concentração de
Chl-a em ambientes aquáticos (GOWER et al., 2005; MATTHEWS; BERNARD; ROBERTSON,
2012). Um exemplo é aplicação do índice espectral Normalized Difference Vegetation Index
(NDVI) (JR et al., 1973), inicialmente desenvolvido para o mapeamento de vegetação e utilizado
em questões relacionadas às algas (ZHAO, 2003). Pode-se citar ainda o Floating Algae Index
(FAI) (HU, 2009), onde Oyama et al. (2015) o utiliza no monitoramento dos níveis de cianobac-
téria em lagos do Japão. Outros exemplos são o Modified Normalized Difference Water Index
(MNDWI) (HAN-QIU, 2005), inicialmente desenvolvido para detectar a presença de água limpa
em corpos d’água (XU, 2006) e Surface Algae Bloom Index (SABI) (ALAWADI, 2010), outro
índice exclusivamente utilizado na detecção de algas. Adicionalmente, são propostos estudos
que fazem uso de processos matemáticos como a aplicação dos modelos Empirical Orthogonal
Function (EOF) para estimar a concentração de Chl-a no Lago Taihu, China (QI et al., 2014)
e do Medium Resolution Continental Shelf (MRCS) (ALLEN et al., 2008) para detecção de
eventos de HABs em zonas costeiras europeias.

Diante do aumento na quantidade e qualidade de dados fornecidas por avançados
sensores imageadores (YUAN et al., 2020; MARTÍNEZ-ÁLVAREZ; BUI, 2020), torna-se
possível o aprimoramento da acurácia dos resultados de monitoramento, estimação e previsão da
floração de algas. Esse ganho pode ser obtido com a aplicação de técnicas de Aprendizagem de
Máquina (AM) e Aprendizado Profundo (AP), já aplicadas em outras áreas do conhecimento
e expandidas na ciência do SR. O uso de Máquinas de Vetores de Suporte (Support Vector

Machines – SVM) (CORTES; VAPNIK, 1995), por exemplo, é abordado em estudos realizados
por Sun, Li e Wang (2009) e Zhang, Huang e Wang (2020). Já os resultados com a aplicação do
algoritmo de Florestas Aleatórias (Random Forests – RF), inicialmente proposto por Breiman
(2001), podem ser observados nas publicações de Song et al. (2015) e Kupssinskü et al. (2020).
A implementação de classificação supervisionada é possível quando os dados de treinamento
estão disponíveis. Da mesma forma, quando o objetivo é automatizar a detecção de anomalias,
o modelo One-class Support Vector Machine (OC-SVM) proposto por Schölkopf et al. (2000)
surge como alternativa. Munoz-Mari et al. (2010) destacaram os desafios de usar OC-SVM para
classificar dados pouco representativos, o que é um problema recorrente neste tipo de aplicação.
No entanto, esses estudos mostram que modificações no processo de modelagem de OC-SVM
podem aumentar sua eficácia de classificação e potencial detecção de anomalias.


9

Na mesma direção, são observadas ferramentas que fazem uso de AP (CHO; CHOI;
PARK, 2018; VILÁN et al., 2013). Pode-se citar o estudo desenvolvido por Lee et al. (2019),
que implementa redes neurais de múltiplas camadas na detecção de marés vermelhas (FLAGEL-
LATES, 1979) na península coreana. Em outro estudo, Barzegar, Aalami e Adamowski (2020)
propõem um modelo híbrido composto por Redes Neurais Convolucionais (Convolutional Neural

Network – CNN) (LECUN et al., 1989) e Memória Longa de Curto Prazo (Long Short-term

Memory – LSTM) (HOCHREITER; SCHMIDHUBER, 1997) na previsão da quantidade de
Chl-a em um lago localizado na Grécia.

Conforme abordado anteriormente, a maioria dos modelos que se utilizam de imagens
de SR se baseiam em modelos físicos (SATHYENDRANATH et al., 2001; FRANKLIN et
al., 2020) (i.e. empíricos) ou são direcionados ao ambiente onde as amostras de campo foram
coletadas, não sendo facilmente replicáveis em outros locais (YUAN et al., 2020). Desta forma,
faz-se necessário o desenvolvimento de ferramentas universais e independentes de dados in situ.

Portanto, utilizando-se de técnicas de AM e análise de séries temporais de imagens de
SR adquiridas da plataforma Google Earth Engine Application Programming Interface (GEE
API), foram desenvolvidos dois novos algoritmos que possibilitam, de forma totalmente automa-
tizada, a detecção e previsão de anomalias em ambientes aquáticos. O primeiro, denominado
Anomalous Behaviour Detection (ABD), é responsável por detectar a ocorrência de proliferação
de algas em águas interiores com uso do modelo RF. Sua performance foi avaliada com base
em dados in situ dos Lagos Erie (USA) e Taihu (China). O segundo, denominado Anomalous

Behaviour Forecasting (ABF) e modelado a fim de prever a ocorrência do fenômeno em questão,
foi avaliado por meio de três estudos de casos que compreendem os lagos Erie, Chilika (Índia) e
Taihu.

Este documento encontra-se organizado da seguinte forma: o Capítulo 2 apresenta os
artigos “Anomalous behaviour detection using one-class support vector machine and remote

sensing images: a case study of algal bloom occurrence in inland waters” e “Forecasting algal

bloom events in inland waters using remote sensing data and machine learning methods”; O
Capítulo 3 aborda as considerações finais do presente trabalho.


10

2 ARTIGOS

2.1 Artigo - “Anomalous behaviour detection using one-class support vector machine and
remote sensing images: a case study of algal bloom occurrence in inland waters”

O artigo a seguir foi publicado em 17 de março de 2021 no periódico INTERNATIONAL
JOURNAL OF DIGITAL EARTH <http://dx.doi.org/10.1080/17538947.2021.1907462>.

http://dx.doi.org/10.1080/17538947.2021.1907462


Anomalous behaviour detection using one-class support vector
machine and remote sensing images: a case study of algal bloom
occurrence in inland waters
Pedro Henrique Moraes Ananias a,b and Rogério Galante Negri a,b

aGraduate Program in Natural Disasters, UNESP/CEMADEN, São José dos Campos, São Paulo, Brazil; bSciences
Technology Institute, São Paulo State University (UNESP), São José dos Campos, São Paulo, Brazil

ABSTRACT
Algal blooms are a frequent subject in scientific discussions and are the
focus of many recent studies, mainly due to their adverse effect on
society. Given the lack of ground truth data and the need to develop
tools for their detection and monitoring, this research proposes a novel
method to automate detection. Concepts derived from multi-temporal
image series processing, spectral indices and classification with One-
class Support Vector Machine (OC-SVM) are used in this proposal.
Imagery from multi-spectral sensors on Landsat-8 and MODIS were
acquired through the Google Earth Engine API (GEE API). In order to
evaluate our method, two bloom detection case studies (Lake Erie (USA)
and Lake Taihu (China)) were performed. Comparisons were made with
methods based on spectral index thresholds. Also, to demonstrate the
performance of the OC-SVM classifier compared to other machine
learning methods, the proposal was adapted to be used with a Random
Forest (RF) classifier, having its results added to the analysis. In situ
measurements show that the proposed method delivers highly accurate
results compared to spectral index thresholding approaches. However, a
drawback of the proposal refers to its higher computational cost. The
application of the new method to a real-world bloom case is
demonstrated.

ARTICLE HISTORY
Received 11 September 2020
Accepted 17 March 2021

KEYWORDS
Remote sensing; spectral
indices; unsupervised
classification; anomalies;
algal bloom detection

1. Introduction

Natural disasters such as floods, fires, deforestation, thawing, water pollution, earthquakes, are
becoming more widely reported (Marzuoli and Liu 2019). They are commonly related to anthro-
pogenic causes, climate change, population growth, and overuse of land resources (Grimmond
2007; Chawla, Karthikeyan, and Mishra 2020). As described by the United Nations, I.S.f.D.R.
(2015), a disaster is characterised by its ability to disorganise a society, causing human, material,
economic or environmental losses.

In this sense, and as part of a management system, it is strongly desirable that the authorities
know about these events in advance, so that they can build early warning systems to protect society
(He et al. 2019).

Regarding aquatic environments, the appearance of phytoplankton is often associated with econ-
omic losses and the consequent impact on society (Klemas 2011). In addition to the damage to the
local ecosystem, some species are responsible for poisoning humans and animals (Ghatkar, Singh,
and Shanmugam 2019). On the other hand, these algae comprise an important share of the aquatic

© 2021 Informa UK Limited, trading as Taylor & Francis Group

CONTACT Pedro Henrique Moraes Ananias pedro.ananias@unesp.br

INTERNATIONAL JOURNAL OF DIGITAL EARTH
2021, VOL. 14, NO. 7, 921–942
https://doi.org/10.1080/17538947.2021.1907462


food base and are accountable for fixing approximately 50% of the CO2 present in the atmosphere
(Ghatkar, Singh, and Shanmugam 2019).

In the face of environmental degradation and recent climate changes, controlling algal blooms
has become a challenge for humankind (Yi et al. 2018). Observation of the Earth through satellites
and remote sensing images allows quantitative, qualitative, and temporal assessments of these
blooms, making room for the development of algorithms to solve this problem.

An example covered in Binding et al. (2018) highlights the concern about cyanobacterial pro-
liferation in Lake Winnipeg (Canada) in recent decades, as this region provides essential support
for multiple ecosystems, recreational and commercial activities, and is a source of hydroelectric
power generation. The author also discusses that the impact is a function of the exposure time,
location, quantity, and composition of the proliferating species.

Chlorophyll-a (Chl-a) is a pigment necessary for photosynthesis and is found in all types of
organisms that perform it, mainly algae (Wetzel 2001). Chl-a has been the central focus of recent
discussions on its classification in inland waters (Watanabe et al. 2018). Song et al. (2015) pointed
out several challenges in this process, including a lack of ground truth data and unbalanced samples.
Another issue is the complications generated by obtaining field data, such as high costs and remote
locations (Klemas 2011).

Several studies have concerned the detection of potentially harmful algae using remote sensing.
However, there have only been a few efforts to develop automatic detection and alert models for its
presence (Song et al. 2015). One approach to this discussion is given by Sun, Li, and Wang (2009),
who created a unified model to estimate the concentration of Chl-a in Lake Taihu (China) using
support vector machines (SVM).

In parallel, Song et al. (2015) discussed the use of artificial intelligence methods to predict algal
bloom in Monterey Bay (California, USA). In the study, given a lack of field data, the authors pro-
pose the use of the Random Forest model (RF) (Breiman 2001) and satellite images obtained from a
Moderate Resolution Imaging Spectroradiometer (MODIS) andMedium Resolution Imaging Spec-
trometer (MERIS) to detect algal distribution in the bay coast.

The use of spectral indices and its thresholds for algal bloom detection in recent studies are
found in the literature. One is the Normalized Difference Vegetation Index (NDVI) (Rouse et al.
1973), which was initially developed for vegetation mapping, but is formally used in studies related
to algae. Another is the Floating Algae Index (FAI) (Hu 2009), which focuses on detecting floating
algae.

However, the use of these indices based on thresholds is not a permanent solution, as it does not
consider the historical behaviour of the algae. The existing algorithms, such as Maximum Chloro-
phyll Index (MCI) and Maximum Peak Height (MPH), were also considered. As pointed out by Shi
et al. (2019), and following the scope of this research, the mentioned indices were not designed to be
used with the Landsat-like data to solve the addressed problem, due to its limited spectral infor-
mation and low signal-to-noise ratios. In this sense, (Shi et al. 2019) also suggested that new studies
should follow the direction of elucidating algal bloom behaviour over time, including the develop-
ment of tools that enable its prediction.

Moreover, the implementation of supervised classification in remote sensing, including Chl-a
concentration estimation, is possible when training data is available. Similarly, when the objective
is to automate anomaly detection, the One-class Support Vector Machine (OC-SVM) model pro-
posed by Schölkopf et al. (2000) emerges as a potential alternative. Muñoz-Marí et al. (2010) high-
lighted the challenge of using OC-SVM to classify poorly representative data, which is a recurring
problem in this type of application. However, these studies show that modifications in the model-
ling process of OC-SVM may increase its classification effectiveness and potential anomaly
detection.

In light of the problems caused by algae presence, as well as the importance of its continuous
monitoring, we propose a new algorithm based on fully automated unsupervised anomaly detection
approaches, in order to verify the occurrence of such phenomenon in inland waters by using remote

922 P. H. M. ANANIAS AND R. G. NEGRI


sensing images acquired through the Google Earth Engine Application Programming Interface
(GEE API). The formalisation and construction of this new proposal uses concepts related to spec-
tral indices, image processing, and classification based on OC-SVM. In order to validate it, two case
studies with in situ measurements are carried out on Lake Erie (USA) and a portion of Lake Taihu
(China). A comparison of the proposal, based on a Random Forest (RF) classifier, with the other
methods mentioned earlier (Zhao 2003; Jia, Zhang, and Dong 2019) is also included in this
study. Finally, to demonstrate the application of the proposed method in a real-world algal
bloom case, we present an annual mapping (i.e. 2014 to 2018) of this event in Lake Taihu, compar-
ing the results obtained using the study presented by Jia, Zhang, and Dong (2019).

This article is organised as follows: Section 2 briefly reviews data classification and spectral indi-
ces; Section 3 presents the proposed algorithm; Section 4 describes the study areas and data as well
as the experiment design and the results; Section 4.1 discusses details about the study areas, images
used in the algae identification process and selected reference samples for results evaluation; Section
5 presents and discusses the results obtained; Finally, Section 6 concludes this paper.

2. Theory rational

2.1. Preliminary notations

Let I be the matrix representation of an image obtained by remote sensing. Each position of I is
expressed in terms of s, defined over a regular grid S , N2. Usually, s is called pixel and is related to
a given geographic location and its sensor measurements, expressed by the x [ X vector. We call
the X vector space per attribute space. Thus, according to these elements and notations, I(s) = x
determines which behaviour of I , in the respective s position, is expressed by the components of
x = (x1, x2, . . . , xℓ).

Assuming a defined geographic region over S support observed at distinct times
ti, i = 1, 2, . . . , n, it is convenient to adopt I (i). Thus, I (i) and I (j) express the behaviour of the
targets contained in the same region, but at different times.

Due to the need to delimit certain portions of a given image, the use of masks becomes con-
venient. For a given I (i) image, theM(i) image, also defined on the same support S, whose positions
are associated with binary values (i.e. 0 or 1). Under these conditions, M(i)(s) = 0 has singular use
for occluding values/vectors assigned to the position s in I (i). On the other hand,M(i)(s) = 1 ident-
ifies s positions where there is a particular interest. Consequently, the process of hiding attributes of
I (i) may be achieved through I (i) ⊗M(i), where ⊗ represents the multiplication between I (i) and
M(i), with respect to every s [ S. In time, M(i)

denotes the complement (i.e. 0 � 1) of the binary
values in M(i).

Among different applications that use of remote sensing images, it is necessary to identify and
distinguish several types of targets that compose the observed landscape. For this purpose, image
classification techniques are usually employed.

The classification process consists of applying F:X � V to the attribute vectors x of each s [ S
in order to associate a class vj [ V, j = 1, . . . , c. The distinct classification methods proposed in
the literature establish ways of modelling F.

Usually, supervised and unsupervised approaches are adopted in remote sensing. In supervised
learning, a set of labelled attribute vectors is used to learn how F should assign an unlabelled vector
x to a class ofΩ. Regarding the unsupervised approach, no labelled information is available, and F is
modelled though the structural organisation found in the set where it is applied.

Additionally, according to Congalton and Green (2009), it is essential before the analysis of map-
pings obtained on remote sensing data that assessments are made regarding the accuracy of these
mappings. Among different measures available in the literature, the kappa coefficient and the global
accuracy allow to account for omissions and inclusions according to certain classes, as well as to
compare different mappings between themselves (Bishop et al. 1977). F1-Score (van Rijsbergen

INTERNATIONAL JOURNAL OF DIGITAL EARTH 923


1979) comprises an alternative way of measuring the ratio between true and false detections (Yang
and Liu 1999).

2.2. One-class support vector machines

Support Vector Machine (SVM) is a widespread classification technique, especially in remote sen-
sing applications. A solid mathematical formulation, simple algorithmic architecture and high gen-
eralisation ability are some features that highlight such method (Bruzzone and Persello 2009).
Furthermore, as reported in Mountrakis, Im, and Ogole (2011), SVM has achieved similar or
even higher accuracy results compared to other classification methods.

Based on the original SVM conception, several variations have been proposed, for example,
laplacian (Gu and Feng 2013), transductive (Li et al. 2018), context-sensitive (Bruzzone and Persello
2009; Negri, Dutra, and Sant’Anna 2014) and One-class (Schölkopf et al. 2001) SVMs. The last
example, One-class SVM (OC-SVM), regards into an unsupervised approach motivated by quantile
estimation (Glazer, Lindenbaum, and Markovitch 2013) and may be adopted for change detection
(Negri et al. 2020) applications. Conceptually, from a given set of unlabelled observations, the OC-
SVM obtains a model able to classify elements as part of such set with a false-positive/negative
occurrence rate ν.

Formally, we may write a function F:D , X � {+1, − 1}, where +1 output implies that input
elements are in D, and −1 otherwise. The classifier F is given by:

F(x) = sgn
∑m
i=1

aiK(x, xi)− b

( )
, (1)

where b = ∑m
j=1 ajK(xi, x j) for any xi [ D, and K( · , · ) is a kernel function. The coefficient ai,

i = 1, . . . , m, are obtained as solution of the following optimisation:

min
a1,...am

∑m
i,j=1

aiajK xi, xj
( )

s.t.
ai[ 0,

1
vm

[ ]
∑m
i=1

ai=1

⎧⎪⎪⎨⎪⎪⎩
(2)

The most significant characteristic that distinguishes this method from classic SVM approach lies in
the optimisation problem expressed by Equation (2). It is worth noting that the OC-SVM is para-
meterised by n [ [0, 1] and additional parameters related to the adopted kernel function. For
example, when the RBF kernel (i.e. K(xi, x j) = exp (− g‖xi, x j‖2)) is adopted, g [ (0, 1) should
also be adjusted. More details on kernel functions can be found in Shawe-Taylor and Cristianini
(2004).

As mentioned, the method is formalised on unsupervised learning and anomaly detection
approaches, thus avoiding issues related to the difficulty of acquiring labelled data, since it only con-
siders ‘regular samples’ in the learning process. Alternatively, it is essential to highlight that it is
sensitive to the presence of noise in the dataset used in its decision rule modelling process; other-
wise, it becomes liable to understand anomalies as regular occurrences.

2.3. Random forests

Random Forest (RF) is another classifier employed in the recent remote sensing studies. Introduced
by Breiman (2001), its idea lies in the use of a forest of decision trees. Moreover, it exploits the

924 P. H. M. ANANIAS AND R. G. NEGRI


ensemble learning technique in order to combine the output of multiple decision trees through a
major voting process, finally producing a classification decision.

Assuming a training set D, nest sets with the same cardinality are replicated by a bootstrap
sampling. Posteriorly, for each replica, it is considered a random attribute subset with a maximum
of natt attributes and then used to build a decision tree. Parameters regarding such trees, like the
maximum depth (pdepth), minimum samples to split (psplit), and minimum samples per leaf (pleaf )
should be tuned before the training process. More details and discussions regarding those par-
ameters are found in Breiman (2001).

Concerning the classification stage, as previously mentioned, the vector x is assigned to a class in
Ω that produces significant concordance among all individual trees. Accordingly to Belgiu and Dră-
guţ (2016), RF is a computationally efficient algorithm that does not overfit the final decision rule.

2.4. Spectral indices thresholding for algal bloom detection

Spectral indices allow the extraction and analysis of remote sensing data. Despite a feature of inter-
est, a spectral index can assist in its identification. This approach, also called ‘spectral enhancement’,
is essential, given the impossibility of modifying orbiting imaging sensors and the difficulty of
obtaining field data with the same spatial and temporal resolution (Verstraete and Pinty 1996).

Generally, spectral indices are derived from algebraic operations on the attributes of X that
characterise the behaviour of x assigned with every pixel s of I (i).

Examples of spectral indices proposed in the literature are focussed on the characterisation of
vegetation (Rouse et al. 1973), components in water bodies (Gao 1996), constructed areas
(Huang, Lu, and Zhang 2014), soil moisture (Khanna et al. 2007), and others (Xue and Su 2017).
Among the different applications benefited by the use of spectral indices is the detection of algal
bloom (Hu 2009; Mishra and Mishra 2012; Zhang et al. 2014; Houborg et al. 2016; Watanabe
et al. 2018). Such problem has received attention due to its harmful potential to man and the
environment.

Zhao (2003) proposes to distinguish algal concentration and bloom using thresholds in NDVI
values. Admitting I (i), an image whose attributes express the behaviour of imaged targets at wave-
length ranges in the electromagnetic spectrum, NDVI is characterised by the following function:

fNDVI(x) = xNIR − xRed
xNIR + xRed

, (3)

where xNIR and xRed refer to the behaviour of the target at near-infra-red and red wavelengths,
respectively. Without loss of generality, fNDVI is applied to every s [ S since I (i)(s) = x.
Also, since x = (x1, x2, . . . , xℓ), we have NIR, Red [ {1, 2, . . . , ℓ}. Consequently, the approach
proposed in Zhao (2003) can be expressed in terms of the following decision rule:

FNDVI(x) = v1, fNDVI(x) ≤ −0.15
v2, otherwise;

{
(4)

being v1 and v2 classes referring to non-occurrence and occurrence of algae, respectively.
Additionally, other approaches can be used to detect algal bloom, to mention the study by Jia,

Zhang, and Dong (2019). In this case, the following threshold for FAI (Hu 2009) values is set:

FFAI(x) = v1, fFAI(x) ≤ −0.004
v2, otherwise.

{
(5)

such that:

fFAI(x) = xNIR − xRed + (xSWIR − xRed)× lNIR − lRed
lSWIR − lRed

( )
, (6)

INTERNATIONAL JOURNAL OF DIGITAL EARTH 925


where xSWIR is the behaviour of the target at short-wave infra-red wavelength. Also, lSWIR ranges
from 1608 to 1640 nm.

3. Automatic algal bloom detection

3.1. Conceptual formalisation

Using a multi-temporal series of remote sensing images and based on the concepts of spectral indices
and image classification, this work proposes an algorithm based on anomaly detection with case
studies applied to algal blooms in aquatic environments. The diagram illustrated in Figure 1 presents
an overview of this proposal and its steps, discussed according to its representation.

Initially, the user limits the region of interest, which contains the water body, a spanning period
for characterising a time series of images, a particular instant for which anomalies are to be detected

Figure 1. General organisation of the proposed method. The ‘Modelling data selection’ section refers to the process of creating a
reference set by automatically selecting between pixels associated with the possible occurrence or non-occurrence of algal
bloom based on indices median variation.

926 P. H. M. ANANIAS AND R. G. NEGRI


and the sensor, limited to Landsat-8 OLI (30 m spatial resolution) or MODIS MOD09GA.006 (500
m spatial resolution). According to the purposes of this research, anomalies are interpreted as an
algal bloom occurrence. However, sudden changes in the target’s spectral response can also be
detected as non-regularities.

Once input parameters are set, a search for remote sensing images is performed. The Google
Earth Engine (GEE) (Gorelick et al. 2017) platform is used to acquire data according to established
criteria. As a result, images derived from the chosen sensor, with spectral bands ranging from blue
to shortwave infra-red, are returned. To reduce the computational cost, those images are stored in a
caching area, allowing them to be reused by the algorithm. Also, for each image, an auxiliary pro-
duct is provided, allowing identification of elements such as water bodies and the presence of
clouds. For convenience, further details regarding this subproduct are discussed later in Section
3.2 (see support data topic).

Subsequently, the query performed allows the construction of time series images representing
I (1), I (2), . . . , I (n). Such series have temporal ordering, where I (n) refers to the most recent instant.
Also, we have I (k), where k ≈ n

2, expressing the particular instant. The difference between I (1) and
I (k); or the equivalent between I (k) and I (n), expresses the input spanning period.

Through the mentioned auxiliary product, masks are constructed to allow delimitation of the
existing water body in the area of interest as well as mapping locations associated with the occur-
rence of clouds and cloud shadows. Since its occurrence has a dynamic behaviour over time, it is
necessary to define masks M(1)

cl , . . . , M(n)
cl for each image/instant I (1), . . . , I (n). On the other

hand, the regions that comprise water are defined by a single mask Mwb.
It is a common issue that, in applications involving the observation of the Earth’s surface via

optical sensors, the occurrence of clouds provides a lack of information about the targets. This
work considers images with good visibility, i.e. those where cloud cover is less than or equal to
50%. More information on this procedure is shown in Section 3.2 (cloud threshold topic). Under
these conditions, the regions affected by this atmospheric phenomenon are ignored. For the sake
of mathematical and computational simplicity, it is convenient to determine the median image
Ĩ , whose attribute vector associated with each position s represents the median of each attribute,
also in s, for the considered time series.

From the initial time series and its supporting data (i.e. masks and median image), each image is
limited to the water body, followed by correcting the target spectral response with the removal of
cloud occurrences and median trend subtraction. Such a process is formally expressed by:

I (i) := I (i) − Ĩ
( )

⊗M(i)
cl ⊗Mwb, i = 1, . . . , n, (7)

where ‘−’ denotes the usual matrix subtraction and M(i)
cl is the reverse of mask M(i)

cl . Note that the
expressed process provides a redefinition of each I (i).

When in possession of the multi-temporal series of images, adjusted according to Equation (7),
and facing the central objective of this proposal, which is the detection of algal blooms, measures
that favour the identification of this phenomenon are computed. Such measures refer to the NDVI
and FAI indices discussed in Section 2.4.

With the exception of the image I (k) (i.e. the instant of interest), values from the spectral indices
considered are observed in relation to all pixels of the time series. According to these index values, pro-
cessed separately, it is possible to extract their average trendmz and its variation level, given in terms of
the standard deviation sz, with z [ {NDVI, FAI}. In turn, the values performed by each index, and
not limited by the range [mz − sz, mz + sz], may represent an anomaly occurrence according to z.

While the OC-SVM model uses the D reference set represented by non-occurrence of algal
bloom (i.e. [mz − sz, mz + sz], for z = {NDVI, FAI}), the RF algorithm is trained with all datasets,
which also comprehends the anomaly class (i.e. ]−1, mz − sz]

⋃
[mz + sz, +1[, for

z = {NDVI, FAI}). However, before modelling, the set of observations defined as regular or anom-
alous is used as auxiliary information in the process of selecting the method’s parameters. Thus, it

INTERNATIONAL JOURNAL OF DIGITAL EARTH 927


makes it possible to obtain a decision rule capable of identifying the occurrence of anomalies
according to the behaviour of the considered spectral indices. It is important to note that the adjust-
ment of parameters associated with each method (i.e. the ν and kernel function parameters related
to OC-SVM or depth (pdepth), number of estimators (nest), feature subset size and minimum samples
required to split (psplit) or to be a leaf node (pleaf ), regarding RF method) is conducted automatically.
Further details on this procedure are discussed in Section 3.2 (see model parameter tuning topic).

Lastly, I (k) is expressed in terms of the NDVI and FAI indices and then submitted for classifi-
cation. As a result (i.e. output), maps in Tagged Image File Format (TIFF) and GeoJSON (Butler
et al. 2016) formats delimit the water body between anomaly and regular classes (i.e. no anomalous
observation).

Along with the purposes, which drove this development, the introduced algorithm is denoted by
the acronym ABD-OCSVM, where ABD stands for Anomalous Behaviour Detection. In this context,
ABD-RF stands for the use of the RF model.

3.2. Implementation details

The previous formalisation was simplified to make it clear. The following information completes
the proposal. The code of the proposed algorithm is freely available at https://github.com/
pedroananias/abd.

Programming language and libraries: The Python 3.6 (van Rossum and Drake 2011) programming
language was used to implement the proposed algorithm. Additionally, functions provided in
the Numpy (Van Der Walt, Colbert, and Varoquaux 2011) and Pandas (McKinney 2010)
libraries were used for data manipulation. The Scikit-Learn (Pedregosa et al. 2011) library
was used in OC-SVM and RF modelling and classification processes.

Google Earth Engine API:To access Landsat-8 OLI and MODIS MOD09GA.006 satellites imageries,
Google Earth Engine Application Programming Interface (API) (GEE-API 2019), compatible
with the Python language, was used. This API allows automation of the image search process
for a given period and region of interest. Also, it is worth mentioning that images from Landsat
(Landsat 8 Surface Reflectance Code (LaSRC) algorithm (USGS 2017)) and MODIS (USGS
2020) sensors returned by this API are previously subjected to atmospheric correction and
all processing and modelling steps after its extraction are performed outside the GEE platform.

Model parameter tuning: As discussed in Section 2.2 and 2.3, OC-SVM and RF methods require the
adjustment of parameters inherent in its formalisation (i.e. the ν parameter or numbers of trees
in the forest). Given the high degree of freedom associated with the process of selecting appro-
priate parameters for the method modelling step, a Randomized Grid Search procedure was
employed (Rastrigin 1963; Baba 1981; Bergstra and Bengio 2012). This procedure consists of
testing a finite set of parameter settings and selecting one that ensures higher accuracy. For
OC-SVM, the search space that determines the tested settings is given from the
n [ {10−1, 10−2, . . . , 10−7} value and a RBF kernel with g [ {10−1, 10−2, . . . , 10−7}. Regard-
ing the RF parameters, with the Gini impurity measure guiding the nodes splitting process, the
tested settings are nest [ {1, 5 · · · 250}, pdepth [ {1, 2, . . . , 30}, psplit [ {2, 4, . . . , 20},
pleaf [ {2, 4, . . . , 20} and natt [ {

���������
dim(X)

√
, 100%, 75%, 50%}. In addition, this procedure is

replicated in the decision rules modelling the data set for each possible configuration, accord-
ing to a 10-fold cross-validation process.

Decision rule modelling dataset size: As described above, the samples comprising the decision rule
modelling dataset are defined as a function of the trend range [m− s, m+ s]. Given the extre-
mely high number of examples (i.e. attribute vectors associated with each pixel) in the time
series employed, the use of all available data for modelling the OC-SVMmethod becomes com-
putationally impeding, thus motivating the use of randomly selected subsets. After preliminary
testings with 10%, 5%, 2.5% and 1% of available examples, it was found that using 1% allowed

928 P. H. M. ANANIAS AND R. G. NEGRI


results similar to other higher proportions with a lower computational cost. There is an expo-
nential increase in the computational cost with an increasing decision rule modelling set size, as
a time series can reach more than ten million pixels in its processing dataset. Thus, the 1% ratio
is employed in determining the modelling set.

Support data: Masks that delimit water bodies and cloud occurrence within the region of interest are
used as supporting data in the developed algorithm. In turn, bitwise operations are used to
operate binary values directly (Cavanagh 2013). In addition, the AND operator is commonly
used to create filters and masks. In our study, the masks are derived from the operation per-
formed on the object qa, which represents the quality assessment band of the working product,
using function bitwiseAnd. More information can be found in Gorelick et al. (2017). For Land-
sat-8 OLI, the quality band is known as ‘pixel_qa’, comprising bits 3 and 5 for cloud/sha-
dow. MODIS MOD09GA.006 sensor uses a bitwise operation for cloud occurrence (bit 10)
extraction through the ‘state_1km’ subproduct. The extraction of water bodies must be
done with auxiliary products, called ‘MOD44W.006 Terra Land Water Mask Derived from
MODIS and SRTM Yearly Global 250m’ and ‘NASA GLCF Landsat Global Inland Water’
throughout bands ‘water_mask’ and ‘water’, respectively. This is necessary to ensure
that images comprising the time series have the same number of pixels.

Cloud threshold: The occurrence of clouds in remote sensing data brings little or no information.
The implementation of the proposed algorithm considered occurrence thresholds according
to the percentages 25%, 50% and 75% to determine the best value for the selection and use
of the images (i.e. good visibility). Preliminary analysis for the NDVI index, based on the
data in Section 4.1, showed a mean and variance equal to −0.42 and 0.11, when considering
images with up to 25% of cloud occurrence. This was −0.48 and 0.11 for 50%; −0.57 and
0.11 for 75% occurrence. Since the difference between these values is not statistically significant
at 5%, it was decided to use the intermediate percentage equal to 50%. However, it is worth
emphasising that the sites affected by the occurrence of this phenomenon are discarded
from the composition of the database used in modelling the method.

Spanning period: The spanning period is an input parameter responsible for determining the size of
the time series used in the anomaly detection process. Note that the general trend that charac-
terises the occurrence of anomalies (i.e. algal bloom) is determined as a function of time series
characteristics. Thus, the spanning period may directly influence the anomaly detection pro-
cess. If considered a spanning period of d days, the time series comprehends the period between
d days after and before the particular instant (i.e. the image I (k)). In addition, the algorithm will
automatically seek to allocate this period so that the excess on both ends of the series is reallo-
cated from one to the other, considering the selected value and the image I (k). Preliminary tests
using 90, 180 and 365 as inputs indicated that the 180-value is the most stable, in terms of
accuracy.

4. Experiments

In order to evaluate the proposed method, two case studies were conducted on the detection and
mapping of algal bloom occurrence in water bodies. The first was Lake Erie, located in Ohio,
USA. A portion of Lake Taihu, located in China, was the second study area.

The assessment and validation of the proposed method were conceived by choosing dates and
sensors according to the availability of ground truth data provided by the National Centers for
Environmental Information of USA (Burtner et al. 2019, 2020) the and National Earth System
Science Data Center, National Science and Technology Infrastructure of China (Ma 2016, 2017).
Correlating its spatial and temporal occurrences, the definition of anomalous samples was based
on the chlorophyll concentration .5.6 g/L, total phosphorus .21.7 g/L and Secchi depth
,3.0m for Lake Erie (Chapra and Dobson 1981; Kasich, Taylor, and Butler 2014) and chlorophyll

INTERNATIONAL JOURNAL OF DIGITAL EARTH 929


.20 g/L for Lake Taihu (Xu et al. 2015; Wang et al. 2019). Regular samples comprised data not
covered by the rules in either case.

The results obtained were compared to the methods proposed in Gandhi et al. (2015) and Jia,
Zhang, and Dong (2019), previously discussed in Section 2.4, which detect the occurrence of
algal blooms based on thresholds for NDVI (Equation (4)) and FAI (Equation (5)) spectral indices,
respectively. Overall accuracy, the kappa coefficient of agreement (Bishop et al. 1977) and F1-Score
(van Rijsbergen 1979) were computed from reference samples identified on distinct dates. Also,
hypothesis tests derived from the kappa measure were performed to evaluate the significance of
the results. Furthermore, the analysed methods were compared on the basis of false/true posi-
tives/negatives occurrence ratios.

Additionally, the spanning period of 180 days, previously discussed in Section 3.2, was considered
in the following experiments. The same inputs and analysis were tested through amodified version of
the proposal using an RF classifier, comparing its performance against the OC-SVM method.

The experiments were performed on a computer with an AMD Ryzen 9 3900X 12-core pro-
cessor, and 32 GB of RAM running the Ubuntu Linux version 20.04 operating system.

4.1. Study areas and data

As mentioned at the beginning of Section 4, two case studies were conducted. Regions referring to
parts of Lake Erie (Figure 2) and Laike Taihu (Figure 3) determined the respective study areas.

Lake Erie is one of the five Great Lakes in North America and has the eleventh largest surface
area in the world; it is located between Ohio, Michigan, Pennsylvania, and New York states and
covers an area of 25, 657 km2. Lake Erie has a total volume of 484 km3 and an average and maxi-
mum depth of approximately 19m and 64m, respectively (Bolsenga and Herdendorf 1993). His-
torically, severe algal blooms have been recorded since the ‘70s, predominantly during the
summer, peaking between August and September (Stumpf et al. 2012). Through Figure 2, it is
also possible to observe the water quality monitoring stations belonging to the Great Lakes Environ-
mental Research Laboratory (GLERL), a multidisciplinary environmental research laboratory
linked to the National Oceanic and Atmospheric Administration (NOAA).

The third-largest lake in China, Lake Taihu is a source of water for 40 million people. Recently,
high algal blooms affected more than 4 million residents and industries in the region (Duan et al.
2015). According to Gao et al. (2020), it has an area of 2, 338 km2 and an average depth of 1.9m.
According to the authors, the period of rain begins in April and extends until July. However, the
lake faces its driest period between February and March.

According to the discussions held in Section 3, the algal bloom detection process uses images
obtained from the Landsat-8 OLI and MODIS MOD09GA.006 sensors. This detection process,
however, only uses one sensor at a time. Such images have 30 and 500m spatial resolutions, respect-
ively, and spectral bands ranging from blue to short-wave infra-red wavelengths.

After analysis of the good visibility images obtained by matching in situ data with the passage of
the Landsat and MODIS satellites (i.e. less than 50% of cloud presence), and motivated by the tem-
poral variation in algal bloom occurrence over Lake Erie, the following images were selected: June
3rd, 2019 – moderate; July 1st, 2019 – very high; August 19th, 2019 – very high; September 24th,
2019 – high (MODIS MOD09GA.006); and September 21st, 2015 – moderate (Landsat-8 OLI).
Figure 4 illustrates the images by sensor of origin, highlighting the dates and validation samples
considered for the occurrence or not of anomalies (that is, algae proliferation). The yellow arrows
point to the exact location where the samples were selected, based on the NOAA/GLERL monitor-
ing station locations. Table 1 summarises the number of pixels and polygons selected, as well as the
total of training instants.

For the study area on Lake Taihu, two images matching the ground truth dataset and the passage
of Landsat satellite were acquired: September 13th, 2016 – moderate; and December 2nd, 2016 –
moderate (Figure 5(c,d)). The yellow arrows point to the exact location where the samples were

930 P. H. M. ANANIAS AND R. G. NEGRI


selected, based on the mapping of the chlorophyll concentration for days September 13th and
December 2nd (Figure 5(a,b), respectively) provided by the National Earth System Science Data
Center, National Science and Technology Infrastructure of China (Ma 2016, 2017). The thresholds
used to determine whether blooms occurred are discussed in Section 4. Reference samples related to
the occurrence or not of anomalies, as well as the total of training instants, are expressed in Table 2.

Figure 2. Spatial location of the study area 1 – Lake Erie, Ohio, USA.

Figure 3. Spatial location of the study area 2 – Lake Taihu, China.

INTERNATIONAL JOURNAL OF DIGITAL EARTH 931


Figure 4. Images in false-color composition (NIR, red and green bands) from the Lake Erie study area for each considered instant.
Regular and anomaly samples are identified by cyan and magenta polygons, respectively.

Table 1. Summary of regular and anomaly reference samples related to Lake Erie for the considered instants.

Pixels/Polygons

MODIS MOD09GA.006 Landsat-8 OLI
2019 2015

Jun. 3rd Jul. 1st Aug. 19th Sep. 24th Sep. 21st

Regular 29/4 20/2 31/2 27/2 266/1
Anomaly 27/3 37/6 35/6 34/6 281/5
Training instants 105 104 99 104 24

932 P. H. M. ANANIAS AND R. G. NEGRI


5. Results and discussion

Taking the case study of Lake Erie as a starting point, anomaly mapping was obtained for each con-
sidered date. Figure 6 shows the accuracy values associated with each tested method, expressed in
terms of overall accuracy, kappa coefficient and F1-Score, and computed from the reference
samples.

Regarding the first image from June 3rd, 2019, the proposal outperforms its competitors. Con-
cerning the second date (July 1st, 2019), the proposal again demonstrates higher accuracy, deliver-
ing the first and second-best results with the application of the OC-SVM and RF algorithms,
respectively. Similar to the first, the image from August 19th, 2019 presents the best result of the
proposal. However, NDVI shows similar performance compared to the OC-SVM variation,

Figure 5. Images in false-color composition (NIR, red and green bands) from the Lake Taihu study area for each considered
instant. Regular and anomaly samples are identified by cyan and magenta polygons, respectively. At the top, Chl-a concentration
maps from the National Earth System Science Data Center, National Science and Technology Infrastructure of China are shown.

Table 2. Summary of regular and anomaly reference samples related to Lake Taihu for the considered instants.

Pixels/Polygons

September 13th, 2016 December 2nd, 2016

Regular 915/3 1452/1
Anomaly 922/2 1424/2
Training instants 10 10

INTERNATIONAL JOURNAL OF DIGITAL EARTH 933


producing the best result among the analysed indices. The last image, acquired on September 24th,
2019 with the MODIS sensor MOD09GA.006, also demonstrates the superiority of the proposal’s
classification, with OC-SVM and RF showing the first and second-best results, respectively. On this
date, the performance of competitors was still low, with FAI exhibiting the worst metrics. The Land-
sat-8 OLI sensor, the single available image for this study area and acquired on September 21st,
2015, presents the best performance with ABD-OCSVM variance.

Given the above results, the proposal using the OC-SVM algorithm showed the most regularity
among its competitors. In order to accept or reject the superiority of its performance, hypothesis
tests comparing kappa values were performed (Congalton and Green 2009). It was observed that
the proposal is statistically superior at 1% significance (i.e. 99% confidence), except for the image
from August 19th, where NDVI, ABD-OCSVM and ABD-RF are statistically equivalent. This
behaviour is also true for the image from September 24th with ABD-OCSVM and ABD-RF. Figure 7
shows the True/False-Positive/Negative ratio graph for the first study area, where the acronyms FN,
FP, TN and TP relate to false-negative, false-positive, true-negative and true-positive, respectively.
The term ‘positive’ refers to the occurrence of anomalies and ‘negative’ refers to regular pixels/
regions. First, it should be noted that among the competitors, the ABD-OCSVM proposal delivers
the lowest FN rate for all compared dates. The opposite is observed with the NDVI and FAI indices,
which present high rates of FN throughout the evaluated period. Although the proposal using the

Figure 6. Accuracy of the analysed methods related to the Lake Erie study area at distinct instants. Error bars denotes the respect-
ive kappa standard deviation value.

Figure 7. True(T)/False(F)-Positive(P)/Negative(N) ratio graph for the Lake Erie detection results.

934 P. H. M. ANANIAS AND R. G. NEGRI


Figure 8. Mapping results obtained by the analysed methods for the Lake Erie study area related to the MODIS MOD09GA.006
(a)–(t) and Landsat-8 OLI (u)–(y) sensors. Regular, anomaly, cloud and land/abscence data regions are denoted in cyan, magenta,
grey and white, respectively.

Figure 9. Images considering the performance obtained by the tested methods for the Lake Taihu study area: (a) kappa coeffi-
cient/Accuracy/F1-Score of the analysed methods at distinct instants. Error bars denotes the respective kappa standard deviation
value. (b) True(T)/False(F)-Positive(P)/Negative(N) ratio graph.

INTERNATIONAL JOURNAL OF DIGITAL EARTH 935


RF algorithm presents similar results of TP compared with the ABD-OCSVM variation for most of
the evaluated dates, there is a decrease in the detection of TN and an increase in FP on June 3rd,
2019 for the model in question. These results indicate that competitors overestimate the occurrence
of regular pixels in the tested samples, presenting high values of FN and ignoring the occurrence of
algae in all images. As shown in the previous analysis, ABD-OCSVM has the most regular detection.

Figure 8(a,f,k,p,u) represent the areas of interest and polygons of in situ samples for the evaluated
dates (i.e. yellow arrows). Alternatively, Figure 8(b–e,g–j,l–o,q–t,v–y) refer to the detection results
delivered by each one of the tested methods.

Regarding Figure 8(b,c) from the MODIS NDVI and FAI first dates, respectively, a comparison
with in situ samples (Figure 8(a)) shows an overestimation of FN by both approaches, which is cor-
roborated by Figure 7. Concerning the ABD-RF model, the opposite behaviour is true as it over-
estimates FP. Regarding the image from July 1st, 2019, ABD-OCSVM shows the best
performance, followed by ABD-RF. For the August 19th, 2019 image, the FAI indice delivered
high FP compared with NDVI, ABD-OCSVM and ABD-RF, the last three being statistically equiv-
alent. Figure 8(p–t) (MODIS last image), show high FN delivered by spectral indices and corrobo-
rate the detection equivalence and superiority of the ABD-OCSVM and ABD-RF variations.

Finally, Figure 8(u–y) represent the only acquired image for Lake Erie using the Landsat-8 OLI
sensor. It demonstrates the superiority of the proposal, having its variations showing lower FN
compared to tested indices.

Regarding the second study area through Figure 9(a), the superior performance of the proposal
variants, in relation to NDVI and FAI thresholds, is evident for both dates (September 13th and
December 2nd, 2016). This analysis is confirmed by Figure 9(b), where the image acquired in Sep-
tember 13th, 2016 shows low FN and FP performance by the ABD-RF model, in contrast with the
high FN proportion detected by spectral indices. In the same direction, the ABD-OCSVM model
showed low FN values for December 2nd, 2016. In this image, however, NDVI and FAI obtained
results equivalent to the proposal’s RF algorithm.

Visual analysis of the second area’s results is possible through Figure 10(a–j). In this sense,
Figure 10(a,f) represent the areas of interest selected according to ground truth data (i.e. yellow
arrows). From another angle, Figure 10(b,c) confirm the non-detection of anomalous
regions found in this image by NDVI and FAI indices, validating the results from the True/
False-Positive/Negative graph. However, Figure 10(i) stands out, and it is possible to observe the
detection of a vast anomalous region by the ABD-OCSVM variant, corroborated by samples
from Figure 10(f).

Lastly, hypothesis tests derived from the kappa coefficient confirm the superiority of the ABD-
OCSVM and ABD-RF models, being statistically superior to the spectral indices with a significance

Figure 10. Mapping results obtained by the analysed methods for the Lake Taihu study area. Regular, anomaly, cloud and land/
abscence data regions are denoted in cyan, magenta, grey and white, respectively.

936 P. H. M. ANANIAS AND R. G. NEGRI


of 1%. However, the ABD-RF variant is superior to the ABD-OCSVMmodel for the first date (Sep-
tember 13th, 2016) and the opposite is true for the second date, where ABD-OCSVM presents a
better result than ABD-RF.

Although the proposal has considerably higher computational costs (as shown in Figure 11), it
has the advantage of considering anomalies temporal behaviour and being an unsupervised

Figure 11. Run-time of analysed methods for both study areas.

Figure 12. Annual mapping of cyanobacteria blooms occurrence in the Lake Taihu from 2014 to 2018 using multi-temporal series
of images extracted from GEE API and MODIS MOD09GA.006 sensor: (a)–(e) algal bloom coverage with water-leaving reflectance
correction (Shenglei et al. 2016; Wang et al. 2018) and FAI.− 0.004 (threshold proposed by Jia, Zhang, and Dong 2019). (f)–(j)
algal bloom coverage using ABD-OCSVM algorithm. Brackets show the total instants mapped by each year.

INTERNATIONAL JOURNAL OF DIGITAL EARTH 937


approach, where no manual selection of training data is needed. The proposal creates datasets larger
than millions of pixels, mainly in the parameter optimisation and decision rule modelling. These
sets are filtered and automatically obtained, reaching 20 or even 100 instants (MODIS
MOD09GA.006 sensor), depending on the spanning period selected. To minimise the impact of
network resource and consequently reduce the computational cost, a caching system was adopted,
storing the images according to the area of interest locally. During the classifier construction, sev-
eral images are processed. In this step, acquisition time is therefore reduced, if already in the disk.

5.1. Real-world application: 5-year mapping of the spatial occurrence of cyanobacterial
blooms in Lake Taihu

In their study, Jia, Zhang, and Dong (2019) proposed a workflow responsible for annually mapping
the occurrence of cyanobacteria in Lake Taihu in addition to the FAI.− 0.004 threshold (Section
2.4). With results obtained using the algorithm implemented via the GEE platform (Jia 2019), the
authors discuss the spatio-temporal patterns of the mentioned phenomenon and its annual charac-
teristics, as well as the effects of environmental factors on the proliferation of these bacteria between
the years 2000 and 2018. They also highlight the application of the water-leaving reflectance correc-
tion, recommended for Chinese inland waters greater than 25 km2 (Shenglei et al. 2016; Wang et al.
2018).

To demonstrate the application of the ABD-OCSVM model in a real case study related to algae
problems, the algorithm proposed by Jia, Zhang, and Dong (2019) was replicated, resulting in an
annual mapping of algae blooming in Lake Taihu between 2014 and 2018 (Figure 12(a–e)). Simul-
taneously, respecting the aforementioned spatiality and temporality, the results of the proposal are
shown in Figure 12(f–j).

In addition to the higher concentration verified by the ABD-OCSVM mapping compared to the
algorithm proposed by Jia, Zhang, and Dong (2019), algae blooms occur in the central region of the
lake, which is not widely verified in the first model. However, there is a spatial compatibility
between both algorithms, especially in the most extreme regions, where the occurrence of the
phenomenon is close to 100%.

6. Conclusions

In summary, regarding the problem of algal blooms and their adverse effect, whether related to
health (i.e. poisoning, water quality) or economic losses, it is essential to develop studies aimed
at monitoring it. Given this motivation, an algorithm based on anomaly detection approaches to
detect algal bloom occurrence in inland water, supported by spectral indices and the OC-SVM
model, was proposed. In order to validate it, two case studies were conducted with in situmeasure-
ments. In this direction, comparisons with similar proposals based in NDVI and FAI indices thresh-
olding, were also included. The proposal was modified to be used with an RF classifier, comparing
its results against the OC-SVM approach. Finally, to demonstrate the proposed algorithm in a real-
world application and compare it with other studies, a 5-year mapping of algal bloom occurrence in
Lake Taihu was presented, showing its spatial behaviour over time.

The results showed that the proposed method using the OC-SVM algorithm has higher accuracy
levels than the analysed competitors, showing averages for kappa coefficient, overall accuracy and
F1-Score of 68%, 87% and 86%, respectively; against 47%, 78% and 75% for ABD-RF; 28%, 67% and
57% for NDVI; and −4%, 53% and 46% for FAI.

Furthermore, it has been observed that the tested methods overestimate the non-occurrence of
algal blooms (i.e. regular regions). On the other hand, it should be mentioned that the proposal is
highly dependent on network connection and images with good visibility (i.e. less than 50% of
clouds) in the training set and requires a higher computational effort. Still, given the lack of ground
truth data, it was not possible to validate the behaviour of the proposal in regions with the presence

938 P. H. M. ANANIAS AND R. G. NEGRI


of snow or ice. As mentioned before, sudden changes in the target’s spectral response can be
detected as anomalies. The opposite is true when the phenomenon’s occurrence is too subtle,
being detected as regular. Additionally, a positive or negative performance variation is expected
at different instants, both for the proposal and for the evaluated indexes.

Future perspectives for this study include (i) the development of alternative strategies to reduce
computational cost; (ii) evaluate the behaviour of other spectral indices in the modelling process in
order to extend the application for the proposed algorithm; (iii) implementation of moving average
based strategies in the modelling process.

Acknowledgments

The authors would like to thank the National Centers for Environmental Information of USA (https://www.ncei.
noaa.gov) and the National Earth System Science Data Center, National Science and Technology Infrastructure of
China (http://www.geodata.cn) for providing data support for Lake Erie and Lake Taihu, respectively. Finally,
they would like to thank the anonymous reviewers for their suggestions and comments that highly improved this
work.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Funding

The authors thank Fundação de Amparo á Pesquisa do Estado de São Paulo (FAPESP) (grants 2018/
01033-3) for their financial support of this research.

ORCID

Pedro Henrique Moraes Ananias http://orcid.org/0000-0002-0924-1236
Rogério Galante Negri http://orcid.org/0000-0002-4808-2362

References

Baba, N. 1981. “Convergence of a Random Optimization Method for Constrained Optimization Problems.” Journal
of Optimization Theory and Applications 33 (4): 451–461.

Belgiu, M., and L. Drăguţ. 2016. “Random Forest in Remote Sensing: A Review of Applications and Future
Directions.” ISPRS Journal of Photogrammetry and Remote Sensing 114: 24–31.

Bergstra, J., and Y. Bengio. 2012. “Random Search for Hyper-Parameter Optimization.” The Journal of Machine
Learning Research 13 (1): 281–305.

Binding, C., T. Greenberg, G. McCullough, S. Watson, and E. Page. 2018. “An Analysis of Satellite-Derived
Chlorophyll and Algal Bloom Indices on Lake Winnipeg.” Journal of Great Lakes Research 44 (3): 436–446.

Bishop, Y. M., S. E. Fienberg, P. W. Holland, R. J. Light, and F. Mosteller. 1977. “Book Review: Discrete Multivariate
Analysis: Theory and Practice.” Applied Psychological Measurement 1 (2): 297–306.

Bolsenga, S. J., and C. E. Herdendorf. 1993. Lake Erie and Lake St. Clair Handbook. Detroit, MI: Wayne State
University Press.

Breiman, L. 2001. “Random Forests.” Machine Learning 45 (1): 5–32.
Bruzzone, L., and C. Persello. 2009. “A Novel Context-Sensitive Semisupervised SVM Classifier Robusttomislabeled

Training Samples.” IEEE Transactions on Geoscience and Remote Sensing 47 (7): 2142–2154.
Burtner, A., C. Kitchens, D. Fyffe, C. Godwin, T. Johengen, D. Stuart, R. Errera, D. Palladino, D. Fanslow, and D.

Gossiaux. 2020. “Physical, Chemical, and Biological Water Quality Data Collected From a Small Boat in
Western Lake Erie, Great Lakes From 2019-04-30 to 2019-10-07 (NCEI Accession 0209116)” (dataset). NOAA
National Centers for Environmental Information. Accessed June 13, 2020. https://accession.nodc.noaa.gov/
0209116.

Burtner, A., D. Palladino, C. Kitchens, D. Fyffe, T. Johengen, D. Stuart, D. Fanslow, and D. Gossiaux. 2019. “Physical,
Chemical, and Biological Water Quality Data Collected From a Small Boat in Western Lake Erie, Great Lakes

INTERNATIONAL JOURNAL OF DIGITAL EARTH 939


From 2012-05-15 to 2018-10-09 (NCEI Accession 0187718)” (dataset). NOAA National Centers for
Environmental Information. Accessed June 13, 2020. https://accession.nodc.noaa.gov/0187718.

Butler, H., M. Daly, A. Doyle, S. Gillies, S. Hagen, and T. Schaub. 2016. The GeoJSON Format. Internet Engineering
Task Force (IETF).

Cavanagh, J. 2013. X86 Assembly Language and C Fundamentals. Boca Raton, FL: CRC Press.
Chapra, S. C., and H. F. Dobson. 1981. “Quantification of the Lake Trophic Typologies of Naumann (Surface Quality)

and Thienemann (Oxygen) with Special Reference to the Great Lakes.” Journal of Great Lakes Research 7 (2): 182–
193.

Chawla, I., L. Karthikeyan, and A. K. Mishra. 2020. “A Review of Remote Sensing Applications for Water Security:
Quantity, Quality, and Extremes.” Journal of Hydrology 585: Article ID: 124826.

Congalton, R. G., and K. Green. 2009. Assessing the Accuracy of Remotely Sensed Data. Boca Raton, FL: CRC Press.
Duan, H., S. A. Loiselle, L. Zhu, L. Feng, Y. Zhang, and R. Ma. 2015. “Distribution and Incidence of Algal Blooms in

Lake Taihu.” Aquatic Sciences 77 (1): 9–16.
Gandhi, G. M., S. Parthiban, N. Thummalu, and A. Christy. 2015. “NDVI: Vegetation Change Detection Using

Remote Sensing and GIS-a Case Study of Vellore District.” Procedia Computer Science 57: 1199–1210.
Gao, B.-C. 1996. “NDWI–A Normalized Difference Water Index for Remote Sensing of Vegetation Liquid Water

From Space.” Remote Sensing of Environment 58 (3): 257–266.
Gao, Y., G. Zhu, H. W. Paerl, B. Qin, J. Yu, and Y. Song. 2020. “A Study of Bioavailable Phosphorus in the Inflowing

Rivers of Lake Taihu, China.” Aquatic Sciences 82 (1): 1.
GEE-API. 2019. “Google Earth Engine API.” Accessed 29 October 2019. https://github.com/google/earthengine-api.
Ghatkar, J. G., R. K. Singh, and P. Shanmugam. 2019. “Classification of Algal Bloom Species From Remote Sensing

Data Using An Extreme Gradient Boosted Decision Tree Model.” International Journal of Remote Sensing 40 (24):
9412–9438.

Glazer, A., M. Lindenbaum, and S. Markovitch. 2013. “q-ocsvm: A q-Quantile Estimator for High-Dimensional
Distributions.” In Advances in Neural Information Processing Systems 26 (NIPS 2013). Stateline, NV.

Gorelick, N., M. Hancher, M. Dixon, S. Ilyushchenko, D. Thau, and R. Moore. 2017. “Google Earth Engine:
Planetary-Scale Geospatial Analysis for Everyone.” Remote Sensing of Environment 202: 18–27.

Grimmond, S. U. 2007. “Urbanization and Global Environmental Change: Local Effects of Urban Warming.”
Geographical Journal 173 (1): 83–88.

Gu, Y., and K. Feng. 2013. “Optimized Laplacian SVM with Distance Metric Learning for Hyperspectral
Image Classification.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 6 (3):
1109–1117.

He, M., S. Pathak, U. Muaz, J. Zhou, S. Saini, S. Malinchik, and S. Sobolevsky. 2019. “Pattern and Anomaly Detection
in Urban Temporal Networks.” Preprint. arXiv:1912.01960.

Houborg, R., M. F. McCabe, Y. Angel, and E. M. Middleton. 2016. “Detection of Chlorophyll and Leaf Area Index
Dynamics From Sub-Weekly Hyperspectral Imagery.” In Remote Sensing for Agriculture, Ecosystems, and
Hydrology XVIII, 999812. Vol. 9998. International Society for Optics and Photonics.

Hu, C. 2009. “A Novel Ocean Color Index to Detect Floating Algae in the Global Oceans.” Remote Sensing of
Environment 113 (10): 2118–2129.

Huang, X., Q. Lu, and L. Zhang. 2014. “A Multi-Index Learning Approach for Classification of High-Resolution
Remotely Sensed Images Over Urban Areas.” ISPRS Journal of Photogrammetry and Remote Sensing 90: 36–48.

Jia, T. 2019. “Earth Engine Code – Long-Term Spatial and Temporal Monitoring of Cyanobacteria Blooms Using
Modis on Google Earth Engine: A Case Study in Taihu Lake.” Accessed 5 February 2020. https://code.
earthengine.google.com/e6c4d627ec7f4f0dbc1a4f77fdeb3bb3.

Jia, T., X. Zhang, and R. Dong. 2019. “Long-Term Spatial and Temporal Monitoring of Cyanobacteria Blooms Using
Modis on Google Earth Engine: A Case Study in Taihu Lake.” Remote Sensing 11 (19): 2269.

Kasich, J. R., M. Taylor, and C. W. Butler. 2014. Public Water System Harmful Algal Bloom Response Strategy. Ohio
Environmental Protection Agency. Columbus, OH.

Khanna, S., A. Palacios-Orueta, M. L. Whiting, S. L. Ustin, D. Riaño, and J. Litago. 2007. “Development of Angle
Indexes for Soil Moisture Estimation, Dry Matter Detection and Land-Cover Discrimination.” Remote Sensing
of Environment 109 (2): 154–165.

Klemas, V. 2011. “Remote Sensing of Algal Blooms: An Overview with Case Studies.” Journal of Coastal Research 28
(1A): 34–43.

Li, Y., Y. Wang, C. Bi, and X. Jiang. 2018. “Revisiting Transductive Support Vector Machines with Margin
Distribution Embedding.” Knowledge-Based Systems 152: 200–214.

Ma, R. 2016. “Lake Taihu Chlorophyll Inversion Product Data Set (2016)” (dataset). National Earth System Science
Data Center, National Science and Technology Infrastructure of China. Accessed June 13, 2020. http://www.
geodata.cn/data/datadetails.html?dataguid=122425543649945.

Ma, R. 2017. “Lake Taihu Chlorophyll Inversion Product Data Set (2017)” (dataset). National Earth System Science
Data Center, National Science and Technology Infrastructure of China. Accessed June 13, 2020. http://www.
geodata.cn/data/datadetails.html?dataguid=107063659544667.

940 P. H. M. ANANIAS AND R. G. NEGRI


Marzuoli, A., and F. Liu. 2019. “Monitoring of Natural Disasters Through Anomaly Detection on Mobile Phone
Data.” In 2019 IEEE International Conference on Big Data (Big Data), 4089–4098. IEEE.

McKinney, W. 2010. “Data Structures for Statistical Computing in Python.” In Proceedings of the 9th Python in
Science Conference, 51–56. Vol. 445. Austin, TX.

Mishra, S., and D. R. Mishra. 2012. “Normalized Difference Chlorophyll Index: A Novel Model for Remote
Estimation of Chlorophyll-a Concentration in Turbid Productive Waters.” Remote Sensing of Environment 117:
394–406.

Mountrakis, G., J. Im, and C. Ogole. 2011. “Support Vector Machines in Remote Sensing: A Review.” ISPRS Journal of
Photogrammetry and Remote Sensing Society 66 (3): 247–259.

Muñoz-Marí, J., F. Bovolo, L. Gómez-Chova, L. Bruzzone, and G. Camp-Valls. 2010. “Semisupervised One-Class
Support Vector Machines for Classification of Remote Sensing Data.” IEEE Transactions on Geoscience and
Remote Sensing 48 (8): 3188–3197.

Negri, R. G., L. V. Dutra, and S. J. S. Sant’Anna. 2014. “An Innovative Support Vector Machine Based Method for
Contextual Image Classification.” ISPRS Journal of Photogrammetry and Remote Sensing 87: 241–248.

Negri, R. G., A. C. Frery, W. Casaca, S. Azevedo, M. Araújo, E. Silva, and E. Alcântara. 2020. “Spectral-Spatial Aware
Unsupervised Change Detection with Stochastic Distances and Support Vector Machines.” IEEE Transactions on
Geoscience and Remote Sensing 59 (4): 2863–2876.

Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,
and V. Dubourg, et al. 2011. “Scikit-Learn: Machine Learning in Python.” Journal of Machine Learning Research 12
(Oct): 2825–2830.

Rastrigin, L. 1963. “The Convergence of the Random Search Method in the Extremal Control of a Many Parameter
System.” Automaton & Remote Control 24: 1337–1342.

Rouse Jr., J. W., R. H. Haas, J. Schell, and D. Deering. 1973. “Monitoring the Vernal Advancement and
Retrogradation (Green Wave Effect) of Natural Vegetation.” Third ERTS Symp. 1: 309–317.

Schölkopf, B., J. C. Platt, J. C. Shawe-Taylor, A. J. Smola, and R. C. Williamson. 2001. “Estimating the Support of a
High-Dimensional Distribution.” Neural Computation 13 (7): 1443–1471.

Schölkopf, B., R. C. Williamson, A. J. Smola, J. Shawe-Taylor, and J. C. Platt. 2000. “Support Vector Method for
Novelty Detection.” In Advances in Neural Information Processing Systems: 582–588. Denver, CO.

Shawe-Taylor, J., and N. Cristianini. 2004. Kernel Methods for Pattern Analysis. New York, NY, USA: Cambridge
University Press.

Shenglei, W., L. Junsheng, Z. Bing, S. Qian, Z. Fangfang, and L. Zhaoyi. 2016. “A Simple Correction Method for the
Modis Surface Reflectance Product Over Typical InlandWaters in China.” International Journal of Remote Sensing
37 (24): 6076–6096.

Shi, K., Y. Zhang, B. Qin, and B. Zhou. 2019. “Remote Sensing of Cyanobacterial Blooms in Inland Waters: Present
Knowledge and Future Challenges.” Science Bulletin 64 (20): 1540–1556.

Song, W., J. Dolan, D. Cline, and G. Xiong. 2015. “Learning-based Algal Bloom Event Recognition for Oceanographic
Decision Support System Using Remote Sensing Data.” Remote Sensing 7 (10): 13564–13585.

Stumpf, R. P., T. T. Wynne, D. B. Baker, and G. L. Fahnenstiel. 2012. “Interannual Variability of Cyanobacterial
Blooms in Lake Erie.” PloS One 7 (8): e42444.

Sun, D., Y. Li, and Q. Wang. 2009. “A Unified Model for Remotely Estimating Chlorophyll-a in Lake Taihu, China,
Based on SVM and in Situ Hyperspectral Data.” IEEE Transactions on Geoscience and Remote Sensing 47 (8):
2957–2965.

United Nations, I.S.f.D.R. 2015. Global Assessment Report on Disaster Risk Reduction 2015: Making Development
Sustainable: The Future of Disaster Risk Management. UN.

USGS. 2017. “Product Guide: Landsat 8 Surface Reflectance Code (LASRC) Product” (techreport). United States
Geologycal Service. Accessed 1 August, 2019. https://www.usgs.gov/media/files/land-surface-reflectance-code-
lasrc-product-guide.

USGS. 2020. “MODIS/Terra Surface Reflectance Daily L2G Global 1 km and 500 m.” Acesso em 5 mar. 2020. https://
lpdaac.usgs.gov/products/mod09gav006/.

Van Der Walt, S., S. C. Colbert, and G. Varoquaux. 2011. “The Numpy Array: a Structure for Efficient Numerical
Computation.” Computing in Science & Engineering 13 (2): 22.

van Rijsbergen, C. J. 1979. Information Retrieval. 2nd ed. London: Butterworths. http://www.dcs.gla.ac.uk/Keith/
Preface.html.

van Rossum, G., and F. L. Drake. 2011. The Python Language Reference Manual. Godalming, UK: Network Theory
Ltd.

Verstraete, M. M., and B. Pinty. 1996. “Designing Optimal Spectral Indexes for Remote Sensing Applications.” IEEE
Transactions on Geoscience and Remote Sensing 34 (5): 1254–1265.

Wang, S., J. Li, B. Zhang, E. Spyrakos, A. N. Tyler, Q. Shen, F. Zhang, T. Kuster, M. K. Lehmann, and Y. Wu, et al.
2018. “Trophic State Assessment of Global Inland Waters Using a Modis-Derived Forel-Ule Index.” Remote
Sensing of Environment 217: 444–460.

INTERNATIONAL JOURNAL OF DIGITAL EARTH 941


Wang, M., M. Strokal, P. Burek, C. Kroeze, L. Ma, and A. B. Janssen. 2019. “Excess Nutrient Loads to Lake Taihu:
Opportunities for Nutrient Reduction.” Science of the Total Environment 664: 865–873.

Watanabe, F., E. Alcantara, T. Rodrigues, L. Rotta, N. Bernardo, and N. Imai. 2018. “Remote Sensing of the
Chlorophyll-a Based on OLI/Landsat-8 and MSI/Sentinel-2A (Barra Bonita Reservoir, Brazil).” Anais da
Academia Brasileira de Ciências 90 (2): 1987–2000.

Wetzel, R. G. 2001. Limnology: Lake and River Ecosystems. 3rd ed. San Diego, CL: Academic Press.
Xu, H., H. Paerl, B. Qin, G. Zhu, N. Hall, and Y. Wu. 2015. “Determining Critical Nutrient Thresholds Needed to

Control Harmful Cyanobacterial Blooms in Eutrophic Lake Taihu, China.” Environmental Science &
Technology 49 (2): 1051–1059.

Xue, J., and B. Su. 2017. “Significant Remote Sensing Vegetation Indices: A Review of Developments and
Applications.” Journal of Sensors 2017: Article ID: 1353691.

Yang, Y., and X. Liu. 1999. “A Re-Examination of Text Categorization Methods.” In Proceedings of the 22nd Annual
International ACM SIGIR Conference on Research and Development in Information Retrieval, 42–49. Berkeley, CL.

Yi, H.-S., S. Park, K.-G. An, and K.-C. Kwak. 2018. “Algal Bloom Prediction Using Extreme Learning Machine
Models at Artificial Weirs in the Nakdong River, Korea.” International Journal of Environmental Research and
Public Health 15 (10): 2078.

Zhang, Y., R. Ma, H. Duan, S. A. Loiselle, J. Xu, and M. Ma. 2014. “A Novel Algorithm to Estimate Algal Bloom
Coverage to Subpixel Resolution in Lake Taihu.” IEEE Journal of Selected Topics in Applied Earth Observations
and Remote Sensing 7 (7): 3060–3068.

Zhao, D. 2003. “Application of NDVI to Detecting Algal Bloom in the Bohai Sea of China from AVHRR.” In Ocean
Remote Sensing and Applications, 241–246. Vol. 4892. International Society for Optics and Photonics.

942 P. H. M. ANANIAS AND R. G. NEGRI


33

2.2 Artigo - “Forecasting algal bloom events in inland waters using remote sensing data
and machine learning methods”

O artigo a seguir foi submetido para apreciação do corpo editorial do periódico ENVI-
RONMENTAL MODELLING & SOFTWARE em 23 de julho de 2021.


Forecasting algal bloom events in inland waters using remote
sensing data and machine learning methods

A R T I C L E I N F O

Keywords:
Remote sensing
Algal bloom
Forecasting
Google Earth Engine
Classification

A B S T R A C T

The monitoring of water quality and algal blooms has received considerable attention from the
scientific community, as these can pose risks to the health of living beings. The use of Remote
Sensing can minimize costs in the in loco analysis processes, as well as being a generating source
of data. Additionally, Machine Learning techniques and concepts favor the development of
solutions for environmental analysis and monitoring. In this context, supported by classification
models and multitemporal data series, it is proposed an automated method able to predict algal
blooms. Such a proposal uses Modis images and meteorological/climatic products acquired from
the Google Earth Engine platform. Three case studies involving predicting the phenomenon in
lakes Erie (USA), Chilika (India) and Taihu (China) demonstrate a median global accuracy of
95%. The computational cost comprises the main drawback of the proposal.

1. Introduction
Water quality is of vital importance for the Earth, mainly due to the recent increase in population and climate

changes [11]. Thus, there is a growth in demand for both domestic and agricultural, increasing pressure on this resource
and, consequently, on the environment [40]. Another consequence is the proliferation of cyanobacteria, responsible for
severe damage to ecological structures and aquatic ecosystems [68]. This phenomenon, also known as Harmful Algal
Blooms (HABs), implies severe risks to the behavior and health of living beings [53].

Several parameters are described by [11] as responsible for water quality, including suspended sediment, turbidity,
total phosphorus, dissolved organic content, temperature, and Secchi disk. However, chlorophyll-a (Chl-a) stands out
as the most used parameter [23]. Its extension is directly related to sudden changes in components such as surface
temperature, wind speed, precipitation, water column stratification, and water flow direction [68, 53, 50].

[40] point to sustainable water resources management. Terrestrial monitoring stations are described as precarious,
presenting outdated data and non-existent information. As highlighted by [11], they are not uniformly distributed and
operate intermittently, primarily due to lack of investment and government interest.

As a way of getting around this limitation, [11] address the use of remote sensing images in water quality
management. An example can be seen in [45], where the authors present an algorithm based on the Empirical
Orthogonal Functions model to estimate the concentration of Chl-a in Lake Taihu, China. [2] make use of images from
the Moderate Resolution Imaging Spectroradiometer (Modis) sensor to assess the ability of the Medium Resolution
Continental Shelf model to predict high algal bloom events in a eutrophic coastal sea.

With the increase in the quantity and quality of data, technologies, and information from imaging sensors and
Artificial Intelligence [72, 36], several studies are presented focusing on increasing the accuracy of Chl-a concentration
estimation and prediction models of algal blooms. An example is a use of Support Vector Machine (SVM) and Landsat-
8/OLI images to estimate Chl-a concentration in [73] lakes. Also, [3] propose a study to detect algal bloom occurrences
in inland waters with One-class Support Vector Machine. Using Sentinel-2/MSI images and the Random Forest (RF)
model, the study of [54] addresses the prediction of Chl-a in two small water bodies. Additionally, approaches using
deep learning models [12, 31], which often accurately approximate nonlinear environmental parameters [72], have
become popular. In their study, [5] uses a hybrid model of Convolutional Neural Network and Long Short-term Memory
(LSTM) capable of predicting the concentration of Chl-a in a lake in Greece. Applying LSTM nets, [12] demonstrate
a 1- and 4-day Chl-a concentration prediction model in the Geum River (South Korea). In a study published by [63],
a multilayer perceptron is applied to generate a cyanobacterial prediction model.

However, it should be stressed that physical models, such as the one presented by [19], are strongly dependent on
environment variables [72]. Similarly, [72] it also highlights that most of the models based on Remote Sensing data
and Machine Learning concepts are trained with ground-truth/reference samples collected from the study area [72]. In

ORCID(s):

First Author et al.: Preprint submitted to Elsevier Page 1 of 20


Forecasting algal bloom events in inland waters using remote sensing data and machine learning methods

both cases, obtaining the needed information may be critical. Thus, efforts are needed in the development of tools that
not depends on reference data.

In this context, by the use of multitemporal series of remote sensing images obtained by the Modis sensor,
meteorological/climatic attributes and Machine Learning methods, this work proposes an algorithm able to forecast
the occurrence of anomalies in aquatic environments. Three case studies related to the prediction of algal blooms in
Lakes Erie (USA), Chilika (India) and Taihu (China) are carried to demonstrate the potential of the proposed method.

In summary, the main contributions of this research are:

(i) A unsupervised method designed to predict anomalous behavior;

(ii) The proposed method is modular and, therefore, flexible regarding the use of other classification models, in
addition to those presented in the following formalization;

(iii) A conceptual formalization that, after convenient changes and adaptations, can be applied to other environmental
issues, in addition to the detection of anomalies in inland waters.

This article is organized as follows: Section 2 briefly analyzes data classification, machine learning models and
spectral indices; Section 3 performs the formalization of the proposed algorithm; Section 4 describes the experiment,
the study areas, the reference data, presents and discusses the obtained results; Finally, Section 5 presents conclusions
about the proposal and studies developed.

2. Theory Rational
2.1. Preliminary notations

A more extensive and comprehensive discussion about the LSTM method and its parameters are found in [26].
Let  be an image remotely obtained by a sensor whose pixels 𝑠 ∈  ⊂ N2 are associated to a attribute vector
𝐱𝑠 =

[
𝑥𝑠∶1,… , 𝑥𝑠∶𝑛

]
∈  ⊂ R𝑛. Assuming that a given set of images composes a multitemporal series with 𝑡 distinct

instants take from a same spatial domain  , the notation (𝓁), with 𝓁 = 1,… , 𝑡, is adopted as a spatio-temporal
representation of these data.

The image classification process comprises associating a class 𝜔𝑘 ∈ Ω = {𝜔1,… , 𝜔𝑐} to each 𝑠 ∈  by applying
a function 𝐺 ∶  →  over 𝐱𝑠, where  = {1,… , 𝑐}. The different classification methods proposed in the literature
differ in how 𝐺 is modeled. When modelling of this function is performed in a supervised way, it is used information
from a training set  =

{
(𝐱𝑖, 𝑦𝑖) ∈  ×  ∶ 𝑖 = 1,… , 𝑚

}
, where (𝐱𝑖, 𝑦𝑖) indicates that 𝐱𝑖 is assigned to the class 𝜔𝑎 if

the scalar 𝑦𝑖 ∈ N is equal to 𝑎. Conveniently, we denote by (𝓁) the result that arises when applying 𝐺 on each position
𝑠 of (𝓁).

2.2. Machine learning models
Given the importance of image classification in several applications involving Remote Sensing data, the devel-

opment of more efficient and accurate methods has become a constant challenge [30, 66]. Among various examples,
consolidated methods such as Support Vector Machine (SVM) [14], Random Forest (RF) [8] and the recent models
based on neural networks have proven potential in the context of Remote Sensing applications.

Introduced by Vladimir Vapnik [62], the SVM method comprises a supervised learning algorithm that aims to
separate classes though a surface 𝑔(𝐱) = 𝐾(𝐱,𝐰) + 𝑏 whose margin is maximum [14, 41]; 𝐰 and 𝑏 are parameters that
determine the separation surface, and 𝐾(⋅, ⋅) is a kernel function [52] conveniently adopted according to the complexity
of the classification problem.

As described by [9], starting from a dataset  = {(𝐱𝑖, 𝑦𝑖) ∈  ×{−1,+1} ∶ 𝑖 = 1,… , 𝑚}, where 𝑦𝑖 = ±1 indicates
membership between two classes, the training of the SVM method comprises the calculation of the parameters 𝐰 and
𝑏 after solve the following optimization problem:

max
𝛼

(∑𝑚
𝑖=1 𝛼𝑖 −

1
2
∑𝑚

𝑖=1
∑𝑚

𝑗=1 𝑦𝑖𝑦𝑗𝛼𝑖𝛼𝑗𝐾(𝐱𝑖, 𝐱𝑗)
)

𝑠.𝑡.

⎧⎪⎨⎪⎩
∑𝑛

𝑖=1 𝑦𝑖𝛼𝑖 = 0
0 ≤ 𝛼𝑖 ≤ 𝐶
𝑖 = 1,… , 𝑛

(1)

First Author et al.: Preprint submitted to Elsevier Page 2 of 20


Forecasting algal bloom events in inland waters using remote sensing data and machine learning methods

where 𝛼𝑖 ∈ R are Lagrange multipliers and 𝐶 ∈ ℝ+
0 is a penalty factor applied to misclassifications. Regarding the

kernel, the radial basis function (RBF) 𝐾(𝐱𝑖,𝐰) = 𝑒−𝛾‖𝐱𝑖−𝐰‖2 , with 𝛾 ∈ ℝ+
0 , is highlighted as a convenient option.

The Lagrange multipliers obtained when solving Equation 1 allows then define the classification rule 𝐺(𝐱) =

sgn

( 𝑚∑
𝑖=1

𝑦𝑖𝛼𝑖𝐾(𝐱, 𝐱𝑖) + 𝑏

)
. More details are found in [15] and [57].

Proposed by [8], the RF method comprises a classification rule 𝐺 ∶  →  resulting from an ensemble of decision
trees [16]. Formally, from a training set , it is made 𝑛𝑒𝑠𝑡 replications with same cardinality by bootstrap sampling. For
each replica, a subset with up to 𝑛𝑎𝑡𝑡 attributes are randomly considered and then used to train a single decision tree.
Parameters such as the maximum depth (𝑝𝑑𝑒𝑝𝑡ℎ), and the minimum examples for splitting and per leaf (𝑝𝑠𝑝𝑙𝑖𝑡 and 𝑝𝑙𝑒𝑎𝑓 ,
respectively) need to be tuned before the training process. A detailed discussions about these parameters are found
in [8]. After training each tree 𝐺𝑘 ∶  →  , with 𝑘 = 1,… , 𝑛𝑒𝑠𝑡, a given unlabeled attribute vector 𝐱 is classified
according to a class 𝜔𝑎 ∈ Ω whose agreement among the 𝑛𝑒𝑠𝑡 decision trees is maximum, that is:

𝐺(𝐱) = argmax
𝑎∈{1,…,𝑐}

{𝑛𝑒𝑠𝑡∑
𝑘=1

𝛿𝑎
(
𝐺𝑘 (𝐱)

)}
(2)

where 𝛿𝑎
(
𝐺𝑘 (𝐱)

)
= 1 when 𝐺𝑘 (𝐱) = 𝑎, otherwise, 𝛿𝑎

(
𝐺𝑘 (𝐱)

)
= 0.

Concurrently with the previous methods, models based on neural networks concepts emerge as an effective
alternative for the classification of Remote Sensing images [72]. Such models are characterized by their high
generalization ability when dealing with uncorrelated data.

Among dozen neural network proposals, the Long Short-term Memory (LSTM) comprises a model that inserts
and exploits the temporal effects over the learning process. The diagram depicted in Figure 1 yields to understand this
method. Initially, we assume 𝐱𝑖, with 𝑖 = 1,… , 𝑚, as a set of patterns (i.e., attribute vectors) ordered by its index 𝑖
and which are sequentially submitted to concatenation, element-wise multiplication and vector sum (⊙, ⊗ and ⊕) and
then applied to sigmoid (𝜎) and hyperbolic tangent (𝜑) functions.

Also, worth highlighting some elements in the training process, for example, 𝐜𝑖−1 and 𝐜𝑖 as a vector with the
“previous state” and “current” of the network; 𝐟𝑖 a vector comprising “forgetting factors” for each component of the
input vector; 𝐢𝑖 as a “signal modulation” of input information; and 𝐡𝑖−1 and 𝐡𝑖 as the network input and output vectors
in the “previous” and “current” states.

After successively insert the patterns 𝐱𝑖, and consider the respective predictions 𝐡𝑖, the training process stops when
occurs the network convergence. It is worth noting that the classification is performed by a softmax function [18]
adjusted using the outputs 𝐡𝑖, which class indicators 𝑦𝑖 are known from

(
𝐱𝑖, 𝑦𝑖

)
∈ . According to this method, the

combination between the described network and the softmax function comprise the classification model 𝐺. A more
extensive and comprehensive discussion about the LSTM method and its parameters are found in [26].

2.3. Algal bloom detection through spectral index thresholding
Spectral indices can be understood as an attribute extraction process for Remote Sensing images whose purpose is

to aid in the identification of specific targets. Over the years, several studies have proposed and used spectral indices,
for example, for the characterization of water bodies [37, 20, 69] and vegetation health [49, 27, 33, 67]. In the context of
Chl-a estimation, the detection of algal blooms has been performed through the adoption of rigid thresholds. In [74] the
Normalized Difference Vegetation Index (NDVI) [49] is used to identify algae when such index achieve values greater
than −0.15 Similarly, in [69] , the Modified Normalized Difference Water Index (MNDWI) [25] is used to characterize
algal blooms when such index is less than zero. The Surface Algae Bloom Index (SABI) [1] and Floating Algae Index
(FAI) [42] are proposals exclusively for detecting algae, whose occurrence is verified when such indices surpass the
values −0.1 and −0.004, respectively.

Table 1 summarizes the different above-mentioned spectral indices and the respective thresholds that characterize
the occurrence of algae. Regarding the expressions, 𝑥𝑅𝑒𝑑 , 𝑥𝐺𝑟𝑒𝑒𝑛, 𝑥𝐵𝑙𝑢𝑒, 𝑥𝑁𝐼𝑅 and 𝑥𝑆𝑊 𝐼𝑅 represent the spectral
behavior measured in the red, green, near-infrared and short-wave infrared wave spectral bands, respectively; 𝜆𝑅𝑒𝑑 ,
𝜆𝑁𝐼𝑅 and 𝜆𝑆𝑊 𝐼𝑅 are the midpoint of the red, near-infrared, and short-wave infrared wavelength bands.

First Author et al.: Preprint submitted to Elsevier Page 3 of 20


Forecasting algal bloom events in inland waters using remote sensing data and machine learning methods

Figure 1: An overview of the LSTM architecture.

Table 1
Summary of spectral indices and thresholds used for algal bloom detection.

Spectral Index Expression Algae Referencethreshold

NDVI
𝑥𝑁𝐼𝑅 − 𝑥𝑅𝑒𝑑

𝑥𝑁𝐼𝑅 + 𝑥𝑅𝑒𝑑
> −0.15 [74]

MNDWI
𝑥𝐺𝑟𝑒𝑒𝑛 − 𝑥𝑆𝑊 𝐼𝑅

𝑥𝐺𝑟𝑒𝑒𝑛 + 𝑥𝑆𝑊 𝐼𝑅
< 0 [70]

SABI
𝑥𝑁𝐼𝑅 − 𝑥𝑅𝑒𝑑

𝑥𝐵𝑙𝑢𝑒 + 𝑥𝐺𝑟𝑒𝑒𝑛
> −0.1 [1]

FAI
𝑥𝑁𝐼𝑅 −

[
𝑥𝑅𝑒𝑑+(

𝑥𝑆𝑊 𝐼𝑅 − 𝑥𝑅𝑒𝑑
)
+

× 𝜆𝑁𝐼𝑅−𝜆𝑅𝑒𝑑
𝜆𝑆𝑊 𝐼𝑅−𝜆𝑅𝑒𝑑

] > −0.004 [42]

3. Anomalous behaviour forecasting
This section introduces a novel proposal for predicting algal blooms in inland waters. In the following discussions,

the emergence of this phenomenon will be called “anomaly”; otherwise, we adopt the term “regular”. Section 3.1
presents a conceptual formalization of the proposal. Section 3.2 gives relevant details regarding the proposal’s
implementation.

3.1. Conceptual formalization
Based on Machine Learning concepts and multitemporal data obtained by Remote Sensing, a method for predicting

anomalous events in aquatic environments is proposed. For this reason, this method will be called Anomalous Behavior
Forecasting (ABF). Figure 2 illustrates the main components of this proposal, which are structured in four main blocks.

The first block is intended to build a multitemporal database for predicting algal blooms. Five basic parameters
are needed at this stage: a region of interest where the forecasting is executed; a period of analysis for information
extraction and modelling; the number of past step instants used to composite the information (i.e., the attribute vector)
to build the forecast model; the nu