Data analysis in python: Anonymized features and imbalanced data target

Woiski, Emanuel Rocha [UNESP]

doi:10.1007/978-3-319-55852-3_10

Data analysis in python: Anonymized features and imbalanced data target

dc.contributor.author	Woiski, Emanuel Rocha [UNESP]
dc.contributor.institution	Universidade Estadual Paulista (Unesp)
dc.date.accessioned	2018-12-11T17:34:50Z
dc.date.available	2018-12-11T17:34:50Z
dc.date.issued	2017-04-25
dc.description.abstract	Remaining useful life (RUL) of an equipment or system is a prognostic value that depends on data gathered from multiple and diverse sources. Moreover, assumed for the sake of the present study as a binary classification problem, the probability of failure of any system is usually very much smaller than that of the same system to be in normal operating conditions. The imbalanced outcome (largely much more 'normal' than 'failure' states) at any time results from the combined values of a large set of features, some related to one another, some redundant, and most quite noisy. Previewing the development and requirements of a robust framework, it is advocated that by using Python libraries, those difficulties can be dealt with. In the present Chapter, DOROTHEA, a dataset from UCI library with a hundred thousand of sparse anonymized (i.e. unrecognizable labels) binary features and imbalanced binary classes are analyzed. For that, an ipython (jupyter) notebook, pandas are used to import the data set, then some exploratory analysis and feature engineering are performed and several estimators (classifiers) obtained from scikit-learn library are applied. It is demonstrated that global accuracy does not work for this case, since the minority class is easily overlooked by the algorithms. Therefore, receiver operating characteristics (ROC), Precision-Recall curves and respective area under curve (AUCs) evaluated from each estimator or ensemble, as well as some simple statistics, using three hybrid methods, that are, a mix of filter, embedded and wrapper methods, feature selection strategies, were compared.	en
dc.description.affiliation	Department of Mechanical Engineering São Paulo State University (UNESP)
dc.description.affiliationUnesp	Department of Mechanical Engineering São Paulo State University (UNESP)
dc.format.extent	169-188
dc.identifier	http://dx.doi.org/10.1007/978-3-319-55852-3_10
dc.identifier.citation	Probabilistic Prognostics and Health Management of Energy Systems, p. 169-188.
dc.identifier.doi	10.1007/978-3-319-55852-3_10
dc.identifier.lattes	0956241471937475
dc.identifier.orcid	0000-0002-6030-639X
dc.identifier.scopus	2-s2.0-85033671443
dc.identifier.uri	http://hdl.handle.net/11449/179351
dc.language.iso	eng
dc.relation.ispartof	Probabilistic Prognostics and Health Management of Energy Systems
dc.rights.accessRights	Acesso restrito
dc.source	Scopus
dc.subject	Data analysis
dc.subject	Imbalanced classes
dc.subject	Machine learning
dc.subject	Precision-recall
dc.subject	Python
dc.subject	ROC
dc.subject	Scikit-learn
dc.title	Data analysis in python: Anonymized features and imbalanced data target	en
dc.type	Capítulo de livro
dspace.entity.type	Publication
unesp.author.lattes	0956241471937475[1]
unesp.author.orcid	0000-0002-6030-639X[1]
unesp.department	Engenharia Mecânica - FEIS	pt

Coleções

Ilha Solteira - FEIS - Faculdade de Engenharia

Data analysis in python: Anonymized features and imbalanced data target

Arquivos

Coleções