Data analysis in python: Anonymized features and imbalanced data target

Remaining useful life (RUL) of an equipment or system is a prognostic value that depends on data gathered from multiple and diverse sources. Moreover, assumed for the sake of the present study as a binary classification problem, the probability of failure of any system is usually very much smaller than that of the same system to be in normal operating conditions. The imbalanced outcome (largely much more 'normal' than 'failure' states) at any time results from the combined values of a large set of features, some related to one another, some redundant, and most quite noisy. Previewing the development and requirements of a robust framework, it is advocated that by using Python libraries, those difficulties can be dealt with. In the present Chapter, DOROTHEA, a dataset from UCI library with a hundred thousand of sparse anonymized (i.e. unrecognizable labels) binary features and imbalanced binary classes are analyzed. For that, an ipython (jupyter) notebook, pandas are used to import the data set, then some exploratory analysis and feature engineering are performed and several estimators (classifiers) obtained from scikit-learn library are applied. It is demonstrated that global accuracy does not work for this case, since the minority class is easily overlooked by the algorithms. Therefore, receiver operating characteristics (ROC), Precision-Recall curves and respective area under curve (AUCs) evaluated from each estimator or ensemble, as well as some simple statistics, using three hybrid methods, that are, a mix of filter, embedded and wrapper methods, feature selection strategies, were compared.

Palavras-chave

Data analysis, Imbalanced classes, Machine learning, Precision-recall, Python, ROC, Scikit-learn

Idioma

Inglês

Citação

Probabilistic Prognostics and Health Management of Energy Systems, p. 169-188.

URI

http://hdl.handle.net/11449/179351

Coleções

Ilha Solteira - FEIS - Faculdade de Engenharia

Estatísticas de acesso

Página do item completo

Outras formas de acesso

doi

Atenção!

Data analysis in python: Anonymized features and imbalanced data target

Arquivos

Fontes externas

Fontes externas

Data

Autores

Orientador

Coorientador

Pós-graduação

Curso de graduação

Título da Revista

ISSN da Revista

Título de Volume

Editor

Tipo

Direito de acesso

PlumX

Arquivos

Fontes externas

Fontes externas

Resumo

Descrição

Palavras-chave

Idioma

Citação

URI

Itens relacionados

Financiadores

Coleções

Unidades

Departamentos

Cursos de graduação

Programas de pós-graduação

Outras formas de acesso