Logo PUC-Rio Logo Maxwell
ETDs @PUC-Rio
Estatística
Título: A STUDY ON ELLIPSOIDAL CLUSTERING
Autor: RAPHAEL ARAUJO SAMPAIO
Colaborador(es): MARCUS VINICIUS SOLEDADE POGGI DE ARAGAO - Orientador
THIBAUT VICTOR GASTON VIDAL - Coorientador
Catalogação: 16/JAN/2019 Língua(s): ENGLISH - UNITED STATES
Tipo: TEXT Subtipo: THESIS
Notas: [pt] Todos os dados constantes dos documentos são de inteira responsabilidade de seus autores. Os dados utilizados nas descrições dos documentos estão em conformidade com os sistemas da administração da PUC-Rio.
[en] All data contained in the documents are the sole responsibility of the authors. The data used in the descriptions of the documents are in conformity with the systems of the administration of PUC-Rio.
Referência(s): [pt] https://www.maxwell.vrac.puc-rio.br/projetosEspeciais/ETDs/consultas/conteudo.php?strSecao=resultado&nrSeq=36126&idi=1
[en] https://www.maxwell.vrac.puc-rio.br/projetosEspeciais/ETDs/consultas/conteudo.php?strSecao=resultado&nrSeq=36126&idi=2
DOI: https://doi.org/10.17771/PUCRio.acad.36126
Resumo:
Unsupervised cluster analysis, the process of grouping sets of points according to one or more similarity criteria, plays an essential role in various fields. The two most popular algorithms for this process are the k-means and the Gaussian Mixture Models (GMM). The former assigns each point to a single cluster and uses Euclidean distance as similarity. The latter determines a probability matrix of points to belong to clusters, and the Mahalanobis distance is the underlying similarity. Apart from the difference in the assignment method - the so-called hard assignment for the former and soft assignment for the latter - the algorithms also differ concerning the cluster structure, or shape: the k-means considers spherical structures in the data; while the GMM considers ellipsoidal ones through the estimation of covariance matrices. In this work, a mathematical optimization problem that combines the hard assignment with the ellipsoidal cluster structure is detailed and formulated. Since the estimation of the covariance plays a major role in the behavior of ellipsoidal cluster structures, regularization techniques are explored. In this context, two meta-heuristic methods, a Random Swap perturbation and a hybrid genetic algorithm, are adapted, and their impact on the improvement of the performance of the methods is studied. The central objective is three-fold: to gain an understanding of the conditions in which ellipsoidal clustering structures are more beneficial than spherical ones; to determine the impact of covariance estimation with regularization methods; and to analyze the effect of global optimization meta-heuristics on unsupervised cluster analysis. Finally, in order to provide grounds for comparison of the present findings to future related works, a database was generated together with an extensive benchmark containing an analysis of the variations of different sizes, shapes, number of clusters, and separability and their impact on the results of different clustering algorithms. Furthermore, packages written in the Julia language have been made available with the algorithms studied throughout this work.
Descrição: Arquivo:   
COMPLETE PDF