Título: | HEURISTICS FOR DATA POINT SELECTION FOR LABELING IN SEMI-SUPERVISED AND ACTIVE LEARNING CONTEXTS | ||||||||||||
Autor: |
SONIA FIOL GONZALEZ |
||||||||||||
Colaborador(es): |
HELIO CORTES VIEIRA LOPES - Orientador CASSIO FREITAS PEREIRA DE ALMEIDA - Coorientador |
||||||||||||
Catalogação: | 16/SET/2021 | Língua(s): | ENGLISH - UNITED STATES |
||||||||||
Tipo: | TEXT | Subtipo: | THESIS | ||||||||||
Notas: |
[pt] Todos os dados constantes dos documentos são de inteira responsabilidade de seus autores. Os dados utilizados nas descrições dos documentos estão em conformidade com os sistemas da administração da PUC-Rio. [en] All data contained in the documents are the sole responsibility of the authors. The data used in the descriptions of the documents are in conformity with the systems of the administration of PUC-Rio. |
||||||||||||
Referência(s): |
[pt] https://www.maxwell.vrac.puc-rio.br/projetosEspeciais/ETDs/consultas/conteudo.php?strSecao=resultado&nrSeq=54776&idi=1 [en] https://www.maxwell.vrac.puc-rio.br/projetosEspeciais/ETDs/consultas/conteudo.php?strSecao=resultado&nrSeq=54776&idi=2 |
||||||||||||
DOI: | https://doi.org/10.17771/PUCRio.acad.54776 | ||||||||||||
Resumo: | |||||||||||||
Supervised learning is, today, the branch of Machine Learning central
to most business disruption. The approach relies on having amounts of labeled
data large enough to learn functions with the required approximation.
However, labeled data may be expensive, to obtain or to construct through
a labeling process. Semi-supervised learning (SSL) strives to label accurately data from small amounts of labeled data and the use of unsupervised learning techniques. One labeling technique is label propagation. We use specifically the Consensus rate-based label propagation (CRLP) in this work. A consensus function is central to the propagation. A possible consensus function is a coassociation
matrix that estimates the probability of data points i and j belong to the same group. In this work, we observe that the co-association matrix has valuable information embedded in it. When no data is labeled, it is common to choose with a uniform probability randomly, the data to manually label, from which the propagation proceeds. This work addresses the problem of selecting
a fixed-size set of data points to label (manually), to improve the label propagation algorithm s accuracy. Three selection techniques, based on stochastic sampling principles, are proposed: Stratified Sampling (SP), Probability (P), and Stratified Sampling - Probability (SSP). They are all based on the information embedded in the co-association matrix. Experiments were carried out on 15 benchmark sets and showed exciting results. Not only because they provide a more balanced selection when compared to a uniform random selection, but also improved the accuracy results of a label propagation method. These strategies were also tested inside an active learning process in a different
context, also achieving good results.
|
|||||||||||||
|