Logo PUC-Rio Logo Maxwell
ETDs @PUC-Rio
Estatística
Título: QUOTATION EXTRACTION FOR PORTUGUESE
Autor: WILLIAM PAULO DUCCA FERNANDES
Colaborador(es): RUY LUIZ MILIDIU - Orientador
Catalogação: 24/JAN/2017 Língua(s): ENGLISH - UNITED STATES
Tipo: TEXT Subtipo: THESIS
Notas: [pt] Todos os dados constantes dos documentos são de inteira responsabilidade de seus autores. Os dados utilizados nas descrições dos documentos estão em conformidade com os sistemas da administração da PUC-Rio.
[en] All data contained in the documents are the sole responsibility of the authors. The data used in the descriptions of the documents are in conformity with the systems of the administration of PUC-Rio.
Referência(s): [pt] https://www.maxwell.vrac.puc-rio.br/projetosEspeciais/ETDs/consultas/conteudo.php?strSecao=resultado&nrSeq=28807&idi=1
[en] https://www.maxwell.vrac.puc-rio.br/projetosEspeciais/ETDs/consultas/conteudo.php?strSecao=resultado&nrSeq=28807&idi=2
DOI: https://doi.org/10.17771/PUCRio.acad.28807
Resumo:
Quotation Extraction consists of identifying quotations from a text and associating them to their authors. In this work, we present a Quotation Extraction system for Portuguese. Quotation Extraction has been previously approached using different techniques and for several languages. Our proposal differs from previous work since we use Machine Learning to automatically build specialized rules instead of human-derived rules. Machine Learning models usually present stronger generalization power compared to human-derived models. In addition, we are able to easily adapt our model to other languages, needing only a list of verbs of speech for a given language. The previously proposed systems would probably need a rule set adaptation to correctly classify the quotations, which would be time consuming. We tackle the Quotation Extraction task using one model for the Entropy Guided Transformation Learning algorithm and another one for the Structured Perceptron algorithm. In order to train and evaluate the system, we have build the GloboQuotes corpus, with news extracted from the globo.com portal. We add part-of-speech tags to the corpus using a state-of-the-art tagger. The Structured Perceptron based on weighted interval scheduling obtains an F sub Beta equal 1 score of 76.80 per cent.
Descrição: Arquivo:   
COVER, ACKNOWLEDGEMENTS, RESUMO, ABSTRACT, SUMMARY AND LISTS PDF    
CHAPTER 1 PDF    
CHAPTER 2 PDF    
CHAPTER 3 PDF    
CHAPTER 4 PDF    
CHAPTER 5 PDF    
CHAPTER 6 PDF    
REFERENCES, GLOSSARY AND APPENDICES PDF