Título: | MORPHOSYNTACTIC TAGGER FOR PORTUGUESE-TWITTER | |||||||
Autor: |
PEDRO LARRONDA ASTI |
|||||||
Colaborador(es): |
RUY LUIZ MILIDIU - Orientador |
|||||||
Catalogação: | 13/OUT/2011 | Língua(s): | PORTUGUESE - BRAZIL |
|||||
Tipo: | TEXT | Subtipo: | THESIS | |||||
Notas: |
[pt] Todos os dados constantes dos documentos são de inteira responsabilidade de seus autores. Os dados utilizados nas descrições dos documentos estão em conformidade com os sistemas da administração da PUC-Rio. [en] All data contained in the documents are the sole responsibility of the authors. The data used in the descriptions of the documents are in conformity with the systems of the administration of PUC-Rio. |
|||||||
Referência(s): |
[pt] https://www.maxwell.vrac.puc-rio.br/projetosEspeciais/ETDs/consultas/conteudo.php?strSecao=resultado&nrSeq=18481&idi=1 [en] https://www.maxwell.vrac.puc-rio.br/projetosEspeciais/ETDs/consultas/conteudo.php?strSecao=resultado&nrSeq=18481&idi=2 |
|||||||
DOI: | https://doi.org/10.17771/PUCRio.acad.18481 | |||||||
Resumo: | ||||||||
In this paper we present a language processor that solves the task of Morphosyntactic
Tagging of messages posted in Portuguese on Twitter. By analyzing
the messages written by Brazilian on Twitter, it is easy to notice that new
characters are introduced in the alphabet and also that new words are added
to the language. Furthermore, we note that these messages are syntactically
malformed. This precludes the use of existing Portuguese processors in these
messages, nevertheless this problem can be solved by considering these messages
as written in a new language, the Portuguese-Twitter. Both the alphabet
and the vocabulary of such idiom contain features of Portuguese. However, the
grammar is are different. In order to build the processors for this new language,
we have used a supervised learning technique known as Entropy Guided
Transformation Learning (ETL). Additionally, to train ETL processors,
we have built an annotated corpus of messages in Portuguese-Twitter. We are
not aware of any other taggers for the Morphosyntactic Portuguese-Twitter
task, thus we have compared our tagger to the the accuracy of state-of-art
Morphosyntactic Annotation for Portuguese, which has accuracy around 96%
depending on the tag set chosen. To assess the quality of the processor, we have
used accuracy, which measures how many tokens were tagged correctly. Our
experimental results show an accuracy of 90,24% for the proposed Morphosyntatic
Tagger. This corresponds to significant learning, since the initial
baseline system has an accuracy of only 76,58%. This finding is consistent with
the observed learning for the corresponding regular Portuguese taggers.
|
||||||||