Logo PUC-Rio Logo Maxwell
ETDs @PUC-Rio
Estatística
Título: THE IMPACT OF STRUCTURAL ATTRIBUTES TO IDENTIFY TABLES AND LISTS IN HTML DOCUMENTS
Autor: IAM VITA JABOUR
Colaborador(es): EDUARDO SANY LABER - Orientador
RAUL PIERRE RENTERIA - Coorientador
Catalogação: 11/ABR/2011 Língua(s): PORTUGUESE - BRAZIL
Tipo: TEXT Subtipo: THESIS
Notas: [pt] Todos os dados constantes dos documentos são de inteira responsabilidade de seus autores. Os dados utilizados nas descrições dos documentos estão em conformidade com os sistemas da administração da PUC-Rio.
[en] All data contained in the documents are the sole responsibility of the authors. The data used in the descriptions of the documents are in conformity with the systems of the administration of PUC-Rio.
Referência(s): [pt] https://www.maxwell.vrac.puc-rio.br/projetosEspeciais/ETDs/consultas/conteudo.php?strSecao=resultado&nrSeq=17247&idi=1
[en] https://www.maxwell.vrac.puc-rio.br/projetosEspeciais/ETDs/consultas/conteudo.php?strSecao=resultado&nrSeq=17247&idi=2
DOI: https://doi.org/10.17771/PUCRio.acad.17247
Resumo:
The segmentation of HTML documents has been essential to information extraction tasks, as showed by several works in this area. This paper studies the link between an HTML document and its visual representation to show how it helps segments identification using a structural approach. For this, we investigate how tree edit distance algorithms can find structural similarities in a DOM tree, using two tasks to execute our experiments. The first one is the identification of genuine tables where we obtained a 90.40% F1 score using the corpus provided by (Wang e Hu, 2002). We show through an experimental study that this result is competitive with the best results in the area. The second task studied is the identification of product listings in e-commerce sites. Here we get a 94.95% F1 score using a corpus with 1114 HTML documents from 8 distinct sites. We conclude that algorithms to calculate trees similarity provide competitive results for both tasks, making them also good candidates to identify other types of segments.
Descrição: Arquivo:   
COVER, ACKNOWLEDGEMENTS, RESUMO, ABSTRACT, SUMMARY AND LISTS PDF    
CHAPTER 1 PDF    
CHAPTER 2 PDF    
CHAPTER 3 PDF    
CHAPTER 4 PDF    
CHAPTER 5 PDF    
CHAPTER 6 PDF    
REFERENCES PDF