Título: | EFFICIENT WEB PAGE REFRESH POLICIES | ||||||||||||||||||||||||||||||||||||||||||||
Autor: |
CRISTON PEREIRA DE SOUZA |
||||||||||||||||||||||||||||||||||||||||||||
Colaborador(es): |
EDUARDO SANY LABER - Orientador |
||||||||||||||||||||||||||||||||||||||||||||
Catalogação: | 15/JUL/2010 | Língua(s): | PORTUGUESE - BRAZIL |
||||||||||||||||||||||||||||||||||||||||||
Tipo: | TEXT | Subtipo: | THESIS | ||||||||||||||||||||||||||||||||||||||||||
Notas: |
[pt] Todos os dados constantes dos documentos são de inteira responsabilidade de seus autores. Os dados utilizados nas descrições dos documentos estão em conformidade com os sistemas da administração da PUC-Rio. [en] All data contained in the documents are the sole responsibility of the authors. The data used in the descriptions of the documents are in conformity with the systems of the administration of PUC-Rio. |
||||||||||||||||||||||||||||||||||||||||||||
Referência(s): |
[pt] https://www.maxwell.vrac.puc-rio.br/projetosEspeciais/ETDs/consultas/conteudo.php?strSecao=resultado&nrSeq=15893&idi=1 [en] https://www.maxwell.vrac.puc-rio.br/projetosEspeciais/ETDs/consultas/conteudo.php?strSecao=resultado&nrSeq=15893&idi=2 |
||||||||||||||||||||||||||||||||||||||||||||
DOI: | https://doi.org/10.17771/PUCRio.acad.15893 | ||||||||||||||||||||||||||||||||||||||||||||
Resumo: | |||||||||||||||||||||||||||||||||||||||||||||
A search engine needs to continuously revisit web pages in order to
keep its local repository up-to-date. A page revisiting schedule must be
defined to keep the repository up-to-date using the available resources.
In order to avoid web server overload, the revisiting policy must respect
a minimum amount of time between consecutive requests to the same
server. This rule is called politeness constraint. Due to the large number
of web pages, we consider that a revisiting policy is efficient when the
mean time to schedule a revisit is sublinear on the number of pages in the
repository. Therefore, when the politeness constraint is considered, there
are no existing efficient policies with theoretical quality guarantees. We
investigate three efficient policies that respect the politeness constraint,
called MERGE, RANDOM and DELAYED. We provide approximation
factors for the repository’s up-to-date level for the MERGE and RANDOM
policies. Based on these approximation factors, we devise a 0.77 lower bound
for the approximation factor provided by the RANDOM policy and we
present a conjecture that 0.927 is a lower bound for the approximation
factor provided by the MERGE policy. We evaluate these policies through
simulation experiments which try to keep a repository with 14.5 million
web pages up-to-date. Additional experiments based on a repository with
Wikipedia’s articles concluded that the MERGE policy provides better
results than a natural greedy strategy. The main conclusion of this research
is that there are simple and efficient policies that can be applied to this
problem, even when the politeness constraint must be respected, resulting
in a small loss of repository’s up-to-date level.
|
|||||||||||||||||||||||||||||||||||||||||||||
|