ETDs

Estatística

Título:

USING RUNTIME INFORMATION AND MAINTENANCE KNOWLEDGE TO ASSIST FAILURE DIAGNOSIS, DETECTION AND RECOVERY

Autor:

THIAGO PINHEIRO DE ARAUJO

Colaborador(es):

ARNDT VON STAA - Orientador

Catalogação:

16/JAN/2017

Língua(s):

ENGLISH - UNITED STATES

Tipo:

TEXT

Subtipo:

THESIS

Notas:

[pt] Todos os dados constantes dos documentos são de inteira responsabilidade de seus autores. Os dados utilizados nas descrições dos documentos estão em conformidade com os sistemas da administração da PUC-Rio.
[en] All data contained in the documents are the sole responsibility of the authors. The data used in the descriptions of the documents are in conformity with the systems of the administration of PUC-Rio.

Referência(s):

[pt] https://www.maxwell.vrac.puc-rio.br/projetosEspeciais/ETDs/consultas/conteudo.php?strSecao=resultado&nrSeq=28702&idi=1
[en] https://www.maxwell.vrac.puc-rio.br/projetosEspeciais/ETDs/consultas/conteudo.php?strSecao=resultado&nrSeq=28702&idi=2

DOI:

https://doi.org/10.17771/PUCRio.acad.28702

Resumo:

Even software systems developed with strict quality control may expect failures during their lifetime. When a failure is observed in a production environment the maintainer is responsible for diagnosing the cause and eventually removing it. However, considering a critical service this might demand too long a time to complete, hence, if possible, the failure signature should be identified in order to generate a recovery mechanism to automatically detect and handle future occurrences until a proper correction can be made. In this thesis, recovery consists of restoring a correct context allowing dependable execution, even if the causing fault is still unknown. To be effective, the tasks of diagnosing and recovery implementation require detailed information about the failed execution. Failures that occur during the test phase run in a controlled environment, allow adding specific code instrumentation and usually can be replicated, making it easier to study the unexpected behavior. However, failures that occur in the production environment are limited to the information present in the first occurrence of the failure. But run time failures are obviously unexpected, hence run time data must be gathered systematically to allow detecting, diagnosing with the purpose of recovering, and eventually diagnosing with the purpose of removing the causing fault. Thus there is a balance between the detail of information inserted as instrumentation and the system performance: standard logging techniques usually present low impact on performance, but carry insufficient information about the execution; while tracing techniques can record precise and detailed information, however are impracticable for a production environment. This thesis proposes a novel hybrid approach for recording and extracting system s runtime information. The solution is based on event logs, where events are enriched with contextual properties about the current state of the execution at the moment the event is recorded. Using these enriched log events a diagnosis technique and a tool have been developed to allow event filtering based on the maintainer s perspective of interest. Furthermore, an approach using these enriched events has been developed that allows detecting and diagnosing failures aiming at recovery. The proposed solutions were evaluated through measurements and studies conducted using deployed systems, based on failures that actually occurred while using the software in a production context.

Descrição:			Arquivo:
COMPLETE			PDF