ETDs

Estatística

Título:

QUERYING DATABASES WITH NATURAL LANGUAGE: THE USE OF LARGE LANGUAGE MODELS FOR TEXT-TO-SQL TASKS

Autor:

EDUARDO ROGER SILVA NASCIMENTO

Colaborador(es):

MARCO ANTONIO CASANOVA - Orientador

Catalogação:

23/MAI/2024

Língua(s):

ENGLISH - UNITED STATES

Tipo:

TEXT

Subtipo:

THESIS

Concurso de Teses e Dissertações em Banco de Dados 2024 - SBC

Notas:

[pt] Todos os dados constantes dos documentos são de inteira responsabilidade de seus autores. Os dados utilizados nas descrições dos documentos estão em conformidade com os sistemas da administração da PUC-Rio.
[en] All data contained in the documents are the sole responsibility of the authors. The data used in the descriptions of the documents are in conformity with the systems of the administration of PUC-Rio.

Referência(s):

[pt] https://www.maxwell.vrac.puc-rio.br/projetosEspeciais/ETDs/consultas/conteudo.php?strSecao=resultado&nrSeq=66799&idi=1
[en] https://www.maxwell.vrac.puc-rio.br/projetosEspeciais/ETDs/consultas/conteudo.php?strSecao=resultado&nrSeq=66799&idi=2

DOI:

https://doi.org/10.17771/PUCRio.acad.66799

Resumo:

The Text-to-SQL task involves generating an SQL query based on a given relational database and a Natural Language (NL) question. While the leaderboards of well-known benchmarks indicate that Large Language Models (LLMs) excel in this task, they are evaluated on databases with simpler schemas. This dissertation first investigates the performance of LLM-based Text-to-SQL models on a complex and openly available database (Mondial) with a large schema and a set of 100 NL questions. Running under GPT-3.5 and GPT-4, the results of this first experiment show that the performance of LLM-based tools is significantly less than that reported in the benchmarks and that these tools struggle with schema linking and joins, suggesting that the relational schema may not be suitable for LLMs. This dissertation then proposes using LLM-friendly views and data descriptions for better accuracy in the Text-to-SQL task. In a second experiment, using the strategy with better performance, cost and benefit from the previous experiment and another set with 100 questions over a real-world database, the results show that the proposed approach is sufficient to considerably improve the accuracy of the prompt strategy. This work concludes with a discussion of the results obtained and suggests further approaches to simplify the Text-to-SQL task.

Descrição:			Arquivo:
COMPLETE			PDF