DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing

Guardado en:

Detalles Bibliográficos
Publicado en:	arXiv.org (Dec 8, 2024), p. n/a
Autor principal:	Shankar, Shreya
Otros Autores:	Chambers, Tristan, Shah, Tarak, Parameswaran, Aditya G, Wu, Eugene
Publicado:	Cornell University Library, arXiv.org
Materias:	Data analysis Algorithms Data processing Prompt engineering Unstructured data Large language models Task complexity Documents Query languages Optimization
Acceso en línea:	Citation/Abstract Full text outside of ProQuest
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

MARC


LEADER	00000nab a2200000uu 4500
001	3142726310
003	UK-CbPIL
022			\|a 2331-8422
035			\|a 3142726310
045	0		\|b d20241208
100	1		\|a Shankar, Shreya
245	1		\|a DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing
260			\|b Cornell University Library, arXiv.org \|c Dec 8, 2024
513			\|a Working Paper
520	3		\|a Analyzing unstructured data has been a persistent challenge in data processing. Large Language Models (LLMs) have shown promise in this regard, leading to recent proposals for declarative frameworks for LLM-powered processing of unstructured data. However, these frameworks focus on reducing cost when executing user-specified operations using LLMs, rather than improving accuracy, executing most operations as-is (in a single LLM call). This is problematic for complex tasks and data, where LLM outputs for user-defined operations are often inaccurate, even with optimized prompts. For example, an LLM may struggle to identify {\em all} instances of specific clauses, like force majeure or indemnification, in lengthy legal documents, requiring decomposition of the data, the task, or both. We present DocETL, a system that optimizes complex document processing pipelines, while accounting for LLM shortcomings. DocETL offers a declarative interface for users to define such pipelines and uses an agent-based approach to automatically optimize them, leveraging novel agent-based rewrites (that we call rewrite directives), as well as an optimization and evaluation framework. We introduce (i) logical rewriting of pipelines, tailored for LLM-based tasks, (ii) an agent-guided plan evaluation mechanism that synthesizes and orchestrates task-specific validation prompts, and (iii) an optimization algorithm that efficiently finds promising plans, considering the latencies of agent-based plan generation and evaluation. Our evaluation on four different unstructured document analysis tasks demonstrates that DocETL finds plans with outputs that are 25 to 80% more accurate than well-engineered baselines, addressing a critical gap in unstructured data analysis. DocETL is open-source at docetl.org, and as of November 2024, has amassed over 1.3k GitHub Stars, with users spanning a variety of domains.
653			\|a Data analysis
653			\|a Algorithms
653			\|a Data processing
653			\|a Prompt engineering
653			\|a Unstructured data
653			\|a Large language models
653			\|a Task complexity
653			\|a Documents
653			\|a Query languages
653			\|a Optimization
700	1		\|a Chambers, Tristan
700	1		\|a Shah, Tarak
700	1		\|a Parameswaran, Aditya G
700	1		\|a Wu, Eugene
773	0		\|t arXiv.org \|g (Dec 8, 2024), p. n/a
786	0		\|d ProQuest \|t Engineering Database
856	4	1	\|3 Citation/Abstract \|u https://www.proquest.com/docview/3142726310/abstract/embedded/ZKJTFFSVAI7CB62C?source=fedsrch
856	4	0	\|3 Full text outside of ProQuest \|u http://arxiv.org/abs/2410.12189