DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing
Guardado en:
| Publicado en: | arXiv.org (Dec 8, 2024), p. n/a |
|---|---|
| Autor principal: | |
| Otros Autores: | , , , |
| Publicado: |
Cornell University Library, arXiv.org
|
| Materias: | |
| Acceso en línea: | Citation/Abstract Full text outside of ProQuest |
| Etiquetas: |
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
MARC
| LEADER | 00000nab a2200000uu 4500 | ||
|---|---|---|---|
| 001 | 3142726310 | ||
| 003 | UK-CbPIL | ||
| 022 | |a 2331-8422 | ||
| 035 | |a 3142726310 | ||
| 045 | 0 | |b d20241208 | |
| 100 | 1 | |a Shankar, Shreya | |
| 245 | 1 | |a DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing | |
| 260 | |b Cornell University Library, arXiv.org |c Dec 8, 2024 | ||
| 513 | |a Working Paper | ||
| 520 | 3 | |a Analyzing unstructured data has been a persistent challenge in data processing. Large Language Models (LLMs) have shown promise in this regard, leading to recent proposals for declarative frameworks for LLM-powered processing of unstructured data. However, these frameworks focus on reducing cost when executing user-specified operations using LLMs, rather than improving accuracy, executing most operations as-is (in a single LLM call). This is problematic for complex tasks and data, where LLM outputs for user-defined operations are often inaccurate, even with optimized prompts. For example, an LLM may struggle to identify {\em all} instances of specific clauses, like force majeure or indemnification, in lengthy legal documents, requiring decomposition of the data, the task, or both. We present DocETL, a system that optimizes complex document processing pipelines, while accounting for LLM shortcomings. DocETL offers a declarative interface for users to define such pipelines and uses an agent-based approach to automatically optimize them, leveraging novel agent-based rewrites (that we call rewrite directives), as well as an optimization and evaluation framework. We introduce (i) logical rewriting of pipelines, tailored for LLM-based tasks, (ii) an agent-guided plan evaluation mechanism that synthesizes and orchestrates task-specific validation prompts, and (iii) an optimization algorithm that efficiently finds promising plans, considering the latencies of agent-based plan generation and evaluation. Our evaluation on four different unstructured document analysis tasks demonstrates that DocETL finds plans with outputs that are 25 to 80% more accurate than well-engineered baselines, addressing a critical gap in unstructured data analysis. DocETL is open-source at docetl.org, and as of November 2024, has amassed over 1.3k GitHub Stars, with users spanning a variety of domains. | |
| 653 | |a Data analysis | ||
| 653 | |a Algorithms | ||
| 653 | |a Data processing | ||
| 653 | |a Prompt engineering | ||
| 653 | |a Unstructured data | ||
| 653 | |a Large language models | ||
| 653 | |a Task complexity | ||
| 653 | |a Documents | ||
| 653 | |a Query languages | ||
| 653 | |a Optimization | ||
| 700 | 1 | |a Chambers, Tristan | |
| 700 | 1 | |a Shah, Tarak | |
| 700 | 1 | |a Parameswaran, Aditya G | |
| 700 | 1 | |a Wu, Eugene | |
| 773 | 0 | |t arXiv.org |g (Dec 8, 2024), p. n/a | |
| 786 | 0 | |d ProQuest |t Engineering Database | |
| 856 | 4 | 1 | |3 Citation/Abstract |u https://www.proquest.com/docview/3142726310/abstract/embedded/ZKJTFFSVAI7CB62C?source=fedsrch |
| 856 | 4 | 0 | |3 Full text outside of ProQuest |u http://arxiv.org/abs/2410.12189 |