GPT Semantic Cache: Reducing LLM Costs and Latency via Semantic Embedding Caching
Uloženo v:
| Vydáno v: | arXiv.org (Dec 9, 2024), p. n/a |
|---|---|
| Hlavní autor: | |
| Další autoři: | |
| Vydáno: |
Cornell University Library, arXiv.org
|
| Témata: | |
| On-line přístup: | Citation/Abstract Full text outside of ProQuest |
| Tagy: |
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
MARC
| LEADER | 00000nab a2200000uu 4500 | ||
|---|---|---|---|
| 001 | 3126805767 | ||
| 003 | UK-CbPIL | ||
| 022 | |a 2331-8422 | ||
| 035 | |a 3126805767 | ||
| 045 | 0 | |b d20241209 | |
| 100 | 1 | |a Regmi, Sajal | |
| 245 | 1 | |a GPT Semantic Cache: Reducing LLM Costs and Latency via Semantic Embedding Caching | |
| 260 | |b Cornell University Library, arXiv.org |c Dec 9, 2024 | ||
| 513 | |a Working Paper | ||
| 520 | 3 | |a Large Language Models (LLMs), such as GPT, have revolutionized artificial intelligence by enabling nuanced understanding and generation of human-like text across a wide range of applications. However, the high computational and financial costs associated with frequent API calls to these models present a substantial bottleneck, especially for applications like customer service chatbots that handle repetitive queries. In this paper, we introduce GPT Semantic Cache, a method that leverages semantic caching of query embeddings in in-memory storage (Redis). By storing embeddings of user queries, our approach efficiently identifies semantically similar questions, allowing for the retrieval of pre-generated responses without redundant API calls to the LLM. This technique achieves a notable reduction in operational costs while significantly enhancing response times, making it a robust solution for optimizing LLM-powered applications. Our experiments demonstrate that GPT Semantic Cache reduces API calls by up to 68.8% across various query categories, with cache hit rates ranging from 61.6% to 68.8%. Additionally, the system achieves high accuracy, with positive hit rates exceeding 97%, confirming the reliability of cached responses. This technique not only reduces operational costs, but also improves response times, enhancing the efficiency of LLM-powered applications. | |
| 653 | |a Semantics | ||
| 653 | |a Large language models | ||
| 653 | |a Caching | ||
| 653 | |a Queries | ||
| 653 | |a Storage | ||
| 653 | |a Application programming interface | ||
| 653 | |a Operating costs | ||
| 653 | |a Artificial intelligence | ||
| 653 | |a Natural language processing | ||
| 653 | |a Response time (computers) | ||
| 653 | |a Customer services | ||
| 653 | |a Speech recognition | ||
| 700 | 1 | |a Pun, Chetan Phakami | |
| 773 | 0 | |t arXiv.org |g (Dec 9, 2024), p. n/a | |
| 786 | 0 | |d ProQuest |t Engineering Database | |
| 856 | 4 | 1 | |3 Citation/Abstract |u https://www.proquest.com/docview/3126805767/abstract/embedded/ZKJTFFSVAI7CB62C?source=fedsrch |
| 856 | 4 | 0 | |3 Full text outside of ProQuest |u http://arxiv.org/abs/2411.05276 |