GPT Semantic Cache: Reducing LLM Costs and Latency via Semantic Embedding Caching

Uloženo v:
Podrobná bibliografie
Vydáno v:arXiv.org (Dec 9, 2024), p. n/a
Hlavní autor: Regmi, Sajal
Další autoři: Pun, Chetan Phakami
Vydáno:
Cornell University Library, arXiv.org
Témata:
On-line přístup:Citation/Abstract
Full text outside of ProQuest
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

MARC

LEADER 00000nab a2200000uu 4500
001 3126805767
003 UK-CbPIL
022 |a 2331-8422 
035 |a 3126805767 
045 0 |b d20241209 
100 1 |a Regmi, Sajal 
245 1 |a GPT Semantic Cache: Reducing LLM Costs and Latency via Semantic Embedding Caching 
260 |b Cornell University Library, arXiv.org  |c Dec 9, 2024 
513 |a Working Paper 
520 3 |a Large Language Models (LLMs), such as GPT, have revolutionized artificial intelligence by enabling nuanced understanding and generation of human-like text across a wide range of applications. However, the high computational and financial costs associated with frequent API calls to these models present a substantial bottleneck, especially for applications like customer service chatbots that handle repetitive queries. In this paper, we introduce GPT Semantic Cache, a method that leverages semantic caching of query embeddings in in-memory storage (Redis). By storing embeddings of user queries, our approach efficiently identifies semantically similar questions, allowing for the retrieval of pre-generated responses without redundant API calls to the LLM. This technique achieves a notable reduction in operational costs while significantly enhancing response times, making it a robust solution for optimizing LLM-powered applications. Our experiments demonstrate that GPT Semantic Cache reduces API calls by up to 68.8% across various query categories, with cache hit rates ranging from 61.6% to 68.8%. Additionally, the system achieves high accuracy, with positive hit rates exceeding 97%, confirming the reliability of cached responses. This technique not only reduces operational costs, but also improves response times, enhancing the efficiency of LLM-powered applications. 
653 |a Semantics 
653 |a Large language models 
653 |a Caching 
653 |a Queries 
653 |a Storage 
653 |a Application programming interface 
653 |a Operating costs 
653 |a Artificial intelligence 
653 |a Natural language processing 
653 |a Response time (computers) 
653 |a Customer services 
653 |a Speech recognition 
700 1 |a Pun, Chetan Phakami 
773 0 |t arXiv.org  |g (Dec 9, 2024), p. n/a 
786 0 |d ProQuest  |t Engineering Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3126805767/abstract/embedded/ZKJTFFSVAI7CB62C?source=fedsrch 
856 4 0 |3 Full text outside of ProQuest  |u http://arxiv.org/abs/2411.05276