GPT Semantic Cache: Reducing LLM Costs and Latency via Semantic Embedding Caching

Uloženo v:

Podrobná bibliografie
Vydáno v:	arXiv.org (Dec 9, 2024), p. n/a
Hlavní autor:	Regmi, Sajal
Další autoři:	Pun, Chetan Phakami
Vydáno:	Cornell University Library, arXiv.org
Témata:	Semantics Large language models Caching Queries Storage Application programming interface Operating costs Artificial intelligence Natural language processing Response time (computers) Customer services Speech recognition
On-line přístup:	Citation/Abstract Full text outside of ProQuest
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

MARC


LEADER	00000nab a2200000uu 4500
001	3126805767
003	UK-CbPIL
022			\|a 2331-8422
035			\|a 3126805767
045	0		\|b d20241209
100	1		\|a Regmi, Sajal
245	1		\|a GPT Semantic Cache: Reducing LLM Costs and Latency via Semantic Embedding Caching
260			\|b Cornell University Library, arXiv.org \|c Dec 9, 2024
513			\|a Working Paper
520	3		\|a Large Language Models (LLMs), such as GPT, have revolutionized artificial intelligence by enabling nuanced understanding and generation of human-like text across a wide range of applications. However, the high computational and financial costs associated with frequent API calls to these models present a substantial bottleneck, especially for applications like customer service chatbots that handle repetitive queries. In this paper, we introduce GPT Semantic Cache, a method that leverages semantic caching of query embeddings in in-memory storage (Redis). By storing embeddings of user queries, our approach efficiently identifies semantically similar questions, allowing for the retrieval of pre-generated responses without redundant API calls to the LLM. This technique achieves a notable reduction in operational costs while significantly enhancing response times, making it a robust solution for optimizing LLM-powered applications. Our experiments demonstrate that GPT Semantic Cache reduces API calls by up to 68.8% across various query categories, with cache hit rates ranging from 61.6% to 68.8%. Additionally, the system achieves high accuracy, with positive hit rates exceeding 97%, confirming the reliability of cached responses. This technique not only reduces operational costs, but also improves response times, enhancing the efficiency of LLM-powered applications.
653			\|a Semantics
653			\|a Large language models
653			\|a Caching
653			\|a Queries
653			\|a Storage
653			\|a Application programming interface
653			\|a Operating costs
653			\|a Artificial intelligence
653			\|a Natural language processing
653			\|a Response time (computers)
653			\|a Customer services
653			\|a Speech recognition
700	1		\|a Pun, Chetan Phakami
773	0		\|t arXiv.org \|g (Dec 9, 2024), p. n/a
786	0		\|d ProQuest \|t Engineering Database
856	4	1	\|3 Citation/Abstract \|u https://www.proquest.com/docview/3126805767/abstract/embedded/ZKJTFFSVAI7CB62C?source=fedsrch
856	4	0	\|3 Full text outside of ProQuest \|u http://arxiv.org/abs/2411.05276