How Large Language Models Enhance Topic Modeling on User-Generated Content

Kaydedildi:
Detaylı Bibliyografya
Yayımlandı:Journal of Physics: Conference Series vol. 3114, no. 1 (Sep 2025), p. 012011
Yazar: Bui, Minh Phuoc
Diğer Yazarlar: Nguyen, Mien Thi Ngoc
Baskı/Yayın Bilgisi:
IOP Publishing
Konular:
Online Erişim:Citation/Abstract
Full Text - PDF
Etiketler: Etiketle
Etiket eklenmemiş, İlk siz ekleyin!

MARC

LEADER 00000nab a2200000uu 4500
001 3252527904
003 UK-CbPIL
022 |a 1742-6588 
022 |a 1742-6596 
024 7 |a 10.1088/1742-6596/3114/1/012011  |2 doi 
035 |a 3252527904 
045 2 |b d20250901  |b d20250930 
100 1 |a Bui, Minh Phuoc 
245 1 |a How Large Language Models Enhance Topic Modeling on User-Generated Content 
260 |b IOP Publishing  |c Sep 2025 
513 |a Journal Article 
520 3 |a Understanding user-generated content (UGC) is crucial for obtaining actionable insights in domains such as e-commerce and hospitality. However, the noisy and redundant nature of such content present challenges for topic modeling methods like Latent Semantic Analysis (LSA). In this paper, we investigate whether preprocessing user reviews with large language models (LLMs) can improve topic modeling performance. Specifically, we compare two input variants: (1) raw reviews and (2) ChatGPT-generated summaries produced via API as concise keyphrases. We apply LSA with varimax rotation on each variant and evaluate the resulting topic models using multiple criteria, including topic coherence (cυ), average pairwise Jaccard overlap, and cluster compactness via silhouette scores. Unlike prior work that employs LLMs primarily for post hoc topic labeling or interpretation, our method integrates an LLM directly into the preprocessing pipeline to reshape noisy input into structured, standardized summaries. While ChatGPT-based preprocessing results in lower cυ coherence scores likely due to reduced lexical redundancy, it significantly improves topic separation, cluster quality, and topical specificity, leading to more interpretable and well-structured topic models overall. 
653 |a User generated content 
653 |a Preprocessing 
653 |a Large language models 
653 |a Clusters 
653 |a Summaries 
653 |a Multiple criterion 
653 |a Chatbots 
653 |a Data mining 
653 |a Coherence 
653 |a Redundancy 
700 1 |a Nguyen, Mien Thi Ngoc 
773 0 |t Journal of Physics: Conference Series  |g vol. 3114, no. 1 (Sep 2025), p. 012011 
786 0 |d ProQuest  |t Advanced Technologies & Aerospace Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3252527904/abstract/embedded/L8HZQI7Z43R0LA5T?source=fedsrch 
856 4 0 |3 Full Text - PDF  |u https://www.proquest.com/docview/3252527904/fulltextPDF/embedded/L8HZQI7Z43R0LA5T?source=fedsrch