Text Clustering with LLM Embeddings

Abstract

Text clustering is an important method for organising the increasing volumeof digital content, aiding in the structuring and discovery of hidden patternsin uncategorised data. The effectiveness of text clustering largely depends onthe selection of textual embeddings and clustering algorithms. This studyargues that recent advancements in large language models (LLMs) have thepotential to enhance this task. The research investigates how different textualembeddings, particularly those utilised in LLMs, and various clusteringalgorithms influence the clustering of text datasets. A series of experimentswere conducted to evaluate the impact of embeddings on clustering results, therole of dimensionality reduction through summarisation, and the adjustment ofmodel size. The findings indicate that LLM embeddings are superior at capturingsubtleties in structured language. OpenAI's GPT-3.5 Turbo model yields betterresults in three out of five clustering metrics across most tested datasets.Most LLM embeddings show improvements in cluster purity and provide a moreinformative silhouette score, reflecting a refined structural understanding oftext data compared to traditional methods. Among the more lightweight models,BERT demonstrates leading performance. Additionally, it was observed thatincreasing model dimensionality and employing summarisation techniques do notconsistently enhance clustering efficiency, suggesting that these strategiesrequire careful consideration for practical application. These resultshighlight a complex balance between the need for refined text representationand computational feasibility in text clustering applications. This studyextends traditional text clustering frameworks by integrating embeddings fromLLMs, offering improved methodologies and suggesting new avenues for futureresearch in various types of textual analysis.

Quick Read (beta)

loading the full paper ...