Product Clustering and Topic Assignment

This project focuses on clustering products based on their titles and assigning relevant topics to each cluster. The approach involves advanced techniques in data preprocessing, embedding, dimensionality reduction, clustering, and topic modeling.

Overview

The goal of this project is to cluster products into meaningful groups based on their titles and assign topics to each cluster. Initially, BERT embeddings with PCA and t-SNE were used for clustering, but this approach yielded noisy results. The strategy was improved by using SBERT for embeddings, UMAP for dimensionality reduction, and HDBSCAN for clustering, resulting in better-defined clusters. Topics are assigned to each cluster using Llama-3-8b with prompt engineering.

Installation

To get started, clone this repository and install the necessary dependencies:

git clone https://github.com/your-username/product-clustering.git
cd product-clustering
pip install -r requirements.txt

Usage

Place your CSV file containing product titles in the data directory.

Run the preprocessing and embedding script:

python preprocess_and_embed.py --input data/products.csv --output data/embeddings.pkl

Perform dimensionality reduction and clustering:

python cluster.py --input data/embeddings.pkl --output data/clusters.pkl

Assign topics to the clusters:

python topic_assignment.py --input data/clusters.pkl --output data/topics.csv

Project Details

Process overview

This project aims to cluster products based on their titles and assign meaningful topics to each cluster. Initially, the project utilized BERT embeddings with PCA and t-SNE for dimensionality reduction and DBSCAN for clustering. However, this approach resulted in noisy and poorly defined clusters.
To improve the results, the project shifted to using SBERT for embeddings, which are better aligned for Retrieval-Augmented Generation (RAG) systems often employed in economic chatbots. For dimensionality reduction, UMAP was used due to its ability to preserve both global and local data structures, outperforming t-SNE. HDBSCAN was chosen for clustering due to its robustness in handling noise and detecting clusters of varying densities.
Llama-3-8b was used for topic assignment, with prompt engineering ensuring that clusters with similar concepts received the same topic. The result is a more coherent and accurate clustering and topic assignment process.

Data Preprocessing

Tokenization and cleaning of product titles.
Embedding using SBERT for alignment with RAG systems often used in economic chatbots.

Dimensionality Reduction

UMAP is used to reduce dimensions while preserving global and local structures, outperforming t-SNE.

Clustering

HDBSCAN is chosen for its ability to handle noise and find clusters of varying density.

Topic Assignment

Topics are assigned using Llama-3-8b with prompt engineering to ensure clusters with similar concepts receive the same topic.

Results

The project demonstrates improved clustering performance and topic coherence using the described methods. The UMAP and HDBSCAN combination, along with SBERT embeddings, significantly reduces noise and enhances cluster definition.

Contributing

Contributions are welcome! Please open an issue or submit a pull request for any improvements or additions.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
ClothesTitleClusteringEfficientVersion.ipynb		ClothesTitleClusteringEfficientVersion.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Product Clustering and Topic Assignment

Table of Contents

Overview

Installation

Usage

Project Details

Process overview

Data Preprocessing

Dimensionality Reduction

Clustering

Topic Assignment

Results

Contributing

License

About

Releases

Packages

Languages

QuangNguyen2910/ProductTitleClustering

Folders and files

Latest commit

History

Repository files navigation

Product Clustering and Topic Assignment

Table of Contents

Overview

Installation

Usage

Project Details

Process overview

Data Preprocessing

Dimensionality Reduction

Clustering

Topic Assignment

Results

Contributing

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages