A data science project that analyzes 2,000+ arXiv papers to uncover trending topics in quantum machine learning using NLP and LDA topic modeling. Built with Python, NLTK, Gensim, and Matplotlib, this project visualizes topic evolution from 2015 to 2025.
This project explores the evolution of research topics in quantum machine learning by:
- Filtering 50,000+ arXiv papers for quantum-related content.
- Cleaning abstracts using NLP techniques (tokenization, lemmatization, stopword removal).
- Applying Latent Dirichlet Allocation (LDA) to identify 5 key topics.
- Visualizing topic trends over time with interactive plots.
🔍 Key Insight: Identified growing trends in quantum states, algorithms, and machine learning applications, with a notable surge in research activity from 2015 to 2025.
The analysis revealed fascinating trends in quantum machine learning research:
- Topic 1: Quantum states and entanglement.
- Topic 2: Quantum algorithms and optimization.
- Topic 3: Machine learning applications in quantum systems.
- Topic 4: Quantum error correction.
- Topic 5: Quantum cryptography and security.
Check out the trend visualization below:
Follow these steps to reproduce the analysis:
- Install Dependencies:
pip install pandas nltk gensim matplotlib seaborn
The dataset used is arxiv-metadata-oai-snapshot.json, sourced from Kaggle.
- Get the dataset from Kaggle.
- Place it in the project folder.
To process the dataset and perform topic modeling, run the following commands:
python data_processing.py
python topic_modeling.py
python visualization.py| File | Description |
|---|---|
quantum_papers.csv |
Filtered quantum-related papers. |
quantum_papers_cleaned.csv |
Cleaned abstracts with preprocessed text. |
quantum_papers_with_topics.csv |
Papers with assigned topics from LDA modeling. |
topic_trends.png |
Graph showing research trends over time. |
- Python: Core programming language.
- Pandas: Data manipulation and analysis.
- NLTK: Text preprocessing (tokenization, lemmatization).
- Gensim: LDA topic modeling.
- Matplotlib/Seaborn: Data visualization.
✅ Processed and analyzed 2,000+ quantum-related papers from a dataset of 50,000+.
✅ Identified key research trends, providing insights into the future of quantum machine learning.
✅ Demonstrated proficiency in NLP, topic modeling, and data visualization—valuable skills for data science roles.
This project is licensed under the MIT License. See the LICENSE file for details.
- Built as a portfolio project for April 2025.
- Thanks to arXiv for providing the dataset.
- Inspired by the growing intersection of quantum computing and machine learning.
📧 Email: raiyan7080@gmail.com
