This project performs a complete data-processing and exploratory analysis workflow on a dataset of social media posts related to the #MeToo movement. It includes raw-text cleaning, structured dataset generation, descriptive analysis, and insight extraction. All components are implemented in Python using Jupyter notebooks.
The repository contains three primary artifacts:
- Text Cleaning.ipynb – end-to-end text preprocessing pipeline
- cleaned_metoo_tweets.csv – the processed dataset ready for downstream analysis
- Data analysis and results.ipynb – exploratory data analysis (EDA) and final insights
The dataset consists of tweets tagged with #MeToo. Each record includes:
id– unique tweet identifiercreated– timestampcleaned_text– normalized and cleaned version of the tweet content (generated by this project)
The raw text underwent multiple preprocessing stages before being saved as cleaned_metoo_tweets.csv.
Implemented in Text Cleaning.ipynb, the pipeline standardizes and filters textual data to prepare it for analysis.
Key steps include:
- Imported raw tweet dataset using
pandas. - Removed records with missing or invalid text.
The following operations were applied:
- Lowercasing all text
- Removing URLs, hashtags, mentions, punctuation, digits
- Removing extraneous whitespace
- Stripping non-alphabetic characters
- Dropping rows where the cleaned text became empty
The cleaned dataset was saved as:
cleaned_metoo_tweets.csv
This file serves as the input for subsequent analysis.
Conducted in Data analysis and results.ipynb.
- Verified timestamp format and detected invalid date entries.
- Ensured no missing cleaned text fields.
- Confirmed dataset size after cleaning.
- Examined trends using the
createdtimestamp. - Identified activity spikes, potential campaign peaks, and unusual data gaps.
- Token analysis of the cleaned text
- Preliminary keyword trends
- Distribution of tweet lengths and common vocabulary patterns
(Note: Analysis depth depends on the dataset and notebook’s available code.)
- Python 3
- Jupyter Notebook
- pandas – data manipulation
- re – text normalization using regular expressions
- matplotlib / seaborn (if present in the notebook) – visualization
- NumPy – numeric utilities
.
├── Text Cleaning.ipynb
├── Data analysis and results.ipynb
└── cleaned_metoo_tweets.csv
- Text Cleaning.ipynb → Creates
cleaned_metoo_tweets.csv - cleaned_metoo_tweets.csv → Final cleaned dataset
- Data analysis and results.ipynb → Performs EDA on the cleaned dataset
Install required libraries:
pip install pandas numpy matplotlib- Open Text Cleaning.ipynb to regenerate the cleaned dataset (optional).
- Open Data analysis and results.ipynb to run the full analysis pipeline.
Both notebooks are fully independent and can be executed sequentially in Jupyter.
This project demonstrates a clean and modular approach to text-based social data analysis:
- Robust preprocessing ensures high-quality normalized textual data
- Clean dataset enables reliable downstream analytics
- EDA reveals temporal patterns and content-level trends in the #MeToo conversation
The workflow is reproducible, extensible, and designed to support further sentiment analysis, topic modeling, or deep learning applications.