GitHub - Luna-CaiC/MeToo-Movement-Modeling-NLP: End-to-end unsupervised text analytics pipeline processing 390K+ #MeToo tweets with Python, including regex-based text normalization, vectorization, topic modeling (LDA/NMF), sentiment scoring (TextBlob), and temporal pattern mining. Provides interpretable topic clusters and visualization of discourse trajectories during the movement.

Project Overview

This project performs a complete data-processing and exploratory analysis workflow on a dataset of social media posts related to the #MeToo movement. It includes raw-text cleaning, structured dataset generation, descriptive analysis, and insight extraction. All components are implemented in Python using Jupyter notebooks.

The repository contains three primary artifacts:

Text Cleaning.ipynb – end-to-end text preprocessing pipeline
cleaned_metoo_tweets.csv – the processed dataset ready for downstream analysis
Data analysis and results.ipynb – exploratory data analysis (EDA) and final insights

1. Dataset Description

The dataset consists of tweets tagged with #MeToo. Each record includes:

id – unique tweet identifier
created – timestamp
cleaned_text – normalized and cleaned version of the tweet content (generated by this project)

The raw text underwent multiple preprocessing stages before being saved as cleaned_metoo_tweets.csv.

2. Text Cleaning Pipeline

Implemented in Text Cleaning.ipynb, the pipeline standardizes and filters textual data to prepare it for analysis.

Key steps include:

2.1 Load & Inspect Data

Imported raw tweet dataset using pandas.
Removed records with missing or invalid text.

2.2 Text Normalization

The following operations were applied:

Lowercasing all text
Removing URLs, hashtags, mentions, punctuation, digits
Removing extraneous whitespace
Stripping non-alphabetic characters
Dropping rows where the cleaned text became empty

2.3 Output Generation

The cleaned dataset was saved as:

cleaned_metoo_tweets.csv

This file serves as the input for subsequent analysis.

3. Exploratory Data Analysis (EDA)

Conducted in Data analysis and results.ipynb.

3.1 Data Integrity Checks

Verified timestamp format and detected invalid date entries.
Ensured no missing cleaned text fields.
Confirmed dataset size after cleaning.

3.2 Temporal Analysis

Examined trends using the created timestamp.
Identified activity spikes, potential campaign peaks, and unusual data gaps.

3.3 Textual Insights

Token analysis of the cleaned text
Preliminary keyword trends
Distribution of tweet lengths and common vocabulary patterns

(Note: Analysis depth depends on the dataset and notebook’s available code.)

4. Technologies Used

Python 3
Jupyter Notebook
pandas – data manipulation
re – text normalization using regular expressions
matplotlib / seaborn (if present in the notebook) – visualization
NumPy – numeric utilities

5. File Structure

.
├── Text Cleaning.ipynb
├── Data analysis and results.ipynb
└── cleaned_metoo_tweets.csv

File Roles

Text Cleaning.ipynb → Creates cleaned_metoo_tweets.csv
cleaned_metoo_tweets.csv → Final cleaned dataset
Data analysis and results.ipynb → Performs EDA on the cleaned dataset

6. How to Run the Project

Prerequisites

Install required libraries:

pip install pandas numpy matplotlib

Execution Steps

Open Text Cleaning.ipynb to regenerate the cleaned dataset (optional).
Open Data analysis and results.ipynb to run the full analysis pipeline.

Both notebooks are fully independent and can be executed sequentially in Jupyter.

7. Project Summary

This project demonstrates a clean and modular approach to text-based social data analysis:

Robust preprocessing ensures high-quality normalized textual data
Clean dataset enables reliable downstream analytics
EDA reveals temporal patterns and content-level trends in the #MeToo conversation

The workflow is reproducible, extensible, and designed to support further sentiment analysis, topic modeling, or deep learning applications.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.DS_Store		.DS_Store
Data analysis and results.ipynb		Data analysis and results.ipynb
README.md		README.md
Text Cleaning.ipynb		Text Cleaning.ipynb
cleaned_metoo_tweets.csv		cleaned_metoo_tweets.csv
metoo_tweets_RawData.xlsx		metoo_tweets_RawData.xlsx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Project Overview

1. Dataset Description

2. Text Cleaning Pipeline

2.1 Load & Inspect Data

2.2 Text Normalization

2.3 Output Generation

3. Exploratory Data Analysis (EDA)

3.1 Data Integrity Checks

3.2 Temporal Analysis

3.3 Textual Insights

4. Technologies Used

5. File Structure

File Roles

6. How to Run the Project

Prerequisites

Execution Steps

7. Project Summary

About

Uh oh!

Releases

Packages

Languages

Luna-CaiC/MeToo-Movement-Modeling-NLP

Folders and files

Latest commit

History

Repository files navigation

Project Overview

1. Dataset Description

2. Text Cleaning Pipeline

2.1 Load & Inspect Data

2.2 Text Normalization

2.3 Output Generation

3. Exploratory Data Analysis (EDA)

3.1 Data Integrity Checks

3.2 Temporal Analysis

3.3 Textual Insights

4. Technologies Used

5. File Structure

File Roles

6. How to Run the Project

Prerequisites

Execution Steps

7. Project Summary

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages