Skip to content

End-to-end unsupervised text analytics pipeline processing 390K+ #MeToo tweets with Python, including regex-based text normalization, vectorization, topic modeling (LDA/NMF), sentiment scoring (TextBlob), and temporal pattern mining. Provides interpretable topic clusters and visualization of discourse trajectories during the movement.

Notifications You must be signed in to change notification settings

Luna-CaiC/MeToo-Movement-Modeling-NLP

Repository files navigation

Project Overview

This project performs a complete data-processing and exploratory analysis workflow on a dataset of social media posts related to the #MeToo movement. It includes raw-text cleaning, structured dataset generation, descriptive analysis, and insight extraction. All components are implemented in Python using Jupyter notebooks.

The repository contains three primary artifacts:

  1. Text Cleaning.ipynb – end-to-end text preprocessing pipeline
  2. cleaned_metoo_tweets.csv – the processed dataset ready for downstream analysis
  3. Data analysis and results.ipynb – exploratory data analysis (EDA) and final insights

1. Dataset Description

The dataset consists of tweets tagged with #MeToo. Each record includes:

  • id – unique tweet identifier
  • created – timestamp
  • cleaned_text – normalized and cleaned version of the tweet content (generated by this project)

The raw text underwent multiple preprocessing stages before being saved as cleaned_metoo_tweets.csv.


2. Text Cleaning Pipeline

Implemented in Text Cleaning.ipynb, the pipeline standardizes and filters textual data to prepare it for analysis.

Key steps include:

2.1 Load & Inspect Data

  • Imported raw tweet dataset using pandas.
  • Removed records with missing or invalid text.

2.2 Text Normalization

The following operations were applied:

  • Lowercasing all text
  • Removing URLs, hashtags, mentions, punctuation, digits
  • Removing extraneous whitespace
  • Stripping non-alphabetic characters
  • Dropping rows where the cleaned text became empty

2.3 Output Generation

The cleaned dataset was saved as:

cleaned_metoo_tweets.csv

This file serves as the input for subsequent analysis.


3. Exploratory Data Analysis (EDA)

Conducted in Data analysis and results.ipynb.

3.1 Data Integrity Checks

  • Verified timestamp format and detected invalid date entries.
  • Ensured no missing cleaned text fields.
  • Confirmed dataset size after cleaning.

3.2 Temporal Analysis

  • Examined trends using the created timestamp.
  • Identified activity spikes, potential campaign peaks, and unusual data gaps.

3.3 Textual Insights

  • Token analysis of the cleaned text
  • Preliminary keyword trends
  • Distribution of tweet lengths and common vocabulary patterns

(Note: Analysis depth depends on the dataset and notebook’s available code.)


4. Technologies Used

  • Python 3
  • Jupyter Notebook
  • pandas – data manipulation
  • re – text normalization using regular expressions
  • matplotlib / seaborn (if present in the notebook) – visualization
  • NumPy – numeric utilities

5. File Structure

.
├── Text Cleaning.ipynb
├── Data analysis and results.ipynb
└── cleaned_metoo_tweets.csv

File Roles

  • Text Cleaning.ipynb → Creates cleaned_metoo_tweets.csv
  • cleaned_metoo_tweets.csv → Final cleaned dataset
  • Data analysis and results.ipynb → Performs EDA on the cleaned dataset

6. How to Run the Project

Prerequisites

Install required libraries:

pip install pandas numpy matplotlib

Execution Steps

  1. Open Text Cleaning.ipynb to regenerate the cleaned dataset (optional).
  2. Open Data analysis and results.ipynb to run the full analysis pipeline.

Both notebooks are fully independent and can be executed sequentially in Jupyter.


7. Project Summary

This project demonstrates a clean and modular approach to text-based social data analysis:

  • Robust preprocessing ensures high-quality normalized textual data
  • Clean dataset enables reliable downstream analytics
  • EDA reveals temporal patterns and content-level trends in the #MeToo conversation

The workflow is reproducible, extensible, and designed to support further sentiment analysis, topic modeling, or deep learning applications.

About

End-to-end unsupervised text analytics pipeline processing 390K+ #MeToo tweets with Python, including regex-based text normalization, vectorization, topic modeling (LDA/NMF), sentiment scoring (TextBlob), and temporal pattern mining. Provides interpretable topic clusters and visualization of discourse trajectories during the movement.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published