HEPMass Classification with PySpark

This project classifies high-energy physics particles using the HEPMASS dataset. Leveraging PySpark and MLlib, it delivers a scalable pipeline for:

Distributed Processing: Efficient handling of large datasets via HDFS/Spark RDDs
ML Pipeline: Feature scaling, model training (Logistic Regression, Decision Trees), and hyperparameter tuning
Actionable Outputs: Performance metrics (F1-score, accuracy), confusion matrices, and feature importance visualizations

Optimized for reproducibility, the system automates Hadoop/Spark cluster management and minimizes computational overhead.

Features

Distributed Data Processing: Uses PySpark for scalable handling of large datasets.
End-to-End Pipeline:
- Data loading and parsing
- Exploratory analysis with visualizations
- Feature scaling and preprocessing
- Model training (Logistic Regression, Decision Tree)
- Hyperparameter tuning
- Model evaluation and visualization (confusion matrix, feature importance)
Automated Service Management: Scripts to start/stop HDFS and Spark services.

Technologies Used

PySpark: Distributed data processing and ML.
Hadoop HDFS: Storage for datasets and processed files.
Python Libraries: NumPy, Pandas, Matplotlib, Seaborn.
Spark MLlib: For model training and evaluation.
Linux/Unix: Recommended OS for deployment.

Dataset Structure

The HEPMASS dataset contains simulated particle collision data. Each entry has 27 features (f0 to f26) representing kinematic properties, a mass value, and a binary label (0 or 1) indicating the particle type.

Sample Data:

label	f0	f1	f2	...	f26	mass
0	0.09439	0.01276	0.91193	...	-1.29023	499.999969
1	0.32720	-0.23955	-1.5920	...	-0.45855	750

Getting Started

Prerequisites

Java 8+: Required for Spark.
Hadoop & Spark: Install and configure in standalone/cluster mode.
Python 3.8+: With pip for dependency management.

Setup Instructions

Clone the Repository:

git clone https://github.com/chouaib-629/hepmassClassification.git

Navigate to the project directory:
```
cd hepmassClassification
```
Set Up Python Environment:
```
./setup_env.sh
```
Optional Flags:
- --help: Display usage instructions and available command-line options.
- --version: Display the version information of the script..

Download and Prepare the Dataset:

Download the dataset:

wget https://archive.ics.uci.edu/static/public/347/hepmass.zip

Extract files:

mkdir data
unzip data/hepmass.zip -d data/

Organize the data folder:

mv data/hepmass/* data/
rmdir data/hepmass

Upload to HDFS:

hdfs dfs -mkdir /hepmass
hdfs dfs -put data/all_train.csv.gz /hepmass/
hdfs dfs -put data/all_test.csv.gz /hepmass/

Usage

Run the Pipeline

Activate the Virtual Environment:
```
source penv/bin/activate
```
Execute the End-to-End Workflow:
```
./run_pipeline.sh
```
Optional Flags:
- --help: Display usage instructions and available command-line options.
- --version: Display the version information of the script.
- --enable-logs: Enable detailled Spark logs.
- --no-services: Skip starting/stopping HDFS/Spark (manual management).
- --disable-safe-mode: Force-disable HDFS safemode.
Example with Flags:
```
./run_pipeline.sh --enable-logs --disable-safe-mode
```
Deactivate Environment (when finished):
```
deactivate
```
Optional: Save the Models Locally

Copy the models from HDFS to your local storage for further use:
```
mkdir models
start-dfs.sh
hdfs dfs -get /hepmass/models/* /models
stop-dfs.sh
```

Expected Output

Preprocessed Data: Stored in HDFS (/hepmass/scaled_train, /hepmass/scaled_test).
Models: Saved to HDFS (/hepmass/models/).
Visualisation: Generated in the plots/ directory:
- Class distribution
- Feature importance
- Confusion matrix

Contributing

Contributions are welcome! To contribute:

Fork the repository.
Create a new branch:
```
git checkout -b feature/feature-name
```
Commit your changes:
```
git commit -m "Add feature description"
```
Push to the branch:
```
git push origin feature/feature-name
```
Open a pull request.

Contact Information

For questions or support, please contact Me.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
plots		plots
scripts		scripts
README.md		README.md
requirements.txt		requirements.txt
run_pipeline.sh		run_pipeline.sh
setup_env.sh		setup_env.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HEPMass Classification with PySpark

Table of Contents

Features

Technologies Used

Dataset Structure

Getting Started

Prerequisites

Setup Instructions

Usage

Run the Pipeline

Expected Output

Contributing

Contact Information

About

Uh oh!

Releases

Packages

Languages

chouaib-629/hepmassClassification

Folders and files

Latest commit

History

Repository files navigation

HEPMass Classification with PySpark

Table of Contents

Features

Technologies Used

Dataset Structure

Getting Started

Prerequisites

Setup Instructions

Usage

Run the Pipeline

Expected Output

Contributing

Contact Information

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages