This project classifies high-energy physics particles using the HEPMASS dataset. Leveraging PySpark and MLlib, it delivers a scalable pipeline for:
- Distributed Processing: Efficient handling of large datasets via HDFS/Spark RDDs
- ML Pipeline: Feature scaling, model training (Logistic Regression, Decision Trees), and hyperparameter tuning
- Actionable Outputs: Performance metrics (F1-score, accuracy), confusion matrices, and feature importance visualizations
Optimized for reproducibility, the system automates Hadoop/Spark cluster management and minimizes computational overhead.
- Distributed Data Processing: Uses PySpark for scalable handling of large datasets.
- End-to-End Pipeline:
- Data loading and parsing
- Exploratory analysis with visualizations
- Feature scaling and preprocessing
- Model training (Logistic Regression, Decision Tree)
- Hyperparameter tuning
- Model evaluation and visualization (confusion matrix, feature importance)
- Automated Service Management: Scripts to start/stop HDFS and Spark services.
- PySpark: Distributed data processing and ML.
- Hadoop HDFS: Storage for datasets and processed files.
- Python Libraries: NumPy, Pandas, Matplotlib, Seaborn.
- Spark MLlib: For model training and evaluation.
- Linux/Unix: Recommended OS for deployment.
The HEPMASS dataset contains simulated particle collision data. Each entry has 27 features (f0 to f26) representing kinematic properties, a mass value, and a binary label (0 or 1) indicating the particle type.
Sample Data:
| label | f0 | f1 | f2 | ... | f26 | mass |
|---|---|---|---|---|---|---|
| 0 | 0.09439 | 0.01276 | 0.91193 | ... | -1.29023 | 499.999969 |
| 1 | 0.32720 | -0.23955 | -1.5920 | ... | -0.45855 | 750 |
- Java 8+: Required for Spark.
- Hadoop & Spark: Install and configure in standalone/cluster mode.
- Python 3.8+: With
pipfor dependency management.
-
Clone the Repository:
git clone https://github.com/chouaib-629/hepmassClassification.git
-
Navigate to the project directory:
cd hepmassClassification -
Set Up Python Environment:
./setup_env.sh
Optional Flags:
--help: Display usage instructions and available command-line options.--version: Display the version information of the script..
-
Download and Prepare the Dataset:
-
Download the dataset:
wget https://archive.ics.uci.edu/static/public/347/hepmass.zip
-
Extract files:
mkdir data unzip data/hepmass.zip -d data/
-
Organize the data folder:
mv data/hepmass/* data/ rmdir data/hepmass -
Upload to HDFS:
hdfs dfs -mkdir /hepmass hdfs dfs -put data/all_train.csv.gz /hepmass/ hdfs dfs -put data/all_test.csv.gz /hepmass/
-
-
Activate the Virtual Environment:
source penv/bin/activate -
Execute the End-to-End Workflow:
./run_pipeline.sh
Optional Flags:
--help: Display usage instructions and available command-line options.--version: Display the version information of the script.--enable-logs: Enable detailled Spark logs.--no-services: Skip starting/stopping HDFS/Spark (manual management).--disable-safe-mode: Force-disable HDFS safemode.
Example with Flags:
./run_pipeline.sh --enable-logs --disable-safe-mode
-
Deactivate Environment (when finished):
deactivate
Optional: Save the Models Locally
Copy the models from HDFS to your local storage for further use:
mkdir models start-dfs.sh hdfs dfs -get /hepmass/models/* /models stop-dfs.sh
- Preprocessed Data: Stored in HDFS (
/hepmass/scaled_train,/hepmass/scaled_test). - Models: Saved to HDFS (
/hepmass/models/). - Visualisation: Generated in the
plots/directory:- Class distribution
- Feature importance
- Confusion matrix
Contributions are welcome! To contribute:
-
Fork the repository.
-
Create a new branch:
git checkout -b feature/feature-name
-
Commit your changes:
git commit -m "Add feature description" -
Push to the branch:
git push origin feature/feature-name
-
Open a pull request.
For questions or support, please contact Me.