diff --git a/docs/NLP/projects/Email_Spam_Detection/README.md b/docs/NLP/projects/Email_Spam_Detection/README.md new file mode 100644 index 00000000..5be4710e --- /dev/null +++ b/docs/NLP/projects/Email_Spam_Detection/README.md @@ -0,0 +1,213 @@ + +# Email Spam Detection + +### AIM +To develop a machine learning-based system that classifies email content as spam or ham (not spam). + +### DATASET LINK +[https://www.kaggle.com/datasets/ashfakyeafi/spam-email-classification](https://www.kaggle.com/datasets/ashfakyeafi/spam-email-classification) + + +### NOTEBOOK LINK +[https://www.kaggle.com/code/inshak9/email-spam-detection](https://www.kaggle.com/code/inshak9/email-spam-detection) + + +### LIBRARIES NEEDED + +??? quote "LIBRARIES USED" + + - pandas + - numpy + - scikit-learn + - matplotlib + - seaborn + + +--- + +### DESCRIPTION +!!! info "What is the requirement of the project?" + - A robust system to detect spam emails is essential to combat increasing spam content. + - It improves user experience by automatically filtering unwanted messages. + +??? info "Why is it necessary?" + - Spam emails consume resources, time, and may pose security risks like phishing. + - Helps organizations and individuals streamline their email communication. + +??? info "How is it beneficial and used?" + - Provides a quick and automated solution for spam classification. + - Used in email services, IT systems, and anti-spam software to filter messages. + +??? info "How did you start approaching this project? (Initial thoughts and planning)" + - Analyzed the dataset and prepared features. + - Implemented various machine learning models for comparison. + +??? info "Mention any additional resources used (blogs, books, chapters, articles, research papers, etc.)." + - Documentation from [scikit-learn](https://scikit-learn.org) + - Blog: Introduction to Spam Classification with ML + +--- + +### EXPLANATION + +#### DETAILS OF THE DIFFERENT FEATURES +The dataset contains features like word frequency, capital letter counts, and others that help in distinguishing spam emails from ham. + +| Feature | Description | +|----------------------|-------------------------------------------------| +| `word_freq_x` | Frequency of specific words in the email body | +| `capital_run_length` | Length of consecutive capital letters | +| `char_freq` | Frequency of special characters like `;` and `$` | +| `is_spam` | Target variable (1 = Spam, 0 = Ham) | + +--- + +#### WHAT I HAVE DONE + +=== "Step 1" + + Initial data exploration and understanding: + - Loaded the dataset using pandas. + - Explored dataset features and target variable distribution. + +=== "Step 2" + + Data cleaning and preprocessing: + - Checked for missing values. + - Standardized features using scaling techniques. + +=== "Step 3" + + Feature engineering and selection: + - Extracted relevant features for spam classification. + - Used correlation matrix to select significant features. + +=== "Step 4" + + Model training and evaluation: + - Trained models: KNN, Naive Bayes, SVM, and Random Forest. + - Evaluated models using accuracy, precision, and recall. + +=== "Step 5" + + Model optimization and fine-tuning: + - Tuned hyperparameters using GridSearchCV. + +=== "Step 6" + + Validation and testing: + - Tested models on unseen data to check performance. + +--- + +#### PROJECT TRADE-OFFS AND SOLUTIONS + +=== "Trade Off 1" + - **Accuracy vs. Training Time**: + - Models like Random Forest took longer to train but achieved higher accuracy compared to Naive Bayes. + +=== "Trade Off 2" + - **Complexity vs. Interpretability**: + - Simpler models like Naive Bayes were more interpretable but slightly less accurate. + +--- + +### SCREENSHOTS + + +!!! success "Project structure or tree diagram" + + ``` mermaid + graph LR + A[Start] --> B[Load Dataset]; + B --> C[Preprocessing]; + C --> D[Train Models]; + D --> E{Compare Performance}; + E -->|Best Model| F[Deploy]; + E -->|Retry| C; + ``` + +??? tip "Visualizations and EDA of different features" + + === "Feature Correlation Heatmap" + ![Correlation](images/correlation_heatmap.png) + +??? example "Model performance graphs" + + === "Model Comparison" + ![Comparison](images/model_comparison.png) + +--- + +### MODELS USED AND THEIR EVALUATION METRICS + +| Model | Accuracy | Precision | Recall | +|----------------------|----------|-----------|--------| +| KNN | 90% | 89% | 88% | +| Naive Bayes | 92% | 91% | 90% | +| SVM | 94% | 93% | 91% | +| Random Forest | 95% | 94% | 93% | +| AdaBoost | 97% | 97% | 100% | + +--- + +#### MODELS COMPARISON GRAPHS + +!!! tip "Models Comparison Graphs" + + === "Accuracy Comparison" + ![Accuracy Graph](images/accuracy_graph.png) + +--- + +### CONCLUSION + +#### WHAT YOU HAVE LEARNED + +!!! tip "Insights gained from the data" + - Feature importance significantly impacts spam detection. + - Simple models like Naive Bayes can achieve competitive performance. + +??? tip "Improvements in understanding machine learning concepts" + - Gained hands-on experience with classification models and model evaluation techniques. + +??? tip "Challenges faced and how they were overcome" + - Balancing between accuracy and training time was challenging, solved using model tuning. + +--- + +#### USE CASES OF THIS MODEL + +=== "Application 1" + + **Email Service Providers** + - Automated filtering of spam emails for improved user experience. + +=== "Application 2" + + **Enterprise Email Security** + - Used in enterprise software to detect phishing and spam emails. + +--- + +### FEATURES PLANNED BUT NOT IMPLEMENTED + +=== "Feature 1" + + - Integration of deep learning models (LSTM) for improved accuracy. + +--- + +### **DEVELOPER** +***Insha Khan*** + +[LinkedIn](https://www.linkedin.com/in/insha-khan-4087532a4/){ .md-button } +[GitHub](https://www.github.com/ikcod){ .md-button } + +##### Happy Coding 🤓 +#### Show some  ❤️  by  🌟  this repository! + + + + + diff --git a/docs/NLP/projects/Email_Spam_Detection/images/Confusion Matrix - AdaBoost.png b/docs/NLP/projects/Email_Spam_Detection/images/Confusion Matrix - AdaBoost.png new file mode 100644 index 00000000..cea45f81 Binary files /dev/null and b/docs/NLP/projects/Email_Spam_Detection/images/Confusion Matrix - AdaBoost.png differ diff --git a/docs/NLP/projects/Email_Spam_Detection/images/Confusion Matrix - Decision Tree.png b/docs/NLP/projects/Email_Spam_Detection/images/Confusion Matrix - Decision Tree.png new file mode 100644 index 00000000..678468e0 Binary files /dev/null and b/docs/NLP/projects/Email_Spam_Detection/images/Confusion Matrix - Decision Tree.png differ diff --git a/docs/NLP/projects/Email_Spam_Detection/images/Confusion Matrix - Naive Bayes.png b/docs/NLP/projects/Email_Spam_Detection/images/Confusion Matrix - Naive Bayes.png new file mode 100644 index 00000000..d2372578 Binary files /dev/null and b/docs/NLP/projects/Email_Spam_Detection/images/Confusion Matrix - Naive Bayes.png differ diff --git a/docs/NLP/projects/Email_Spam_Detection/images/Confusion Matrix - Random Forest.png b/docs/NLP/projects/Email_Spam_Detection/images/Confusion Matrix - Random Forest.png new file mode 100644 index 00000000..ffbbb73f Binary files /dev/null and b/docs/NLP/projects/Email_Spam_Detection/images/Confusion Matrix - Random Forest.png differ diff --git a/docs/NLP/projects/Email_Spam_Detection/images/Confusion Matrix - SVM.png b/docs/NLP/projects/Email_Spam_Detection/images/Confusion Matrix - SVM.png new file mode 100644 index 00000000..32313728 Binary files /dev/null and b/docs/NLP/projects/Email_Spam_Detection/images/Confusion Matrix - SVM.png differ diff --git a/docs/NLP/projects/Email_Spam_Detection/images/Model accracy comparison.png b/docs/NLP/projects/Email_Spam_Detection/images/Model accracy comparison.png new file mode 100644 index 00000000..b605b0d2 Binary files /dev/null and b/docs/NLP/projects/Email_Spam_Detection/images/Model accracy comparison.png differ