ππ‘π’π¬π‘π’π§π πππππ¬ππ ππ¨π« ππππ‘π’π§π ππππ«π§π’π§π
About Dataset Context Anti-phishing refers to efforts to block phishing attacks. Phishing is a kind of cybercrime where attackers pose as known or trusted entities and contact individuals through email, text or telephone and ask them to share sensitive information. Typically, in a phishing email attack, and the message will suggest that there is a problem with an invoice, that there has been suspicious activity on an account, or that the user must login to verify an account or password. Users may also be prompted to enter credit card information or bank account details as well as other sensitive data. Once this information is collected, attackers may use it to access accounts, steal data and identities, and download malware onto the userβs computer.
Content This dataset contains 48 features extracted from 5000 phishing webpages and 5000 legitimate webpages, which were downloaded from January to May 2015 and from May to June 2017. An improved feature extraction technique is employed by leveraging the browser automation framework (i.e., Selenium WebDriver), which is more precise and robust compared to the parsing approach based on regular expressions.
Anti-phishing researchers and experts may find this dataset useful for phishing features analysis, conducting rapid proof of concept experiments or benchmarking phishing classification models.
π§ Goal: To develop a Machine Learning model that accurately identifies phishing websites by analyzing URL-based features β strengthening cybersecurity and user trust through data-driven insights.
Inquiry Pipeline:
1οΈβ£ Imports & Config: Set up Python environment with essential ML and visualization libraries to ensure a reproducible workflow.
2οΈβ£ Load Data: Loaded the Phishing vs Legitimate Websites dataset for analysis and model building.
3οΈβ£ Quick Data Audit: Checked missing values, data types, and overall dataset health before processing.
4οΈβ£ Target Column Detection: Identified the dependent variable (phishing / legitimate) to guide model training.
5οΈβ£Features & Basic Cleaning: Cleaned, encoded, and standardized the feature set to ensure quality model inputs.
6οΈβ£ EDA: Class Balance: Visualized data imbalance to understand distribution of phishing vs legitimate samples.
7οΈβ£ EDA: Correlations (Numeric): Explored numerical feature relationships using heatmaps to detect strong correlations.
8οΈβ£ EDA: Top Feature Distributions: Plotted key feature trends distinguishing phishing from legitimate URLs.
9οΈβ£ Feature Selection (Optional): Retained the most informative predictors to reduce noise and improve accuracy.
πTrain/Validation Split: Split data into training and testing sets to enable unbiased performance evaluation.
1οΈβ£1οΈβ£ Model Pipelines: Built modular pipelines for multiple models β Logistic Regression, Decision Tree, and Random Forest.
1οΈβ£2οΈβ£ Train & Evaluate (Holdout): Evaluated models on holdout data.RandomForestClassifier delivered the highest F1-score and ROC-AUC β proving most reliable for phishing detection.
1οΈβ£3οΈβ£ Cross-Validation Leaderboard: Compared models via cross-validation to confirm RandomForestβs consistent performance.
1οΈβ£4οΈβ£ Feature Importance (Tree-Based): Visualized top predictive factors β URL length, HTTPS presence, and domain features were key indicators.
1οΈβ£5οΈβ£ Save Best Model (Optional): Saved the final trained model for deployment and real-time phishing detection use.
Random Forest outperformed all models in both F1-score and ROC-AUC, proving ideal for security-based ML detection.