MSc Project - Advanced Unsupervised Learning Authors: Mirko Morello, Andrea Borghesi University: University of Milano-Bicocca Date: January 2025
- Overview
- Key Features
- Dataset
- Methodologies
- Results
- Installation
- Usage
- Project Structure
- Technical Details
- References
This project explores and compares multiple unsupervised anomaly detection algorithms on a high-dimensional medical dataset containing mixed-type data (numerical and categorical features). The core challenge lies in effectively handling this heterogeneous data structure, which required developing custom distance metrics and adapted implementations of classical algorithms.
Anomaly detection is crucial in domains like:
- π₯ Medical diagnosis and health monitoring
- π Fraud detection and cybersecurity
- π Manufacturing quality control
- π Network intrusion detection
This project demonstrates a comprehensive approach to unsupervised anomaly detection, implementing methods from three major categories:
- Proximity-based (k-NN, LOF, COF, DBSCAN)
- Prototype-based (K-Means++)
- Reconstruction-based (PCA, Autoencoders)
Why it matters: Standard distance metrics (Euclidean, Manhattan) are inappropriate for mixed-type data. Our custom implementation:
- β Handles mixed data types - Combines Euclidean distance for numerical features and Hamming distance for categorical features
- β Optimized performance - ~10 seconds on high-end processor (7,200 Γ 7,200 matrix)
- β Flexible design - Supports multiple distance metrics without forced normalization
- β Faster than existing libraries - Outperforms open-source alternatives
# Gower distance combines multiple metrics
proximity = (numerical_distance + categorical_distance) / total_features
Figure 1: Proximity matrix computed with custom Gower's distance showing data structure
Implemented and evaluated 8 different methods:
- k-Nearest Neighbor (k-NN) - Distance-based
- Local Outlier Factor (LOF) - Density-based
- Connectivity Outlier Factor (COF) - Graph-based
- DBSCAN - Clustering-based
- K-Means++ with custom Gower distance
- Principal Component Analysis (PCA) - Reconstruction-based
- Autoencoder - Deep learning approach
- Ensemble methods - Combining multiple detectors
- Two-stage IQR (Interquartile Range) method
- Reduces bias from anomalies in the data
- More accurate boundary detection
- Size: 7,200 observations
- Original features: 23 (reduced to 21 after cleaning)
- Feature types: Mixed (numerical + categorical)
- Numerical features: 5 continuous variables (standardized with Z-score)
- Categorical features: 16 binary/boolean features (one-hot encoded)
- Domain: Medical/patient data
- Labels: Not provided (unsupervised learning task)
- Feature removal: Dropped 2 empty columns
- Type conversion:
- String β float64 for numerical features
- String β boolean for categorical features
- Standardization: Z-score normalization (mean=0, std=1) for numerical features
- Validation: No missing values detected
Due to high dimensionality (21 features), visualization techniques were employed:
- PCA (Principal Component Analysis) - Linear dimensionality reduction
- t-SNE (t-Distributed Stochastic Neighbor Embedding) - Non-linear manifold learning
- Feature combination plots - All pairs of numerical features
These visualizations served as validation tools throughout the anomaly detection process.
Problem: Traditional distance metrics fail with mixed-type data.
Solution: Implemented Gower's distance, which properly handles heterogeneous features:
def proximity_matrix(data_x, data_y=None, metrics={'numeric': 'euclidean', 'categorical': 'hamming'}, cat_features=[]):
"""
Computes proximity matrix for mixed-type data using Gower's distance.
Parameters:
-----------
data_x : pandas DataFrame
Input dataset
metrics : dict
Distance metrics for each feature type
cat_features : list
Names of categorical features
Returns:
--------
prox_matrix : numpy array
Proximity/distance matrix
"""
# Separate numerical and categorical features
X_num = numerical features
X_cat = categorical features
# Apply appropriate metrics
for each data point:
num_dist = euclidean_distance(x_num, Y_num)
cat_dist = hamming_distance(x_cat, Y_cat)
# Combined distance (Gower)
prox_matrix[i,:] = (num_dist + cat_dist) / total_features
return prox_matrixKey Insights:
- Hamming distance and Minkowski distances behave identically on binary data
- Boolean features can be treated as floats if rounded during centroid updates
- This observation enables performance optimizations
Principle: Anomalies have larger distances to their k-nearest neighbors.
Implementation:
- Computed distance to 5th nearest neighbor for each point
- Applied two-stage IQR thresholding
Results:
- Anomalies detected: 693 (9.62%)
- Performance: Good separation between normal and anomalous points
Figure 2: Histogram of distances to 5th nearest neighbor
Figure 3: k-NN anomaly detection results (PCA & t-SNE projection)
Figure 4: k-NN results across all numerical feature combinations
Principle: Compares local density of a point to the densities of its neighbors.
Challenge: Highly sensitive to curse of dimensionality.
Results:
- Anomalies detected: 1,089 (15.11%)
- Issue: LOF scores ranged from 0.8 to 1.5 Γ 10βΈ
- Problem: High-dimensional sparsity caused inflated scores
- Verdict: Many false positives; not suitable for this high-dimensional dataset
Figure 5: LOF struggles with high-dimensional data
Principle: Graph-based variant of LOF; considers connectivity through shortest paths.
Advantage: Less sensitive to dimensionality than LOF.
Results:
- Anomalies detected: 1,034 (14.36%)
- Performance: Better than LOF but still shows false positives
Figure 6: COF score distribution
Figure 7: COF anomaly detection results
Principle: Density-based clustering; noise points = anomalies.
Parameters:
- eps = 0.075
- min_samples = 5
Results:
- Anomalies detected: 421 (5.85%)
- Performance: Very promising; correctly identifies outliers across multiple dimensions
Figure 8: DBSCAN effectively identifies anomalies
Standard K-Means problem: Cannot handle mixed-type data properly.
Our Solution:
- Custom centroid structure: Boolean features remain boolean, not converted to float
- Custom distance: Gower's distance for assignments
- Custom updates:
- Float features: Mean
- Boolean features: Mode (majority vote)
Mathematical Insight:
For boolean features, the mode can be computed efficiently:
mode = average > 0.5 ? True : False
This means we can treat booleans as floats and round after averaging, enabling significant optimization.
Optimization Result:
- Naive implementation: Very slow
- Optimized implementation: ~100Γ faster
- Still slower than scikit-learn's Cython implementation, but necessary for proper mixed-type handling
Finding Optimal Clusters:
Used the elbow method with second derivative analysis:
- Elbow point = minimum of second derivative
- Ran 500 iterations to handle fluctuations
- Optimal number of clusters: 5-6
Figure 9: Elbow method analysis showing optimal clusters
Figure 10: Distribution of optimal k from 500 runs
Results:
- Anomalies detected: 304 (4.22%)
- Performance: Excellent; identifies outer layer of data distribution
- Quality: One of the best-performing methods
Figure 11: Distance to centroid distribution
Figure 12: K-Means anomaly detection results
Figure 13: K-Means consistently identifies boundary anomalies
Principle: Anomalies have high reconstruction error when projected to lower dimensions.
Implementation:
- Tested multiple component numbers
- Selected 6 components for optimal representation
- Computed reconstruction error using Gower's distance
Results:
- Anomalies detected: 79 (1.1%)
- Issue: Reconstruction error distribution doesn't follow expected Gaussian pattern
- Verdict: Successfully reconstructed some apparent anomalies; not ideal for this task
Figure 14: Reconstruction error vs. number of components
Figure 15: PCA reconstruction error histogram
Figure 16: PCA anomaly detection results
Architecture:
Encoder: Input(21) β Linear(6) β ReLU
Decoder: Linear(6) β Linear(21) β Sigmoid
Training:
- 100 epochs
- Batch size: 32
- Optimizer: Adam (lr=0.001)
- Loss: MSE (Mean Squared Error)
- Training time: ~1.5 minutes
Results:
- Anomalies detected: 284 (3.94%)
- Performance: Excellent; similar quality to k-NN
- Advantage: Captures complex non-linear patterns
Figure 17: Autoencoder reconstruction error histogram
Figure 18: Autoencoder effectively identifies anomalies
Figure 19: Autoencoder results across feature combinations
Goal: Combine multiple detectors for robust anomaly detection.
Methods evaluated: K-Means, Autoencoder, k-NN (the three best-performing individual methods)
Rule: Flag anomaly only if ALL methods agree.
Results:
- Anomalies detected: 163 (2.26%)
- Characteristics: Very conservative; minimizes false positives but may miss true anomalies
Figure 20: AND ensemble - high precision, lower recall
Rule: Flag anomaly if ANY method agrees.
Results:
- Anomalies detected: 792 (11.00%)
- Characteristics: Maximizes anomaly coverage but includes more false positives
Figure 21: OR ensemble - high recall, lower precision
Rule: Combine normalized anomaly scores with equal weights (1/3 each).
# Normalize all scores to [0, 1]
normalized_scores = [score / max(score) for score in [kmeans, autoencoder, knn]]
# Weighted combination
final_score = 0.333 * kmeans + 0.333 * autoencoder + 0.333 * knn
# Apply two-stage IQR threshold
anomalies = final_score > thresholdResults:
- Anomalies detected: 402 (5.58%)
- Performance: Best overall method β
- Quality:
- Excludes clear anomalies
- Forms compact main cluster
- Minimal false positives
- Balanced precision and recall
Figure 22: Weighted sum score distribution shows clear separation
Figure 23: Weighted sum ensemble provides excellent anomaly detection
Figure 24: Weighted sum consistently identifies true anomalies across all feature combinations
| Method | Anomalies Detected | Percentage | Performance |
|---|---|---|---|
| Weighted Sum Ensemble β | 402 | 5.58% | βββββ Excellent |
| DBSCAN | 421 | 5.85% | βββββ Excellent |
| K-Means++ | 304 | 4.22% | βββββ Excellent |
| Autoencoder | 284 | 3.94% | βββββ Excellent |
| k-NN | 693 | 9.62% | ββββ Good |
| AND Ensemble | 163 | 2.26% | βββ Conservative |
| PCA | 79 | 1.1% | ββ Limited |
| OR Ensemble | 792 | 11.00% | ββ Too Permissive |
| COF | 1,034 | 14.36% | β Many False Positives |
| LOF | 1,089 | 15.11% | β Poor (High-D issues) |
Adjusted Rand Index (ARI) between methods:
| NN | LOF | DBSCAN | COF | K-Means | Autoencoder | PCA | Weighted Sum | |
|---|---|---|---|---|---|---|---|---|
| NN | 1.000 | 0.345 | 0.704 | 0.462 | 0.423 | 0.386 | 0.156 | 0.581 |
| LOF | 0.345 | 1.000 | 0.250 | 0.437 | 0.097 | 0.087 | 0.061 | 0.155 |
| DBSCAN | 0.704 | 0.250 | 1.000 | 0.348 | 0.540 | 0.446 | 0.236 | 0.664 |
| COF | 0.462 | 0.437 | 0.348 | 1.000 | 0.213 | 0.186 | 0.090 | 0.298 |
| K-Means | 0.423 | 0.097 | 0.540 | 0.213 | 1.000 | 0.593 | 0.222 | 0.750 |
| Autoencoder | 0.386 | 0.087 | 0.446 | 0.186 | 0.593 | 1.000 | 0.149 | 0.689 |
| PCA | 0.156 | 0.061 | 0.236 | 0.090 | 0.222 | 0.149 | 1.000 | 0.219 |
| Weighted Sum | 0.581 | 0.155 | 0.664 | 0.298 | 0.750 | 0.689 | 0.219 | 1.000 |
Key Findings:
- β Best-performing methods (K-Means, Autoencoder, k-NN, DBSCAN) show strong agreement
- β Weighted sum ensemble has highest correlation with individual best methods
- β LOF shows poor agreement (confirms its unsuitability for high-dimensional data)
- β PCA shows low agreement with all methods
- Optimal anomaly rate: 5-9% of the dataset (~360-650 observations)
- Best single method: K-Means++ with custom Gower distance
- Best overall approach: Weighted Sum Ensemble
- Critical insight: Proper handling of mixed-type data is essential
- Recommendation: Use ensemble methods for robust detection
- Python 3.8+
- pip or conda
# Clone the repository
git clone https://github.com/yourusername/MSc_Unsupervised_Learning.git
cd MSc_Unsupervised_Learning
# Create virtual environment (optional but recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install required packages
pip install -r requirements.txtpandas>=1.3.0
numpy>=1.21.0
matplotlib>=3.4.0
seaborn>=0.11.0
scikit-learn>=1.0.0
torch>=1.9.0
scipy>=1.7.0
tqdm>=4.62.0
kneed>=0.7.0
pythresh>=0.2.0
torchsummary>=1.5.0import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
# Load and preprocess data
df = pd.read_csv('datasets/data.csv', sep=';', index_col='Row')
# Data cleaning
df = df.replace(',', '.', regex=True)
df.drop(['Unnamed: 22', 'Unnamed: 23'], axis=1, inplace=True)
# Identify feature types
int_cols = df.select_dtypes(include=['int64']).columns
for col in int_cols:
if df[col].nunique() == 2:
df[col] = df[col].astype(bool)
bool_cols = df.select_dtypes(include=['bool']).columns
float_cols = df.select_dtypes(exclude=['bool']).columns
df[float_cols] = df[float_cols].astype(float)
# Standardize numerical features
scaler = StandardScaler()
df[float_cols] = scaler.fit_transform(df[float_cols])def proximity_matrix(data_x, data_y=None,
metrics={'numeric': 'euclidean', 'categorical': 'hamming'},
cat_features=[]):
"""
Compute proximity matrix using Gower's distance.
See main.py for full implementation.
"""
# Implementation in main.py (lines 316-370)
pass
# Compute proximity matrix
prox_mat = proximity_matrix(data_x=df, cat_features=bool_cols)from sklearn.neighbors import NearestNeighbors
# k-NN anomaly detection
k = 5
knn = NearestNeighbors(n_neighbors=k-1, metric='precomputed')
knn.fit(prox_mat)
dist, idx = knn.kneighbors()
knn_score = dist[:, -1]
# Apply two-stage IQR threshold
threshold = two_stage_iqr_bound(knn_score)
anomalies = np.where(knn_score > threshold)[0]def kmeans_gower_optimized(data, n_clusters, max_iter=300, random_state=None):
"""
K-Means with proper mixed-type handling.
See main.py for full implementation (lines 832-879).
"""
pass
# Run K-Means
labels, centroids, inertia = kmeans_gower_optimized(df, n_clusters=5, max_iter=300)
# Compute distances to centroids
prox_centers = proximity_matrix(data_x=df, data_y=centroids_df, cat_features=bool_cols)
km_scores = np.min(prox_centers, axis=1)
# Threshold
threshold = two_stage_iqr_bound(km_scores)
anomalies = np.where(km_scores > threshold)[0]import torch
import torch.nn as nn
class Autoencoder(nn.Module):
def __init__(self, input_dim, encoding_dim):
super(Autoencoder, self).__init__()
self.encoder = nn.Sequential(
nn.Linear(input_dim, encoding_dim),
nn.ReLU()
)
self.decoder = nn.Sequential(
nn.Linear(encoding_dim, input_dim),
nn.Sigmoid()
)
def forward(self, x):
encoded = self.encoder(x)
decoded = self.decoder(encoded)
return decoded
# Train autoencoder (see main.py lines 1315-1399)
autoencoder = Autoencoder(input_dim=21, encoding_dim=6)
# ... training code ...
# Compute reconstruction error
with torch.no_grad():
reconstructed = autoencoder(tensor_data)
ed_score = compute_reconstruction_error(df, reconstructed_df, bool_cols)
# Threshold
threshold = two_stage_iqr_bound(ed_score)
anomalies = np.where(ed_score > threshold)[0]# Normalize scores
methods_scores = [km_scores, ed_score, knn_score]
normalized_scores = [score/np.max(score) for score in methods_scores]
# Equal weighting
weights = [1/3, 1/3, 1/3]
# Weighted sum
ws_scores = sum(score * weight for score, weight in zip(normalized_scores, weights))
# Threshold
threshold = two_stage_iqr_bound(ws_scores)
anomalies = np.where(ws_scores > threshold)[0]
print(f"Detected {len(anomalies)} anomalies ({len(anomalies)/len(df)*100:.2f}%)")from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import seaborn as sns
def visualize_results(data, labels, title=''):
# PCA
pca = PCA(n_components=2)
pca_result = pca.fit_transform(data)
# t-SNE
tsne = TSNE(n_components=2, perplexity=20, n_iter=300)
tsne_result = tsne.fit_transform(data)
# Plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
sns.scatterplot(x=pca_result[:,0], y=pca_result[:,1],
hue=labels, palette=['gray', 'red'], ax=ax1)
ax1.set_title(f'PCA - {title}')
sns.scatterplot(x=tsne_result[:,0], y=tsne_result[:,1],
hue=labels, palette=['gray', 'red'], ax=ax2)
ax2.set_title(f't-SNE - {title}')
plt.show()
# Create labels: 1 = normal, -1 = anomaly
labels = np.ones(len(df))
labels[anomalies] = -1
visualize_results(df, labels, title='Anomaly Detection Results')MSc_Unsupervised_Learning/
β
βββ Final_project/
β βββ main.py # Main implementation (1,552 lines)
β βββ main.tex # LaTeX report source
β βββ Unsupervised_Learning__Final_project.pdf # Full project report
β βββ 2023-2024 - Unsupervised Learning - Exam.pdf # Assignment
β β
β βββ images/ # All visualization outputs
β βββ proxmat.png # Proximity matrix
β βββ NN_*.png # k-NN results
β βββ LOF_*.png # LOF results
β βββ DBSCAN_*.png # DBSCAN results
β βββ COF_*.png # COF results
β βββ KM_*.png # K-Means results
β βββ PCA_*.png # PCA results
β βββ ED_*.png # Autoencoder results
β βββ WS_*.png # Weighted sum results
β βββ AND_*.png # AND ensemble
β βββ OR_*.png # OR ensemble
β
βββ datasets/
β βββ data.csv # Original dataset
β βββ data_with_anomaly.csv # With anomaly scores
β βββ fortune500(2017-2021).csv # Additional dataset
β
βββ Lessons_notes/ # Course materials
β
βββ README.md # This file
βββ requirements.txt # Python dependencies
βββ .gitignore
| Operation | Time Complexity | Space Complexity | Actual Time (7,200 samples) |
|---|---|---|---|
| Gower's Distance Matrix | O(nΒ² Γ m) | O(nΒ²) | ~10 seconds |
| k-NN Distance Computation | O(nΒ² Γ k) | O(n Γ k) | ~5 seconds |
| K-Means (custom) | O(i Γ n Γ k Γ m) | O(n Γ m) | ~30 seconds (300 iter) |
| COF Shortest Paths | O(nΒ³) | O(nΒ²) | ~9 minutes |
| Autoencoder Training | O(e Γ b Γ n Γ m) | O(m) | ~90 seconds (100 epochs) |
Where:
- n = number of samples (7,200)
- m = number of features (21)
- k = number of neighbors/clusters (5-6)
- i = iterations
- e = epochs
- b = batch operations
Two-Stage IQR Method:
def two_stage_iqr_bound(scores):
"""
Two-stage IQR thresholding reduces bias from anomalies.
Stage 1: Remove obvious outliers
Stage 2: Recalculate bounds on cleaner data
"""
# Stage 1
q1, q3 = np.percentile(scores, [25, 75])
iqr = q3 - q1
upper_bound_1 = q3 + 1.5 * iqr
# Keep non-outliers
clean_scores = scores[scores <= upper_bound_1]
# Stage 2
q1, q3 = np.percentile(clean_scores, [25, 75])
iqr = q3 - q1
upper_bound_2 = q3 + 1.5 * iqr
return upper_bound_2- Vectorized Operations: NumPy broadcasting for distance computations
- Multi-threading: Used for elbow method iterations
- Memory Management: Garbage collection for large matrix operations
- Boolean Optimization: Treating booleans as floats with rounding
-
Arthur, D., & Vassilvitskii, S. (2007) k-means++: The advantages of careful seeding Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms (pp. 1027-1035)
-
Breunig, M. M., Kriegel, H. P., Ng, R. T., & Sander, J. (2000) LOF: Identifying density-based local outliers Proceedings of the 2000 ACM SIGMOD international conference on Management of data (pp. 93-104)
-
Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996) A density-based algorithm for discovering clusters in large spatial databases with noise KDD-96 Proceedings (pp. 226-231)
-
Hartigan, J. A., & Wong, M. A. (1979) Algorithm AS 136: A k-means clustering algorithm Journal of the Royal Statistical Society, Series C, 28(1), 100-108
-
Shyu, M. L., Chen, S. C., Sarinnapakorn, K., & Chang, L. (2003) A novel anomaly detection scheme based on principal component classifier IEEE Foundations and New Directions of Data Mining Workshop (pp. 172-179)
-
Yang, J., Rahardja, S., & FrΓ€nti, P. (2019) Outlier detection: How to threshold outlier scores? International Conference on Artificial Intelligence, Information Processing and Cloud Computing
- NumPy & Pandas - Data manipulation
- Scikit-learn - Machine learning algorithms
- PyTorch - Deep learning (Autoencoder)
- Matplotlib & Seaborn - Visualization
- SciPy - Scientific computing (shortest paths)
Disclaimer from Original Report:
This report does not contain any form of plagiarism, including content generated or suggested by AI tools such as ChatGPT or similar services. All sources used have been properly cited and referenced.
Mirko Morello π§ m.morello11@campus.unimib.it π University of Milano-Bicocca
Andrea Borghesi π§ a.borghesi1@campus.unimib.it π University of Milano-Bicocca
This project is submitted as academic work for the MSc in Data Science at the University of Milano-Bicocca.
For academic use and reference, please cite appropriately.
- Course: Unsupervised Learning
- Institution: University of Milano-Bicocca
- Academic Year: 2023-2024
- Project Type: MSc Final Project
Special thanks to the course instructors for guidance on unsupervised learning techniques and anomaly detection methodologies.
β Star this repository if you found it helpful! β
Made with π§ by MSc Data Science students