To replicate and extend a deep learning model using a contrastive learning framework for the task of image similarity. The underlying model learns meaningful visual representations through self-supervised learning techniques, enabling accurate similarity measurements between image pairs. This implementation is based on PyTorch.
Fully implemented SimCLR framework from scratch (vectorized), including the NT-Xent loss:
Comparison of two CNN backbones:
- Simple: 2 convolutional layers + 1 fully connected layer (LeNet-style), conv layers use kernel size 3, stride 2, padding 1.
- ResNet: Residual blocks with skip connections, batch normalization, and ReLU activations.
The output vector space is
| Model | Train Loss | Test Loss | Duration |
|---|---|---|---|
| Simple | 0.0350 | 0.0352 | 2h 50m |
| ResNet | 0.0166 | 0.0184 | 4h 18m |
Figure 1. Loss plots of the ResNet model on the train and validation sets during training.
Figure 2. The base image sampled from the test set used as a reference for similarity comparison.
Figure 3. Visualization showing the ranking of all test images in the test set based on similarity to the basis image by Ranking of top-5 and bottom-5 similar images compared to a basis image.
- GPU: NVIDIA RTX 4060 Mobile
- CPU: Intel i5-12500H
- RAM: 40 GB DDR4
Before running the program:
-
Prepare CIFAR Data
Download and extract the CIFAR-10 dataset into the following directory:data/raw/cifar/The directory should contain files such as:
data_batch_1, data_batch_2, ..., test_batch, batches.meta, etc. -
Adjust the Data Loader
You must configure theload_datamethod of theCifarobject insrc/data/dataset.pyto correctly parse your data layout and preprocessing preferences.
The implementation of data loading is left intentionally flexible for users to define according to their needs.
All interactions with this project are done through the run.py script, which serves as the main entry point.
Configuration files are written in JSON and define all the necessary parameters for model setup and training.
Default config files are located in the template/ directory. Below is the structure and meaning of each field:
{
"model_architecture_cfg": {
"type": "ResNet", // Backbone encoder type (e.g., "ResNet")
"instance_prsd_shape": [3, 32, 32], // Shape of one input instance (C, H, W)
"N_repr": 128, // Dimensionality of the projection head output
"detailed": null // Reserved for detailed architecture overrides (optional)
},
"training_cfg": {
"rng_seed": 0, // Random seed for reproducibility
"max_epochs": 100, // Maximum number of training epochs
"data": "cifar", // Dataset identifier
"M_minibatch": 16, // Mini-batch size
"train_fraction": 0.6, // Fraction of data used for training
"subset_size": null, // Optional subset size (useful for debugging)
"temperature": 0.5, // Temperature for contrastive loss
"optimization": {
"type": "adam", // Optimizer type (e.g., "adam", "sgd")
"lr": 1e-3 // Learning rate
}
}
}
- You can add or override parameters as needed.
subset_sizecan be used for partial data training.temperatureis critical for contrastive loss performance tuning.- The
"detailed"field is reserved for custom architecture options (e.g., ResNet variants or layer-specific configs).
Refer to the default files under template/ (e.g., config-resnet.json, config-simple.json) as starting points.
python run.py --device <cpu|gpu> --mode <train|pred> [additional args]
To train a model, you must provide either:
- a new configuration file path (
--config), or - a previous checkpoint directory path (
--checkpoint) to resume from.
# Start training from a config file
python run.py --device gpu --mode train --config config-resnet.json
# Resume training from an existing checkpoint
python run.py --device gpu --mode train --checkpoint checkpoint/Dt20250719124327UTC0
During training, progress and evaluation results for each epoch are printed to standard output in a structured format. Below is an example and explanation of the key components:
[EPOCH 5/99 @ Dt20250719125139UTC0]
Iterative optimization state:
100% |==============================| S2249/2249 [46s<000ms; 020ms/it; L0.0677]
Epoch optimization completed; proceeding to model evaluation stage.
Train set evaluation state:
100% |==============================| S2249/2249 [36s<000ms; 016ms/it; L0.0598]
Validation set evaluation state:
100% |==============================| S1124/1124 [18s<000ms; 016ms/it; L0.1637]
[EPOCH CONCLUSIVE REPORT]
Epoch time: 01m:41s
Performance measurement time: 54s
+------+-------------+---------------+--------------+
| | val_v_opt | val_v_worst | val_v_init |
|------+-------------+---------------+--------------|
| loss | 2.88 % | -96.62 % | -96.62 % |
+------+-------------+---------------+--------------+
+------+-----------------+-----------------+-----------------+-----------+-----------+
| | est_train_min | est_train_max | est_train_avg | train | val |
|------+-----------------+-----------------+-----------------+-----------+-----------|
| loss | 0.0111888 | 0.448736 | 0.0763622 | 0.0823384 | 0.0834423 |
+------+-----------------+-----------------+-----------------+-----------+-----------+
In the epoch summary table, the training loss columns are as follows:
| Column | Description |
|---|---|
est_train_min |
The minimum loss observed during minibatch updates within the epoch. |
est_train_max |
The maximum loss observed during minibatch updates within the epoch. |
est_train_avg |
The average loss computed over all minibatch updates (estimated average) during the epoch. |
train |
The true average loss computed over the entire training dataset after the epoch ends. |
val |
The true average loss computed over the entire validation dataset after the epoch ends. |
- The
est_*values reflect the loss as observed iteratively within minibatch steps during training. - The
trainandvalvalues are computed precisely over the full respective datasets after completing all minibatch iterations.
This distinction helps diagnose training dynamics, comparing instantaneous minibatch losses with full-epoch averages.
-
Performance measurement time:
The duration (in seconds) spent measuring model performance on training and validation datasets after the epoch completes. -
Validation metric comparisons (shown as percentages in the summary table):
Metric Description val_v_optPercentage difference between the current validation metric (e.g., loss) and the best (optimal) validation metric recorded so far. val_v_worstPercentage difference between the current validation metric and the worst validation metric recorded so far. val_v_initPercentage difference between the current validation metric and the validation metric measured before training started (initial weights).
These metrics help track how the model’s validation performance evolves relative to its best, worst, and initial states throughout training.
To run inference or estimate representations, provide:
- the config file used during training (
--config) - a model weights file (
--tparams)
python run.py --device gpu --mode pred --config config-resnet.json --tparams model_final.pt
You can pass debug parameters for development purposes:
limit2SmallSubsetofData: Use a small subset of the datasetclearExports: Remove exported checkpoints after execution
python run.py --device cpu --mode train --config config-test.json --debug limit2SmallSubsetofData clearExports
The current methodologies were highly relied on the work of [1].
[1] T. Chen, S. Kornblith, M. Norouzi and G. Hinton. "SimCLR: A Simple Framework for Contrastive Learning of Visual Representations". Proceedings of the 37th International Conference on Machine Learning, Vienna, Austria, PMLR 119. 2020. Available at: https://arxiv.org/abs/2002.05709


