An Architectural Approach to Continual Learning Without Catastrophic Forgetting
Русский | Paper (LaTeX source)
Catastrophic forgetting occurs when neural networks trained on sequential tasks lose previously acquired knowledge. FCD (Frozen Core Decomposition) solves this through an architectural approach: task-specific weights are generated via a Tucker-style factorization with mode-3 contraction (often called Tucker-2 parameterization), and the shared core is frozen after the first task.
Note on terminology: We do not decompose each weight matrix
$W_t$ separately. Instead, we parameterize a family of task-indexed weights via a shared 3-way core$S$ and factors$U, V$ , generating$W_t$ by contracting$S$ with task vector$v_t$ along mode 3.
Key formula:
Where:
-
$U \in \mathbb{R}^{d_{in} \times r}$ ,$V \in \mathbb{R}^{r \times d_{out}}$ — factor matrices -
$S \in \mathbb{R}^{r \times r \times k}$ — core tensor (frozen after task 1) -
$v_t \in \mathbb{R}^k$ — task-specific coefficient vector
- 🧊 Core Freezing — primary mechanism preventing forgetting
- 📐 Tucker-style mode-3 contraction — memory-efficient weight generation
- 🎯 Separation Loss — orthogonalization of task vectors
- 📊 Near-zero forgetting — <1% on all benchmarks
- 💾 Memory efficient — O(Tk) per-task overhead vs O(T×N) for separate networks
| Method | Accuracy | Forgetting |
|---|---|---|
| FCD (Ours) | 96.1 ± 0.4% | 0.2 ± 0.2% |
| HAT | 82.9 ± 4.1% | 19.3 ± 5.1% |
| PackNet | 56.6 ± 2.6% | 40.6 ± 2.3% |
| DER++ | 56.6 ± 2.2% | 52.5 ± 2.9% |
| EWC | 57.0 ± 3.4% | 52.2 ± 4.3% |
| Fine-tuning | 55.8 ± 1.9% | 54.0 ± 2.4% |
| Method | Accuracy | Forgetting |
|---|---|---|
| FCD (Ours) | 82.2 ± 0.4% | 0.2 ± 0.1% |
| EWC | 65.7 ± 2.7% | 2.6 ± 1.4% |
| HAT | 49.2 ± 7.1% | 35.5 ± 7.1% |
| Fine-tuning | 30.3 ± 2.9% | 68.6 ± 3.2% |
| PackNet | 25.4 ± 1.6% | 42.0 ± 2.3% |
| DER++ | 18.0 ± 6.6% | 32.8 ± 26.9% |
| Method | Accuracy | Forgetting |
|---|---|---|
| FCD (Ours) | 50.5 ± 0.3% | 0.1 ± 0.1% |
| HAT | 27.1 ± 1.7% | 25.5 ± 1.7% |
| EWC | 15.9 ± 0.2% | 19.5 ± 0.8% |
| PackNet | 14.4 ± 0.2% | 40.1 ± 0.2% |
| Fine-tuning | 14.2 ± 0.1% | 43.9 ± 0.3% |
| DER++ | 13.7 ± 0.1% | 39.9 ± 0.5% |
Note: All methods use the same MLP architecture for fair comparison. HAT and PackNet were designed for CNNs.
FCD can be used as a lightweight adapter on top of frozen pretrained backbones. Results on Split CIFAR-100 (5 tasks, 20 classes each, 10 epochs):
| Method | Accuracy | Forgetting |
|---|---|---|
| ResNet-18 + FCD Adapter | 59.8% | 0.3% |
| ResNet-18 + Fine-tuning | 16.2% | 61.3% |
Key insight: FCD achieves 200× reduction in forgetting (0.3% vs 61.3%) while maintaining 3.7× higher accuracy. The frozen backbone provides rich features, and FCD adapter learns task-specific projections without catastrophic interference.
To address the MLP limitation on CIFAR-100, we implement FCDCNN—a custom 4-block CNN (32→64→128→256 channels) trained from scratch with FCD adapter. Results on Split CIFAR-100 (10 tasks, 10 classes each, 20 epochs):
| Method | Accuracy | Forgetting |
|---|---|---|
| FCD + CNN | 57.6% | 1.3% |
| Fine-tuning CNN | 17.1% | 72.7% |
Key insight: FCD reduces forgetting by 56× (1.3% vs 72.7%) while achieving 3.4× higher accuracy. Unlike the pretrained ResNet-18, here the CNN is trained from scratch, showing FCD's core freezing works effectively without transfer learning.
FCD can be applied to Large Language Models as an alternative to LoRA for continual fine-tuning without forgetting.
GPT-2 (124M params), 3 sequential tasks, 5 epochs each:
| Configuration | Forgetting | Trainable Parameters |
|---|---|---|
| Soft FCD (separation loss) | 5.4% | 848K (0.7% of model) |
| Hard FCD (freeze core) | 0% | 576 (<0.01% of model) |
| Standard LoRA | ~30% | 768K (0.6% of model) |
Note: % of model = fraction of trainable parameters out of total (124M). FCD and LoRA use comparable parameter counts (~0.6-0.7%), but FCD preserves previous tasks.
Key advantages over LoRA:
- Multi-task in one adapter: Switch tasks via
model.set_task(id) - No forgetting: Previous tasks preserved when learning new ones
- Same memory: Similar parameter count to LoRA
Scales to: DeepSeek, LLaMA 2/3, Mistral, Qwen.
| Configuration | Accuracy | Forgetting |
|---|---|---|
| Full FCD | 96.1 ± 0.4% | 0.2 ± 0.2% |
| Without separation loss | 95.8 ± 0.5% | 0.2 ± 0.2% |
| Without core freezing | 93.2 ± 4.8% | 6.7 ± 6.1% |
| Minimal (no sep, no freeze) | 92.2 ± 5.2% | 8.1 ± 6.6% |
Key finding: Core freezing is essential (+6.5% forgetting without it). Separation loss provides marginal improvement.
What happens when tasks T exceed vector dimension k?
| Tasks T | Accuracy | Forgetting |
|---|---|---|
| 5 | 98.4% | 0.0% |
| 10 | 97.6% | 0.5% |
| 16 | 97.6% | 0.9% |
| 20 | 96.6% | 1.3% |
Method gracefully degrades — even at T=20 with k=16, accuracy >96% with only 1.3% forgetting.
| Tasks T | Baseline | FCD | Savings |
|---|---|---|---|
| 5 | 1,006,850 | 34,136 | 96.6% |
| 10 | 2,013,700 | 34,216 | 98.3% |
| 20 | 4,027,400 | 34,376 | 99.1% |
Note: in this repository’s implementation, per-task overhead includes both the task vector (
git clone https://github.com/infosave2007/fcd.git
cd fcd
# Install the library (recommended)
python -m pip install .
# Optional: benchmark/text extras
python -m pip install ".[benchmarks]"
# For development/editable installs (requires a recent pip)
python -m pip install --upgrade pip
python -m pip install -e .Requirements: Python 3.9+, PyTorch 2.0+
from fcd import FCDModel, train_continual
from benchmarks import get_split_mnist
# Load benchmark
tasks = get_split_mnist()
# Create model
model = FCDModel(
input_dim=784,
hidden_dim=256,
num_classes=2,
task_rank=16, # k: task vector dimension
core_rank=32 # r: factorization rank
)
# Train on all tasks sequentially
results = train_continual(model, tasks, epochs=100)
print(f"Average Accuracy: {results['avg_accuracy']*100:.1f}%")
print(f"Average Forgetting: {results['avg_forgetting']*100:.1f}%")from fcd import FCDResNet18, train_fcd_backbone
from fcd.backbone import get_cifar100_for_cnn # or use custom loader
# Load CIFAR-100 split into 10 tasks
tasks = get_cifar100_for_cnn(n_tasks=10)
# Create ResNet-18 with FCD adapter (backbone frozen, only adapter trains)
model = FCDResNet18(
num_classes=10, # classes per task
task_rank=32,
core_rank=64,
pretrained=True,
freeze_backbone=True
)
# Train on all tasks
results = train_fcd_backbone(model, tasks, epochs=20, device='mps')
# Typical result: ~60% accuracy, <1% forgettingfrom transformers import AutoModelForCausalLM, AutoTokenizer
from fcd.llm_adapter import FCDLoRAConfig, apply_fcd_to_model
# Load any HuggingFace model
base_model = AutoModelForCausalLM.from_pretrained("gpt2") # or deepseek, llama
tokenizer = AutoTokenizer.from_pretrained("gpt2")
# Apply FCD adapters (similar to LoRA, but with multi-task support)
config = FCDLoRAConfig(
r=8, # Adapter rank
task_rank=16, # Task vector dimension
target_modules=["c_attn", "c_proj"], # Layers to adapt
)
model = apply_fcd_to_model(base_model, config)
# Train task 0
model.set_task("sentiment")
train(model, sentiment_data)
model.freeze_task("sentiment", freeze_core=False) # soft FCD
# Train task 1 (no forgetting of task 0!)
model.set_task("qa")
train(model, qa_data)
model.freeze_task("qa")
# Inference: switch between tasks
model.set_task("sentiment")
generate(model, "This movie was") # Uses sentiment knowledge
model.set_task("qa")
generate(model, "Question: What is") # Uses QA knowledge1. Initialize U, V, S randomly
2. For each task t = 0, 1, ..., T-1:
a. Initialize v_t orthogonally (Gram-Schmidt)
b. Train with L = L_CE + λ_sep * L_sep
c. If t == 0: Freeze U, V, S (core freezing)
d. Freeze v_t
| vs Method | FCD Advantage |
|---|---|
| HAT | Simpler architecture (no attention masks), higher accuracy |
| PackNet | No pruning degradation, maintains capacity |
| DER++ | No sample storage required, O(Tk) vs O(buffer) |
| EWC | Hard isolation vs soft constraints |
The main result tables above are the source of truth for accuracy/forgetting. This table summarizes practical properties of the implementations in this repo.
| Method | Task ID at inference | Replay buffer | Parameter growth with tasks |
|---|---|---|---|
| Fine-tuning | No | No | |
| EWC | No | No |
|
| DER++ | No | Yes |
|
| HAT | Yes | No |
|
| PackNet | No* | No |
|
| FCD | Yes | No |
|
*Original PackNet typically requires a task-specific mask at inference; the current implementation keeps only a single global mask.
fcd/
├── python/
│ ├── fcd/ # Core library
│ │ ├── model.py # FCDModel, FactorizedCoreTensor
│ │ ├── backbone.py # FCDResNet18, FCDViT adapters
│ │ ├── llm_adapter.py # FCD-LoRA for LLMs (GPT-2, LLaMA, etc.)
│ │ ├── losses.py # separation_loss, total_loss
│ │ ├── training.py # train_task, train_continual
│ │ └── baselines.py # EWC, HAT, PackNet, DER++
│ ├── benchmarks/ # Data loaders
│ │ └── loaders.py # Split MNIST, Permuted MNIST, CIFAR-100
│ ├── examples/ # Example scripts
│ ├── full_comparison.py # Run all methods
│ ├── vision_backbone_benchmark.py # ResNet/ViT + FCD
│ ├── llm_fcd_example.py # GPT-2/LLaMA + FCD-LoRA
│ └── ablation_study.py # Ablation experiments
├── FCD.tex # Paper source
├── LICENSE
└── README.md
# Full comparison (all methods, all benchmarks)
cd python
python full_comparison.py
# Fast sanity check (quick run, skips CIFAR-100)
python full_comparison.py --smoke
# Force a device
python full_comparison.py --device cpu
python full_comparison.py --device cuda # NVIDIA GPU (requires CUDA-enabled PyTorch)
python full_comparison.py --device mps # macOS GPU via Metal (MPS)
# Control what to run
python full_comparison.py --skip-cifar
python full_comparison.py --runs 5
# Ablation study
python ablation_study.py
# Vision backbones (ResNet-18, ViT) on Split CIFAR-100
python vision_backbone_benchmark.py --device mps
python vision_backbone_benchmark.py --skip-vit # ResNet only (faster)
# Individual benchmarks
cd examples
python split_mnist_benchmark.py
python permuted_mnist_benchmark.pyNotes:
- If you don't pass
--device, the scripts auto-selectmps→cuda→cpu. --smokeis intended for quick “does it run?” checks; for paper-style numbers, run with default epochs and multiple--runs.
| Parameter | Default | Description |
|---|---|---|
| r (core_rank) | 32 | Factorization rank |
| k (task_rank) | 16 | Task vector dimension |
| λ_sep | 0.5 | Separation loss weight |
| lr | 0.01 | Learning rate |
| epochs | 100 | Epochs per task |
- Task identity must be known at inference (task-aware setting)
- Task dimension k limits perfectly orthogonal tasks (but graceful degradation)
Resolved: CNN architecture for CIFAR-100 is now implemented (FCDCNN), achieving 57.6% accuracy with 1.3% forgetting.
@article{kirichenko2025fcd,
author = {Kirichenko, Oleg},
title = {Frozen Core Decomposition: An Architectural Approach to Continual Learning Without Catastrophic Forgetting},
year = {2025},
url = {https://github.com/infosave2007/fcd}
}If you find this project useful:
MIT License
Oleg Kirichenko — urevich55@gmail.com