Skip to content

Frozen Core Decomposition (FCD) Library for continual learning of neural networks without catastrophic forgetting

License

Notifications You must be signed in to change notification settings

infosave2007/fcd

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FCD: Frozen Core Decomposition

An Architectural Approach to Continual Learning Without Catastrophic Forgetting

Python 3.9+ PyTorch 2.0+ License: MIT

Русский | Paper (LaTeX source)

Abstract

Catastrophic forgetting occurs when neural networks trained on sequential tasks lose previously acquired knowledge. FCD (Frozen Core Decomposition) solves this through an architectural approach: task-specific weights are generated via a Tucker-style factorization with mode-3 contraction (often called Tucker-2 parameterization), and the shared core is frozen after the first task.

Note on terminology: We do not decompose each weight matrix $W_t$ separately. Instead, we parameterize a family of task-indexed weights via a shared 3-way core $S$ and factors $U, V$, generating $W_t$ by contracting $S$ with task vector $v_t$ along mode 3.

Key formula: $$W_t = U \cdot (S \times_3 v_t) \cdot V$$

Where:

  • $U \in \mathbb{R}^{d_{in} \times r}$, $V \in \mathbb{R}^{r \times d_{out}}$ — factor matrices
  • $S \in \mathbb{R}^{r \times r \times k}$ — core tensor (frozen after task 1)
  • $v_t \in \mathbb{R}^k$ — task-specific coefficient vector

Key Features

  • 🧊 Core Freezing — primary mechanism preventing forgetting
  • 📐 Tucker-style mode-3 contraction — memory-efficient weight generation
  • 🎯 Separation Loss — orthogonalization of task vectors
  • 📊 Near-zero forgetting — <1% on all benchmarks
  • 💾 Memory efficient — O(Tk) per-task overhead vs O(T×N) for separate networks

Results

Full Comparison with State-of-the-Art (3 runs, mean ± std)

Split MNIST (5 binary tasks)

Method Accuracy Forgetting
FCD (Ours) 96.1 ± 0.4% 0.2 ± 0.2%
HAT 82.9 ± 4.1% 19.3 ± 5.1%
PackNet 56.6 ± 2.6% 40.6 ± 2.3%
DER++ 56.6 ± 2.2% 52.5 ± 2.9%
EWC 57.0 ± 3.4% 52.2 ± 4.3%
Fine-tuning 55.8 ± 1.9% 54.0 ± 2.4%

Permuted MNIST (10 tasks, 10 classes each)

Method Accuracy Forgetting
FCD (Ours) 82.2 ± 0.4% 0.2 ± 0.1%
EWC 65.7 ± 2.7% 2.6 ± 1.4%
HAT 49.2 ± 7.1% 35.5 ± 7.1%
Fine-tuning 30.3 ± 2.9% 68.6 ± 3.2%
PackNet 25.4 ± 1.6% 42.0 ± 2.3%
DER++ 18.0 ± 6.6% 32.8 ± 26.9%

Split CIFAR-100 (10 tasks, 10 classes each)

Method Accuracy Forgetting
FCD (Ours) 50.5 ± 0.3% 0.1 ± 0.1%
HAT 27.1 ± 1.7% 25.5 ± 1.7%
EWC 15.9 ± 0.2% 19.5 ± 0.8%
PackNet 14.4 ± 0.2% 40.1 ± 0.2%
Fine-tuning 14.2 ± 0.1% 43.9 ± 0.3%
DER++ 13.7 ± 0.1% 39.9 ± 0.5%

Note: All methods use the same MLP architecture for fair comparison. HAT and PackNet were designed for CNNs.

Vision Backbone Results (ResNet-18 + FCD Adapter)

FCD can be used as a lightweight adapter on top of frozen pretrained backbones. Results on Split CIFAR-100 (5 tasks, 20 classes each, 10 epochs):

Method Accuracy Forgetting
ResNet-18 + FCD Adapter 59.8% 0.3%
ResNet-18 + Fine-tuning 16.2% 61.3%

Key insight: FCD achieves 200× reduction in forgetting (0.3% vs 61.3%) while maintaining 3.7× higher accuracy. The frozen backbone provides rich features, and FCD adapter learns task-specific projections without catastrophic interference.

CNN Results (FCDCNN — trained from scratch)

To address the MLP limitation on CIFAR-100, we implement FCDCNN—a custom 4-block CNN (32→64→128→256 channels) trained from scratch with FCD adapter. Results on Split CIFAR-100 (10 tasks, 10 classes each, 20 epochs):

Method Accuracy Forgetting
FCD + CNN 57.6% 1.3%
Fine-tuning CNN 17.1% 72.7%

Key insight: FCD reduces forgetting by 56× (1.3% vs 72.7%) while achieving 3.4× higher accuracy. Unlike the pretrained ResNet-18, here the CNN is trained from scratch, showing FCD's core freezing works effectively without transfer learning.

LLM Results (GPT-2 + FCD-LoRA)

FCD can be applied to Large Language Models as an alternative to LoRA for continual fine-tuning without forgetting.

GPT-2 (124M params), 3 sequential tasks, 5 epochs each:

Configuration Forgetting Trainable Parameters
Soft FCD (separation loss) 5.4% 848K (0.7% of model)
Hard FCD (freeze core) 0% 576 (<0.01% of model)
Standard LoRA ~30% 768K (0.6% of model)

Note: % of model = fraction of trainable parameters out of total (124M). FCD and LoRA use comparable parameter counts (~0.6-0.7%), but FCD preserves previous tasks.

Key advantages over LoRA:

  • Multi-task in one adapter: Switch tasks via model.set_task(id)
  • No forgetting: Previous tasks preserved when learning new ones
  • Same memory: Similar parameter count to LoRA

Scales to: DeepSeek, LLaMA 2/3, Mistral, Qwen.

Ablation Study

Configuration Accuracy Forgetting
Full FCD 96.1 ± 0.4% 0.2 ± 0.2%
Without separation loss 95.8 ± 0.5% 0.2 ± 0.2%
Without core freezing 93.2 ± 4.8% 6.7 ± 6.1%
Minimal (no sep, no freeze) 92.2 ± 5.2% 8.1 ± 6.6%

Key finding: Core freezing is essential (+6.5% forgetting without it). Separation loss provides marginal improvement.

Scalability: T > k

What happens when tasks T exceed vector dimension k?

Tasks T Accuracy Forgetting
5 98.4% 0.0%
10 97.6% 0.5%
16 97.6% 0.9%
20 96.6% 1.3%

Method gracefully degrades — even at T=20 with k=16, accuracy >96% with only 1.3% forgetting.

Memory Efficiency

Tasks T Baseline FCD Savings
5 1,006,850 34,136 96.6%
10 2,013,700 34,216 98.3%
20 4,027,400 34,376 99.1%

Note: in this repository’s implementation, per-task overhead includes both the task vector ($k$ params) and a small task-specific classifier head (multi-head setting). The coefficient vector is only $k=16$ parameters.

Installation

git clone https://github.com/infosave2007/fcd.git
cd fcd

# Install the library (recommended)
python -m pip install .

# Optional: benchmark/text extras
python -m pip install ".[benchmarks]"

# For development/editable installs (requires a recent pip)
python -m pip install --upgrade pip
python -m pip install -e .

Requirements: Python 3.9+, PyTorch 2.0+

Quick Start

from fcd import FCDModel, train_continual
from benchmarks import get_split_mnist

# Load benchmark
tasks = get_split_mnist()

# Create model
model = FCDModel(
    input_dim=784,
    hidden_dim=256,
    num_classes=2,
    task_rank=16,    # k: task vector dimension
    core_rank=32     # r: factorization rank
)

# Train on all tasks sequentially
results = train_continual(model, tasks, epochs=100)

print(f"Average Accuracy: {results['avg_accuracy']*100:.1f}%")
print(f"Average Forgetting: {results['avg_forgetting']*100:.1f}%")

Vision Backbone Example (ResNet-18 / ViT)

from fcd import FCDResNet18, train_fcd_backbone
from fcd.backbone import get_cifar100_for_cnn  # or use custom loader

# Load CIFAR-100 split into 10 tasks
tasks = get_cifar100_for_cnn(n_tasks=10)

# Create ResNet-18 with FCD adapter (backbone frozen, only adapter trains)
model = FCDResNet18(
    num_classes=10,      # classes per task
    task_rank=32,
    core_rank=64,
    pretrained=True,
    freeze_backbone=True
)

# Train on all tasks
results = train_fcd_backbone(model, tasks, epochs=20, device='mps')
# Typical result: ~60% accuracy, <1% forgetting

LLM Example (GPT-2 / DeepSeek / LLaMA)

from transformers import AutoModelForCausalLM, AutoTokenizer
from fcd.llm_adapter import FCDLoRAConfig, apply_fcd_to_model

# Load any HuggingFace model
base_model = AutoModelForCausalLM.from_pretrained("gpt2")  # or deepseek, llama
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Apply FCD adapters (similar to LoRA, but with multi-task support)
config = FCDLoRAConfig(
    r=8,                  # Adapter rank
    task_rank=16,         # Task vector dimension
    target_modules=["c_attn", "c_proj"],  # Layers to adapt
)
model = apply_fcd_to_model(base_model, config)

# Train task 0
model.set_task("sentiment")
train(model, sentiment_data)
model.freeze_task("sentiment", freeze_core=False)  # soft FCD

# Train task 1 (no forgetting of task 0!)
model.set_task("qa")
train(model, qa_data)
model.freeze_task("qa")

# Inference: switch between tasks
model.set_task("sentiment")
generate(model, "This movie was")  # Uses sentiment knowledge

model.set_task("qa")
generate(model, "Question: What is")  # Uses QA knowledge

Method Overview

Training Protocol

1. Initialize U, V, S randomly
2. For each task t = 0, 1, ..., T-1:
   a. Initialize v_t orthogonally (Gram-Schmidt)
   b. Train with L = L_CE + λ_sep * L_sep
   c. If t == 0: Freeze U, V, S (core freezing)
   d. Freeze v_t

Why FCD Works

vs Method FCD Advantage
HAT Simpler architecture (no attention masks), higher accuracy
PackNet No pruning degradation, maintains capacity
DER++ No sample storage required, O(Tk) vs O(buffer)
EWC Hard isolation vs soft constraints

Method Comparison (Practical Setup)

The main result tables above are the source of truth for accuracy/forgetting. This table summarizes practical properties of the implementations in this repo.

Method Task ID at inference Replay buffer Parameter growth with tasks
Fine-tuning No No $O(N)$
EWC No No $O(N)$ params + $O(N)$ Fisher/opt params
DER++ No Yes $O(N)$ params + $O(B)$ buffer
HAT Yes No $O(N)$ params + $O(T\cdot H)$ task embeddings
PackNet No* No $O(N)$ params (this simplified impl does not store per-task masks)
FCD Yes No $O(N_{core}) + O(T\cdot(k + H\cdot C))$ (task vector + task head)

*Original PackNet typically requires a task-specific mask at inference; the current implementation keeps only a single global mask.

Project Structure

fcd/
├── python/
│   ├── fcd/                 # Core library
│   │   ├── model.py         # FCDModel, FactorizedCoreTensor
│   │   ├── backbone.py      # FCDResNet18, FCDViT adapters
│   │   ├── llm_adapter.py   # FCD-LoRA for LLMs (GPT-2, LLaMA, etc.)
│   │   ├── losses.py        # separation_loss, total_loss
│   │   ├── training.py      # train_task, train_continual
│   │   └── baselines.py     # EWC, HAT, PackNet, DER++
│   ├── benchmarks/          # Data loaders
│   │   └── loaders.py       # Split MNIST, Permuted MNIST, CIFAR-100
│   ├── examples/            # Example scripts
│   ├── full_comparison.py   # Run all methods
│   ├── vision_backbone_benchmark.py  # ResNet/ViT + FCD
│   ├── llm_fcd_example.py   # GPT-2/LLaMA + FCD-LoRA
│   └── ablation_study.py    # Ablation experiments
├── FCD.tex                  # Paper source
├── LICENSE
└── README.md

Running Experiments

# Full comparison (all methods, all benchmarks)
cd python
python full_comparison.py

# Fast sanity check (quick run, skips CIFAR-100)
python full_comparison.py --smoke

# Force a device
python full_comparison.py --device cpu
python full_comparison.py --device cuda   # NVIDIA GPU (requires CUDA-enabled PyTorch)
python full_comparison.py --device mps    # macOS GPU via Metal (MPS)

# Control what to run
python full_comparison.py --skip-cifar
python full_comparison.py --runs 5

# Ablation study
python ablation_study.py

# Vision backbones (ResNet-18, ViT) on Split CIFAR-100
python vision_backbone_benchmark.py --device mps
python vision_backbone_benchmark.py --skip-vit  # ResNet only (faster)

# Individual benchmarks
cd examples
python split_mnist_benchmark.py
python permuted_mnist_benchmark.py

Notes:

  • If you don't pass --device, the scripts auto-select mpscudacpu.
  • --smoke is intended for quick “does it run?” checks; for paper-style numbers, run with default epochs and multiple --runs.

Hyperparameters

Parameter Default Description
r (core_rank) 32 Factorization rank
k (task_rank) 16 Task vector dimension
λ_sep 0.5 Separation loss weight
lr 0.01 Learning rate
epochs 100 Epochs per task

Limitations

  • Task identity must be known at inference (task-aware setting)
  • Task dimension k limits perfectly orthogonal tasks (but graceful degradation)

Resolved: CNN architecture for CIFAR-100 is now implemented (FCDCNN), achieving 57.6% accuracy with 1.3% forgetting.

Citation

@article{kirichenko2025fcd,
  author = {Kirichenko, Oleg},
  title = {Frozen Core Decomposition: An Architectural Approach to Continual Learning Without Catastrophic Forgetting},
  year = {2025},
  url = {https://github.com/infosave2007/fcd}
}

Support

If you find this project useful:

Support via Tribute 🙏

License

MIT License

Author

Oleg Kirichenkourevich55@gmail.com

About

Frozen Core Decomposition (FCD) Library for continual learning of neural networks without catastrophic forgetting

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published