Generate realistic synthetic OHLCV (Open, High, Low, Close, Volume) data for backtesting trading strategies without overfitting to historical data.
Traditional synthetic data generators (like the original QuantGAN) only generate log returns - a single column. But for support/resistance and breakout strategies, you need:
- Full OHLC structure to detect S/R levels (highs and lows matter!)
- Candle patterns (wicks, body ratios) for rejection signals
- Volume for breakout confirmation
- Volatility clustering (periods of calm followed by explosions)
This modified QuantGAN generates 6 correlated features that preserve these relationships:
| Feature | What It Captures |
|---|---|
log_return |
Price movement direction and magnitude |
body_ratio |
Bullish/bearish sentiment (-1 to 1) |
upper_wick |
Rejection from highs (resistance) |
lower_wick |
Rejection from lows (support) |
range_pct |
Volatility / candle size |
volume_norm |
Volume patterns (spikes on breakouts) |
cd QuantGAN
pip install -r requirements.txtFor GPU training (recommended):
pip install torch --index-url https://download.pytorch.org/whl/cu118 # CUDA 11.8
# or
pip install torch --index-url https://download.pytorch.org/whl/cu121 # CUDA 12.1# Preprocess BTC data
python preprocess_ohlcv.py ../testdata/BTCUSDT_5m_365d.csv \
--output_dir preprocessed \
--seq_length 64
# Preprocess ETH data
python preprocess_ohlcv.py ../testdata/ETHUSDT_5m_365d.csv \
--output_dir preprocessed \
--seq_length 64This creates:
preprocessed/BTCUSDT_5m_365d_sequences.npy- Training sequencespreprocessed/BTCUSDT_5m_365d_features.csv- Feature datapreprocessed/BTCUSDT_5m_365d_params.json- Transform parameters
# Train on BTC (GPU recommended)
python train_ohlcv.py --name BTCUSDT_5m_365d \
--epochs 200 \
--batch_size 64 \
--cuda
# Train on ETH
python train_ohlcv.py --name ETHUSDT_5m_365d \
--epochs 200 \
--batch_size 64 \
--cudaTraining creates checkpoints in checkpoints/<name>_<timestamp>/:
netG_best.pth- Best generator (lowest loss)netG_final.pth- Final generatortrain_info.json- Model configuration
Training time estimates:
- CPU: ~2-4 hours for 200 epochs
- GPU (RTX 3080+): ~15-30 minutes
# Generate 10 synthetic BTC datasets (each ~6400 candles ≈ 22 days of 5m data)
python generate_ohlcv.py \
--checkpoint_dir checkpoints/BTCUSDT_5m_365d_tcn_<timestamp> \
--n_sequences 100 \
--n_datasets 10 \
--initial_price 50000 \
--output_dir ../testdata/synthetic \
--output_prefix BTCUSDT_synthetic
# Generate synthetic ETH
python generate_ohlcv.py \
--checkpoint_dir checkpoints/ETHUSDT_5m_365d_tcn_<timestamp> \
--n_sequences 100 \
--n_datasets 10 \
--initial_price 3500 \
--output_dir ../testdata/synthetic \
--output_prefix ETHUSDT_synthetictime,open,high,low,close,volume
1702000000,50000.00,50125.50,49875.25,50100.00,123.45678
1702000300,50100.00,50200.00,50050.00,50175.50,145.67890
...
The GAN uses Temporal Convolutional Networks (TCN) with:
- Dilated convolutions for capturing long-range patterns (S/R levels that persist)
- Self-attention for learning global dependencies (market regime awareness)
- Spectral normalization for stable WGAN-GP training
- OHLCV constraint loss to ensure valid candle relationships
Generator:
Noise (64-dim) → Project → [ResBlock + Attention] × 4 → OHLCV Features (6-dim)
Discriminator:
OHLCV Features (6-dim) → [ResBlock + Attention] × 4 → Pool → Real/Fake Score
python train_ohlcv.py --name BTCUSDT_5m_365d \
--model_type tcn \ # or 'lstm' for slower but potentially better long-term patterns
--hidden_dim 256 \ # larger = more capacity, slower training
--n_layers 6 \ # more layers = longer temporal receptive field
--noise_dim 128 \ # more noise = more diversity in generations
--lr_g 0.00005 \ # lower LR if training is unstable
--n_critic 5 \ # D updates per G update (WGAN-GP)
--epochs 500python train_ohlcv.py --name BTCUSDT_5m_365d \
--resume_G checkpoints/.../netG_epoch_100.pth \
--resume_D checkpoints/.../netD_epoch_100.pth \
--epochs 200python generate_ohlcv.py \
--checkpoint_dir checkpoints/... \
--n_datasets 5 \
--seed 42 # Same seed = same synthetic data- Train on more data: The more real data, the better the GAN learns patterns
- Longer sequences: Try
--seq_length 128for longer temporal dependencies - Multiple assets: Train separate models for BTC and ETH, or combine them
- Validate visually: Plot synthetic vs real data to check for obvious issues
- Statistical tests: Compare return distributions, autocorrelations, etc.
Training is unstable (loss oscillating wildly)
- Reduce learning rates:
--lr_g 0.00005 --lr_d 0.00005 - Increase critic steps:
--n_critic 10
Generated data looks too smooth/boring
- Train longer
- Increase noise dimension:
--noise_dim 128 - Check if your preprocessing lost volatility
Generated data has invalid OHLC (high < low)
- This should be handled by
features_to_ohlcv()but if it persists, increase--lambda_constraint
Out of memory on GPU
- Reduce batch size:
--batch_size 32or--batch_size 16 - Reduce hidden dimension:
--hidden_dim 64
MIT License - see LICENSE for details. Use freely, but keep the attribution!