Ali Momennasab, Matthew Kwong, Brianna Ha, Emily Chiu, Bill Kim, Matthew Plascencia
Department of Computer Science, California State Polytechnic University, Pomona
This project explores the use of generative AI — specifically Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and Wasserstein GANs (WGANs) — to perform genre transfer in symbolic music.
We found that VAEs performed the best, creating stable and realistic songs compared to GANs and WGANs.
- Learns a probabilistic latent space with encoder + decoder.
- Optimizes reconstruction loss + KL divergence.
- Generator produces sequences from random noise.
- Discriminator evaluates real vs. fake sequences.
- Replaces discriminator with a critic that uses Wasserstein distance as loss.
- Enforces the Lipschitz continuity via weight clipping or gradient penalty, resulting in more stable training than a GAN.
-
Token Encoding:
- Converted MIDI sequences into tokens with
pretty_midi.
- Converted MIDI sequences into tokens with
-
Token Decoding:
- Greedy Sampling: always select the most probable token.
- Temperature Sampling: adds more randomness than greedy sampling by adjusting the softmax distribution with temperature parameter.
- From the MIDI-VAE dataset:
- 559 Jazz samples
- 1,069 Pop samples
- 592 Classical samples
- Preprocessing:
- Tokenized with
pretty_midiinto NOTE ON, NOTE OFF, TIME SHIFT tokens. - Fixed temporal resolution: 0.05s
- Max sequences length: 500 tokens
- Train/Validation split: 80/20
- Testing: selected a random song from the dataset to perform genre transfer with
- Tokenized with
| Hyperparameter | GAN | WGAN | VAE |
|---|---|---|---|
| Latent Dim | - | - | 64 |
| Hidden Dim | 256 | 256 | 256 |
| Embedding Dim | 128 | 128 | 128 |
| Batch Size | 16 | 16 | 16 |
| Learning Rate | 0.0001 | 0.0001 | 0.001 |
| Epochs | 1000 | 1000 | 1000 |
| Optimizer | Adam | Adam | Adam |
- The discriminator struggled to learn. We think this is due to token imbalance.
- Output: generator produced mostly silence and random notes.
- Improved training stability over GAN, but still didn't converge.
- Output: silence + random notes similar to the vanilla GAN.
- Consistent convergence with stable training.
- Output: coherent, genre-conditioned MIDI sequences.
- The VAE significantly outperformed both the GAN and WGAN in creating realistic, coherent music sequences.
- Brunner, G., Konrad, A., Wang, Y., & Wattenhofer, R. MIDI-VAE: Modeling Dynamics and Instrumentation of Music with Applications to Style Transfer. PDF
- MIDI-VAE GitHub Repository
- GeeksforGeeks – Variational AutoEncoders
- GeeksforGeeks – GAN
- GeeksforGeeks – WGANs