Skip to content

alimomennasab/CS4990-Generative-AI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Music Genre Transformation with Generative AI

Ali Momennasab, Matthew Kwong, Brianna Ha, Emily Chiu, Bill Kim, Matthew Plascencia

Department of Computer Science, California State Polytechnic University, Pomona


Abstract

This project explores the use of generative AI — specifically Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and Wasserstein GANs (WGANs) — to perform genre transfer in symbolic music.

We found that VAEs performed the best, creating stable and realistic songs compared to GANs and WGANs.


Methods

Variational Autoencoder (VAE)

  • Learns a probabilistic latent space with encoder + decoder.
  • Optimizes reconstruction loss + KL divergence.

Generative Adversarial Network (GAN)

  • Generator produces sequences from random noise.
  • Discriminator evaluates real vs. fake sequences.

Wasserstein GAN (WGAN)

  • Replaces discriminator with a critic that uses Wasserstein distance as loss.
  • Enforces the Lipschitz continuity via weight clipping or gradient penalty, resulting in more stable training than a GAN.

Music Processing

  • Token Encoding:

    • Converted MIDI sequences into tokens with pretty_midi.
  • Token Decoding:

    • Greedy Sampling: always select the most probable token.
    • Temperature Sampling: adds more randomness than greedy sampling by adjusting the softmax distribution with temperature parameter.

Dataset

  • From the MIDI-VAE dataset:
    • 559 Jazz samples
    • 1,069 Pop samples
    • 592 Classical samples
  • Preprocessing:
    • Tokenized with pretty_midi into NOTE ON, NOTE OFF, TIME SHIFT tokens.
    • Fixed temporal resolution: 0.05s
    • Max sequences length: 500 tokens
    • Train/Validation split: 80/20
    • Testing: selected a random song from the dataset to perform genre transfer with

Training Configurations

Hyperparameter GAN WGAN VAE
Latent Dim - - 64
Hidden Dim 256 256 256
Embedding Dim 128 128 128
Batch Size 16 16 16
Learning Rate 0.0001 0.0001 0.001
Epochs 1000 1000 1000
Optimizer Adam Adam Adam

Results

Vanilla GAN

  • The discriminator struggled to learn. We think this is due to token imbalance.
  • Output: generator produced mostly silence and random notes.

WGAN

  • Improved training stability over GAN, but still didn't converge.
  • Output: silence + random notes similar to the vanilla GAN.

MIDI-VAE

  • Consistent convergence with stable training.
  • Output: coherent, genre-conditioned MIDI sequences.

Conclusion

  • The VAE significantly outperformed both the GAN and WGAN in creating realistic, coherent music sequences.

References

  1. Brunner, G., Konrad, A., Wang, Y., & Wattenhofer, R. MIDI-VAE: Modeling Dynamics and Instrumentation of Music with Applications to Style Transfer. PDF
  2. MIDI-VAE GitHub Repository
  3. GeeksforGeeks – Variational AutoEncoders
  4. GeeksforGeeks – GAN
  5. GeeksforGeeks – WGANs

About

Performing music genre transfer with VAEs, GANs, and WGANs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •