Music Genre Transformation with Generative AI

Ali Momennasab, Matthew Kwong, Brianna Ha, Emily Chiu, Bill Kim, Matthew Plascencia

Department of Computer Science, California State Polytechnic University, Pomona

Abstract

This project explores the use of generative AI — specifically Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and Wasserstein GANs (WGANs) — to perform genre transfer in symbolic music.

We found that VAEs performed the best, creating stable and realistic songs compared to GANs and WGANs.

Methods

Variational Autoencoder (VAE)

Learns a probabilistic latent space with encoder + decoder.
Optimizes reconstruction loss + KL divergence.

Generative Adversarial Network (GAN)

Generator produces sequences from random noise.
Discriminator evaluates real vs. fake sequences.

Wasserstein GAN (WGAN)

Replaces discriminator with a critic that uses Wasserstein distance as loss.
Enforces the Lipschitz continuity via weight clipping or gradient penalty, resulting in more stable training than a GAN.

Music Processing

Token Encoding:
- Converted MIDI sequences into tokens with pretty_midi.
Token Decoding:
- Greedy Sampling: always select the most probable token.
- Temperature Sampling: adds more randomness than greedy sampling by adjusting the softmax distribution with temperature parameter.

Dataset

From the MIDI-VAE dataset:
- 559 Jazz samples
- 1,069 Pop samples
- 592 Classical samples
Preprocessing:
- Tokenized with pretty_midi into NOTE ON, NOTE OFF, TIME SHIFT tokens.
- Fixed temporal resolution: 0.05s
- Max sequences length: 500 tokens
- Train/Validation split: 80/20
- Testing: selected a random song from the dataset to perform genre transfer with

Training Configurations

Hyperparameter	GAN	WGAN	VAE
Latent Dim	-	-	64
Hidden Dim	256	256	256
Embedding Dim	128	128	128
Batch Size	16	16	16
Learning Rate	0.0001	0.0001	0.001
Epochs	1000	1000	1000
Optimizer	Adam	Adam	Adam

Results

Vanilla GAN

The discriminator struggled to learn. We think this is due to token imbalance.
Output: generator produced mostly silence and random notes.

WGAN

Improved training stability over GAN, but still didn't converge.
Output: silence + random notes similar to the vanilla GAN.

MIDI-VAE

Consistent convergence with stable training.
Output: coherent, genre-conditioned MIDI sequences.

Conclusion

The VAE significantly outperformed both the GAN and WGAN in creating realistic, coherent music sequences.

References

Brunner, G., Konrad, A., Wang, Y., & Wattenhofer, R. MIDI-VAE: Modeling Dynamics and Instrumentation of Music with Applications to Style Transfer. PDF
MIDI-VAE GitHub Repository
GeeksforGeeks – Variational AutoEncoders
GeeksforGeeks – GAN
GeeksforGeeks – WGANs

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
VAE		VAE
WGAN		WGAN
vanillaGAN		vanillaGAN
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
run_all.sh		run_all.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Music Genre Transformation with Generative AI

Abstract

Methods

Variational Autoencoder (VAE)

Generative Adversarial Network (GAN)

Wasserstein GAN (WGAN)

Music Processing

Dataset

Training Configurations

Results

Vanilla GAN

WGAN

MIDI-VAE

Conclusion

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

alimomennasab/CS4990-Generative-AI

Folders and files

Latest commit

History

Repository files navigation

Music Genre Transformation with Generative AI

Abstract

Methods

Variational Autoencoder (VAE)

Generative Adversarial Network (GAN)

Wasserstein GAN (WGAN)

Music Processing

Dataset

Training Configurations

Results

Vanilla GAN

WGAN

MIDI-VAE

Conclusion

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages