Skip to content

One shot learning to identify the voice of a singer using transfer learning based on VGGish architecture

License

Notifications You must be signed in to change notification settings

LauraFe01/singer-detector

 
 

Repository files navigation

Problem Statement

As a singer, I tend to record a lot of songs (bhajans specificaly, songs like these) on my phone, sang by other singers and myself alike. The result is a jumble of songs with generic names like "My Recording 67.wav". My loved ones would often ask me to send songs that I have sung and I found it very difficult to find anything in this mess. I took the opportunity to solve this problem with machine learning.

Methodology

TL;DR

I used voice recordings that I had gathered on my phone over the last year (total of ~350 bhajans, ~80 were sang by me, ~20 different singers). I developed a simple html/js tool that would help annotate the songs, shared subsets of the data to friends and family members and within a few weeks I had a usable dataset. I then converted the dataset into 4 second spectrograms that could be fed into a deep neural net based on VGGish model.

I used two different methods to identify my voice in a given snipet of audio

  • Generalizable model -
    Used a siamese network to train a model that generates a "fingerprint" of a given singer. A new audio sample is compared to the fingerprint using a distance metric and is classified as my voice if the distance is within a defined threshold
  • Non Generalizable model
    • Binary Classifier - Trained a model that predicts whether a given spectrogram is my voice or not
    • Multi-class Classifier - Trained a model that that predicts whether a given spectrogram is one or many different singers present in the dataset

Results

  • Non generalizable models performed much better in identifying my voice, >99% accuracy and recall for both binary and multi class classifier
  • The siamese network performed very well at the task of distinguishing between two artists (>90% accuracy on validation data). This however did not directly translate into stellar performance in the one-shot learning task. Using an average of "fingerprints" generated for spectrograms as the my voice's fingerprint, I was able to identify my songs with a ~70% accuracy.

For details, checkout the blog.

Future Work

  • Add more variety to the dataset and retrain (more female singers, more songs without any percussion and supporting instruments)
  • Evaluate different methods of generating a fingerprint for a singer
  • Deploy model as a consumable API

References

About

One shot learning to identify the voice of a singer using transfer learning based on VGGish architecture

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 96.2%
  • Python 3.6%
  • Other 0.2%