- Run
python BotScript.pyfor the Tweetbot (This script runs only withGPT2-rap-recommended) - Run
python src/test_generation.pywith proper parameters to test language generation. - To quit the program, use
CTRL+C
- Install requirements
pip install -r requirements.txt - Alternatively run
./install.sh
- Apply for a Twitter Developer Account with elevated access
- Create an
.envfile including the variables:CONSUMER_API_KEYCONSUMER_API_KEY_SECRETACCESS_TOKENACCESS_TOKEN_SECRETand provide the necessary credentials to each variable.
- Download fasttext's language identification model and place it in the same folder as this file.
- Create a folder called
.modelin the same folder as this file and place the proper finetuned GPT-2 model (see Models section) inside it (.model/GPT2-rap-recommended/config.json pytorch...). The model is available here - Hardware that can deal with GPT-2.
- We gathered raps from genius.com, ohhla.com and battlerap.com. For genius.com, we used the official API
(GeniusLyrics and GetRankings repos) while genius.com and ohhla.com were scraped using a specifically tailored scrapy scraper.
In total we gathered ~70k raps which we used for finetuning. GPT-2 was finetuned by creating one large text, while T5 was finetuned
on prompts. The prompts had the form of
KEYWORDS: <keywords> RAP-LYRICS: <rap text>which proved to be insufficient for our task. Eventually we chosed to use the fine-tuned GPT2 model. Experimental and succeeding scripts can be found in./preprocessing/finetunging. Additionaly, a RoBERTa model was finetuned on both data from the english wikipedia, tweets regarding hate speech, the CNN/Dailymail dataset and 4k rap lyrics data (data can be found underData) to classify the quality of the generated raps.
finetuningFineTuneRapMachineExp.ipynbExperimental scriptFineTuneRapMachineGPT2.ipynbGPT2 finetuning scriptT5.ipynbFinetuning Script for T5 on a key2text approachkeytotext.ipynbUsing the keytotext library for finetuningFineTuneRapMachineExp2.ipynbAnother experimental script, in which GPT-J and GPT-NEO were used, yet didn't succeed
data_analysisCreateAdvData.ipynbScript to create balanced dataset to train the ranker modelLyricsAnalyye.ipzngScript to analyze the scraped data.
lyrics_spider- Includes scrapy program to obtain lyrics
cleaning_and_keywordsdata_cleanerScript for removing noise from 70k scraped rap corpuskw_extractionScript that starts building a TF-IDF model either from scratch or from an existing model to generate keywords for rap corpustf_idfTF-IDF model script
rankerroberta_ranker.ipynbRoberta finetuning script
- ohhla.com - Scraped
- BattleRap.com - Scraped
- Genius.com - Accessed through API, GeniusLyrics and GetRankings used.
- To obtain lyrics from genius.com, two programs were implemented which are based on different, yet outdated, repositories.
- Both programs are part of this project
- GPT2-rap-recommended Download (Necessary to use BotScript.py)
- GPT2-small-key2text Download (Approach did not work out, trained on 4k corpus)
- Roberta Ranker Download (Ranker trained on 8k data with 4k rap corpus and 4k non-rap corpus)
- T5-large-key2text Download (Approach did not work out, trained on 70k corpus)
- T5-small-key2text Download (Approach did not work out, trained on 4k corpus)
- tf-idf pickle Download (Approach did not work out, trained on 70k corpus)
- Our data can be downloaded here