Create a text classification model that can predict the native language of an author who writes in English. With using the Google's BERT, the vector representations of the authors can be obtained which is fed to Neural Networks and other Prediction Models.
This task was created for the CS585 - Natural Language Processing Fall 2019 Course at Illinois Institute of Technology, Chicago.
- Clone the BERT repository and add it as a git submodule referred as
BERT_BASE_DIR. - Use the BERT-Base Uncased model that is used as the data files. This is referred as
BERT_DATA_DIR. Repo Link. - Download the dataset from the University Repo that is used for training, validation and testing. This is the
datadirectory. - Run the
Format Data For Input.shthat programmatically reformats the data files into thebert_input_dataand then run therun_bert_fv.shthat obtains the feature vector representation for each data into thebert_output_datadirectory. - Apply Prediction models for the prediction
Prediction Models.ipynbfile.
BERT_BASE_DIR (The files from the Google's BERT Submodule)
BERT_DATA_DIR (The files from the BERT-Base Uncased Model)
data (The dataset from the University Repository)
|--lang_id_train.csv
|--lang_id_eval.csv
|--lang_id_test.csv
bert_input_data (Formatted files for vector representation)
|--train.txt
|--eval.txt
|--test.txt
bert_output_data (Obtained feature vector representation)
|--train.jsonlines
|--eval.jsonlines
|--test.jsonlines
Format Data For Input.sh
run_bert_fv.sh
Prediction Models.ipynb