Introduction

Create a text classification model that can predict the native language of an author who writes in English. With using the Google's BERT, the vector representations of the authors can be obtained which is fed to Neural Networks and other Prediction Models.

This task was created for the CS585 - Natural Language Processing Fall 2019 Course at Illinois Institute of Technology, Chicago.

Workflow

Clone the BERT repository and add it as a git submodule referred as BERT_BASE_DIR.
Use the BERT-Base Uncased model that is used as the data files. This is referred as BERT_DATA_DIR. Repo Link.
Download the dataset from the University Repo that is used for training, validation and testing. This is the data directory.
Run the Format Data For Input.sh that programmatically reformats the data files into the bert_input_data and then run the run_bert_fv.sh that obtains the feature vector representation for each data into the bert_output_data directory.
Apply Prediction models for the prediction Prediction Models.ipynb file.

Directory Structure

BERT_BASE_DIR (The files from the Google's BERT Submodule)
BERT_DATA_DIR (The files from the BERT-Base Uncased Model)
data (The dataset from the University Repository)
|--lang_id_train.csv
|--lang_id_eval.csv
|--lang_id_test.csv
bert_input_data (Formatted files for vector representation)
|--train.txt
|--eval.txt
|--test.txt
bert_output_data (Obtained feature vector representation)
|--train.jsonlines
|--eval.jsonlines
|--test.jsonlines
Format Data For Input.sh
run_bert_fv.sh
Prediction Models.ipynb

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
BERT_BASE_DIR @ eedf571		BERT_BASE_DIR @ eedf571
data		data
.gitignore		.gitignore
.gitmodules		.gitmodules
Format Data For Input.sh		Format Data For Input.sh
LICENSE.txt		LICENSE.txt
Prediction Models.ipynb		Prediction Models.ipynb
ReadMe.md		ReadMe.md
neural_network_2019-11-28_05:44:47.111578.csv		neural_network_2019-11-28_05:44:47.111578.csv
run_bert_fv.sh		run_bert_fv.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Workflow

Directory Structure

About

Uh oh!

Releases

Packages

Languages

License

Tejas-Nanaware/Native-Language-Identification

Folders and files

Latest commit

History

Repository files navigation

Introduction

Workflow

Directory Structure

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages