We are data mining a corpus of ancient texts to train machine learning classifiers that distinguish between different genres.
Replication code for Gianitsos et al., "Stylometric Classification of Ancient Greek Literary Texts by Genre," LaTeCH-CLfL 2019
Link to paper: https://www.aclweb.org/anthology/W19-2507/
Open the Terminal app
-
Check that you have
Python 3.6installed:which python3.6
If it is installed, this command should have output a path. For example:
/Library/Frameworks/Python.framework/Versions/3.6/bin/python3.6. If nothing was output, downloadPython 3.6here: https://www.python.org/downloads/release/python-368/ -
Ensure that you have the Xcode command-line tools installed on your Mac by running the following. If the tools are already installed, it will not do anything harmful. This step ensures you have
gitandsvninstalled which are necessary to run the code in this project.xcode-select --install
-
Install
pipenv1. If already installed, this command will not do anything harmful.pip install pipenv
-
Clone this repository - click on green 'clone' button on the right side of the Github webpage for this repo to copy the link:
git clone <link you just copied>
-
Navigate inside the project folder:
cd <the project folder you just cloned>
-
Now that you are in the project directory, run the following command. This will generate a virtual environment called
.venvin the current directory2 that will contain the Python dependencies for this project.PIPENV_VENV_IN_PROJECT=true pipenv install
-
This will activate the virtual environment. After activation, running
Pythoncommands will ignore the system-levelPythonversion & packages, and only use the packages from the virtual environment.pipenv shell
Using exit will exit the virtual environment i.e. it restores the system-level Python configurations to your shell. You can also simply close the terminal. Whenever you want to resume working on the project, run pipenv shell while in the project directory to activate the virtual environment again.
Here are examples of commands you can run:
Run the demo (this does a feature extraction for a small sample of files, and analyzes the results in one step):
python demo.pyExtract features from all files:
python run_feature_extraction.py all_data.pickleExtract features from only drama and epic files:
python run_feature_extraction.py drama_epic_data.pickle drama epicRun all model analyzer functions on the data from all files to classify prose from verse:
python run_ml_analyzers.py all_data.pickle labels/prosody_labels.csv allRun all model analyzer functions on the data from only drama and epic files to classify drama from epic:
python run_ml_analyzers.py drama_epic_data.pickle labels/genre_labels.csv all1) The pipenv tool works by making a project-specific directory called a virtual environment that hold the dependencies for that project. After a virtual environment is activated, newly installed dependencies will automatically go into the virtual environment instead of being placed among your system-level Python packages. This precludes the possiblity of different projects on the same machine from having dependencies that conflict with one another. ↩
2) Setting the PIPENV_VENV_IN_PROJECT variable to true will indicate to pipenv to make this virtual environment within the same directory as the project so that all the files corresponding to a project can be in the same place. This is not default behavior (e.g. on Mac, the environments will normally be placed in ~/.local/share/virtualenvs/ by default). ↩