TLDR: The code to process Wikipedia articles from the Mosaico subset through the UCREL English only C version of the CLAWS and USAS NLP pipeline, which produces the following syntactic and semantic annotations:
- Sentence boundaries
- Tokens
- Lemmas
- POS tags
- USAS tags
The pre-processed data that this code base has created can be found at the following HuggingFace Dataset repository ucrelnlp/English-USAS-Mosaico .
Note due to the dependeny on using the C version of CLAWS and USAS the first time you run either the make test command or the first time you want to tag something with CLAWS or USAS you will require a VPN connection to the University (Lancaster University) server or be on Lancaster University campus as it requires a connection to the two Gitlab repoistories on delta (if you are interested in tagging using this pipeline please get in touch with us (contact information at the bottom of this README)):
Mosaico is a dataset of processed Wikipedia pages, of which these Wikipedia pages have been filtered to only contain good and featured tagged pages. The Mosaico dataset does contain various NLP annotations but in this project at the moment we are only interested in the Wikipedia article pages so that we can add our own NLP annotations, specifically:
- Sentence boundaries
- Tokens
- Lemmas
- POS tags
- USAS tags
To start with we will only tag the English Wikipedia pages that have the tag of
goodandfeatured. The tagging will be performed by: - CLAWS - generates sentence boundaries, tokens, and POS tags.
- USAS - generates lemmas, and USAS tags based off the CLAWS POS tags and tokens.
We have used pixi and uv for this project, uv is in essence a Python Virtual environment manager and Pixi is an extension of that which we use to install Perl. Once you have installed UV and Pixi run the the following to setup the project:
(pixi will likely install perl at $HOME/.pixi/bin/perl)
uv venv
uv lock
pixi global install --channel conda-forge perlNote we use a Perl script to replace some UTF-8 characters to ASCII equivalent characters, this is required as CLAWS requires text to be encoded in ASCII rather than UTF-8.
To activate a Python repl with all of the project requirements:
uv run pythonFor data exporting it requires docker.
To run the tests:
source env.sh
make testThe make command ensures that all of the required docker images are created and the binaries and files that the environment variables set within ./env.sh are downloaded.
Git repositories downloaded are:
Docker images created/required:
- claws:4.0 - created by building within the
./clawsrepository. - usas:7.0 - created by building within the
./usasrepository.
Environment variables set within env.sh:
All of these environment variables are explained in the section CLAWS and USAS binaries setup:
- CLAWS_RUN_SCRIPT
- rdirectory
- USAS_RUN_SCRIPT
- USAS_EXE
- USAS_RESOURCES
To run the linting (ruff), formatting (ruff), and type checker (pyrefly) checks:
make checkThis section describes how to export the English Wikipedia data from Mosaico into JSONL (JSON lines) format so that the data can be further processed by other tools (USAS and CLAWS):
As mentioned the Wikipedia data has appeared to have come from this Cirrus Wikipedia dump, of which a schema for this dump can be found here. We did investigate the structured and un structured wikipedia dumps on HuggingFace but they do not appear to contain the tags for Good and Featured articles that Mosaico state are good signals for quality, of which training on this subset of data has shown to be more efficient than training on the whole Wikipedia dataset while achieving comparable results for WSD on all but the rare word sense test set (42D).
The data export we are going to generate will be in JSONL format whereby each JSON line contains one unique Wikipedia page entry and will have the following JSON structure:
uv run python -c "from mosaico_usas_processing.data_export import WikiDataExport; import json; print(json.dumps(WikiDataExport.model_json_schema(),indent=2))"{
"properties": {
"document_id": {
"description": "ID that uniquely represents that document within the Mosaico Mongo DB",
"examples": [
"3021080"
],
"title": "Document ID",
"type": "string"
},
"wikidata_id": {
"description": "Wikidata ID, every Wikipedia page should one as it is an unique ID that allows you to access it's global unique URL https://www.wikidata.org/entity/ID",
"examples": [
"Q921355"
],
"title": "Wikidata ID",
"type": "string"
},
"title": {
"description": "Wikipedia page title",
"examples": [
"Erik Adolf von Willebrand"
],
"title": "Title",
"type": "string"
},
"text": {
"description": "The UTF-8 encoded Wikipedia article text",
"title": "Text",
"type": "string"
},
"ascii_text": {
"description": "ASCII encoded version of the Wikipedia article text",
"title": "ASCII Text",
"type": "string"
},
"language": {
"const": "en",
"description": "Language of the Wikipedia article",
"examples": [
"en"
],
"title": "Language",
"type": "string"
},
"quality": {
"description": "Quality of the Wikipedia page as determined by the Wikipedia community",
"enum": [
"good",
"featured"
],
"examples": [
"good",
"featured"
],
"title": "Quality",
"type": "string"
},
"ores_articletopics": {
"additionalProperties": {
"type": "number"
},
"description": "High level article topics that are easily searchable and have been predicted by a machine learning model. This will be represented as a dictionary of topic and score whereby the score is between 0-1 where 1 indicates the model is more confident of it's prediction.",
"examples": [
{
"Geography.Regions.Europe.Northern Europe": 0.584
}
],
"title": "ORES article topics",
"type": "object"
},
"categories": {
"description": "A noisy list of article topics that are found on the Wikipedia page at the end. To note that the hierarchy of this category system can be found through the SQL database dumps according to this source. The reason these are noisy is that they sometimes contain meta data topics like `CS1 Swedish-language sources (sv)` or `Good articles`.",
"examples": [
[
"CS1 Swedish-language sources (sv)",
"AC with 0 elements",
"1870 births",
"1949 deaths",
"Academics of the University of Helsinki",
"Finnish hematologists",
"Finnish people of German descent",
"People from Vaasa"
]
],
"items": {
"type": "string"
},
"title": "Categories",
"type": "array"
},
"popularity_score": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"description": "As defined in the cirrus schema, 'A floating point number representing the percentage of page views to this wiki that requests this page. This is only available for content pages.' If the popularity score cannot be validated or found it will have a value of `None`.",
"examples": [
8.327128616319467e-08,
null
],
"title": "Popularity Score"
},
"timestamp": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"description": "Timestamp of the most recently index/edited version of the page. If the timestamp cannot be found it will have a value of `None`.",
"examples": [
"2021-12-26T18:49:13Z",
null
],
"title": "timestamp"
}
},
"required": [
"document_id",
"wikidata_id",
"title",
"text",
"ascii_text",
"language",
"quality",
"ores_articletopics",
"categories",
"popularity_score",
"timestamp"
],
"title": "WikiDataExport",
"type": "object"
}More information on the ores_articletopics:
High level article topics that are easily searchable and have been predicted by a machine learning model. This will be represented as a dictionary of topic and score, e.g. {'Geography.Regions.Europe.Northern Europe': 0.584, 'Culture.Biography.Biography*': 0.983, 'Geography.Regions.Europe.Europe*': 0.857, 'STEM.STEM*': 0.983} whereby this shows that it predict STEM.STEM* with 98.3% confidence.
The ascii_text is generated from inputting the text (original UTF-8 encoded text) through the ./non_python_scripts/UTF82CLAWS.pl perl script which replaces some UTF-8 characters with ASCII equivalent characters, we then re-encode the string to ASCII ignoring all errors, therefore removing some characters that cannot be mapped to ASCII.
NOTE We use the ./data folder to download and save data too.
NOTE We also use the ./log folder to store error logs too.
To export the data will assume you have started the Mongo database and imported the data into Mongo (see section Set up MongoDB from the Mosaico GitHub repoistory note we do not need the interlanguage-links data collection).
Note when setting up MongoDB as the script is only exporting the English Wikipedia articles we filtered the MongoDB pages data so that it only includes the English Wikipedia articles like so (this should make the process more memory efficient and quicker):
cat ./data/pages.collection.json | grep "\"language\":\"en\"" > ./data/en_pages.collection.jsonUseful code from Mosaico's GitHub repository on importing the data into the Mongo DB and running the Mongo DB:
docker run \
-e MONGO_INITDB_ROOT_USERNAME=admin \
-e MONGO_INITDB_ROOT_PASSWORD=password \
-p 27017:27017 \
--name local-mosaico-db \
--detach \
mongo:6.0.11
# import pages
docker exec -i local-mosaico-db \
mongoimport \
--authenticationDatabase admin -u admin -p password \
--db mosaico --collection pages < ./data/en_pages.collection.json
# import annotations
docker exec -i local-mosaico-db \
mongoimport \
--authenticationDatabase admin -u admin -p password \
--db mosaico --collection annotations < ./data/annotations.collection.jsonThe following code is computationally very efficient using only 1 CPU, it takes about 1 minute to process 835 documents.
uv run src/mosaico_usas_processing/data_export.py ./data/wikipedia_export.jsonl --logging-level error 2>log/data_export_error.txtAfter the script has ran it will output some statistics:
Outputting/exporting data to: wikipedia_export.jsonl
careful, beanie issue still open
Data statistics:
Number of documents saved to wikipedia_export.jsonl: 495
Number of documents we did not save: 340
Total number of processed documents: 835
Total number of ascii tokens in the saved dataset: 1,995,931
Total time taken to process (seconds): 56.22For the full English only Wikipedia pages:
Outputting/exporting data to: wikipedia_export.jsonl
careful, beanie issue still open
Data statistics:
Number of documents saved to wikipedia_export.jsonl: 10,856
Number of documents we did not save: 6,336
Total number of processed documents: 17,192
Total number of ascii tokens in the saved dataset: 74,595,642
Total time taken to process (seconds): 2,069.30Note most of the pages we did not export was due to the fact that we could not extract out the quality metadata therefore we did not export the Wikipedia page.
After the export you can remove the Mongo Database container:
docker stop local-mosaico-db
docker rm local-mosaico-dbHelp information from the script:
uv run src/mosaico_usas_processing/data_export.py --help
usage: data_export.py [-h] [--perl-script-path PERL_SCRIPT_PATH] [-l {debug,info,error}] output_file_path
Data exported from the Mongo Database of Wikipedia pages to the output file in JSONL format, whereby each JSON line contains information on each unique Wikipedia page within the Mongo
Database.
positional arguments:
output_file_path The JSONL file to export the Mongo Database data too.
options:
-h, --help show this help message and exit
--perl-script-path PERL_SCRIPT_PATH
File path to the UTF82CLAWS.pl Perl script. Default: /home/andrew/Downloads/mosaico-usas-
processing/src/mosaico_usas_processing/data_export.py/../../../non_python_scripts/UTF82CLAWS.pl
-l {debug,info,error}, --logging-level {debug,info,error}
The logging level, debug most verbose, info, or error which is the least verbose. Default is info.The data tagging can be done either using:
- docker - easier for development and more OS cross platform friendly.
- pre-compiled binaries - allows us to run it on HEX (HEX is a Slurm cluster).
The reason for the two is that CLAWS and USAS can be ran by either option. It is easier to test with the docker containers but the binaries allow us to run it on HEX.
No matter which approach is chosen, docker or binaries, they both produce a tagged dataset with the same format. For the format see the section Tagged dataset format.
After the data export of the Wikipedia data we can now tag the data using CLAWS and USAS, whereby CLAWS and USAS will be accessed via a docker container.
uv run src/mosaico_usas_processing/claws_usas_tagging.py ./data/wikipedia_export.jsonl \
ascii_text \
./data/docker_tagged_wikipedia_export.jsonl \
2>./log/docker_tagging_error.txtWhich will output the following:
Number of documents saved to ./data/tagged_small_wikipedia_export.jsonl: 494
Number of documents we did not save: 1
Total number of processed documents: 495
Total number tokens in the saved dataset: 2,434,898
Total time taken to process (seconds): 1,812.58
Peak amount of memory used by: 544.076 MBThe reason for the document we did not save was due to a difference in the number of CLAWS and USAS tokens of which this should not occur hence why we did not save that document.
Please see the CLAWS and USAS binaries setup section below before running the following:
After the data export of the Wikipedia data we can now tag the data using CLAWS and USAS, whereby CLAWS and USAS will be accessed via the binary files directly.
# Requires the following environement variables to have been set via export:
# rdirectory, USAS_EXE, and USAS_RESOURCES
uv run src/mosaico_usas_processing/claws_usas_tagging.py ./data/wikipedia_export.jsonl \
ascii_text \
./data/binary_tagged_wikipedia_export.jsonl \
--claws-run-script /home/andrew/Downloads/temp_tagger/new_claws/new_claws/bin/run_claws.sh \
--usas-run-script /home/andrew/Downloads/temp_tagger/usas/bin/run_semtag.sh \
2>./log/binary_tagging_error.txtWhich will output the following:
Number of documents saved to ./data/binary_tagged_wikipedia_export.jsonl: 494
Number of documents we did not save: 1
Total number of processed documents: 495
Total number tokens in the saved dataset: 2,434,898
Total time taken to process (seconds): 866.96
Peak amount of memory used by: 524.116 MBNoteThe make test command in essence does the following for you:
For CLAWS, git clone the following repository (requires VPN access or to be on Lancaster University campus). Once downloaded:
- Compile CLAWS:
cd src
gcc -g -DPROBCALCULATION -o claws4 claws4.c regexp.c
cd ..- Create a directory that contains the CLAWS binary and all the lexical resources it requires:
mkdir claws_bundle
cp src/claws4 claws_bundle/.
cp resources/* claws_bundle/.- When running the Python script for tagging with binaries ensure that the environment variable
rdirectoryis set to the absolute path of theclaws_bundledirectory, e.g.export rdirectory="$(pwd)/claws_bundle" - The
claws_run_scriptargument ofsrc.mosaico_usas_processing.claws:tag_with_binaryshould be set to"$(pwd)/bin/run_claws.sh". When running the tests this is what the environment variableCLAWS_RUN_SCRIPTshould be set to.
For USAS, git clone the following repository (requires VPN access or to be on Lancaster University campus). Once downloaded:
- Set the following environment variables:
USAS_EXE- to an absolute path to the relevant pre-compiled semantic tagger, e.g. semtag_debian64USAS_RESOURCES- to an absolute path to the relevant lexical resources that the semantic tagger requires, e.g.resources
- The
usas_run_scriptargument ofsrc.mosaico_usas_processing.usas:tag_with_binaryshould be set to"$(pwd)/bin/run_semtag.sh". When running the tests this is what the environment variableUSAS_RUN_SCRIPTshould be set to.
uv run src/mosaico_usas_processing/claws_usas_tagging.py --help
usage: claws_usas_tagging.py [-h] [-t TOKENIZER_KEY] [-l LEMMAS_KEY] [-p POS_KEY] [-u USAS_KEY] [-r USAS_RAW_KEY] [-s SENTENCE_BOUNDARIES_KEY] [-c CLAWS_DOCKER_CONTAINER_NAME]
[-d USAS_DOCKER_CONTAINER_NAME] [--claws-run-script CLAWS_RUN_SCRIPT] [--usas-run-script USAS_RUN_SCRIPT] [-g {debug,info,error}]
jsonl_file_path text_key output_file_path
Given a JSONL file whereby each line is a JSON entry that contains a `text_key` with a value that is text that is to be tagged, this script will tag that text for all JSON entry lines.
The output will be the same JSONL input data but with the addition of the following keys for each JSON entry: `tokens`, `lemmas`, `pos`, `usas`, and sentence_boundariesof which these
keys names are configurable. The tokens , Part Of Speech tags, and sentence boundaries will come from the CLAWS tagger and the lemmas and USAS tags will come from the USAS tagger.
positional arguments:
jsonl_file_path File path to a JSONL file whereby each line contains a key whereby the value should be tagged.
text_key JSON key name whereby the value contains the text to be tagged.
output_file_path File path to store the output too in JSONL format.
options:
-h, --help show this help message and exit
-t TOKENIZER_KEY, --tokenizer-key TOKENIZER_KEY
The key to store the tokens too. Default: tokens
-l LEMMAS_KEY, --lemmas-key LEMMAS_KEY
The key to store the lemmas too. Default: lemmas
-p POS_KEY, --pos-key POS_KEY
The key to store the POS tags too. Default: pos
-u USAS_KEY, --usas-key USAS_KEY
The key to store the USAS tags too. Default: usas
-r USAS_RAW_KEY, --usas-raw-key USAS_RAW_KEY
The key to store the USAS raw tags too. Default: usas_raw
-s SENTENCE_BOUNDARIES_KEY, --sentence-boundaries-key SENTENCE_BOUNDARIES_KEY
The key to store the sentence boundaries too. Default: sentence_boundaries
-c CLAWS_DOCKER_CONTAINER_NAME, --claws-docker-container-name CLAWS_DOCKER_CONTAINER_NAME
Name of the running docker container that the CLAWS tokeniser and POS tagger is running on. Default: claws:4.0
-d USAS_DOCKER_CONTAINER_NAME, --usas-docker-container-name USAS_DOCKER_CONTAINER_NAME
Name of the running docker container that the USAS semantic tagger is running on. Default: usas:7.0
--claws-run-script CLAWS_RUN_SCRIPT
Absolute path to the CLAWS run script, a script that is a wrapper around the CLAWS binary.
--usas-run-script USAS_RUN_SCRIPT
Absolute path to the USAS run script, a script that is a wrapper around the USAS binary.
-g {debug,info,error}, --logging-level {debug,info,error}
The logging level, debug most verbose, info, or error which is the least verbose. Default is info.uv run src/mosaico_usas_processing/tagged_data_export.pyWhich outputs:
{
"$defs": {
"USASTag": {
"description": "Represents all of the properties associated with a USAS tag.",
"properties": {
"tag": {
"description": "USAS Tag",
"examples": [
"A1.1.1"
],
"title": "USAS Tag",
"type": "string"
},
"number_positive_markers": {
"default": 0,
"description": "Number of positive markers.",
"examples": [
0,
1,
2,
3
],
"title": "Positive Markers",
"type": "integer"
},
"number_negative_markers": {
"default": 0,
"description": "Number of negative markers.",
"examples": [
0,
1,
2,
3
],
"title": "Negative Markers",
"type": "integer"
},
"rarity_marker_1": {
"default": false,
"description": "Rarity marker 1 indicated by %",
"title": "Rare Marker 1",
"type": "boolean"
},
"rarity_marker_2": {
"default": false,
"description": "Rarity marker 2 indicated by @",
"title": "Rare Marker 2",
"type": "boolean"
},
"female": {
"default": false,
"description": "Female",
"title": "Female",
"type": "boolean"
},
"male": {
"default": false,
"description": "Male",
"title": "Male",
"type": "boolean"
},
"antecedents": {
"default": false,
"description": "Potential antecedents of conceptual anaphors (neutral for number)",
"title": "Antecedents",
"type": "boolean"
},
"neuter": {
"default": false,
"description": "Neuter",
"title": "Neuter",
"type": "boolean"
},
"idiom": {
"default": false,
"description": "Is it an idiom",
"title": "Idiom",
"type": "boolean"
}
},
"required": [
"tag"
],
"title": "USASTag",
"type": "object"
},
"USASTagGroup": {
"description": "Represents a grouping of one or more USAS tags that are associated to a\ntoken.",
"properties": {
"tags": {
"description": "A grouping of one or more USAS tags whereby if more than one exists then the word is an equal member of all semantic tags/categories",
"examples": [
[
{
"antecedents": false,
"female": false,
"idiom": false,
"male": false,
"neuter": false,
"number_negative_markers": 0,
"number_positive_markers": 0,
"rarity_marker_1": false,
"rarity_marker_2": false,
"tag": "A1.1.1"
}
],
[
{
"antecedents": false,
"female": false,
"idiom": false,
"male": false,
"neuter": false,
"number_negative_markers": 1,
"number_positive_markers": 0,
"rarity_marker_1": false,
"rarity_marker_2": false,
"tag": "E2"
},
{
"antecedents": false,
"female": false,
"idiom": false,
"male": false,
"neuter": false,
"number_negative_markers": 0,
"number_positive_markers": 1,
"rarity_marker_1": false,
"rarity_marker_2": false,
"tag": "S7.1"
}
]
],
"items": {
"$ref": "#/$defs/USASTag"
},
"title": "USAS Tags",
"type": "array"
}
},
"required": [
"tags"
],
"title": "USASTagGroup",
"type": "object"
}
},
"properties": {
"document_id": {
"description": "ID that uniquely represents that document within the Mosaico Mongo DB",
"examples": [
"3021080"
],
"title": "Document ID",
"type": "string"
},
"wikidata_id": {
"description": "Wikidata ID, every Wikipedia page should one as it is an unique ID that allows you to access it's global unique URL https://www.wikidata.org/entity/ID",
"examples": [
"Q921355"
],
"title": "Wikidata ID",
"type": "string"
},
"title": {
"description": "Wikipedia page title",
"examples": [
"Erik Adolf von Willebrand"
],
"title": "Title",
"type": "string"
},
"text": {
"description": "The UTF-8 encoded Wikipedia article text",
"title": "Text",
"type": "string"
},
"ascii_text": {
"description": "ASCII encoded version of the Wikipedia article text",
"title": "ASCII Text",
"type": "string"
},
"language": {
"const": "en",
"description": "Language of the Wikipedia article",
"examples": [
"en"
],
"title": "Language",
"type": "string"
},
"quality": {
"description": "Quality of the Wikipedia page as determined by the Wikipedia community",
"enum": [
"good",
"featured"
],
"examples": [
"good",
"featured"
],
"title": "Quality",
"type": "string"
},
"ores_articletopics": {
"additionalProperties": {
"type": "number"
},
"description": "High level article topics that are easily searchable and have been predicted by a machine learning model. This will be represented as a dictionary of topic and score whereby the score is between 0-1 where 1 indicates the model is more confident of it's prediction.",
"examples": [
{
"Geography.Regions.Europe.Northern Europe": 0.584
}
],
"title": "ORES article topics",
"type": "object"
},
"categories": {
"description": "A noisy list of article topics that are found on the Wikipedia page at the end. To note that the hierarchy of this category system can be found through the SQL database dumps according to this source. The reason these are noisy is that they sometimes contain meta data topics like `CS1 Swedish-language sources (sv)` or `Good articles`.",
"examples": [
[
"CS1 Swedish-language sources (sv)",
"AC with 0 elements",
"1870 births",
"1949 deaths",
"Academics of the University of Helsinki",
"Finnish hematologists",
"Finnish people of German descent",
"People from Vaasa"
]
],
"items": {
"type": "string"
},
"title": "Categories",
"type": "array"
},
"popularity_score": {
"anyOf": [
{
"type": "number"
},
{
"type": "null"
}
],
"description": "As defined in the cirrus schema, 'A floating point number representing the percentage of page views to this wiki that requests this page. This is only available for content pages.' If the popularity score cannot be validated or found it will have a value of `None`.",
"examples": [
8.327128616319467e-08,
null
],
"title": "Popularity Score"
},
"timestamp": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"description": "Timestamp of the most recently index/edited version of the page. If the timestamp cannot be found it will have a value of `None`.",
"examples": [
"2021-12-26T18:49:13Z",
null
],
"title": "timestamp"
},
"tokens": {
"items": {
"type": "string"
},
"title": "Tokens",
"type": "array"
},
"lemmas": {
"items": {
"type": "string"
},
"title": "Lemmas",
"type": "array"
},
"pos": {
"items": {
"items": {
"maxItems": 2,
"minItems": 2,
"prefixItems": [
{
"type": "string"
},
{
"type": "integer"
}
],
"type": "array"
},
"type": "array"
},
"title": "POS",
"type": "array"
},
"usas": {
"items": {
"items": {
"$ref": "#/$defs/USASTagGroup"
},
"type": "array"
},
"title": "USAS",
"type": "array"
},
"usas_raw": {
"items": {
"type": "string"
},
"title": "USAS Raw",
"type": "array"
},
"sentence_boundaries": {
"items": {
"maxItems": 2,
"minItems": 2,
"prefixItems": [
{
"type": "integer"
},
{
"type": "integer"
}
],
"type": "array"
},
"title": "Sentence Boundaries",
"type": "array"
}
},
"required": [
"document_id",
"wikidata_id",
"title",
"text",
"ascii_text",
"language",
"quality",
"ores_articletopics",
"categories",
"popularity_score",
"timestamp",
"tokens",
"lemmas",
"pos",
"usas",
"usas_raw",
"sentence_boundaries"
],
"title": "TaggedWikiDataExport",
"type": "object"
}Hex, which is Lancaster University NLP group's compute cluster (which uses Slurm), can be used to tag data using this code base, below is an example of how to do so.
We assume that you have already got a dataset to tag that is in the Wikipedia format this script expects (./src/mosaico_usas_processing/claws_usas_tagging.py), an example dataset that can be generated is detailed in section Wikipedia data export from Mosaico. Transfer the dataset to the Hex cluster using scp:
scp dataset.jsonl login.ucrel-hex.scc.lancs.ac.uk:/PATH/TO/STORE/IT/ON/HEXGit clone this repository:
git clone git@github.com:UCREL/mosaico-usas-processing.gitAs we are working on Hex we need to create a virtual environment for Python not using UV, but using the standard library venv. This can be done like so:
python3 -m venv venv
source venv/bin/activateFor it to work on HEX, as HEX only has one version of Python currently, the pyproject.toml file needs to be changed so that the Python version is:
requires-python = ">3.9"The reason why this proect had to be set to Python=3.10.* is that for the data extraction of the Wikipedia data the wikiextractor python package only works with Python version 3.10. As we are only processing the data we do not need to be fixed to this version of Python therefore making this chnage is ok.
Then to install all of the required python packages:
pip install -r requirements.txtNote if you have issues with mosaico @ git+ssh://git@github.com/SapienzaNLP/mosaico@19c57473b9d77601fdf2b01cd2f8f766939ff52a when installing the Python packages, change this to mosaico @ git+https://github.com/SapienzaNLP/mosaico@19c57473b9d77601fdf2b01cd2f8f766939ff52a and it should then work.
As we are going to perform the tagging on HEX using the CLAWS and USAS binaries we need to get both the CLAWS and USAS Github repositories and build CLAWS, we can do this using the following make command:
make build-clawsThe CLAWS and USAS repositories should now be in the folders, claws and usas respectively.
As we are going to process the whole of the English Wikipedia articles we extracted, which was 10,856 articles, and from processing a small subset of those (495 articles (4.55%)) toke about 14 minutes I think it would be good to batch the articles into 10 chunks, ~1,086 articles a chunk, which should take around 30 minuts to process. Assuming that the JSONL file containing the Wikipedia articles is at the Path ./wikipedia_export.jsonl we can chunk the articles like so:
mkdir wikipedia_data_chunks
bash data_chunking.sh 10 ./wikipedia_export.jsonl ./wikipedia_data_chunksThen we should see the following in the ./wikipedia_data_chunks folder:
wc -l ./wikipedia_data_chunks/*
1086 wikipedia_data_chunks/wikipedia_export.jsonl.0
1086 wikipedia_data_chunks/wikipedia_export.jsonl.1
1086 wikipedia_data_chunks/wikipedia_export.jsonl.2
1086 wikipedia_data_chunks/wikipedia_export.jsonl.3
1086 wikipedia_data_chunks/wikipedia_export.jsonl.4
1086 wikipedia_data_chunks/wikipedia_export.jsonl.5
1086 wikipedia_data_chunks/wikipedia_export.jsonl.6
1086 wikipedia_data_chunks/wikipedia_export.jsonl.7
1086 wikipedia_data_chunks/wikipedia_export.jsonl.8
1082 wikipedia_data_chunks/wikipedia_export.jsonl.9
10856 totalWhereby each chunk contains 1,086 lines/data points apart from the last chunk which has a few less.
We can now create the following SBATCH SLURM script, runTagging.sh, which will:
- Use the 6 hour maximum processing time nodes, of which we have requested 10 of them through the
--array(does not mean we will get all 10 at once). - All stdout will be logged to:
tagging_log/out/wikipedia_en_tagging_%A_%a.logwhereAis the Job ID andais the array ID within that job id. - All stderr will be logged to:
tagging_log/error/wikipedia_en_tagging_%A_%a.logwhereAis the Job ID andais the array ID within that job id. - We ensure that all environment variables required by the CLAWS and USAS binaries are intialised using
source ./env.shfor each processing node. - Each node based off it's array ID (0-9) will process a chunk of the Wikipedia data and save the tagged chunk to the following folder
./tagged_wikipedia_chunks/ - The
CLAWS_RUN_SCRIPTandUSAS_RUN_SCRIPTenvironment variables are set throughsource ./env.sh.
#!/bin/bash
#SBATCH --partition=cpu-6h
#SBATCH --output=tagging_log/out/wikipedia_en_tagging_%A_%a.log
#SBATCH --error=tagging_log/error/wikipedia_en_tagging_%A_%a.log
#SBATCH --array=0-9
source ./env.sh
echo "${SLURM_ARRAY_TASK_ID}: Starting tagging"
python src/mosaico_usas_processing/claws_usas_tagging.py \
./wikipedia_data_chunks/wikipedia_export.jsonl.${SLURM_ARRAY_TASK_ID} \
ascii_text \
./tagged_wikipedia_chunks/wikipedia_export.jsonl.${SLURM_ARRAY_TASK_ID} \
--claws-run-script ${CLAWS_RUN_SCRIPT} \
--usas-run-script ${USAS_RUN_SCRIPT}
echo "${SLURM_ARRAY_TASK_ID}: Finished tagging"Before running the SLURM scripts ensure we have tagged_wikipedia_chunks as a directory and we are in the python virtual environemnt (source venv/bin/activate):
mkdir tagged_wikipedia_chunksTo run the SLURM script:
sbatch runTagging.shAfter tagging we can check if any articles were not processed through looking at our stdout logs tagging_log/out/wikipedia_en_tagging_*. The data we have tagged should be in the directory ./tagged_wikipedia_chunks as 10 separate files:
wc -l ./tagged_wikipedia_chunks/*
1083 tagged_wikipedia_chunks/wikipedia_export.jsonl.0
1080 tagged_wikipedia_chunks/wikipedia_export.jsonl.1
1079 tagged_wikipedia_chunks/wikipedia_export.jsonl.2
1081 tagged_wikipedia_chunks/wikipedia_export.jsonl.3
1083 tagged_wikipedia_chunks/wikipedia_export.jsonl.4
1075 tagged_wikipedia_chunks/wikipedia_export.jsonl.5
1077 tagged_wikipedia_chunks/wikipedia_export.jsonl.6
1075 tagged_wikipedia_chunks/wikipedia_export.jsonl.7
1078 tagged_wikipedia_chunks/wikipedia_export.jsonl.8
1068 tagged_wikipedia_chunks/wikipedia_export.jsonl.9
10779 totalThese data files can now be found in gzip compressed format at the following HuggingFace Dataset repository ucrelnlp/English-USAS-Mosaico .. Note the naming convention is slightly different, e.g. wikipedia_shard_0.jsonl.gz = wikipedia_export.jsonl.0.
- Paul Rayson (p.rayson@lancaster.ac.uk)
- Andrew Moore (a.p.moore@lancaster.ac.uk)
- UCREL Research Centre (ucrel@lancaster.ac.uk) at Lancaster University.