Skip to content

UCREL/mosaico-usas-processing

Repository files navigation

Mosaico USAS and CLAWS processing

TLDR: The code to process Wikipedia articles from the Mosaico subset through the UCREL English only C version of the CLAWS and USAS NLP pipeline, which produces the following syntactic and semantic annotations:

  • Sentence boundaries
  • Tokens
  • Lemmas
  • POS tags
  • USAS tags

The pre-processed data that this code base has created can be found at the following HuggingFace Dataset repository ucrelnlp/English-USAS-Mosaico .

Note due to the dependeny on using the C version of CLAWS and USAS the first time you run either the make test command or the first time you want to tag something with CLAWS or USAS you will require a VPN connection to the University (Lancaster University) server or be on Lancaster University campus as it requires a connection to the two Gitlab repoistories on delta (if you are interested in tagging using this pipeline please get in touch with us (contact information at the bottom of this README)):

  1. CLAWS
  2. USAS

Mosaico is a dataset of processed Wikipedia pages, of which these Wikipedia pages have been filtered to only contain good and featured tagged pages. The Mosaico dataset does contain various NLP annotations but in this project at the moment we are only interested in the Wikipedia article pages so that we can add our own NLP annotations, specifically:

  • Sentence boundaries
  • Tokens
  • Lemmas
  • POS tags
  • USAS tags To start with we will only tag the English Wikipedia pages that have the tag of good and featured. The tagging will be performed by:
  • CLAWS - generates sentence boundaries, tokens, and POS tags.
  • USAS - generates lemmas, and USAS tags based off the CLAWS POS tags and tokens.

Setup

We have used pixi and uv for this project, uv is in essence a Python Virtual environment manager and Pixi is an extension of that which we use to install Perl. Once you have installed UV and Pixi run the the following to setup the project:

(pixi will likely install perl at $HOME/.pixi/bin/perl)

uv venv
uv lock
pixi global install --channel conda-forge perl

Note we use a Perl script to replace some UTF-8 characters to ASCII equivalent characters, this is required as CLAWS requires text to be encoded in ASCII rather than UTF-8.

To activate a Python repl with all of the project requirements:

uv run python

For data exporting it requires docker.

Tests

To run the tests:

source env.sh
make test

The make command ensures that all of the required docker images are created and the binaries and files that the environment variables set within ./env.sh are downloaded.

Git repositories downloaded are:

  1. CLAWS - downloaded too ./claws
  2. USAS - downloaded too ./usas

Docker images created/required:

  • claws:4.0 - created by building within the ./claws repository.
  • usas:7.0 - created by building within the ./usas repository.

Environment variables set within env.sh: All of these environment variables are explained in the section CLAWS and USAS binaries setup:

  • CLAWS_RUN_SCRIPT
  • rdirectory
  • USAS_RUN_SCRIPT
  • USAS_EXE
  • USAS_RESOURCES

Linting

To run the linting (ruff), formatting (ruff), and type checker (pyrefly) checks:

make check

Wikipedia data export from Mosaico

This section describes how to export the English Wikipedia data from Mosaico into JSONL (JSON lines) format so that the data can be further processed by other tools (USAS and CLAWS):

Data Export

As mentioned the Wikipedia data has appeared to have come from this Cirrus Wikipedia dump, of which a schema for this dump can be found here. We did investigate the structured and un structured wikipedia dumps on HuggingFace but they do not appear to contain the tags for Good and Featured articles that Mosaico state are good signals for quality, of which training on this subset of data has shown to be more efficient than training on the whole Wikipedia dataset while achieving comparable results for WSD on all but the rare word sense test set (42D).

Data export structure

The data export we are going to generate will be in JSONL format whereby each JSON line contains one unique Wikipedia page entry and will have the following JSON structure:

uv run python -c "from mosaico_usas_processing.data_export import WikiDataExport; import json; print(json.dumps(WikiDataExport.model_json_schema(),indent=2))"
{
  "properties": {
    "document_id": {
      "description": "ID that uniquely represents that document within the Mosaico Mongo DB",
      "examples": [
        "3021080"
      ],
      "title": "Document ID",
      "type": "string"
    },
    "wikidata_id": {
      "description": "Wikidata ID, every Wikipedia page should one as it is an unique ID that allows you to access it's global unique URL https://www.wikidata.org/entity/ID",
      "examples": [
        "Q921355"
      ],
      "title": "Wikidata ID",
      "type": "string"
    },
    "title": {
      "description": "Wikipedia page title",
      "examples": [
        "Erik Adolf von Willebrand"
      ],
      "title": "Title",
      "type": "string"
    },
    "text": {
      "description": "The UTF-8 encoded Wikipedia article text",
      "title": "Text",
      "type": "string"
    },
    "ascii_text": {
      "description": "ASCII encoded version of the Wikipedia article text",
      "title": "ASCII Text",
      "type": "string"
    },
    "language": {
      "const": "en",
      "description": "Language of the Wikipedia article",
      "examples": [
        "en"
      ],
      "title": "Language",
      "type": "string"
    },
    "quality": {
      "description": "Quality of the Wikipedia page as determined by the Wikipedia community",
      "enum": [
        "good",
        "featured"
      ],
      "examples": [
        "good",
        "featured"
      ],
      "title": "Quality",
      "type": "string"
    },
    "ores_articletopics": {
      "additionalProperties": {
        "type": "number"
      },
      "description": "High level article topics that are easily searchable and have been predicted by a machine learning model. This will be represented as a dictionary of topic and score whereby the score is between 0-1 where 1 indicates the model is more confident of it's prediction.",
      "examples": [
        {
          "Geography.Regions.Europe.Northern Europe": 0.584
        }
      ],
      "title": "ORES article topics",
      "type": "object"
    },
    "categories": {
      "description": "A noisy list of article topics that are found on the Wikipedia page at the end. To note that the hierarchy of this category system can be found through the SQL database dumps according to this source. The reason these are noisy is that they sometimes contain meta data topics like `CS1 Swedish-language sources (sv)` or `Good articles`.",
      "examples": [
        [
          "CS1 Swedish-language sources (sv)",
          "AC with 0 elements",
          "1870 births",
          "1949 deaths",
          "Academics of the University of Helsinki",
          "Finnish hematologists",
          "Finnish people of German descent",
          "People from Vaasa"
        ]
      ],
      "items": {
        "type": "string"
      },
      "title": "Categories",
      "type": "array"
    },
    "popularity_score": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "description": "As defined in the cirrus schema, 'A floating point number representing the percentage of page views to this wiki that requests this page. This is only available for content pages.' If the popularity score cannot be validated or found it will have a value of `None`.",
      "examples": [
        8.327128616319467e-08,
        null
      ],
      "title": "Popularity Score"
    },
    "timestamp": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "description": "Timestamp of the most recently index/edited version of the page. If the timestamp cannot be found it will have a value of `None`.",
      "examples": [
        "2021-12-26T18:49:13Z",
        null
      ],
      "title": "timestamp"
    }
  },
  "required": [
    "document_id",
    "wikidata_id",
    "title",
    "text",
    "ascii_text",
    "language",
    "quality",
    "ores_articletopics",
    "categories",
    "popularity_score",
    "timestamp"
  ],
  "title": "WikiDataExport",
  "type": "object"
}

More information on the ores_articletopics: High level article topics that are easily searchable and have been predicted by a machine learning model. This will be represented as a dictionary of topic and score, e.g. {'Geography.Regions.Europe.Northern Europe': 0.584, 'Culture.Biography.Biography*': 0.983, 'Geography.Regions.Europe.Europe*': 0.857, 'STEM.STEM*': 0.983} whereby this shows that it predict STEM.STEM* with 98.3% confidence.

The ascii_text is generated from inputting the text (original UTF-8 encoded text) through the ./non_python_scripts/UTF82CLAWS.pl perl script which replaces some UTF-8 characters with ASCII equivalent characters, we then re-encode the string to ASCII ignoring all errors, therefore removing some characters that cannot be mapped to ASCII.

Run the data export

NOTE We use the ./data folder to download and save data too.

NOTE We also use the ./log folder to store error logs too.

To export the data will assume you have started the Mongo database and imported the data into Mongo (see section Set up MongoDB from the Mosaico GitHub repoistory note we do not need the interlanguage-links data collection).

Note when setting up MongoDB as the script is only exporting the English Wikipedia articles we filtered the MongoDB pages data so that it only includes the English Wikipedia articles like so (this should make the process more memory efficient and quicker):

cat ./data/pages.collection.json | grep "\"language\":\"en\"" > ./data/en_pages.collection.json

Useful code from Mosaico's GitHub repository on importing the data into the Mongo DB and running the Mongo DB:

docker run \
  -e MONGO_INITDB_ROOT_USERNAME=admin \
  -e MONGO_INITDB_ROOT_PASSWORD=password \
  -p 27017:27017 \
  --name local-mosaico-db \
  --detach \
  mongo:6.0.11

# import pages
docker exec -i local-mosaico-db \
    mongoimport \
    --authenticationDatabase admin -u admin -p password \
    --db mosaico --collection pages < ./data/en_pages.collection.json


# import annotations
docker exec -i local-mosaico-db \
    mongoimport \
    --authenticationDatabase admin -u admin -p password \
    --db mosaico --collection annotations < ./data/annotations.collection.json

The following code is computationally very efficient using only 1 CPU, it takes about 1 minute to process 835 documents.

uv run src/mosaico_usas_processing/data_export.py ./data/wikipedia_export.jsonl --logging-level error 2>log/data_export_error.txt

After the script has ran it will output some statistics:

Outputting/exporting data to: wikipedia_export.jsonl
careful, beanie issue still open
Data statistics:
Number of documents saved to wikipedia_export.jsonl: 495
Number of documents we did not save: 340
Total number of processed documents: 835
Total number of ascii tokens in the saved dataset: 1,995,931
Total time taken to process (seconds): 56.22

For the full English only Wikipedia pages:

Outputting/exporting data to: wikipedia_export.jsonl
careful, beanie issue still open
Data statistics:
Number of documents saved to wikipedia_export.jsonl: 10,856
Number of documents we did not save: 6,336
Total number of processed documents: 17,192
Total number of ascii tokens in the saved dataset: 74,595,642
Total time taken to process (seconds): 2,069.30

Note most of the pages we did not export was due to the fact that we could not extract out the quality metadata therefore we did not export the Wikipedia page.

After the export you can remove the Mongo Database container:

docker stop local-mosaico-db
docker rm local-mosaico-db

Help information from the script:

uv run src/mosaico_usas_processing/data_export.py --help
usage: data_export.py [-h] [--perl-script-path PERL_SCRIPT_PATH] [-l {debug,info,error}] output_file_path

Data exported from the Mongo Database of Wikipedia pages to the output file in JSONL format, whereby each JSON line contains information on each unique Wikipedia page within the Mongo
Database.

positional arguments:
  output_file_path      The JSONL file to export the Mongo Database data too.

options:
  -h, --help            show this help message and exit
  --perl-script-path PERL_SCRIPT_PATH
                        File path to the UTF82CLAWS.pl Perl script. Default: /home/andrew/Downloads/mosaico-usas-
                        processing/src/mosaico_usas_processing/data_export.py/../../../non_python_scripts/UTF82CLAWS.pl
  -l {debug,info,error}, --logging-level {debug,info,error}
                        The logging level, debug most verbose, info, or error which is the least verbose. Default is info.

Data tagging

The data tagging can be done either using:

  • docker - easier for development and more OS cross platform friendly.
  • pre-compiled binaries - allows us to run it on HEX (HEX is a Slurm cluster).

The reason for the two is that CLAWS and USAS can be ran by either option. It is easier to test with the docker containers but the binaries allow us to run it on HEX.

No matter which approach is chosen, docker or binaries, they both produce a tagged dataset with the same format. For the format see the section Tagged dataset format.

Tagging with docker (small dataset test)

After the data export of the Wikipedia data we can now tag the data using CLAWS and USAS, whereby CLAWS and USAS will be accessed via a docker container.

uv run src/mosaico_usas_processing/claws_usas_tagging.py ./data/wikipedia_export.jsonl \
  ascii_text \
  ./data/docker_tagged_wikipedia_export.jsonl \
  2>./log/docker_tagging_error.txt

Which will output the following:

Number of documents saved to ./data/tagged_small_wikipedia_export.jsonl: 494
Number of documents we did not save: 1
Total number of processed documents: 495
Total number tokens in the saved dataset: 2,434,898
Total time taken to process (seconds): 1,812.58
Peak amount of memory used by: 544.076 MB

The reason for the document we did not save was due to a difference in the number of CLAWS and USAS tokens of which this should not occur hence why we did not save that document.

Tagging with binaries (small dataset test)

Please see the CLAWS and USAS binaries setup section below before running the following:

After the data export of the Wikipedia data we can now tag the data using CLAWS and USAS, whereby CLAWS and USAS will be accessed via the binary files directly.

# Requires the following environement variables to have been set via export:
# rdirectory, USAS_EXE, and USAS_RESOURCES
uv run src/mosaico_usas_processing/claws_usas_tagging.py ./data/wikipedia_export.jsonl \
  ascii_text \
  ./data/binary_tagged_wikipedia_export.jsonl \
  --claws-run-script /home/andrew/Downloads/temp_tagger/new_claws/new_claws/bin/run_claws.sh \
  --usas-run-script /home/andrew/Downloads/temp_tagger/usas/bin/run_semtag.sh \
  2>./log/binary_tagging_error.txt

Which will output the following:

Number of documents saved to ./data/binary_tagged_wikipedia_export.jsonl: 494
Number of documents we did not save: 1
Total number of processed documents: 495
Total number tokens in the saved dataset: 2,434,898
Total time taken to process (seconds): 866.96
Peak amount of memory used by: 524.116 MB

CLAWS and USAS binaries setup

NoteThe make test command in essence does the following for you:

For CLAWS, git clone the following repository (requires VPN access or to be on Lancaster University campus). Once downloaded:

  1. Compile CLAWS:
cd src
gcc -g -DPROBCALCULATION -o claws4 claws4.c regexp.c
cd ..
  1. Create a directory that contains the CLAWS binary and all the lexical resources it requires:
mkdir claws_bundle
cp src/claws4 claws_bundle/.
cp resources/* claws_bundle/.
  1. When running the Python script for tagging with binaries ensure that the environment variable rdirectory is set to the absolute path of the claws_bundle directory, e.g. export rdirectory="$(pwd)/claws_bundle"
  2. The claws_run_script argument of src.mosaico_usas_processing.claws:tag_with_binary should be set to "$(pwd)/bin/run_claws.sh". When running the tests this is what the environment variable CLAWS_RUN_SCRIPT should be set to.

For USAS, git clone the following repository (requires VPN access or to be on Lancaster University campus). Once downloaded:

  • Set the following environment variables:
    • USAS_EXE - to an absolute path to the relevant pre-compiled semantic tagger, e.g. semtag_debian64
    • USAS_RESOURCES - to an absolute path to the relevant lexical resources that the semantic tagger requires, e.g. resources
  • The usas_run_script argument of src.mosaico_usas_processing.usas:tag_with_binary should be set to "$(pwd)/bin/run_semtag.sh". When running the tests this is what the environment variable USAS_RUN_SCRIPT should be set to.

Help information from the data tagging script

uv run src/mosaico_usas_processing/claws_usas_tagging.py --help
usage: claws_usas_tagging.py [-h] [-t TOKENIZER_KEY] [-l LEMMAS_KEY] [-p POS_KEY] [-u USAS_KEY] [-r USAS_RAW_KEY] [-s SENTENCE_BOUNDARIES_KEY] [-c CLAWS_DOCKER_CONTAINER_NAME]
                             [-d USAS_DOCKER_CONTAINER_NAME] [--claws-run-script CLAWS_RUN_SCRIPT] [--usas-run-script USAS_RUN_SCRIPT] [-g {debug,info,error}]
                             jsonl_file_path text_key output_file_path

Given a JSONL file whereby each line is a JSON entry that contains a `text_key` with a value that is text that is to be tagged, this script will tag that text for all JSON entry lines.
The output will be the same JSONL input data but with the addition of the following keys for each JSON entry: `tokens`, `lemmas`, `pos`, `usas`, and sentence_boundariesof which these
keys names are configurable. The tokens , Part Of Speech tags, and sentence boundaries will come from the CLAWS tagger and the lemmas and USAS tags will come from the USAS tagger.

positional arguments:
  jsonl_file_path       File path to a JSONL file whereby each line contains a key whereby the value should be tagged.
  text_key              JSON key name whereby the value contains the text to be tagged.
  output_file_path      File path to store the output too in JSONL format.

options:
  -h, --help            show this help message and exit
  -t TOKENIZER_KEY, --tokenizer-key TOKENIZER_KEY
                        The key to store the tokens too. Default: tokens
  -l LEMMAS_KEY, --lemmas-key LEMMAS_KEY
                        The key to store the lemmas too. Default: lemmas
  -p POS_KEY, --pos-key POS_KEY
                        The key to store the POS tags too. Default: pos
  -u USAS_KEY, --usas-key USAS_KEY
                        The key to store the USAS tags too. Default: usas
  -r USAS_RAW_KEY, --usas-raw-key USAS_RAW_KEY
                        The key to store the USAS raw tags too. Default: usas_raw
  -s SENTENCE_BOUNDARIES_KEY, --sentence-boundaries-key SENTENCE_BOUNDARIES_KEY
                        The key to store the sentence boundaries too. Default: sentence_boundaries
  -c CLAWS_DOCKER_CONTAINER_NAME, --claws-docker-container-name CLAWS_DOCKER_CONTAINER_NAME
                        Name of the running docker container that the CLAWS tokeniser and POS tagger is running on. Default: claws:4.0
  -d USAS_DOCKER_CONTAINER_NAME, --usas-docker-container-name USAS_DOCKER_CONTAINER_NAME
                        Name of the running docker container that the USAS semantic tagger is running on. Default: usas:7.0
  --claws-run-script CLAWS_RUN_SCRIPT
                        Absolute path to the CLAWS run script, a script that is a wrapper around the CLAWS binary.
  --usas-run-script USAS_RUN_SCRIPT
                        Absolute path to the USAS run script, a script that is a wrapper around the USAS binary.
  -g {debug,info,error}, --logging-level {debug,info,error}
                        The logging level, debug most verbose, info, or error which is the least verbose. Default is info.

Tagged dataset format

uv run src/mosaico_usas_processing/tagged_data_export.py

Which outputs:

{
  "$defs": {
    "USASTag": {
      "description": "Represents all of the properties associated with a USAS tag.",
      "properties": {
        "tag": {
          "description": "USAS Tag",
          "examples": [
            "A1.1.1"
          ],
          "title": "USAS Tag",
          "type": "string"
        },
        "number_positive_markers": {
          "default": 0,
          "description": "Number of positive markers.",
          "examples": [
            0,
            1,
            2,
            3
          ],
          "title": "Positive Markers",
          "type": "integer"
        },
        "number_negative_markers": {
          "default": 0,
          "description": "Number of negative markers.",
          "examples": [
            0,
            1,
            2,
            3
          ],
          "title": "Negative Markers",
          "type": "integer"
        },
        "rarity_marker_1": {
          "default": false,
          "description": "Rarity marker 1 indicated by %",
          "title": "Rare Marker 1",
          "type": "boolean"
        },
        "rarity_marker_2": {
          "default": false,
          "description": "Rarity marker 2 indicated by @",
          "title": "Rare Marker 2",
          "type": "boolean"
        },
        "female": {
          "default": false,
          "description": "Female",
          "title": "Female",
          "type": "boolean"
        },
        "male": {
          "default": false,
          "description": "Male",
          "title": "Male",
          "type": "boolean"
        },
        "antecedents": {
          "default": false,
          "description": "Potential antecedents of conceptual anaphors (neutral for number)",
          "title": "Antecedents",
          "type": "boolean"
        },
        "neuter": {
          "default": false,
          "description": "Neuter",
          "title": "Neuter",
          "type": "boolean"
        },
        "idiom": {
          "default": false,
          "description": "Is it an idiom",
          "title": "Idiom",
          "type": "boolean"
        }
      },
      "required": [
        "tag"
      ],
      "title": "USASTag",
      "type": "object"
    },
    "USASTagGroup": {
      "description": "Represents a grouping of one or more USAS tags that are associated to a\ntoken.",
      "properties": {
        "tags": {
          "description": "A grouping of one or more USAS tags whereby if more than one exists then the word is an equal member of all semantic tags/categories",
          "examples": [
            [
              {
                "antecedents": false,
                "female": false,
                "idiom": false,
                "male": false,
                "neuter": false,
                "number_negative_markers": 0,
                "number_positive_markers": 0,
                "rarity_marker_1": false,
                "rarity_marker_2": false,
                "tag": "A1.1.1"
              }
            ],
            [
              {
                "antecedents": false,
                "female": false,
                "idiom": false,
                "male": false,
                "neuter": false,
                "number_negative_markers": 1,
                "number_positive_markers": 0,
                "rarity_marker_1": false,
                "rarity_marker_2": false,
                "tag": "E2"
              },
              {
                "antecedents": false,
                "female": false,
                "idiom": false,
                "male": false,
                "neuter": false,
                "number_negative_markers": 0,
                "number_positive_markers": 1,
                "rarity_marker_1": false,
                "rarity_marker_2": false,
                "tag": "S7.1"
              }
            ]
          ],
          "items": {
            "$ref": "#/$defs/USASTag"
          },
          "title": "USAS Tags",
          "type": "array"
        }
      },
      "required": [
        "tags"
      ],
      "title": "USASTagGroup",
      "type": "object"
    }
  },
  "properties": {
    "document_id": {
      "description": "ID that uniquely represents that document within the Mosaico Mongo DB",
      "examples": [
        "3021080"
      ],
      "title": "Document ID",
      "type": "string"
    },
    "wikidata_id": {
      "description": "Wikidata ID, every Wikipedia page should one as it is an unique ID that allows you to access it's global unique URL https://www.wikidata.org/entity/ID",
      "examples": [
        "Q921355"
      ],
      "title": "Wikidata ID",
      "type": "string"
    },
    "title": {
      "description": "Wikipedia page title",
      "examples": [
        "Erik Adolf von Willebrand"
      ],
      "title": "Title",
      "type": "string"
    },
    "text": {
      "description": "The UTF-8 encoded Wikipedia article text",
      "title": "Text",
      "type": "string"
    },
    "ascii_text": {
      "description": "ASCII encoded version of the Wikipedia article text",
      "title": "ASCII Text",
      "type": "string"
    },
    "language": {
      "const": "en",
      "description": "Language of the Wikipedia article",
      "examples": [
        "en"
      ],
      "title": "Language",
      "type": "string"
    },
    "quality": {
      "description": "Quality of the Wikipedia page as determined by the Wikipedia community",
      "enum": [
        "good",
        "featured"
      ],
      "examples": [
        "good",
        "featured"
      ],
      "title": "Quality",
      "type": "string"
    },
    "ores_articletopics": {
      "additionalProperties": {
        "type": "number"
      },
      "description": "High level article topics that are easily searchable and have been predicted by a machine learning model. This will be represented as a dictionary of topic and score whereby the score is between 0-1 where 1 indicates the model is more confident of it's prediction.",
      "examples": [
        {
          "Geography.Regions.Europe.Northern Europe": 0.584
        }
      ],
      "title": "ORES article topics",
      "type": "object"
    },
    "categories": {
      "description": "A noisy list of article topics that are found on the Wikipedia page at the end. To note that the hierarchy of this category system can be found through the SQL database dumps according to this source. The reason these are noisy is that they sometimes contain meta data topics like `CS1 Swedish-language sources (sv)` or `Good articles`.",
      "examples": [
        [
          "CS1 Swedish-language sources (sv)",
          "AC with 0 elements",
          "1870 births",
          "1949 deaths",
          "Academics of the University of Helsinki",
          "Finnish hematologists",
          "Finnish people of German descent",
          "People from Vaasa"
        ]
      ],
      "items": {
        "type": "string"
      },
      "title": "Categories",
      "type": "array"
    },
    "popularity_score": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "description": "As defined in the cirrus schema, 'A floating point number representing the percentage of page views to this wiki that requests this page. This is only available for content pages.' If the popularity score cannot be validated or found it will have a value of `None`.",
      "examples": [
        8.327128616319467e-08,
        null
      ],
      "title": "Popularity Score"
    },
    "timestamp": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "description": "Timestamp of the most recently index/edited version of the page. If the timestamp cannot be found it will have a value of `None`.",
      "examples": [
        "2021-12-26T18:49:13Z",
        null
      ],
      "title": "timestamp"
    },
    "tokens": {
      "items": {
        "type": "string"
      },
      "title": "Tokens",
      "type": "array"
    },
    "lemmas": {
      "items": {
        "type": "string"
      },
      "title": "Lemmas",
      "type": "array"
    },
    "pos": {
      "items": {
        "items": {
          "maxItems": 2,
          "minItems": 2,
          "prefixItems": [
            {
              "type": "string"
            },
            {
              "type": "integer"
            }
          ],
          "type": "array"
        },
        "type": "array"
      },
      "title": "POS",
      "type": "array"
    },
    "usas": {
      "items": {
        "items": {
          "$ref": "#/$defs/USASTagGroup"
        },
        "type": "array"
      },
      "title": "USAS",
      "type": "array"
    },
    "usas_raw": {
      "items": {
        "type": "string"
      },
      "title": "USAS Raw",
      "type": "array"
    },
    "sentence_boundaries": {
      "items": {
        "maxItems": 2,
        "minItems": 2,
        "prefixItems": [
          {
            "type": "integer"
          },
          {
            "type": "integer"
          }
        ],
        "type": "array"
      },
      "title": "Sentence Boundaries",
      "type": "array"
    }
  },
  "required": [
    "document_id",
    "wikidata_id",
    "title",
    "text",
    "ascii_text",
    "language",
    "quality",
    "ores_articletopics",
    "categories",
    "popularity_score",
    "timestamp",
    "tokens",
    "lemmas",
    "pos",
    "usas",
    "usas_raw",
    "sentence_boundaries"
  ],
  "title": "TaggedWikiDataExport",
  "type": "object"
}

Tagging on HEX

Hex, which is Lancaster University NLP group's compute cluster (which uses Slurm), can be used to tag data using this code base, below is an example of how to do so.

We assume that you have already got a dataset to tag that is in the Wikipedia format this script expects (./src/mosaico_usas_processing/claws_usas_tagging.py), an example dataset that can be generated is detailed in section Wikipedia data export from Mosaico. Transfer the dataset to the Hex cluster using scp:

scp dataset.jsonl login.ucrel-hex.scc.lancs.ac.uk:/PATH/TO/STORE/IT/ON/HEX

Git clone this repository:

git clone git@github.com:UCREL/mosaico-usas-processing.git

As we are working on Hex we need to create a virtual environment for Python not using UV, but using the standard library venv. This can be done like so:

python3 -m venv venv
source venv/bin/activate

For it to work on HEX, as HEX only has one version of Python currently, the pyproject.toml file needs to be changed so that the Python version is:

requires-python = ">3.9"

The reason why this proect had to be set to Python=3.10.* is that for the data extraction of the Wikipedia data the wikiextractor python package only works with Python version 3.10. As we are only processing the data we do not need to be fixed to this version of Python therefore making this chnage is ok.

Then to install all of the required python packages:

pip install -r requirements.txt

Note if you have issues with mosaico @ git+ssh://git@github.com/SapienzaNLP/mosaico@19c57473b9d77601fdf2b01cd2f8f766939ff52a when installing the Python packages, change this to mosaico @ git+https://github.com/SapienzaNLP/mosaico@19c57473b9d77601fdf2b01cd2f8f766939ff52a and it should then work.

As we are going to perform the tagging on HEX using the CLAWS and USAS binaries we need to get both the CLAWS and USAS Github repositories and build CLAWS, we can do this using the following make command:

make build-claws

The CLAWS and USAS repositories should now be in the folders, claws and usas respectively.

As we are going to process the whole of the English Wikipedia articles we extracted, which was 10,856 articles, and from processing a small subset of those (495 articles (4.55%)) toke about 14 minutes I think it would be good to batch the articles into 10 chunks, ~1,086 articles a chunk, which should take around 30 minuts to process. Assuming that the JSONL file containing the Wikipedia articles is at the Path ./wikipedia_export.jsonl we can chunk the articles like so:

mkdir wikipedia_data_chunks
bash data_chunking.sh 10 ./wikipedia_export.jsonl ./wikipedia_data_chunks

Then we should see the following in the ./wikipedia_data_chunks folder:

wc -l ./wikipedia_data_chunks/*
     1086 wikipedia_data_chunks/wikipedia_export.jsonl.0
     1086 wikipedia_data_chunks/wikipedia_export.jsonl.1
     1086 wikipedia_data_chunks/wikipedia_export.jsonl.2
     1086 wikipedia_data_chunks/wikipedia_export.jsonl.3
     1086 wikipedia_data_chunks/wikipedia_export.jsonl.4
     1086 wikipedia_data_chunks/wikipedia_export.jsonl.5
     1086 wikipedia_data_chunks/wikipedia_export.jsonl.6
     1086 wikipedia_data_chunks/wikipedia_export.jsonl.7
     1086 wikipedia_data_chunks/wikipedia_export.jsonl.8
     1082 wikipedia_data_chunks/wikipedia_export.jsonl.9
    10856 total

Whereby each chunk contains 1,086 lines/data points apart from the last chunk which has a few less.

We can now create the following SBATCH SLURM script, runTagging.sh, which will:

  • Use the 6 hour maximum processing time nodes, of which we have requested 10 of them through the --array (does not mean we will get all 10 at once).
  • All stdout will be logged to: tagging_log/out/wikipedia_en_tagging_%A_%a.log where A is the Job ID and a is the array ID within that job id.
  • All stderr will be logged to: tagging_log/error/wikipedia_en_tagging_%A_%a.log where A is the Job ID and a is the array ID within that job id.
  • We ensure that all environment variables required by the CLAWS and USAS binaries are intialised using source ./env.sh for each processing node.
  • Each node based off it's array ID (0-9) will process a chunk of the Wikipedia data and save the tagged chunk to the following folder ./tagged_wikipedia_chunks/
  • The CLAWS_RUN_SCRIPT and USAS_RUN_SCRIPT environment variables are set through source ./env.sh.
#!/bin/bash
#SBATCH --partition=cpu-6h
#SBATCH --output=tagging_log/out/wikipedia_en_tagging_%A_%a.log
#SBATCH --error=tagging_log/error/wikipedia_en_tagging_%A_%a.log
#SBATCH --array=0-9


source ./env.sh
echo "${SLURM_ARRAY_TASK_ID}: Starting tagging"
python src/mosaico_usas_processing/claws_usas_tagging.py \
        ./wikipedia_data_chunks/wikipedia_export.jsonl.${SLURM_ARRAY_TASK_ID} \
        ascii_text \
        ./tagged_wikipedia_chunks/wikipedia_export.jsonl.${SLURM_ARRAY_TASK_ID} \
        --claws-run-script ${CLAWS_RUN_SCRIPT} \
        --usas-run-script ${USAS_RUN_SCRIPT}
echo "${SLURM_ARRAY_TASK_ID}: Finished tagging"

Before running the SLURM scripts ensure we have tagged_wikipedia_chunks as a directory and we are in the python virtual environemnt (source venv/bin/activate):

mkdir tagged_wikipedia_chunks

To run the SLURM script:

sbatch runTagging.sh

After tagging we can check if any articles were not processed through looking at our stdout logs tagging_log/out/wikipedia_en_tagging_*. The data we have tagged should be in the directory ./tagged_wikipedia_chunks as 10 separate files:

wc -l ./tagged_wikipedia_chunks/*
     1083 tagged_wikipedia_chunks/wikipedia_export.jsonl.0
     1080 tagged_wikipedia_chunks/wikipedia_export.jsonl.1
     1079 tagged_wikipedia_chunks/wikipedia_export.jsonl.2
     1081 tagged_wikipedia_chunks/wikipedia_export.jsonl.3
     1083 tagged_wikipedia_chunks/wikipedia_export.jsonl.4
     1075 tagged_wikipedia_chunks/wikipedia_export.jsonl.5
     1077 tagged_wikipedia_chunks/wikipedia_export.jsonl.6
     1075 tagged_wikipedia_chunks/wikipedia_export.jsonl.7
     1078 tagged_wikipedia_chunks/wikipedia_export.jsonl.8
     1068 tagged_wikipedia_chunks/wikipedia_export.jsonl.9
    10779 total

These data files can now be found in gzip compressed format at the following HuggingFace Dataset repository ucrelnlp/English-USAS-Mosaico .. Note the naming convention is slightly different, e.g. wikipedia_shard_0.jsonl.gz = wikipedia_export.jsonl.0.

Contact Information

About

Processing of the Mosaico dataset with USAS and CLAWS

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published