REBL is an extension of the Radboud Entity Linker (REL) for Batch Entity Linking. REBL makes it easier to isolate the GPU heavy operations from the CPU heavy operations, by separating the mention detection stage from the candidate selection and entity disambiguation stages. In order to use REBL, REL needs to be installed as well.
git clone git@github.com:informagi/REBL.git
cd REBL
pip install .
If you have to run on CPU, then you can issue:
pip install --user -f https://download.pytorch.org/whl/torch_stable.html torch==1.10.1+cpu
Note that torch requires a pretty old python (3.8), or this command will fail.
After installing REBL, it is very easy to run Mention Detection:
python -m rebl.md.mention_detection.py \
--in_file /path/to/input.gz \
--out_file /path/to/md.parquet
A file containing the fields tagged will also be created, its name will be created following the --out_file name.
The following command line options are available:
-i IN_FILE, --in_file IN_FILEName of the file to tag-o OUT_FILE, --out_file OUT_FILEOutput file name-t TAGGER, --tagger TAGGERTagger that is used,default: ner-fast-f [FIELDS [FIELDS ...]], --fields [FIELDS [FIELDS ...]]Fields to tagdefault: title headings body-id IDENTIFIER, --identifier IDENTIFIERfield key to identify documentdefault: docid-pb PREDICT_BATCH_SIZE, --predict_batch_size PREDICT_BATCH_SIZEHow many Sentences to tag at oncedefault: 100-wb WRITE_BATCH_SIZE, --write_batch_size WRITE_BATCH_SIZENumber of lines written to outfile per batchdefault: 10000-ft {jsonl,tsv}, --file_type {jsonl,tsv}Define the input file formatdefault: jsonl-c CONT, --cont CONTAppend output file if it already existsdefault: True
Running these steps is also easy: (Note that /path/to/md.parquet is the output file from the previous step, and /path/to/input.gz is the same file as the input for Mention Detection)
python3 -m rebl.ed.entity_disambiguation \
--md_file /path/to/md.parquet \
--fields title headings body \
--source_file /path/to/input.gz \
--out_file ed.parquet \
--base_url /path/to/REL-data \
--wiki_version wiki_2019 \
--identifier docid
The following command line options are available (-f and -ff are mutually exclusive):
-md MD_FILE, --md_file MD_FILEName of the md parquet file to tag-ff FIELDS_FILE, --fields_file FIELDS_FILEName of the fields file to tag-f [FIELDS [FIELDS ...]], --fields [FIELDS [FIELDS ...]]Give the fields directly--s SOURCE_FILE, --source_file SOURCE_FILEName of the source file that contains raw text-o OUT_FILE, --out_file OUT_FILEName of the file to write to-b BASE_URL, --base_url BASE_URLLocation of base_url for REL ED model-w WIKI_VERSION, --wiki_version WIKI_VERSIONWikipedia version to use-id IDENTIFIER, --identifier IDENTIFIERfield key to identify documentwb WRITE_BATCH_SIZE, --write_batch_size WRITE_BATCH_SIZENWrite batch sizedefault: 10000
REBL is published at DESIRES: https://desires.dei.unipd.it/2022/papers/paper-08.pdf
bib:
@inproceedings{rebl,
author = {Chris Kamphuis and Faegheh Hasibi and Jimmy Lin and Arjen P. de Vries},
title = {{REBL: Entity Linking at Scale}},
booktitle = {Proceedings of the 3rd International Conference on Design of Experimental Search \& Information REtrieval Systems},
year = {2022},
url = {https://desires.dei.unipd.it/2022/papers/paper-08.pdf},
address = {Aachen},
publisher = {CEUR-WS.org},
series = {DESIRES '22}
}