-
Notifications
You must be signed in to change notification settings - Fork 26
Open
Description
The FlairTagger (and possibly CRFTagger) ignores empty documents. The length of the output documents does not match the length of the input documents.
We should either allow empty documents, or raise a warning and that no empty strings should be passed.
Reproducible example
from pprint import pprint
from deidentify.base import Document
from deidentify.taggers import FlairTagger
from deidentify.tokenizer import TokenizerFactory
documents = [
Document(name="doc_01", text=""),
Document(name="doc_02", text="Stukje tekst met de naam Jan Jansen."),
Document(name="doc_03", text=""),
]
tokenizer = TokenizerFactory().tokenizer(corpus="ons", disable=("tagger", "ner"))
tagger = FlairTagger(
model="model_bilstmcrf_ons_fast-v0.2.0", tokenizer=tokenizer, verbose=False
)
annotated_docs = tagger.annotate(documents)
print(f"len(documents) = {len(documents)}")
print(f"len(annotated_docs) = {len(annotated_docs)}")
pprint(annotated_docs)Actual:
len(documents) = 3
len(annotated_docs) = 1
[Document(name=doc_02). Chars: 36, Annotations: 1]
Expected:
len(documents) = 3
len(annotated_docs) = 3
Metadata
Metadata
Assignees
Labels
No labels