Skip to content

Handle empty documents in FlairTagger #49

@jantrienes

Description

@jantrienes

The FlairTagger (and possibly CRFTagger) ignores empty documents. The length of the output documents does not match the length of the input documents.

We should either allow empty documents, or raise a warning and that no empty strings should be passed.

Reproducible example

from pprint import pprint

from deidentify.base import Document
from deidentify.taggers import FlairTagger
from deidentify.tokenizer import TokenizerFactory

documents = [
    Document(name="doc_01", text=""),
    Document(name="doc_02", text="Stukje tekst met de naam Jan Jansen."),
    Document(name="doc_03", text=""),
]


tokenizer = TokenizerFactory().tokenizer(corpus="ons", disable=("tagger", "ner"))
tagger = FlairTagger(
    model="model_bilstmcrf_ons_fast-v0.2.0", tokenizer=tokenizer, verbose=False
)

annotated_docs = tagger.annotate(documents)
print(f"len(documents) = {len(documents)}")
print(f"len(annotated_docs) = {len(annotated_docs)}")

pprint(annotated_docs)

Actual:

len(documents) = 3
len(annotated_docs) = 1
[Document(name=doc_02). Chars: 36, Annotations: 1]

Expected:

len(documents) = 3
len(annotated_docs) = 3

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions