Skip to content

Lemmatization of nonstandard words #1179

@harisont

Description

@harisont

The guidelines for typos and other errors in underlying text build upon the correct assumption that most token-level "errors" in standard texts are performance errors (hence the feature Typo=Yes). This is not the case in learner data, where a significant fraction of them have to do with incorrect inflection, derivation and/or orthography, as well as nonstandard lexical choices.

In UD_Swedish-SweLL, inspired but UD_Italian-Valico, we decided to lemmatize according to literal criteria, i.e. based on the characteristic of the observed word form (see also #1178).

Examples from UD_Swedish-SweLL (feel free to add):

  1. [inflection] förslagor is lemmatized as förslaga: it is clear that the learner means förslag ('proposal(s)'), but the plural is built as if the noun belonged to a declension where the base form would be förslaga
  2. [derivation] upplevelsa - an interesting mix of upplevelse, 'experience' (NOUN) and uppleva, 'experience' (VERB) - is lemmatized as upplevelsa
  3. [orthography] villja is lemmatized as villja (rather than vilja, 'want'). This is unlikely to be a typo because the source text is handwritten, the most common form of the verb is the present vill (with two Ls) and (at least to Italian ears, I'm an L2 speaker myself!) the word villja arguably sounds like it has a long (double) L, so we think it is an actually interesting error to preserve
  4. [lexical choice] brottsligheten is lemmatized as brottslighet ('criminality'), even if the normalized version of the sentence changes that to brott ('crime').

1, 2 and 4 may be more or less controversial, but they do not pose any annotation issues.
3, on the other hand, is an unfortunate case where we end up with an error:

[Line 2552 Sent org-176-test]: [L5 Morpho aux-lemma] 'villja' is not an auxiliary in language [sv]

If I recall correctly, @ElisaDiNuovo solved a similar problem in UD_Italian-Valico by using the "corrected lemma" only in the very few cases where the "descriptive" lemma would be incompatible with the UPOS and/or DEPREL at hand. This is a reasonable quick fix (which we also had to implement for UD 2.17, see UniversalDependencies/UD_Swedish-SweLL#1), but I think it creates unnecessary local inconsistencies.

A simple solution could be to introduce a CorrectLemma field in MISC, e.g.

3	villja	villja	AUX	_	Typo=Yes|VerbForm=Inf|Voice=Act	5	aux	_	CorrectionLabels=O|CorrectLemma=vilja

When available, the validator could check that rather than the LEMMA itself. When unavailable, the usual rules would hold.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions