Skip to content

Nonstandard word segmentation in learner treebanks #1180

@harisont

Description

@harisont

The guidelines for typos and other errors in underlying text cover over- and under-segmented words:

  • for over-segmented words, such as spel ling, they recommend using goeswith
  • under-segmented words, such as thespelling, are re-segmented into e.g. the and spelling and marked with SpaceAfter=No + CorrectSpaceAfter=Yes.

This makes perfect sense for "standard" corpora, but I would argue that, when it comes to learner treebanks, the annotation could be more expressive:

  • at least in some germanic languages, some cases of over-segmentation are not random typos but compounding errors. In UD_Swedish-SweLL we have several clear examples of this and we chose to annotate them with compound (even if the language-specific guidelines say otherwise). This is in partial analogy with UD_English-ESL
  • as for under-segmentation, only UD_Italian-VALICO currently follows the general guidelines. UD_Korean-KSL and UD_Swedish-SweLL do not alter the segmentation and determine the DEPREL and POS for the resulting fused token based on, respectively, the final morpheme (@ksung please correct me if I'm wrong!) and what its head would be if the segmentation was correct. We who are behind UD_Swedish-SweLL argue that, in this way, the annotation is more descriptive (or literal, as some say) and the errors more visible in the dependency trees.

None of these decisions cause validation issues at the time of writing, but it would be good if annotators of learner materials could agree on this or decide on an alternative uniform solution.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions