Nonstandard word segmentation in learner treebanks

The [guidelines for typos and other errors in underlying text](https://universaldependencies.org/u/overview/typos.html) cover over- and under-segmented words:

- for over-segmented words, such as _spel ling_, they recommend using `goeswith`
- under-segmented words, such as _thespelling_, are re-segmented into e.g. _the_ and _spelling_ and marked with `SpaceAfter=No` + `CorrectSpaceAfter=Yes`.

This makes perfect sense for "standard" corpora, but I would argue that, when it comes to learner treebanks, the annotation could be more expressive:

- at least in some germanic languages, some cases of over-segmentation are not random typos but compounding errors. In UD_Swedish-SweLL we have [several clear examples of this](https://universal.grew.fr/?custom=6920ada2c8509) and we chose to annotate them with `compound` (even if the [language-specific guidelines](https://universaldependencies.org/sv/dep/compound.html) say otherwise). This is in partial analogy with UD_English-ESL
- as for under-segmentation, only UD_Italian-VALICO currently follows the general guidelines. UD_Korean-KSL and UD_Swedish-SweLL do not alter the segmentation and determine the DEPREL and POS for the resulting fused token based on, respectively, the final morpheme (@ksung please correct me if I'm wrong!) and what its head would be if the segmentation was correct. We who are behind UD_Swedish-SweLL argue that, in this way, the annotation is more descriptive (or _literal_, as some say) and the errors more visible in the dependency trees.

None of these decisions cause validation issues at the time of writing, but it would be good if annotators of learner materials could agree on this or decide on an alternative uniform solution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Nonstandard word segmentation in learner treebanks #1180

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Nonstandard word segmentation in learner treebanks #1180

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions