-
Notifications
You must be signed in to change notification settings - Fork 264
Open
Labels
a:standard neededb:UPOSUniversal part-of-speech tags: definitions and examplesUniversal part-of-speech tags: definitions and examplesb:dependenciesb:lemmatization
Milestone
Description
The guidelines for typos and other errors in underlying text cover over- and under-segmented words:
- for over-segmented words, such as spel ling, they recommend using
goeswith - under-segmented words, such as thespelling, are re-segmented into e.g. the and spelling and marked with
SpaceAfter=No+CorrectSpaceAfter=Yes.
This makes perfect sense for "standard" corpora, but I would argue that, when it comes to learner treebanks, the annotation could be more expressive:
- at least in some germanic languages, some cases of over-segmentation are not random typos but compounding errors. In UD_Swedish-SweLL we have several clear examples of this and we chose to annotate them with
compound(even if the language-specific guidelines say otherwise). This is in partial analogy with UD_English-ESL - as for under-segmentation, only UD_Italian-VALICO currently follows the general guidelines. UD_Korean-KSL and UD_Swedish-SweLL do not alter the segmentation and determine the DEPREL and POS for the resulting fused token based on, respectively, the final morpheme (@ksung please correct me if I'm wrong!) and what its head would be if the segmentation was correct. We who are behind UD_Swedish-SweLL argue that, in this way, the annotation is more descriptive (or literal, as some say) and the errors more visible in the dependency trees.
None of these decisions cause validation issues at the time of writing, but it would be good if annotators of learner materials could agree on this or decide on an alternative uniform solution.
Metadata
Metadata
Assignees
Labels
a:standard neededb:UPOSUniversal part-of-speech tags: definitions and examplesUniversal part-of-speech tags: definitions and examplesb:dependenciesb:lemmatization