-
-
Notifications
You must be signed in to change notification settings - Fork 8
Description
The current definition is too limiting and the Other variant doesn't give enough structure.
I think we should instead use BCP 47 tags with Region, Script, and Variant and the transformation extension t specified (when relevant). When in doubt, consult Unicode and IANA for clarification, more specifically, the IANA language subtag registery, and the Unicode CLDR data
The following table gives the changes for the current schema
| Old | New |
|---|---|
| IPA | und-fonipa1 |
| Pinyin | zh-Latn-pinyin |
| Hiragana | ja-Hira |
| Katakana | ja-Kana |
| Romaji | ja-Latn |
| Yale | ko-Latn*2 |
| Jyutping | yue-jyutping |
| Bopomofo | zh-Bopo or yue-Bopo |
| Hepburn | ja-Latn-hepburn or ja-Latn-alalc973 |
In case the language is unknown (unlikely), or it's a purely typographic, use the und language tag1
Example
For example, see this example from the codebase.
odict/python/tests/test_pronunciation.py
Lines 118 to 135 in 094dee1
| def test_multiple_pronunciations(): | |
| xml = """ | |
| <dictionary> | |
| <entry term="hello"> | |
| <ety> | |
| <pronunciation kind="ipa" value="həˈləʊ"> | |
| <url src="./hello-british.mp3" /> | |
| </pronunciation> | |
| <pronunciation kind="ipa" value="hɛˈloʊ"> | |
| <url src="./hello-american.mp3" /> | |
| </pronunciation> | |
| <sense pos="adj"> | |
| <definition value="A greeting" /> | |
| </sense> | |
| </ety> | |
| </entry> | |
| </dictionary> | |
| """ |
Currently, both entries are encoded using the ipa kind which is ambiguous. With BCP 47 tags, this would be
<dictionary>
<entry term="hello">
<ety>
- <pronunciation kind="ipa" value="həˈləʊ">
+ <pronunciation kind="en-GB-fonipa" value="həˈləʊ">
<url src="./hello-british.mp3" />
</pronunciation>
- <pronunciation kind="ipa" value="hɛˈloʊ">
+ <pronunciation kind="en-fonipa" value="hɛˈloʊ">
<url src="./hello-american.mp3" />
</pronunciation>
<sense pos="adj">
<definition value="A greeting" />
</sense>
</ety>
</entry>
</dictionary>Modeling the difference in the two pronunciations.
Furthermore, this example
odict/examples/pronunciation_example.xml
Lines 78 to 80 in 094dee1
| <pronunciation kind="wadegiles" value="Pei-ching"> | |
| <url src="./audio/beijing_wadegiles.mp3" type="audio/mpeg" description="Wade-Giles romanization" /> | |
| </pronunciation> |
Stops becoming custom, it's just encoded under zh-Latn-wadegile!
This automatically makes the schema more extensible, supporting more languages/systems without having to touch the codebase.
Rationale
Other than supporting more languages out of the gate, adopting this system would allow odict clients to perform Unicode Transformations robustly. Let's take for example,
odict/examples/pronunciation_example.xml
Lines 44 to 46 in 094dee1
| <example value="你好,认识你很高兴。"> | |
| <pronunciation kind="pinyin" value="Nǐ hǎo, rènshi nǐ hěn gāoxìng." /> | |
| </example> |
Suppose that I am a Chinese dictionary user who prefers to use numbers for tones instead of accents.
If the pronunciation was annotated with zh-Latn-pinyin, my client can use the ICU Latin-NumericPinyin transform to automatically transform the pronunciation to numeric form. i.e. Ni3 hao3, ren4shi ni3 hen3 gao1xing2
Furthermore, this system will help facilitate typesetting ruby by allowing mixed Hiragana Katakana pronunciation using ja-Hrkt. See japanese-furigana-normalize for more info.
Change
For backwards compatibility, we should alias the previous definitions to the current one. Furthermore, I think we should warn when we encounter an Other that is not a valid BCP 47 tag (and optionally maybe when we encounter und for the language1). If we're willing to make a breaking change, then perhaps the following schema is best:
pub enum PronunciationKind {
Bcp47(icu_locale_core::LanguageIdentifier), // or use String if you don't want the dependency
#[strum(to_string = "{0}")]
#[serde(untagged)]
Other(String),
}Update (Oct 1)
I think it's better to just get rid of PronunciationKind and just use icu_locale_core::LanguageIdentifier (or String) instead. If you had any data that didn't fit exactly into the BCP 47 tag, you could just stuff it into the private modifier just like handling Yale2
Footnotes
-
It's best to specify the language, as in the example above, since IPA notation changes from language to language. ↩ ↩2 ↩3
-
Korean is in a more peculiar place, it seems that
ko-Latn-alalc97(Modified McCune–Reischauer) is the only variant specified in Unicode CLDR. According to the CLDR, the default transform forko-Latnis the Revised Romanization of Korean (RR) (also indicated by transform flag-t-m0-mcstor-t-m0-bgn) which suggests that should be the default intrepretation forko-Latn. Unfortuately, this means we have to model Yale ambiguously with RR underko-Latn, or use a private extension (e.g.ko-Latn-x-yale). Funnily enoughko-KP-Latnwould most likely model the Romanization of Korea system. ↩ ↩2 -
You should almost always use
ja-Latn-alalc97(Library of Congress) as that is what's commonly used. If you're sure you're using Traditional Hepburn, useja-Latn-hepburn. See wikipedia page for the differences. ↩