Skip to content

Standardize PronunciationKind to BCP 47 tags #1303

@Waelwindows

Description

@Waelwindows

The current definition is too limiting and the Other variant doesn't give enough structure.

I think we should instead use BCP 47 tags with Region, Script, and Variant and the transformation extension t specified (when relevant). When in doubt, consult Unicode and IANA for clarification, more specifically, the IANA language subtag registery, and the Unicode CLDR data

The following table gives the changes for the current schema

Old New
IPA und-fonipa1
Pinyin zh-Latn-pinyin
Hiragana ja-Hira
Katakana ja-Kana
Romaji ja-Latn
Yale ko-Latn*2
Jyutping yue-jyutping
Bopomofo zh-Bopo or yue-Bopo
Hepburn ja-Latn-hepburn or ja-Latn-alalc973

In case the language is unknown (unlikely), or it's a purely typographic, use the und language tag1

Example

For example, see this example from the codebase.

def test_multiple_pronunciations():
xml = """
<dictionary>
<entry term="hello">
<ety>
<pronunciation kind="ipa" value="həˈləʊ">
<url src="./hello-british.mp3" />
</pronunciation>
<pronunciation kind="ipa" value="hɛˈloʊ">
<url src="./hello-american.mp3" />
</pronunciation>
<sense pos="adj">
<definition value="A greeting" />
</sense>
</ety>
</entry>
</dictionary>
"""

Currently, both entries are encoded using the ipa kind which is ambiguous. With BCP 47 tags, this would be

<dictionary>
  <entry term="hello">
    <ety>
-    <pronunciation kind="ipa" value="həˈləʊ">
+    <pronunciation kind="en-GB-fonipa" value="həˈləʊ">
        <url src="./hello-british.mp3" />
      </pronunciation>
-     <pronunciation kind="ipa" value="hɛˈloʊ">
+     <pronunciation kind="en-fonipa" value="hɛˈloʊ">
        <url src="./hello-american.mp3" />
      </pronunciation>
      <sense pos="adj">
        <definition value="A greeting" />
      </sense>
    </ety>
  </entry>
</dictionary>

Modeling the difference in the two pronunciations.

Furthermore, this example

<pronunciation kind="wadegiles" value="Pei-ching">
<url src="./audio/beijing_wadegiles.mp3" type="audio/mpeg" description="Wade-Giles romanization" />
</pronunciation>

Stops becoming custom, it's just encoded under zh-Latn-wadegile!
This automatically makes the schema more extensible, supporting more languages/systems without having to touch the codebase.

Rationale

Other than supporting more languages out of the gate, adopting this system would allow odict clients to perform Unicode Transformations robustly. Let's take for example,

<example value="你好,认识你很高兴。">
<pronunciation kind="pinyin" value="Nǐ hǎo, rènshi nǐ hěn gāoxìng." />
</example>

Suppose that I am a Chinese dictionary user who prefers to use numbers for tones instead of accents.
If the pronunciation was annotated with zh-Latn-pinyin, my client can use the ICU Latin-NumericPinyin transform to automatically transform the pronunciation to numeric form. i.e. Ni3 hao3, ren4shi ni3 hen3 gao1xing2

Furthermore, this system will help facilitate typesetting ruby by allowing mixed Hiragana Katakana pronunciation using ja-Hrkt. See japanese-furigana-normalize for more info.

Change

For backwards compatibility, we should alias the previous definitions to the current one. Furthermore, I think we should warn when we encounter an Other that is not a valid BCP 47 tag (and optionally maybe when we encounter und for the language1). If we're willing to make a breaking change, then perhaps the following schema is best:

  pub enum PronunciationKind {
      Bcp47(icu_locale_core::LanguageIdentifier), // or use String if you don't want the dependency
      #[strum(to_string = "{0}")]
      #[serde(untagged)]
      Other(String),
  }

Update (Oct 1)

I think it's better to just get rid of PronunciationKind and just use icu_locale_core::LanguageIdentifier (or String) instead. If you had any data that didn't fit exactly into the BCP 47 tag, you could just stuff it into the private modifier just like handling Yale2

Footnotes

  1. It's best to specify the language, as in the example above, since IPA notation changes from language to language. 2 3

  2. Korean is in a more peculiar place, it seems that ko-Latn-alalc97 (Modified McCune–Reischauer) is the only variant specified in Unicode CLDR. According to the CLDR, the default transform for ko-Latn is the Revised Romanization of Korean (RR) (also indicated by transform flag -t-m0-mcst or -t-m0-bgn) which suggests that should be the default intrepretation for ko-Latn. Unfortuately, this means we have to model Yale ambiguously with RR under ko-Latn, or use a private extension (e.g. ko-Latn-x-yale). Funnily enough ko-KP-Latn would most likely model the Romanization of Korea system. 2

  3. You should almost always use ja-Latn-alalc97 (Library of Congress) as that is what's commonly used. If you're sure you're using Traditional Hepburn, use ja-Latn-hepburn. See wikipedia page for the differences.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions