Standardize `PronunciationKind` to BCP 47 tags

The current definition is too limiting and the `Other` variant doesn't give enough structure. 

I think we should instead use [BCP 47 tags](https://en.wikipedia.org/wiki/IETF_language_tag) with Region, Script, and Variant and the transformation extension `t` specified (when relevant).  When in doubt, consult Unicode and IANA for clarification, more specifically, the [IANA language subtag registery](https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry), and the [Unicode CLDR data](https://github.com/unicode-org/cldr/tree/main)

The following table gives the changes for the current schema
| Old | New |
|----|------|
| IPA      | `und-fonipa`[^1] |
| Pinyin   | `zh-Latn-pinyin` |
| Hiragana | `ja-Hira` |
| Katakana | `ja-Kana` |
| Romaji   | `ja-Latn` |
| Yale     | `ko-Latn`*[^2] |
| Jyutping | `yue-jyutping` |
| Bopomofo | `zh-Bopo` or `yue-Bopo` |
| Hepburn  | `ja-Latn-hepburn` or `ja-Latn-alalc97`[^3] |

In case the language is unknown (unlikely), or it's a purely typographic, use the `und` language tag[^1]

# Example
For example, see this example from the codebase.

https://github.com/TheOpenDictionary/odict/blob/094dee15cdf2e3673652a6c6b4df355191d520bd/python/tests/test_pronunciation.py#L118-L135

Currently, both entries are encoded using the `ipa` kind which is ambiguous. With BCP 47 tags, this would be 
```diff
<dictionary>
  <entry term="hello">
    <ety>
-    <pronunciation kind="ipa" value="həˈləʊ">
+    <pronunciation kind="en-GB-fonipa" value="həˈləʊ">
        <url src="./hello-british.mp3" />
      </pronunciation>
-     <pronunciation kind="ipa" value="hɛˈloʊ">
+     <pronunciation kind="en-fonipa" value="hɛˈloʊ">
        <url src="./hello-american.mp3" />
      </pronunciation>
      <sense pos="adj">
        <definition value="A greeting" />
      </sense>
    </ety>
  </entry>
</dictionary>
```
Modeling the difference in the two pronunciations.

Furthermore, this example
https://github.com/TheOpenDictionary/odict/blob/094dee15cdf2e3673652a6c6b4df355191d520bd/examples/pronunciation_example.xml#L78-L80

Stops becoming custom, it's just encoded under `zh-Latn-wadegile`! 
This automatically makes the schema more extensible, supporting more languages/systems without having to touch the codebase.

# Rationale

Other than supporting more languages out of the gate, adopting this system would allow odict clients to perform Unicode Transformations robustly. Let's take for example, 
https://github.com/TheOpenDictionary/odict/blob/094dee15cdf2e3673652a6c6b4df355191d520bd/examples/pronunciation_example.xml#L44-L46

Suppose that I am a Chinese dictionary user who prefers to use numbers for tones instead of accents.
If the pronunciation was annotated with `zh-Latn-pinyin`, my client can use the ICU `Latin-NumericPinyin` transform to automatically transform the pronunciation to numeric form. i.e. `Ni3 hao3, ren4shi ni3 hen3 gao1xing2`

Furthermore, this system will help facilitate typesetting ruby by allowing mixed Hiragana Katakana pronunciation using `ja-Hrkt`. See [`japanese-furigana-normalize`](https://github.com/MarvNC/japanese-furigana-normalize) for more info.

# Change

For backwards compatibility, we should alias the previous definitions to the current one. Furthermore, I think we should warn when we encounter an `Other` that is not a valid BCP 47 tag (and optionally maybe when we encounter `und` for the language[^1]). If we're willing to make a breaking change, then perhaps the following schema is best:
```rust
  pub enum PronunciationKind {
      Bcp47(icu_locale_core::LanguageIdentifier), // or use String if you don't want the dependency
      #[strum(to_string = "{0}")]
      #[serde(untagged)]
      Other(String),
  }
```

# Update (Oct 1)
I think it's better to just get rid of `PronunciationKind` and just use `icu_locale_core::LanguageIdentifier` (or String) instead. If you had any data that didn't fit exactly into the BCP 47 tag, you could just stuff it into the private modifier just like handling Yale[^2]

[^1]: It's best to specify the language, as in the example above, since IPA notation changes from language to language.
[^2]: Korean is in a more peculiar place, it seems that `ko-Latn-alalc97` (Modified McCune–Reischauer) is the only variant specified in Unicode CLDR. According to the [CLDR](https://cldr.unicode.org/index/cldr-spec/transliteration-guidelines#korean), the default transform for `ko-Latn` is the [Revised Romanization of Korean (RR)](https://en.wikipedia.org/wiki/Revised_Romanization_of_Korean) (also indicated by transform flag `-t-m0-mcst` or `-t-m0-bgn`) which suggests that should be the default intrepretation for `ko-Latn`. Unfortuately, this means we have to model Yale ambiguously with RR under `ko-Latn`, or use a private extension (e.g. `ko-Latn-x-yale`). Funnily enough `ko-KP-Latn` would most likely model the [Romanization of Korea](https://en.wikipedia.org/wiki/Romanization_of_Korean_(North_Korea)) system.
[^3]: You should almost always use `ja-Latn-alalc97` (Library of Congress) as that is what's commonly used. If you're sure you're using Traditional Hepburn, use `ja-Latn-hepburn`. See [wikipedia page](https://en.wikipedia.org/wiki/Hepburn_romanization#Variants) for the differences.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Standardize `PronunciationKind` to BCP 47 tags #1303

Example

Rationale

Change

Update (Oct 1)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Old	New
IPA	`und-fonipa`¹
Pinyin	`zh-Latn-pinyin`
Hiragana	`ja-Hira`
Katakana	`ja-Kana`
Romaji	`ja-Latn`
Yale	`ko-Latn`*²
Jyutping	`yue-jyutping`
Bopomofo	`zh-Bopo` or `yue-Bopo`
Hepburn	`ja-Latn-hepburn` or `ja-Latn-alalc97`³

	def test_multiple_pronunciations():
	xml = """
	<dictionary>
	<entry term="hello">
	<ety>
	<pronunciation kind="ipa" value="həˈləʊ">
	<url src="./hello-british.mp3" />
	</pronunciation>
	<pronunciation kind="ipa" value="hɛˈloʊ">
	<url src="./hello-american.mp3" />
	</pronunciation>
	<sense pos="adj">
	<definition value="A greeting" />
	</sense>
	</ety>
	</entry>
	</dictionary>
	"""

	<pronunciation kind="wadegiles" value="Pei-ching">
	<url src="./audio/beijing_wadegiles.mp3" type="audio/mpeg" description="Wade-Giles romanization" />
	</pronunciation>

	<example value="你好，认识你很高兴。">
	<pronunciation kind="pinyin" value="Nǐ hǎo, rènshi nǐ hěn gāoxìng." />
	</example>

Uh oh!

Standardize PronunciationKind to BCP 47 tags #1303

Description

Example

Rationale

Change

Update (Oct 1)

Footnotes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Standardize `PronunciationKind` to BCP 47 tags #1303