Zürcher Verständlichkeitsindex (ZIX)

Get a pragmatic indication of how understandable a German text is.

Contents

Usage
What does the score mean?
How does the score work?
Background
Project team
Feedback and contributing
License
Disclaimer

Usage

Install uv for environment management.

1. Install the ZIX as a package

Install directly from GitHub: pip install git+https://github.com/machinelearningZH/zix_understandability-index
Or clone the repo and install locally: pip install .
The required Spacy language model (de_core_news_sm) will be automatically installed.
Use the package like this:

from zix.understandability import get_zix, get_cefr

text = """
Die Schweiz, amtlich Schweizerische Eidgenossenschaft, ist ein föderalistischer, demokratischer Staat in Mitteleuropa. Er grenzt im Norden an Deutschland, im Osten an Österreich und Liechtenstein, im Süden an Italien und im Westen an Frankreich.
""".strip()
zix_score = get_zix(text)
cefr = get_cefr(zix_score)
print(f"The text has an understandability score of: {zix_score:.1f}")
print(f"The text has a CEFR level of roughly: {cefr}")

>>> The text has a ZIX understandability score of: -2.0
>>> The text has a CEFR level of roughly: C1

2. Explore the methodology in the notebooks

Clone this repo and change into the project directory.
Set up the environment with notebook dependencies: uv sync --extra notebooks
Run the notebooks in an IDE like Visual Studio Code. Alternatively, you can use Jupyter Notebook or Jupyter Lab.
If you want to recreate the synthetic data that we generated with LLMs, you also need to create an .env file and input your OpenRouter API key. The .env file should look like this:

    OPENROUTER_API_KEY=sk-...

What does the score mean?

Negative scores indicate difficult texts in the range of B2 to C2. These texts will likely be very hard to understand for many people (this is classic «Behördendeutsch» or legal text territory...).
Positive scores indicate a language level of B1 or easier.

Here we plot the scores for our own data set.

Now that we have the ZIX metric we can assess other corpora and text types too.

Important

This understandability index is meant as a pragmatic measure. It is by no means exact or in regard to CEFR levels an official measure. That being said, the index serves us well in practice in our context and for our text data. We treat it as an indication that gives us an idea if our editing goes in the right direction.

Please note that this index only works for German texts!

How does the score work?

The score takes into account sentence lengths, the readability metric RIX, the occurrence of common words and overlap with the standard CEFR vocabularies A1, A2 and B1.
At the moment the score does not take into account other language properties that are essential for e.g. Einfache Sprache (B1 or easier, similar to «Plain English») or Leichte Sprache (A2, A1, similar to «Easy English») like use of passive voice, subjunctives, negations, etc.

For more details how we derived the index please have a look at the notebooks, particularly 04_create_zix.ipynb.

Note

The index is slightly adjusted to Swiss German. Specifically we use ss instead of ß in our vocabulary lists. In practice this should not make a big difference. For High German text that actually contains ß the index will likely underestimate the understandability slightly with a difference of around 0.1.

Background

Since no open understandability index seems to be available, we created our own. Many readability metrics exist. However, readability and understandability are related but not identical; a text can be readable yet hard to understand due to difficult vocabulary, passive voice, subjunctives etc.

Our index goes beyond readability metrics by incorporating semantic features, emphasizing common vocabulary. It also measures the overlap between the text's vocabulary and official standard CEFR vocabularies for German.

We recommend that you validate the index systematically with your text data to assess if it works well for your domain too.

Our steps to create the index

1. Data Collection (01_create_cefr_data.ipynb, 02_scrape_administrative_texts.ipynb)

Generate synthetic text samples in CEFR language levels A1 to C2 using 10 LLMs via OpenRouter (Gemini Flash Lite/Flash/Pro, Claude Haiku/Sonnet/Opus, GPT-5 mini/5.1/5.2, Mistral Large).
Scrape news bulletins from the cantonal administration as C1-level references.
Use legal decisions from a cantonal court as C2+ references.
Incorporate official CEFR vocabularies for A1, A2, and B1 from the Goethe Institut.
Use a word frequency list of the most common words in German.

2. Dataset Creation (03_create_dataset.ipynb)

Combine synthetic CEFR samples, administrative news, and legal texts.
Prepare standard CEFR vocabulary reference lists (lemmatized).
Create a unified dataset for model training.

3. Index Development (04_create_zix.ipynb)

Extract linguistic features and readability metrics with Spacy and textdescriptives.
Calculate CEFR vocabulary overlap (A1, A2, B1) and common word scores.
Explore feature distributions across text types.
Use a Gaussian Mixture Model to identify and filter outliers.
Select 6 expressive features (2 syntactic, 4 semantic): sentence length, RIX readability, CEFR vocabulary ratios, and common word score.
Map text types to difficulty levels (A1=1, A2=2, B1=3, B2=4, C1/Admin=5, C2=6, Legal=8).
Train a Ridge Regressor with cross-validation on the difficulty levels.
Scale predicted scores to a -10 to 10 range, centered around 0.
Negative scores indicate difficult texts (B2 to C2); positive scores indicate simpler texts (B1 to A1).
Serialize the trained model and scaler for the package.

4. Package Creation

Refactor the index into a reusable module.
Include the trained model, scaler, and reference vocabularies.
Make it installable via pip.

We developed this index for our text simplification app that helps us rewrite complex administrative texts. The app displays the understandability of both the source text and simplified text. The index also allows us to measure the quality of various prompting techniques and methods quantitatively.

To the best of our knowledge, there are unfortunately no open-source CEFR-labeled NLP datasets with a truly permissive license. Most available general datasets (Wikipedia, Books, news sources, etc.) have licensing that is too restrictive for our use case or are paid. Thus, we use text data from the cantonal administration and additionally create synthetic data.

Project Team

Chantal Amrhein, Patrick Arnecke – Statistisches Amt Zürich: Team Data

Feedback and Contributing

We welcome feedback and contributions! Email us or open an issue or pull request.

We use ruff for linting and formatting.

To run tests:

Install dev dependencies: uv sync --extra dev
Run tests: pytest _tests/

License

This project is licensed under the MIT License. See the LICENSE file for details.

Please be aware that the text data from the cantonal administration (court decisions, news bulletins, RRBs) is copyrighted and therefore is not included in the MIT licensing. This does not affect your usage of the index. You just shouldn't use the cantonal text data for anything else.

Disclaimer

This software (the Software) incorporates commercial and open-source models (the Models) from providers like OpenRouter, spacy etc. The app has been developed according to and with the intent to be used under Swiss law. Please be aware that the EU Artificial Intelligence Act (EU AI Act) may, under certain circumstances, be applicable to your use of the Software. You are solely responsible for ensuring that your use of the Software as well as of the underlying Models complies with all applicable local, national and international laws and regulations. By using this Software, you acknowledge and agree (a) that it is your responsibility to assess which laws and regulations, in particular regarding the use of AI technologies, are applicable to your intended use and to comply therewith, and (b) that you will hold us harmless from any action, claims, liability or loss in respect of your use of the Software.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
_imgs		_imgs
_input		_input
_tests		_tests
zix		zix
.gitignore		.gitignore
01_create_cefr_data.ipynb		01_create_cefr_data.ipynb
02_scrape_administrative_texts.ipynb		02_scrape_administrative_texts.ipynb
03_create_dataset.ipynb		03_create_dataset.ipynb
04_create_zix.ipynb		04_create_zix.ipynb
05_use_zix.ipynb		05_use_zix.ipynb
LICENSE		LICENSE
README.md		README.md
_gh.code-workspace		_gh.code-workspace
pyproject.toml		pyproject.toml
utils_nlp.py		utils_nlp.py
utils_prompts.py		utils_prompts.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Zürcher Verständlichkeitsindex (ZIX)

Usage

What does the score mean?

How does the score work?

Background

Our steps to create the index

Project Team

Feedback and Contributing

License

Disclaimer

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

machinelearningZH/zix_understandability-index

Folders and files

Latest commit

History

Repository files navigation

Zürcher Verständlichkeitsindex (ZIX)

Usage

What does the score mean?

How does the score work?

Background

Our steps to create the index

Project Team

Feedback and Contributing

License

Disclaimer

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages