EASC: The Essex Arabic Summaries Corpus

Mo El-Haj, Udo Kruschwitz, Chris Fox
University of Essex, UK

This repository hosts EASC — the Essex Arabic Summaries Corpus — a collection of 153 Arabic source documents and 765 human-generated extractive summaries, created using Amazon Mechanical Turk.

EASC is one of the earliest publicly available datasets for Arabic single-document summarisation and remains widely used in research on Arabic NLP, extractive summarisation, sentence ranking, and evaluation.

📘 Background

EASC was introduced in:

El-Haj, M., Kruschwitz, U., & Fox, C. (2010).
Using Mechanical Turk to Create a Corpus of Arabic Summaries.
Workshop on LRs & HLT for Semitic Languages @ LREC 2010.

The corpus was motivated by the lack of gold-standard resources for evaluating Arabic text summarisation, particularly extractive systems. Mechanical Turk was used to collect five independent extractive summaries per article, offering natural diversity and enabling aggregation into different gold-standard levels.

The work was later expanded in:

El-Haj (2012). Multi-document Arabic Text Summarisation. PhD Thesis, University of Essex.
El-Haj, Kruschwitz & Fox (2011). Exploring clustering for multi-document Arabic summarisation. AIRS 2011.

🗂 Corpus Contents

EASC contains:

Component	Count	Description
Articles	153	Arabic Wikipedia + AlRai (Jordan) + AlWatan (Saudi Arabia)
Summaries	765	Five extractive summaries per article
Topics	10	Art, Environment, Politics, Sport, Health, Finance, Science & Technology, Tourism, Religion, Education

Each summary was produced by a different Mechanical Turk worker, who selected up to 50% of the sentences they considered most important.

📁 Directory Structure


Articles/
  Article001/
  Article002/
  ...
MTurk/
  Article001/
  Article002/
  ...

Where:

Articles/ArticleXX/*.txt → full document
MTurk/ArticleXX/Dxxxx.M.250.A.#.* → five extractive summaries (A–E)

📦 Modern Dataset Format (this repository)

To make EASC easier to use with modern NLP tools, this repository includes a unified CSV/JSONL version:

CSV Schema

Field	Description
`article_id`	Unique article identifier (1–153)
`topic_name`	Topic label extracted from filename
`article_text`	Full article text
`summary_A`	Human summary A
`summary_B`	Human summary B
`summary_C`	Human summary C
`summary_D`	Human summary D
`summary_E`	Human summary E

JSON Schema

One JSON Object per article:

{ "article_id": 1, "topic_name": "Art and Music", "article_text": "...", "summaries": ["...", "...", "...", "...", "..."] }

🛠️ Regenerating the CSV / JSONL

The following Python script reconstructs the unified dataset from the raw Articles/ and MTurk/ folders (strict UTF-8):

import os
import re
import json
import pandas as pd

ARTICLES_DIR = "Articles"
MTURK_DIR = "MTurk"

records_csv = []
records_jsonl = []

for folder in sorted(os.listdir(ARTICLES_DIR)):
    folder_path = os.path.join(ARTICLES_DIR, folder)
    if not os.path.isdir(folder_path):
        continue

    m = re.match(r"Article(\d+)", folder)
    if not m:
        continue

    article_id = int(m.group(1))
    article_files = os.listdir(folder_path)

    article_file = article_files[0]
    article_file_path = os.path.join(folder_path, article_file)

    base = os.path.splitext(article_file)[0]
    match = re.match(r"(.+?)\s*\(\d+\)", base)
    topic_name = match.group(1).strip() if match else "Unknown"

    with open(article_file_path, "r", encoding="utf-8", errors="replace") as f:
        article_text = f.read().strip()

    summaries_dir = os.path.join(MTURK_DIR, folder)
    summary_files = sorted(os.listdir(summaries_dir))
    summaries = []

    for sfile in summary_files:
        s_path = os.path.join(summaries_dir, sfile)
        with open(s_path, "r", encoding="utf-8", errors="replace") as f:
            summaries.append(f.read().strip())

    while len(summaries) < 5:
        summaries.append("")
    summaries = summaries[:5]

    records_csv.append({
        "article_id": article_id,
        "topic_name": topic_name,
        "article_text": article_text,
        "summary_A": summaries[0],
        "summary_B": summaries[1],
        "summary_C": summaries[2],
        "summary_D": summaries[3],
        "summary_E": summaries[4]
    })

    records_jsonl.append({
        "article_id": article_id,
        "topic_name": topic_name,
        "article_text": article_text,
        "summaries": summaries
    })

df = pd.DataFrame(records_csv)
df.to_csv("EASC.csv", index=False, encoding="utf-8")

with open("EASC.jsonl", "w", encoding="utf-8") as f:
    for row in records_jsonl:
        f.write(json.dumps(row, ensure_ascii=False) + "\n")

print("Done! Created EASC.csv and EASC.jsonl")

📥 Train / Validation / Test Splits

import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv("EASC.csv")

train_df, temp_df = train_test_split(df, test_size=0.2, random_state=42)
val_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42)

train_df.to_csv("EASC_train.csv", index=False)
val_df.to_csv("EASC_val.csv", index=False)
test_df.to_csv("EASC_test.csv", index=False)

🎯 Intended Use

EASC supports research in:

Extractive summarisation
Sentence ranking and scoring
Gold-summary aggregation (Level2, Level3)
ROUGE and Dice evaluation
Learning sentence importance
Human–machine evaluation comparisons
Crowdsourcing quality analysis

EASC is the only Arabic summarisation dataset with:

consistent multiple references per document
real extractive human judgements
cross-worker variability suitable for probabilistic modelling

📊 Recommended Gold Standards

Based on the original paper:

Level 3: sentences selected by ≥3 workers
Level 2: sentences selected by ≥2 workers
All: all sentences selected by any worker

(not recommended as a gold standard; used for analysis only)

These levels can be regenerated programmatically from the unified CSV.

🧪 Evaluations (from the 2010 paper)

Systems evaluated against EASC include:

Sakhr Arabic Summariser
AQBTSS
Gen-Summ
LSA-Summ
Baseline-1 (first sentence)

Metrics used:

Dice coefficient (recommended for extractive summarisation)
ROUGE-2 / ROUGE-L / ROUGE-W / ROUGE-S
AutoSummENG

All details are documented in the LREC 2010 paper.

📑 Citation

If you use EASC, please cite:

El-Haj, M., Kruschwitz, U., & Fox, C. (2010).

Using Mechanical Turk to Create a Corpus of Arabic Summaries.

In LRs & HLT for Semitic Languages Workshop, LREC 2010.

Additional references:

El-Haj (2012). Multi-document Arabic Text Summarisation. PhD Thesis.

El-Haj, Kruschwitz & Fox (2011). Exploring Clustering for Multi-Document Arabic Summarisation. AIRS 2011.

📜 Licence

The original EASC release permits research use.

This cleaned and reformatted version follows the same academic-research usage terms.

✔ Notes

Some Mechanical Turk summaries may include noisy selections or inconsistent behaviour; these are preserved to avoid subjective filtering.
File encodings reflect the original dataset; all modern versions are normalised to UTF-8.
The unified CSV/JSONL is provided for convenience and reproducibility.

🧭 Maintainer

Dr Mo El-Haj

Associate Professor in Natural Language Processing

VinUniversity, Vietnam / Lancaster University, UK

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EASC: The Essex Arabic Summaries Corpus

📘 Background

🗂 Corpus Contents

📁 Directory Structure

📦 Modern Dataset Format (this repository)

CSV Schema

JSON Schema

🛠️ Regenerating the CSV / JSONL

📥 Train / Validation / Test Splits

🎯 Intended Use

📊 Recommended Gold Standards

🧪 Evaluations (from the 2010 paper)

📑 Citation

📜 Licence

✔ Notes

🧭 Maintainer

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Articles		Articles
MTurk		MTurk
EASC.csv		EASC.csv
EASC.jsonl		EASC.jsonl
EASC_test.csv		EASC_test.csv
EASC_train.csv		EASC_train.csv
EASC_val.csv		EASC_val.csv
README.md		README.md
data-conversion-split.ipynb		data-conversion-split.ipynb

ArabicNLP-UK/EASC

Folders and files

Latest commit

History

Repository files navigation

EASC: The Essex Arabic Summaries Corpus

📘 Background

🗂 Corpus Contents

📁 Directory Structure

📦 Modern Dataset Format (this repository)

CSV Schema

JSON Schema

🛠️ Regenerating the CSV / JSONL

📥 Train / Validation / Test Splits

🎯 Intended Use

📊 Recommended Gold Standards

🧪 Evaluations (from the 2010 paper)

📑 Citation

📜 Licence

✔ Notes

🧭 Maintainer

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages