Mo El-Haj, Udo Kruschwitz, Chris Fox
University of Essex, UK
This repository hosts EASC — the Essex Arabic Summaries Corpus — a collection of 153 Arabic source documents and 765 human-generated extractive summaries, created using Amazon Mechanical Turk.
EASC is one of the earliest publicly available datasets for Arabic single-document summarisation and remains widely used in research on Arabic NLP, extractive summarisation, sentence ranking, and evaluation.
EASC was introduced in:
El-Haj, M., Kruschwitz, U., & Fox, C. (2010).
Using Mechanical Turk to Create a Corpus of Arabic Summaries.
Workshop on LRs & HLT for Semitic Languages @ LREC 2010.
The corpus was motivated by the lack of gold-standard resources for evaluating Arabic text summarisation, particularly extractive systems. Mechanical Turk was used to collect five independent extractive summaries per article, offering natural diversity and enabling aggregation into different gold-standard levels.
The work was later expanded in:
- El-Haj (2012). Multi-document Arabic Text Summarisation. PhD Thesis, University of Essex.
- El-Haj, Kruschwitz & Fox (2011). Exploring clustering for multi-document Arabic summarisation. AIRS 2011.
EASC contains:
| Component | Count | Description |
|---|---|---|
| Articles | 153 | Arabic Wikipedia + AlRai (Jordan) + AlWatan (Saudi Arabia) |
| Summaries | 765 | Five extractive summaries per article |
| Topics | 10 | Art, Environment, Politics, Sport, Health, Finance, Science & Technology, Tourism, Religion, Education |
Each summary was produced by a different Mechanical Turk worker, who selected up to 50% of the sentences they considered most important.
Articles/
Article001/
Article002/
...
MTurk/
Article001/
Article002/
...
Where:
-
Articles/ArticleXX/*.txt→ full document -
MTurk/ArticleXX/Dxxxx.M.250.A.#.*→ five extractive summaries (A–E)
To make EASC easier to use with modern NLP tools, this repository includes a unified CSV/JSONL version:
| Field | Description |
|---|---|
article_id |
Unique article identifier (1–153) |
topic_name |
Topic label extracted from filename |
article_text |
Full article text |
summary_A |
Human summary A |
summary_B |
Human summary B |
summary_C |
Human summary C |
summary_D |
Human summary D |
summary_E |
Human summary E |
One JSON Object per article:
{ "article_id": 1, "topic_name": "Art and Music", "article_text": "...", "summaries": ["...", "...", "...", "...", "..."] }
The following Python script reconstructs the unified dataset from the raw Articles/ and MTurk/ folders (strict UTF-8):
import os
import re
import json
import pandas as pd
ARTICLES_DIR = "Articles"
MTURK_DIR = "MTurk"
records_csv = []
records_jsonl = []
for folder in sorted(os.listdir(ARTICLES_DIR)):
folder_path = os.path.join(ARTICLES_DIR, folder)
if not os.path.isdir(folder_path):
continue
m = re.match(r"Article(\d+)", folder)
if not m:
continue
article_id = int(m.group(1))
article_files = os.listdir(folder_path)
article_file = article_files[0]
article_file_path = os.path.join(folder_path, article_file)
base = os.path.splitext(article_file)[0]
match = re.match(r"(.+?)\s*\(\d+\)", base)
topic_name = match.group(1).strip() if match else "Unknown"
with open(article_file_path, "r", encoding="utf-8", errors="replace") as f:
article_text = f.read().strip()
summaries_dir = os.path.join(MTURK_DIR, folder)
summary_files = sorted(os.listdir(summaries_dir))
summaries = []
for sfile in summary_files:
s_path = os.path.join(summaries_dir, sfile)
with open(s_path, "r", encoding="utf-8", errors="replace") as f:
summaries.append(f.read().strip())
while len(summaries) < 5:
summaries.append("")
summaries = summaries[:5]
records_csv.append({
"article_id": article_id,
"topic_name": topic_name,
"article_text": article_text,
"summary_A": summaries[0],
"summary_B": summaries[1],
"summary_C": summaries[2],
"summary_D": summaries[3],
"summary_E": summaries[4]
})
records_jsonl.append({
"article_id": article_id,
"topic_name": topic_name,
"article_text": article_text,
"summaries": summaries
})
df = pd.DataFrame(records_csv)
df.to_csv("EASC.csv", index=False, encoding="utf-8")
with open("EASC.jsonl", "w", encoding="utf-8") as f:
for row in records_jsonl:
f.write(json.dumps(row, ensure_ascii=False) + "\n")
print("Done! Created EASC.csv and EASC.jsonl")
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_csv("EASC.csv")
train_df, temp_df = train_test_split(df, test_size=0.2, random_state=42)
val_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42)
train_df.to_csv("EASC_train.csv", index=False)
val_df.to_csv("EASC_val.csv", index=False)
test_df.to_csv("EASC_test.csv", index=False)
EASC supports research in:
-
Extractive summarisation
-
Sentence ranking and scoring
-
Gold-summary aggregation (Level2, Level3)
-
ROUGE and Dice evaluation
-
Learning sentence importance
-
Human–machine evaluation comparisons
-
Crowdsourcing quality analysis
EASC is the only Arabic summarisation dataset with:
-
consistent multiple references per document
-
real extractive human judgements
-
cross-worker variability suitable for probabilistic modelling
Based on the original paper:
-
Level 3: sentences selected by ≥3 workers
-
Level 2: sentences selected by ≥2 workers
-
All: all sentences selected by any worker
(not recommended as a gold standard; used for analysis only)
These levels can be regenerated programmatically from the unified CSV.
Systems evaluated against EASC include:
-
Sakhr Arabic Summariser
-
AQBTSS
-
Gen-Summ
-
LSA-Summ
-
Baseline-1 (first sentence)
Metrics used:
-
Dice coefficient (recommended for extractive summarisation)
-
ROUGE-2 / ROUGE-L / ROUGE-W / ROUGE-S
-
AutoSummENG
All details are documented in the LREC 2010 paper.
If you use EASC, please cite:
El-Haj, M., Kruschwitz, U., & Fox, C. (2010).
Using Mechanical Turk to Create a Corpus of Arabic Summaries.
In LRs & HLT for Semitic Languages Workshop, LREC 2010.
Additional references:
El-Haj (2012). Multi-document Arabic Text Summarisation. PhD Thesis.
El-Haj, Kruschwitz & Fox (2011). Exploring Clustering for Multi-Document Arabic Summarisation. AIRS 2011.
The original EASC release permits research use.
This cleaned and reformatted version follows the same academic-research usage terms.
- Some Mechanical Turk summaries may include noisy selections or inconsistent behaviour; these are preserved to avoid subjective filtering.
- File encodings reflect the original dataset; all modern versions are normalised to UTF-8.
- The unified CSV/JSONL is provided for convenience and reproducibility.
Dr Mo El-Haj
Associate Professor in Natural Language Processing
VinUniversity, Vietnam / Lancaster University, UK