Skip to content

ArabicNLP-UK/EASC

Repository files navigation

EASC: The Essex Arabic Summaries Corpus

Mo El-Haj, Udo Kruschwitz, Chris Fox
University of Essex, UK

This repository hosts EASC — the Essex Arabic Summaries Corpus — a collection of 153 Arabic source documents and 765 human-generated extractive summaries, created using Amazon Mechanical Turk.

EASC is one of the earliest publicly available datasets for Arabic single-document summarisation and remains widely used in research on Arabic NLP, extractive summarisation, sentence ranking, and evaluation.


📘 Background

EASC was introduced in:

El-Haj, M., Kruschwitz, U., & Fox, C. (2010).
Using Mechanical Turk to Create a Corpus of Arabic Summaries.
Workshop on LRs & HLT for Semitic Languages @ LREC 2010.

The corpus was motivated by the lack of gold-standard resources for evaluating Arabic text summarisation, particularly extractive systems. Mechanical Turk was used to collect five independent extractive summaries per article, offering natural diversity and enabling aggregation into different gold-standard levels.

The work was later expanded in:

  • El-Haj (2012). Multi-document Arabic Text Summarisation. PhD Thesis, University of Essex.
  • El-Haj, Kruschwitz & Fox (2011). Exploring clustering for multi-document Arabic summarisation. AIRS 2011.

🗂 Corpus Contents

EASC contains:

Component Count Description
Articles 153 Arabic Wikipedia + AlRai (Jordan) + AlWatan (Saudi Arabia)
Summaries 765 Five extractive summaries per article
Topics 10 Art, Environment, Politics, Sport, Health, Finance, Science & Technology, Tourism, Religion, Education

Each summary was produced by a different Mechanical Turk worker, who selected up to 50% of the sentences they considered most important.


📁 Directory Structure


Articles/
  Article001/
  Article002/
  ...
MTurk/
  Article001/
  Article002/
  ...
  

Where:

  • Articles/ArticleXX/*.txt → full document

  • MTurk/ArticleXX/Dxxxx.M.250.A.#.* → five extractive summaries (A–E)


📦 Modern Dataset Format (this repository)

To make EASC easier to use with modern NLP tools, this repository includes a unified CSV/JSONL version:

CSV Schema

Field Description
article_id Unique article identifier (1–153)
topic_name Topic label extracted from filename
article_text Full article text
summary_A Human summary A
summary_B Human summary B
summary_C Human summary C
summary_D Human summary D
summary_E Human summary E

JSON Schema

One JSON Object per article:

{ "article_id": 1, "topic_name": "Art and Music", "article_text": "...", "summaries": ["...", "...", "...", "...", "..."] }


🛠️ Regenerating the CSV / JSONL

The following Python script reconstructs the unified dataset from the raw Articles/ and MTurk/ folders (strict UTF-8):

import os
import re
import json
import pandas as pd

ARTICLES_DIR = "Articles"
MTURK_DIR = "MTurk"

records_csv = []
records_jsonl = []

for folder in sorted(os.listdir(ARTICLES_DIR)):
    folder_path = os.path.join(ARTICLES_DIR, folder)
    if not os.path.isdir(folder_path):
        continue

    m = re.match(r"Article(\d+)", folder)
    if not m:
        continue

    article_id = int(m.group(1))
    article_files = os.listdir(folder_path)

    article_file = article_files[0]
    article_file_path = os.path.join(folder_path, article_file)

    base = os.path.splitext(article_file)[0]
    match = re.match(r"(.+?)\s*\(\d+\)", base)
    topic_name = match.group(1).strip() if match else "Unknown"

    with open(article_file_path, "r", encoding="utf-8", errors="replace") as f:
        article_text = f.read().strip()

    summaries_dir = os.path.join(MTURK_DIR, folder)
    summary_files = sorted(os.listdir(summaries_dir))
    summaries = []

    for sfile in summary_files:
        s_path = os.path.join(summaries_dir, sfile)
        with open(s_path, "r", encoding="utf-8", errors="replace") as f:
            summaries.append(f.read().strip())

    while len(summaries) < 5:
        summaries.append("")
    summaries = summaries[:5]

    records_csv.append({
        "article_id": article_id,
        "topic_name": topic_name,
        "article_text": article_text,
        "summary_A": summaries[0],
        "summary_B": summaries[1],
        "summary_C": summaries[2],
        "summary_D": summaries[3],
        "summary_E": summaries[4]
    })

    records_jsonl.append({
        "article_id": article_id,
        "topic_name": topic_name,
        "article_text": article_text,
        "summaries": summaries
    })

df = pd.DataFrame(records_csv)
df.to_csv("EASC.csv", index=False, encoding="utf-8")

with open("EASC.jsonl", "w", encoding="utf-8") as f:
    for row in records_jsonl:
        f.write(json.dumps(row, ensure_ascii=False) + "\n")

print("Done! Created EASC.csv and EASC.jsonl")



📥 Train / Validation / Test Splits

import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv("EASC.csv")

train_df, temp_df = train_test_split(df, test_size=0.2, random_state=42)
val_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42)

train_df.to_csv("EASC_train.csv", index=False)
val_df.to_csv("EASC_val.csv", index=False)
test_df.to_csv("EASC_test.csv", index=False)


🎯 Intended Use

EASC supports research in:

  • Extractive summarisation

  • Sentence ranking and scoring

  • Gold-summary aggregation (Level2, Level3)

  • ROUGE and Dice evaluation

  • Learning sentence importance

  • Human–machine evaluation comparisons

  • Crowdsourcing quality analysis

EASC is the only Arabic summarisation dataset with:

  • consistent multiple references per document

  • real extractive human judgements

  • cross-worker variability suitable for probabilistic modelling


📊 Recommended Gold Standards

Based on the original paper:

  • Level 3: sentences selected by ≥3 workers

  • Level 2: sentences selected by ≥2 workers

  • All: all sentences selected by any worker

  (not recommended as a gold standard; used for analysis only)

These levels can be regenerated programmatically from the unified CSV.


🧪 Evaluations (from the 2010 paper)

Systems evaluated against EASC include:

  • Sakhr Arabic Summariser

  • AQBTSS

  • Gen-Summ

  • LSA-Summ

  • Baseline-1 (first sentence)

Metrics used:

  • Dice coefficient (recommended for extractive summarisation)

  • ROUGE-2 / ROUGE-L / ROUGE-W / ROUGE-S

  • AutoSummENG

All details are documented in the LREC 2010 paper.


📑 Citation

If you use EASC, please cite:

El-Haj, M., Kruschwitz, U., & Fox, C. (2010).

Using Mechanical Turk to Create a Corpus of Arabic Summaries.

In LRs & HLT for Semitic Languages Workshop, LREC 2010.

Additional references:

El-Haj (2012). Multi-document Arabic Text Summarisation. PhD Thesis.

El-Haj, Kruschwitz & Fox (2011). Exploring Clustering for Multi-Document Arabic Summarisation. AIRS 2011.

📜 Licence

The original EASC release permits research use.

This cleaned and reformatted version follows the same academic-research usage terms.

✔ Notes

  • Some Mechanical Turk summaries may include noisy selections or inconsistent behaviour; these are preserved to avoid subjective filtering.
  • File encodings reflect the original dataset; all modern versions are normalised to UTF-8.
  • The unified CSV/JSONL is provided for convenience and reproducibility.

🧭 Maintainer

Dr Mo El-Haj

Associate Professor in Natural Language Processing

VinUniversity, Vietnam / Lancaster University, UK

About

EASC (Essex Arabic Summaries Corpus)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published