Kalimat - a multipurpose Arabic Corpus

This repository provides a cleaned and consolidated version of the Kalimat - a multipurpose Arabic Corpus, containing 18,256 Arabic news articles collected from a diverse range of domains. The original material consisted of thousands of individual .txt files organised across multiple category folders. These have been reconstructed, normalised, and compiled into modern machine-learning-friendly formats.

📚 Corpus Overview

The corpus includes 20k+ articles covering a wide selection of news categories:

Politics
Economy
Culture
Religion
Sport
Social / Society-related topics
Other sub-domains depending on the original folder structure

Each article was originally stored as one word per line. In this cleaned edition, all documents have been reconstructed into natural text format with proper spacing, UTF-8 encoding, and consistent metadata extraction.

Although the original filenames varied widely, each document is now associated with the following fields:

id – numeric identifier extracted from the filename (or -1 where none existed)
filename – original filename exactly as it appeared
category – derived from directory structure or filename
year_month – extracted from filename where possible, otherwise "unknown"
text – reconstructed, cleaned article text

📦 Provided Formats

The cleaned dataset is released in the following forms:

1. CSV File

kalimat.csv

A single UTF-8 CSV containing all metadata and article texts.

2. JSONL File

kalimat.jsonl

One JSON object per line, suitable for training modern NLP models (e.g. HuggingFace Transformers).

3. TXT Version (Zipped)

kalimat_txt.zip

All reconstructed .txt documents are included in a single compressed archive to avoid storing thousands of individual files in the repository. Each .txt file uses the original filename for easy reference.

🔀 Train / Validation / Test Splits

The dataset has been randomly split (using a fixed seed for reproducibility) into:

Training set – 80%
Validation set – 10%
Test set – 10%

These splits are provided as:

kalimat_train.csv
kalimat_val.csv
kalimat_test.csv

All CSVs preserve the same column structure as the main file.

Code used for splitting (for reference)

python
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv("kalimat.csv", encoding="utf-8")

train_df, temp_df = train_test_split(
    df, test_size=0.20, random_state=42, shuffle=True
)

val_df, test_df = train_test_split(
    temp_df, test_size=0.50, random_state=42, shuffle=True
)

train_df.to_csv("kalimat_train.csv", index=False, encoding="utf-8")
val_df.to_csv("kalimat_val.csv", index=False, encoding="utf-8")
test_df.to_csv("kalimat_test.csv", index=False, encoding="utf-8")

📁 Repository Structure

kalimat.csv
kalimat.jsonl
kalimat_train.csv
kalimat_val.csv
kalimat_test.csv
kalimat_txt.zip
README.md

📄 Citation

If you use this dataset in your work, please cite the original Kalimat paper:

El-Haj, M., & Koulali, R. (2013). Kalimat: a multipurpose Arabic corpus. In Second Workshop on Arabic Corpus Linguistics (WACL-2), pp. 22–25.
PDF available here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Kalimat - a multipurpose Arabic Corpus

📚 Corpus Overview

📦 Provided Formats

1. CSV File

2. JSONL File

3. TXT Version (Zipped)

🔀 Train / Validation / Test Splits

Code used for splitting (for reference)

📁 Repository Structure

📄 Citation

About

Uh oh!

Releases

Packages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitattributes		.gitattributes
README.md		README.md
kalimat.csv		kalimat.csv
kalimat.jsonl		kalimat.jsonl
kalimat_test.csv		kalimat_test.csv
kalimat_train.csv		kalimat_train.csv
kalimat_txt.zip		kalimat_txt.zip
kalimat_val.csv		kalimat_val.csv

ArabicNLP-UK/KALIMAT

Folders and files

Latest commit

History

Repository files navigation

Kalimat - a multipurpose Arabic Corpus

📚 Corpus Overview

📦 Provided Formats

1. CSV File

2. JSONL File

3. TXT Version (Zipped)

🔀 Train / Validation / Test Splits

Code used for splitting (for reference)

📁 Repository Structure

📄 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages