This repository provides a cleaned and consolidated version of the Kalimat - a multipurpose Arabic Corpus, containing 18,256 Arabic news articles collected from a diverse range of domains. The original material consisted of thousands of individual .txt files organised across multiple category folders. These have been reconstructed, normalised, and compiled into modern machine-learning-friendly formats.
The corpus includes 20k+ articles covering a wide selection of news categories:
- Politics
- Economy
- Culture
- Religion
- Sport
- Social / Society-related topics
- Other sub-domains depending on the original folder structure
Each article was originally stored as one word per line. In this cleaned edition, all documents have been reconstructed into natural text format with proper spacing, UTF-8 encoding, and consistent metadata extraction.
Although the original filenames varied widely, each document is now associated with the following fields:
- id – numeric identifier extracted from the filename (or
-1where none existed) - filename – original filename exactly as it appeared
- category – derived from directory structure or filename
- year_month – extracted from filename where possible, otherwise
"unknown" - text – reconstructed, cleaned article text
The cleaned dataset is released in the following forms:
kalimat.csv
A single UTF-8 CSV containing all metadata and article texts.
kalimat.jsonl
One JSON object per line, suitable for training modern NLP models (e.g. HuggingFace Transformers).
kalimat_txt.zip
All reconstructed .txt documents are included in a single compressed archive to avoid storing thousands of individual files in the repository. Each .txt file uses the original filename for easy reference.
The dataset has been randomly split (using a fixed seed for reproducibility) into:
- Training set – 80%
- Validation set – 10%
- Test set – 10%
These splits are provided as:
kalimat_train.csvkalimat_val.csvkalimat_test.csv
All CSVs preserve the same column structure as the main file.
python
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_csv("kalimat.csv", encoding="utf-8")
train_df, temp_df = train_test_split(
df, test_size=0.20, random_state=42, shuffle=True
)
val_df, test_df = train_test_split(
temp_df, test_size=0.50, random_state=42, shuffle=True
)
train_df.to_csv("kalimat_train.csv", index=False, encoding="utf-8")
val_df.to_csv("kalimat_val.csv", index=False, encoding="utf-8")
test_df.to_csv("kalimat_test.csv", index=False, encoding="utf-8")
kalimat.csv
kalimat.jsonl
kalimat_train.csv
kalimat_val.csv
kalimat_test.csv
kalimat_txt.zip
README.md
If you use this dataset in your work, please cite the original Kalimat paper:
El-Haj, M., & Koulali, R. (2013). Kalimat: a multipurpose Arabic corpus. In Second Workshop on Arabic Corpus Linguistics (WACL-2), pp. 22–25.
PDF available here