The refugee claim process is alike in most jurisdictions: the claimant files an application, which requires an interview and/or a hearing, during which the claimant has support from a counsel (it can be legal aid, provided by a NGO or private). A decision is then made either by a civil servant or a judge on the grounds of international conventions and their local interpretation. Most of the time the claimant has a right to appeal in case of rejection in the first instance, as well as a right to judicial review. However, in order to analyze decision making process and have consistent data, we need to be in a specific legal frame. i
We chose to focus on the Canadian jurisdiction because they publish online the text data of some of the decisions and maintain a well-documented online database. Extracing structured features, we train a neural network and run inference on withholded, unseen data.
CanLII provides a dataset of 58,386 refugee cases associated with the Immigration and Refugee Board of Canada. The data is provided online in HTML and can be downloaded as PDF.
Having downloaded the data set as PDF files, we extract raw strings and process these with a distinct set of rule-based extraction methods (stemming, regular expressions) using the python modules nltk, re, and spacy to generate a structured data set.
The structures of the PDFs diverge across decisions. Within individual PDFs, we face different pages, text blocks such as paragraphs and titles or headings. For instance, the first page of every PDF contains semi-structured data with information such as claimant's name, date of hearing, place of hearing, date of decision. Also the last page of the majority of PDFs contains semi-structured data with what we define as keywords which try to summarise for instance, the reason of a decision, the overall outcome, and the gender of the claimant. These keywords, regularly on the last page of the PDF, are separated by slashes (/) padded with a varying number of whitespace around the keyword.
The collected dataset (on 2022-10-24) can be found in CSV.
For using the crawler, we need an AWS console account, providing an aws_access_key_id and aws_access_key_secret in a file called aws_secret.json of the following content:
{
"aws": {
"id": "ABCDEFGHIJKLMNOPQRSTUVXYZ"
"secret": "abcdefghijklmnopqrstuvxyz"
}
}
To run the crawler from scratch and download HTML files, call
run_crawler()
There exist several options, such as start_year, end_year, n_threads for parallelisation across CPU cores, and helper functions to restart if a previous run has failed due to various possible exceptions. For more info, see crawler.py.
If your project uses the AsyLex corpus or the asylex-crawler, please consider citing us:
BibTeX:
@inproceedings{asylex,
title={AsyLex: A Dataset for Legal Language Processing of Refugee Claims},
author={Barale, Claire and Klaisoongnoen, Mark and Minervini, Pasquale and Rovatsos, Michael and Bhuta, Nehal},
booktitle={Proceedings of the Natural Legal Language Processing Workshop 2023},
year={2023},
publisher={Association for Computational Linguistics},
url={https://aclanthology.org/2023.nllp-1.24/},
doi={10.18653/v1/2023.nllp-1.24}
}