Skip to content

CrediBench: a large-scale webgraph dynamic dataset collection pipeline

Notifications You must be signed in to change notification settings

ekmpa/CrediGraph

Repository files navigation

CrediGraph

Warning

This repository is no longer maintained and is kept for archival / reference only.

Please now refer to the CrediNet organization, under which our work is instead organized.

CrediGraph Logo

Getting Started

Prerequisites

The project uses uv to manage and lock project dependencies for a consistent and reproducible environment. If you do not have uv installed on your system, visit this page for installation instructions.

Note: If you have pip, you can invoke:

pip install uv

Installation

# Clone the repo

# Enter the repo directory
cd CrediGraph

# Install core dependencies into an isolated environment
uv sync

# The isolated env is .venv
source .venv/bin/activate

Usage

Building Temporal Graphs

The graph construction can be parallelized. We use a configuration of 16 subfolders, distributed across 16 CPUs.

First, cd construction;

bash pipeline.sh <start_month> <end_month> <number of subfolders>
# e.g,
bash pipeline.sh 'January 2020' 'February 2020' 16

To launch a cluster job using run.sh directly (uses cluster practices for the pipeline):

sbatch run.sh <start_month> [<end_month>]

For optimal settings, please see construction/README.md.

This will construct the graph in $SCRATCH/crawl-data/CC-MAIN-YYYY-WW/output.

Processing

For processing a graph, cd construction and:

bash process.sh "$START_MONTH" "$END_MONTH"

Running domain's content extraction

The end-to-end.sh script runs a batch of content extraction from start_idx to end_idx

cd bash_scripts
bash end-to-end.sh CC-Crawls/Dec2024.txt <start_idx> <end_idx> [wet] <seed_list> <spark_table_name>

where

  • Dec2024.txt is a .txt file with the slice names, e.g one CC-MAIN-YYYY-WW per line.
  • start_idx: the start index inclusive to process out of 90K wet files
  • end_idx: the last index inclusive to process out of 90K wet files
  • seed_list: the seed list of domains to extract thier content i.e, the dqr domains list at 'data/dqr/domain_pc1.csv'
  • spark_table_name: the spark table name and output folder naming pattern i.e, content_table

This will generate parquet files per batch under `$SCRATCH/spark-warehouse/<spark_table_name>batch_ccmain202451<start_idx>_<end_idx>

Merging the extracted Content

Loop across the generated parquet files to collect text contents and merge them per domain
simply use: pandas.read_parquet(file_path, engine='pyarrow')
Each parquet file contains columns:

  • Domain_Name: the domain name
  • WARC_Target_URI: the web page URI
  • WARC_Identified_Content_Language: list of CC-identified content languagues
  • WARC_Date: the content scrap date
  • Content_Type: the content type i.e., text,csv,json
  • Content_Length: the content length in bytes
  • wet_record_txt: the UTF-8 text content

Running MLP Experiments

The required embedding files and dataset must be placed under the data directory. The expected file paths are as follows:

  • Domain text embeddings: data/dqr/labeled_11k_scraped_text_emb.pkl
  • Domain name embeddings: data/dqr/labeled_11k_domainname_emb.pkl
  • DQR dataset (ratings): data/dqr/domain_ratings.csv
uv run python tgrag/experiments/mlp_experiments/main.py --target pc1 --embed_type text

Running GNN Baseline Experiment

Given the size of our datasets we must leverage mini-batching in our GNN experiments. To do this we use PyG's neighbor_loader, which requires additional libraries having undocumented build-time dependencies. As such, users are required to install them in their own venv. seperate from uv sync.

pyg-lib:

uv pip install pyg-lib -f https://data.pyg.org/whl/torch-${TORCH}+${CUDA}.html

PyTorch Sparse:

uv pip install torch-scatter torch-sparse -f https://data.pyg.org/whl/torch-2.7.0+${CUDA}.html

For information on installations of these additional libraries see pyg-lib and PyTorch Sparse.

To run our baseline static experimentation:

uv run tgrag/experiments/main.py

Alternatively, you can design your own configuration, updating the model parameters:

uv run tgrag/experiments/main.py --config configs/your_config.yaml

To learn more about making a contribution to CrediGraph see our contribution guide


Citation

@article{kondrupsabry2025credibench,
  title={{CrediBench: Building Web-Scale Network Datasets for Information Integrity}},
  author={Kondrup, Emma and Sabry, Sebastian and Abdallah, Hussein and Yang, Zachary and Zhou, James and Pelrine, Kellin and Godbout, Jean-Fran{\c{c}}ois and Bronstein, Michael and Rabbany, Reihaneh and Huang, Shenyang},
  journal={arXiv preprint arXiv:2509.23340},
  year={2025},
  note={New Perspectives in Graph Machine Learning Workshop @ NeurIPS 2025},
  url={https://arxiv.org/abs/2509.23340}
}

About

CrediBench: a large-scale webgraph dynamic dataset collection pipeline

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •