Verifact

A comprehensive pipeline for fact verification and annotation of LLM-generated responses. This project processes long-form answers from language models, extracts atomic claims, identifies problematic claims (dependent, hallucinated, or self-duplicated), and refines them for fact verification tasks.

Overview

This pipeline systematically processes LLM-generated answers to questions, breaking them down into atomic facts and identifying various types of issues:

Dependent claims: Claims that require additional context to be fully understood
Hallucinated claims: Claims not supported by the original answer
Missing relations: Temporal or contingency relationships between events that are missing from extracted facts

Project Structure

emnlp2025_verifact/
├── src/                          # Main pipeline scripts
│   ├── llm_generation_api_calls.py         # Step 1: Generate LLM responses
│   ├── gpt4o_break_down.py                # Step 2: Break down responses into claims
│   ├── vllm_llama3_modify2self_contained.py # Step 3: Make claims self-contained
│   ├── llama3.3_refine_single_units_final.py # Step 6: Refine problematic claims
│   └── utils.py                            # Utility functions and message classes
├── data/                         # Data processing and storage
│   ├── convert_data_for_annotation.py      # Step 4: Convert for annotation format
│   ├── {model_name}/                       # Model-specific data directories
│   └── gemini/                             # Example model data
├── model_annotation/             # LLM-judge annotation scripts
│   ├── vllm_dependent_with_subcategories.py # Step 5a: Annotate dependent claims
│   └── vllm_relation_with_subcategories.py  # Step 5b: Annotate missing relations
└── scripts/                      # Additional utility scripts

Pipeline Steps

Step 1: Generate Model Responses

Script: src/llm_generation_api_calls.py

Generates responses from LLMs (GPT-4o, Llama, Qwen, etc.) for questions in a dataset. Supports both OpenAI API and local vLLM servers.

Key Features:

Asynchronous API calls with high concurrency (up to 1096 concurrent requests)
Automatic retry mechanism for failed requests
Support for HuggingFace datasets (e.g., FactRBench)
Configurable model selection (OpenAI API vs local vLLM)

Usage:

python src/llm_generation_api_calls.py

Configuration:

Set openai_or_local_api to "openai" or "local"
Configure OPENAI_API_KEY and OPENAI_API_BASE environment variables
Adjust data_path for your dataset
Output saved to data/{model_name}_answers.json

Step 2: Break Down Responses into Claims

Script: src/gpt4o_break_down.py

Decomposes model responses into atomic facts using LLM-based sentence-by-sentence breakdown.

Key Features:

Sentence tokenization using NLTK
Asynchronous processing with concurrency control
Uses Llama-3.3-70B-Instruct for claim extraction
Maintains conversation context across sentences

Usage:

python src/gpt4o_break_down.py <input_file.json>

Input: JSONL file with question and answer fields
Output: JSONL file with added break_down field containing atomic facts per sentence

Step 3: Make Claims Self-Contained

Script: src/vllm_llama3_modify2self_contained.py

Modifies extracted claims to be self-contained by replacing vague references (pronouns, ambiguous entities) with explicit entities from the original response.

Key Features:

Resolves pronouns and vague references
Replaces non-full names with full names
Maintains factual accuracy while improving clarity
Uses Llama-3.3-70B-Instruct for revision

Usage:

python src/vllm_llama3_modify2self_contained.py <input_file.json>

Input: JSONL file with break_down field
Output: JSONL file with modified_sentences field containing self-contained claims

Step 4: Convert Data for Annotation

Script: data/convert_data_for_annotation.py

Converts processed data into format suitable for LLM-judge annotation, identifying missing spans and formatting claims.

Key Features:

Identifies text spans missing from extracted claims
Highlights missing information in responses
Creates structured JSON format for annotation
Maps claims to original sentences

Usage:

python data/convert_data_for_annotation.py <input_file.json> <model_name>

Input: JSONL file with modified_sentences field
Output: Individual JSON files in data/{model_name}/result_{idx}.json containing:

prompt: Original question
orig_response: Original answer
response: Answer with highlighted missing spans
revised_fact_jsonified_all: List of self-contained claims with metadata

Step 5: LLM-Judge Annotation

5a. Dependent Claim Annotation

Script: model_annotation/vllm_dependent_with_subcategories.py

Annotates claims to identify dependent claims (those requiring additional context) and categorizes them into subcategories:

Ambiguous Concepts/Pronouns: Vague terms or pronouns lacking clear referents
Missing Comparison: Implied comparisons without explicit targets
Lack of Condition/Sources: Missing temporal conditions, hypothetical scenarios, or sources/references

Key Features:

Uses Llama-3.3-70B-Instruct as annotator
High concurrency (200 concurrent requests)
Structured annotation format with explanations
Categorizes dependency types

Usage:

python model_annotation/vllm_dependent_with_subcategories.py

Configuration:

Modify model_need_to_be_annotated list to specify target models
Adjust max_concurrency for your hardware

Output: Annotation files in annotations_{annotator_model}_{target_model}/annotations_{instance_id}.json

5b. Relation Annotation

Script: model_annotation/vllm_relation_with_subcategories.py

Identifies missing temporal and contingency relationships between events in responses that are not captured in extracted claims.

Key Features:

Detects PDTB-based discourse relations (Temporal, Contingency, Comparison, Expansion)
Two-stage detection: relation identification and fact coverage check
Uses Qwen-2.5-32B-Instruct for annotation
Identifies spans with missing relationships

Usage:

python model_annotation/vllm_relation_with_subcategories.py

Output: Annotation files in annotations_missingR_{annotator_model}_{target_model}/annotations_{instance_id}.json

Step 6: Refine Problematic Claims

Script: src/llama3.3_refine_single_units_final.py

Refines problematic claims identified in Step 5 by making them self-contained based on annotation feedback.

Key Features:

Merges dependency and hallucination annotations
Refines claims based on error types and explanations
Maintains minimal word count while adding necessary context
Uses Llama-3.3-70B-Instruct for refinement

Usage:

python src/llama3.3_refine_single_units_final.py <model_name>

Input:

Data files in data/{model_name}/result_{idx}.json
Annotation files from model_annotation/annotations_{annotator}_{model_name}/

Output: Refined data in data/final_refine_data/{model_name}_refine_units/result_{idx}.json with refined_units field

Configuration

Environment Variables

OPENAI_API_KEY: API key for OpenAI services (if using OpenAI API)
OPENAI_API_BASE: Base URL for API (local vLLM typically uses http://localhost:1234/v1)

Model Configuration

Most scripts use local vLLM servers. Ensure you have:

A vLLM server running on localhost:1234
Appropriate models loaded (e.g., Llama-3.3-70B-Instruct, Qwen-2.5-32B-Instruct)

Data Paths

Input data: HuggingFace datasets or JSONL files
Output data: data/ directory with model-specific subdirectories
Annotations: model_annotation/ directory

Dependencies

Python Packages

aiohttp: Asynchronous HTTP client
asyncio: Asynchronous programming
openai: OpenAI API client
transformers: HuggingFace transformers
datasets: HuggingFace datasets
nltk: Natural language processing utilities
tqdm: Progress bars
pandas: Data manipulation

NLTK Data

Ensure NLTK data is downloaded:

import nltk
nltk.download('punkt')

Data Format

Input Format (Step 1)

{
  "question": "What is the best way to travel from Oakland to Boston?",
  "answer": "The best way to travel..."
}

Intermediate Format (Step 4)

{
  "prompt": "What is...",
  "orig_response": "The best way...",
  "response": "The <strong>best way</strong> to travel...",
  "revised_fact_jsonified_all": [
    {
      "atomic_unit": "",
      "revised_unit": "The best way to travel from Oakland to Boston depends on budget.",
      "model_response": "",
      "orig_sentence": "The best way to travel..."
    }
  ]
}

Annotation Format (Step 5)

[
  {
    "instance_id": "1",
    "model_name": "gemini",
    "annotator": "llama3.3_70b",
    "fact_id": 0,
    "error_type": "dependent",
    "dependent_type": "Missing Comparison",
    "explanation": "The claim mentions 'a longer journey' but does not specify the comparison."
  }
]

Example Workflow

Generate responses:
```
python src/llm_generation_api_calls.py
```

Break down into claims:

python src/gpt4o_break_down.py data/gemini_answers.json

Make claims self-contained:

python src/vllm_llama3_modify2self_contained.py data/gemini_answers_break_down_by_llama70b.json

Convert for annotation:

python data/convert_data_for_annotation.py data/gemini_answers_break_down_by_llama70b.modified.json gemini

Annotate claims:

python model_annotation/vllm_dependent_with_subcategories.py
python model_annotation/vllm_relation_with_subcategories.py

Refine problematic claims:

python src/llama3.3_refine_single_units_final.py gemini

Notes

All scripts are designed to run with local vLLM servers for cost efficiency
High concurrency limits may need adjustment based on your hardware
The pipeline processes large datasets and may take significant time
Error handling and retries are built into most scripts for robustness
Some scripts require command-line arguments; check script headers for details

Citation

If you use this codebase, please cite the associated EMNLP 2025 paper.

@article{xin2025,
      title={VeriFact: Enhancing Long-Form Factuality Evaluation with Refined Fact Extraction and Reference Facts}, 
      author={Xin Liu and Lechen Zhang and Sheza Munir and Yiyang Gu and Lu Wang},
      year={2025},
      journal={arXiv preprint arXiv:2505.09701},
      url={https://arxiv.org/abs/2505.09701}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
model_annotation		model_annotation
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Verifact

Overview

Project Structure

Pipeline Steps

Step 1: Generate Model Responses

Step 2: Break Down Responses into Claims

Step 3: Make Claims Self-Contained

Step 4: Convert Data for Annotation

Step 5: LLM-Judge Annotation

5a. Dependent Claim Annotation

5b. Relation Annotation

Step 6: Refine Problematic Claims

Configuration

Environment Variables

Model Configuration

Data Paths

Dependencies

Python Packages

NLTK Data

Data Format

Input Format (Step 1)

Intermediate Format (Step 4)

Annotation Format (Step 5)

Example Workflow

Notes

Citation

About

Uh oh!

Releases

Packages

Languages

launchnlp/VeriFact

Folders and files

Latest commit

History

Repository files navigation

Verifact

Overview

Project Structure

Pipeline Steps

Step 1: Generate Model Responses

Step 2: Break Down Responses into Claims

Step 3: Make Claims Self-Contained

Step 4: Convert Data for Annotation

Step 5: LLM-Judge Annotation

5a. Dependent Claim Annotation

5b. Relation Annotation

Step 6: Refine Problematic Claims

Configuration

Environment Variables

Model Configuration

Data Paths

Dependencies

Python Packages

NLTK Data

Data Format

Input Format (Step 1)

Intermediate Format (Step 4)

Annotation Format (Step 5)

Example Workflow

Notes

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages