Skip to content

Code for paper VeriFact: Enhancing Long-Form Factuality Evaluation with Refined Fact Extraction and Reference Facts

Notifications You must be signed in to change notification settings

launchnlp/VeriFact

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Verifact

A comprehensive pipeline for fact verification and annotation of LLM-generated responses. This project processes long-form answers from language models, extracts atomic claims, identifies problematic claims (dependent, hallucinated, or self-duplicated), and refines them for fact verification tasks.

Overview

This pipeline systematically processes LLM-generated answers to questions, breaking them down into atomic facts and identifying various types of issues:

  • Dependent claims: Claims that require additional context to be fully understood
  • Hallucinated claims: Claims not supported by the original answer
  • Missing relations: Temporal or contingency relationships between events that are missing from extracted facts

Project Structure

emnlp2025_verifact/
├── src/                          # Main pipeline scripts
│   ├── llm_generation_api_calls.py         # Step 1: Generate LLM responses
│   ├── gpt4o_break_down.py                # Step 2: Break down responses into claims
│   ├── vllm_llama3_modify2self_contained.py # Step 3: Make claims self-contained
│   ├── llama3.3_refine_single_units_final.py # Step 6: Refine problematic claims
│   └── utils.py                            # Utility functions and message classes
├── data/                         # Data processing and storage
│   ├── convert_data_for_annotation.py      # Step 4: Convert for annotation format
│   ├── {model_name}/                       # Model-specific data directories
│   └── gemini/                             # Example model data
├── model_annotation/             # LLM-judge annotation scripts
│   ├── vllm_dependent_with_subcategories.py # Step 5a: Annotate dependent claims
│   └── vllm_relation_with_subcategories.py  # Step 5b: Annotate missing relations
└── scripts/                      # Additional utility scripts

Pipeline Steps

Step 1: Generate Model Responses

Script: src/llm_generation_api_calls.py

Generates responses from LLMs (GPT-4o, Llama, Qwen, etc.) for questions in a dataset. Supports both OpenAI API and local vLLM servers.

Key Features:

  • Asynchronous API calls with high concurrency (up to 1096 concurrent requests)
  • Automatic retry mechanism for failed requests
  • Support for HuggingFace datasets (e.g., FactRBench)
  • Configurable model selection (OpenAI API vs local vLLM)

Usage:

python src/llm_generation_api_calls.py

Configuration:

  • Set openai_or_local_api to "openai" or "local"
  • Configure OPENAI_API_KEY and OPENAI_API_BASE environment variables
  • Adjust data_path for your dataset
  • Output saved to data/{model_name}_answers.json

Step 2: Break Down Responses into Claims

Script: src/gpt4o_break_down.py

Decomposes model responses into atomic facts using LLM-based sentence-by-sentence breakdown.

Key Features:

  • Sentence tokenization using NLTK
  • Asynchronous processing with concurrency control
  • Uses Llama-3.3-70B-Instruct for claim extraction
  • Maintains conversation context across sentences

Usage:

python src/gpt4o_break_down.py <input_file.json>

Input: JSONL file with question and answer fields
Output: JSONL file with added break_down field containing atomic facts per sentence


Step 3: Make Claims Self-Contained

Script: src/vllm_llama3_modify2self_contained.py

Modifies extracted claims to be self-contained by replacing vague references (pronouns, ambiguous entities) with explicit entities from the original response.

Key Features:

  • Resolves pronouns and vague references
  • Replaces non-full names with full names
  • Maintains factual accuracy while improving clarity
  • Uses Llama-3.3-70B-Instruct for revision

Usage:

python src/vllm_llama3_modify2self_contained.py <input_file.json>

Input: JSONL file with break_down field
Output: JSONL file with modified_sentences field containing self-contained claims


Step 4: Convert Data for Annotation

Script: data/convert_data_for_annotation.py

Converts processed data into format suitable for LLM-judge annotation, identifying missing spans and formatting claims.

Key Features:

  • Identifies text spans missing from extracted claims
  • Highlights missing information in responses
  • Creates structured JSON format for annotation
  • Maps claims to original sentences

Usage:

python data/convert_data_for_annotation.py <input_file.json> <model_name>

Input: JSONL file with modified_sentences field
Output: Individual JSON files in data/{model_name}/result_{idx}.json containing:

  • prompt: Original question
  • orig_response: Original answer
  • response: Answer with highlighted missing spans
  • revised_fact_jsonified_all: List of self-contained claims with metadata

Step 5: LLM-Judge Annotation

5a. Dependent Claim Annotation

Script: model_annotation/vllm_dependent_with_subcategories.py

Annotates claims to identify dependent claims (those requiring additional context) and categorizes them into subcategories:

  • Ambiguous Concepts/Pronouns: Vague terms or pronouns lacking clear referents
  • Missing Comparison: Implied comparisons without explicit targets
  • Lack of Condition/Sources: Missing temporal conditions, hypothetical scenarios, or sources/references

Key Features:

  • Uses Llama-3.3-70B-Instruct as annotator
  • High concurrency (200 concurrent requests)
  • Structured annotation format with explanations
  • Categorizes dependency types

Usage:

python model_annotation/vllm_dependent_with_subcategories.py

Configuration:

  • Modify model_need_to_be_annotated list to specify target models
  • Adjust max_concurrency for your hardware

Output: Annotation files in annotations_{annotator_model}_{target_model}/annotations_{instance_id}.json

5b. Relation Annotation

Script: model_annotation/vllm_relation_with_subcategories.py

Identifies missing temporal and contingency relationships between events in responses that are not captured in extracted claims.

Key Features:

  • Detects PDTB-based discourse relations (Temporal, Contingency, Comparison, Expansion)
  • Two-stage detection: relation identification and fact coverage check
  • Uses Qwen-2.5-32B-Instruct for annotation
  • Identifies spans with missing relationships

Usage:

python model_annotation/vllm_relation_with_subcategories.py

Output: Annotation files in annotations_missingR_{annotator_model}_{target_model}/annotations_{instance_id}.json


Step 6: Refine Problematic Claims

Script: src/llama3.3_refine_single_units_final.py

Refines problematic claims identified in Step 5 by making them self-contained based on annotation feedback.

Key Features:

  • Merges dependency and hallucination annotations
  • Refines claims based on error types and explanations
  • Maintains minimal word count while adding necessary context
  • Uses Llama-3.3-70B-Instruct for refinement

Usage:

python src/llama3.3_refine_single_units_final.py <model_name>

Input:

  • Data files in data/{model_name}/result_{idx}.json
  • Annotation files from model_annotation/annotations_{annotator}_{model_name}/

Output: Refined data in data/final_refine_data/{model_name}_refine_units/result_{idx}.json with refined_units field

Configuration

Environment Variables

  • OPENAI_API_KEY: API key for OpenAI services (if using OpenAI API)
  • OPENAI_API_BASE: Base URL for API (local vLLM typically uses http://localhost:1234/v1)

Model Configuration

Most scripts use local vLLM servers. Ensure you have:

  • A vLLM server running on localhost:1234
  • Appropriate models loaded (e.g., Llama-3.3-70B-Instruct, Qwen-2.5-32B-Instruct)

Data Paths

  • Input data: HuggingFace datasets or JSONL files
  • Output data: data/ directory with model-specific subdirectories
  • Annotations: model_annotation/ directory

Dependencies

Python Packages

  • aiohttp: Asynchronous HTTP client
  • asyncio: Asynchronous programming
  • openai: OpenAI API client
  • transformers: HuggingFace transformers
  • datasets: HuggingFace datasets
  • nltk: Natural language processing utilities
  • tqdm: Progress bars
  • pandas: Data manipulation

NLTK Data

Ensure NLTK data is downloaded:

import nltk
nltk.download('punkt')

Data Format

Input Format (Step 1)

{
  "question": "What is the best way to travel from Oakland to Boston?",
  "answer": "The best way to travel..."
}

Intermediate Format (Step 4)

{
  "prompt": "What is...",
  "orig_response": "The best way...",
  "response": "The <strong>best way</strong> to travel...",
  "revised_fact_jsonified_all": [
    {
      "atomic_unit": "",
      "revised_unit": "The best way to travel from Oakland to Boston depends on budget.",
      "model_response": "",
      "orig_sentence": "The best way to travel..."
    }
  ]
}

Annotation Format (Step 5)

[
  {
    "instance_id": "1",
    "model_name": "gemini",
    "annotator": "llama3.3_70b",
    "fact_id": 0,
    "error_type": "dependent",
    "dependent_type": "Missing Comparison",
    "explanation": "The claim mentions 'a longer journey' but does not specify the comparison."
  }
]

Example Workflow

  1. Generate responses:

    python src/llm_generation_api_calls.py
  2. Break down into claims:

    python src/gpt4o_break_down.py data/gemini_answers.json
  3. Make claims self-contained:

    python src/vllm_llama3_modify2self_contained.py data/gemini_answers_break_down_by_llama70b.json
  4. Convert for annotation:

    python data/convert_data_for_annotation.py data/gemini_answers_break_down_by_llama70b.modified.json gemini
  5. Annotate claims:

    python model_annotation/vllm_dependent_with_subcategories.py
    python model_annotation/vllm_relation_with_subcategories.py
  6. Refine problematic claims:

    python src/llama3.3_refine_single_units_final.py gemini

Notes

  • All scripts are designed to run with local vLLM servers for cost efficiency
  • High concurrency limits may need adjustment based on your hardware
  • The pipeline processes large datasets and may take significant time
  • Error handling and retries are built into most scripts for robustness
  • Some scripts require command-line arguments; check script headers for details

Citation

If you use this codebase, please cite the associated EMNLP 2025 paper.

@article{xin2025,
      title={VeriFact: Enhancing Long-Form Factuality Evaluation with Refined Fact Extraction and Reference Facts}, 
      author={Xin Liu and Lechen Zhang and Sheza Munir and Yiyang Gu and Lu Wang},
      year={2025},
      journal={arXiv preprint arXiv:2505.09701},
      url={https://arxiv.org/abs/2505.09701}, 
}

About

Code for paper VeriFact: Enhancing Long-Form Factuality Evaluation with Refined Fact Extraction and Reference Facts

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages