A comprehensive pipeline for fact verification and annotation of LLM-generated responses. This project processes long-form answers from language models, extracts atomic claims, identifies problematic claims (dependent, hallucinated, or self-duplicated), and refines them for fact verification tasks.
This pipeline systematically processes LLM-generated answers to questions, breaking them down into atomic facts and identifying various types of issues:
- Dependent claims: Claims that require additional context to be fully understood
- Hallucinated claims: Claims not supported by the original answer
- Missing relations: Temporal or contingency relationships between events that are missing from extracted facts
emnlp2025_verifact/
├── src/ # Main pipeline scripts
│ ├── llm_generation_api_calls.py # Step 1: Generate LLM responses
│ ├── gpt4o_break_down.py # Step 2: Break down responses into claims
│ ├── vllm_llama3_modify2self_contained.py # Step 3: Make claims self-contained
│ ├── llama3.3_refine_single_units_final.py # Step 6: Refine problematic claims
│ └── utils.py # Utility functions and message classes
├── data/ # Data processing and storage
│ ├── convert_data_for_annotation.py # Step 4: Convert for annotation format
│ ├── {model_name}/ # Model-specific data directories
│ └── gemini/ # Example model data
├── model_annotation/ # LLM-judge annotation scripts
│ ├── vllm_dependent_with_subcategories.py # Step 5a: Annotate dependent claims
│ └── vllm_relation_with_subcategories.py # Step 5b: Annotate missing relations
└── scripts/ # Additional utility scripts
Script: src/llm_generation_api_calls.py
Generates responses from LLMs (GPT-4o, Llama, Qwen, etc.) for questions in a dataset. Supports both OpenAI API and local vLLM servers.
Key Features:
- Asynchronous API calls with high concurrency (up to 1096 concurrent requests)
- Automatic retry mechanism for failed requests
- Support for HuggingFace datasets (e.g., FactRBench)
- Configurable model selection (OpenAI API vs local vLLM)
Usage:
python src/llm_generation_api_calls.pyConfiguration:
- Set
openai_or_local_apito "openai" or "local" - Configure
OPENAI_API_KEYandOPENAI_API_BASEenvironment variables - Adjust
data_pathfor your dataset - Output saved to
data/{model_name}_answers.json
Script: src/gpt4o_break_down.py
Decomposes model responses into atomic facts using LLM-based sentence-by-sentence breakdown.
Key Features:
- Sentence tokenization using NLTK
- Asynchronous processing with concurrency control
- Uses Llama-3.3-70B-Instruct for claim extraction
- Maintains conversation context across sentences
Usage:
python src/gpt4o_break_down.py <input_file.json>Input: JSONL file with question and answer fields
Output: JSONL file with added break_down field containing atomic facts per sentence
Script: src/vllm_llama3_modify2self_contained.py
Modifies extracted claims to be self-contained by replacing vague references (pronouns, ambiguous entities) with explicit entities from the original response.
Key Features:
- Resolves pronouns and vague references
- Replaces non-full names with full names
- Maintains factual accuracy while improving clarity
- Uses Llama-3.3-70B-Instruct for revision
Usage:
python src/vllm_llama3_modify2self_contained.py <input_file.json>Input: JSONL file with break_down field
Output: JSONL file with modified_sentences field containing self-contained claims
Script: data/convert_data_for_annotation.py
Converts processed data into format suitable for LLM-judge annotation, identifying missing spans and formatting claims.
Key Features:
- Identifies text spans missing from extracted claims
- Highlights missing information in responses
- Creates structured JSON format for annotation
- Maps claims to original sentences
Usage:
python data/convert_data_for_annotation.py <input_file.json> <model_name>Input: JSONL file with modified_sentences field
Output: Individual JSON files in data/{model_name}/result_{idx}.json containing:
prompt: Original questionorig_response: Original answerresponse: Answer with highlighted missing spansrevised_fact_jsonified_all: List of self-contained claims with metadata
Script: model_annotation/vllm_dependent_with_subcategories.py
Annotates claims to identify dependent claims (those requiring additional context) and categorizes them into subcategories:
- Ambiguous Concepts/Pronouns: Vague terms or pronouns lacking clear referents
- Missing Comparison: Implied comparisons without explicit targets
- Lack of Condition/Sources: Missing temporal conditions, hypothetical scenarios, or sources/references
Key Features:
- Uses Llama-3.3-70B-Instruct as annotator
- High concurrency (200 concurrent requests)
- Structured annotation format with explanations
- Categorizes dependency types
Usage:
python model_annotation/vllm_dependent_with_subcategories.pyConfiguration:
- Modify
model_need_to_be_annotatedlist to specify target models - Adjust
max_concurrencyfor your hardware
Output: Annotation files in annotations_{annotator_model}_{target_model}/annotations_{instance_id}.json
Script: model_annotation/vllm_relation_with_subcategories.py
Identifies missing temporal and contingency relationships between events in responses that are not captured in extracted claims.
Key Features:
- Detects PDTB-based discourse relations (Temporal, Contingency, Comparison, Expansion)
- Two-stage detection: relation identification and fact coverage check
- Uses Qwen-2.5-32B-Instruct for annotation
- Identifies spans with missing relationships
Usage:
python model_annotation/vllm_relation_with_subcategories.pyOutput: Annotation files in annotations_missingR_{annotator_model}_{target_model}/annotations_{instance_id}.json
Script: src/llama3.3_refine_single_units_final.py
Refines problematic claims identified in Step 5 by making them self-contained based on annotation feedback.
Key Features:
- Merges dependency and hallucination annotations
- Refines claims based on error types and explanations
- Maintains minimal word count while adding necessary context
- Uses Llama-3.3-70B-Instruct for refinement
Usage:
python src/llama3.3_refine_single_units_final.py <model_name>Input:
- Data files in
data/{model_name}/result_{idx}.json - Annotation files from
model_annotation/annotations_{annotator}_{model_name}/
Output: Refined data in data/final_refine_data/{model_name}_refine_units/result_{idx}.json with refined_units field
OPENAI_API_KEY: API key for OpenAI services (if using OpenAI API)OPENAI_API_BASE: Base URL for API (local vLLM typically useshttp://localhost:1234/v1)
Most scripts use local vLLM servers. Ensure you have:
- A vLLM server running on
localhost:1234 - Appropriate models loaded (e.g., Llama-3.3-70B-Instruct, Qwen-2.5-32B-Instruct)
- Input data: HuggingFace datasets or JSONL files
- Output data:
data/directory with model-specific subdirectories - Annotations:
model_annotation/directory
aiohttp: Asynchronous HTTP clientasyncio: Asynchronous programmingopenai: OpenAI API clienttransformers: HuggingFace transformersdatasets: HuggingFace datasetsnltk: Natural language processing utilitiestqdm: Progress barspandas: Data manipulation
Ensure NLTK data is downloaded:
import nltk
nltk.download('punkt'){
"question": "What is the best way to travel from Oakland to Boston?",
"answer": "The best way to travel..."
}{
"prompt": "What is...",
"orig_response": "The best way...",
"response": "The <strong>best way</strong> to travel...",
"revised_fact_jsonified_all": [
{
"atomic_unit": "",
"revised_unit": "The best way to travel from Oakland to Boston depends on budget.",
"model_response": "",
"orig_sentence": "The best way to travel..."
}
]
}[
{
"instance_id": "1",
"model_name": "gemini",
"annotator": "llama3.3_70b",
"fact_id": 0,
"error_type": "dependent",
"dependent_type": "Missing Comparison",
"explanation": "The claim mentions 'a longer journey' but does not specify the comparison."
}
]-
Generate responses:
python src/llm_generation_api_calls.py
-
Break down into claims:
python src/gpt4o_break_down.py data/gemini_answers.json
-
Make claims self-contained:
python src/vllm_llama3_modify2self_contained.py data/gemini_answers_break_down_by_llama70b.json
-
Convert for annotation:
python data/convert_data_for_annotation.py data/gemini_answers_break_down_by_llama70b.modified.json gemini
-
Annotate claims:
python model_annotation/vllm_dependent_with_subcategories.py python model_annotation/vllm_relation_with_subcategories.py
-
Refine problematic claims:
python src/llama3.3_refine_single_units_final.py gemini
- All scripts are designed to run with local vLLM servers for cost efficiency
- High concurrency limits may need adjustment based on your hardware
- The pipeline processes large datasets and may take significant time
- Error handling and retries are built into most scripts for robustness
- Some scripts require command-line arguments; check script headers for details
If you use this codebase, please cite the associated EMNLP 2025 paper.
@article{xin2025,
title={VeriFact: Enhancing Long-Form Factuality Evaluation with Refined Fact Extraction and Reference Facts},
author={Xin Liu and Lechen Zhang and Sheza Munir and Yiyang Gu and Lu Wang},
year={2025},
journal={arXiv preprint arXiv:2505.09701},
url={https://arxiv.org/abs/2505.09701},
}