LitCab

Code and benchmark for LitCab: Lightweight Calibration of Language Models on Outputs of Varied Lengths

CaT

The directories for the CaT datasets are listed below:

Dataset	Directory
NQ	NQ
SciQ	sciq
TriviaQA	triviaqa
TruthfulQA	truthfulqa
WikiQA	wikiqa
BioGen	name_bio
WikiGen	factuality_prompt

The training and evaluation files for each dataset within the corresponding directory are:

Dataset	Train	Test
NQ	train.txt	test.txt
SciQ	train.txt	test.txt
Triviaqa	train.txt	test.txt
TruthfulQA	train.txt	test.txt
WikiQA	train.txt	test.txt
BioGen	unlabeled_prompt_entities.txt	prompt_entities.txt
WikiGen	train.jsonl	test.jsonl

Evaluating LLMs on CaT

To evaluate a language model for all phrase- and sentence-level datasets, run the following command:

cd script
bash get_baselines.sh <model>

where <model> is the name of the model. The script will download the model and evaluate it on all datasets. The results will be saved in the script/log directory.

Please note that we call the OpenAI GPT-4 api throught Azure for evaluation. Please set the environment variable AZURE_OPENAI_KEY to your OpenAI API key. Your can also mannually set the key in src/get_gpt_correctness.py Line 13.

Before evaluating models on Long-form Generation, please run the following command to download the WikiPedia corpus:

cd FActScore
python -m factscore.download_data

To evaluate a language model for BioGen, run the following command:

cd script
bash get_baseline_long.sh

where the names of LLMs are set in script/get_baseline_long.sh Line 3. The results will be saved in the script/log directory.

To evaluate a language model for WikiGen, run the following command:

cd script
bash get_baseline_fp.sh

where the names of LLMs are set in script/get_baseline_fp.sh Line 3. The results will be saved in the script/log directory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LitCab

CaT

Evaluating LLMs on CaT

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
FActScore		FActScore
NQ		NQ
evaluate		evaluate
factuality_prompt		factuality_prompt
long_form_generation		long_form_generation
name_bio		name_bio
sciq		sciq
script		script
src		src
triviaqa		triviaqa
truthfulqa		truthfulqa
wikiqa		wikiqa
README.md		README.md

launchnlp/LitCab

Folders and files

Latest commit

History

Repository files navigation

LitCab

CaT

Evaluating LLMs on CaT

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages