TARGA: Targeted Synthetic Data Generation for Practical Reasoning over Structured Data

(Paper) TARGA: Targeted Synthetic Data Generation for Practical Reasoning over Structured Data (ACL 2025)

Semantic parsing, which converts natural language questions into logic forms, plays a crucial role in reasoning within structured environments. However, existing methods encounter two significant challenges: reliance on extensive manually annotated datasets and limited generalization capability to unseen examples. To tackle these issues, we propose Targeted Synthetic Data Generation (Targa), a practical framework that dynamically generates high-relevance synthetic data without manual annotation. Starting from the pertinent entities and relations of a given question, we probe for the potential relevant queries through layer-wise expansion and cross-layer combination. Then we generate corresponding natural language questions for these constructed queries to jointly serve as the synthetic demonstrations for in-context learning. Experiments on multiple knowledge base question answering (KBQA) datasets demonstrate that Targa, using only a 7B-parameter model, substantially outperforms existing non-fine-tuned methods that utilize close-sourced model, achieving notable improvements in F1 scores on GrailQA(+7.7) and KBQA-Agent(+12.2). Furthermore, Targa also exhibits superior sample efficiency, robustness, and generalization capabilities under non-I.I.D. settings.

KB and data

KB employment

You can follow this to deploy Freebase on your local device:
https://github.com/dki-lab/GrailQA?tab=readme-ov-file#setup

Please remember to modify the SPARQLPATH to your own endpoint.

Data

You can download the datasets, embedding files, and intermediate results from the following link: https://pan.quark.cn/s/7247fd19a451

Please ensure that all downloaded files are placed in their corresponding directories as required by the project structure.

Run

Query Construction

parallel_qc.py is for GrailQA, GraphQ, KBQA-Agent db_qc.py is for WikiSQL meta_qc.py is for MetaQA

set the parameters in QC/parallel_qc.py
set KB query endpoint to your own in QC/parallel_qc.py
execute QC/parallel_qc.py to generate construction result(Our results are also provided in above link)

Rerank

bge_rerank_simulation.py is for GrailQA, GraphQ, KBQA-Agent bge_rerank_wikisql.py is for WikiSQL bge_rerank_metaqa.py is for MetaQA

based on construction result, execute rerank/bge_rerank_simulation.py to generate rerank result

Question Answering

QA_concurrent.py is for GrailQA, GraphQ, KBQA-Agent QA_concurrent_wikisql.py is for WikiSQL QA_concurrent_metaqa.py is for MetaQA

config the parameters in QA/QA_concurrent.py
modify QA/utils.py to make sure api-key is correctly processed and set KB query endpoint to your own
execute QA/QA_concurrent.py to generate QA result

File Structure

Targa/
├── data: dataset for experiment
│   ├── GraphQ
│   ├── wikisql
│   ├── metaQA
│   ├── kbqa_agent
│   ├── GrailQA
│   └── freebase
├── linking: for relation linking
│   ├── embedding
│   ├── result: class linking result
│   ├── rel_linking_1hop.py
│   ├── utils.py
│   ├── rel_linking_2hop.py
│   └── rel_linking_embedding.py
├── QC
│   ├── parallel_qc.py: query construction for GrailQA, GraphQ and KBQA-Agent
│   ├── db_qc.py: query construction for WikiSQL
│   ├── meta_qc.py: query construction for MetaQA
│   └── result
├── rerank
│   ├── bge_rerank_train_retrieved.py
│   ├── bge_rerank_simulation_metaqa.py
│   ├── bge_rerank_simulation_wikisql.py
│   ├── bge_rerank_simulation.py
│   ├── retrieve_train_bm25.py
│   ├── utils.py
│   └── result
└── QA
    ├── QA_concurrent.py: main entrance
    ├── QA_concurrent_wikisql.py: main entrance
    ├── QA_concurrent_metaqa.py: main entrance
    ├── QA_attack.py
    ├── get_embedding.py
    ├── result
    ├── prompt_list.py: prompts used in our experiments
    └── utils.py

If you have any questions, please feel free to contact us via issues or email (xianghuang@smail.nju.edu.cn or jyshen@smail.nju.edu.cn)

Citation

@inproceedings{huang-etal-2025-targa,
    title = "{TARGA}: Targeted Synthetic Data Generation for Practical Reasoning over Structured Data",
    author = "Huang, Xiang  and
      Shen, Jiayu  and
      Huang, Shanshan  and
      Cheng, Sitao  and
      Wang, Xiaxia  and
      Qu, Yuzhong",
    editor = "Che, Wanxiang  and
      Nabende, Joyce  and
      Shutova, Ekaterina  and
      Pilehvar, Mohammad Taher",
    booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.acl-long.137/",
    pages = "2704--2726",
    ISBN = "979-8-89176-251-0",
    abstract = "Semantic parsing, which converts natural language queries into logic forms, plays a crucial role in reasoning within structured environments. However, existing methods encounter two significant challenges: reliance on extensive manually annotated datasets and limited generalization capability to unseen examples. To tackle these issues, we propose Targeted Synthetic Data Generation (Targa), a practical framework that dynamically generates high-relevance synthetic data without manual annotation. Starting from the pertinent entity and relation of a given question, we probe for the potential relevant queries through layer-wise expansion and cross-layer combination. Then, we generate corresponding natural language questions for these constructed queries to jointly serve as the synthetic demonstration for in-context learning. Experiments on multiple knowledge-based question answering (KBQA) datasets demonstrate that Targa, using only a 7B-parameter model, substantially outperforms existing non-fine-tuned methods that utilize close-sourced model, achieving notable improvements in F1 scores on GrailQA(+7.7) and KBQA-Agent(+12.2). Furthermore, Targa also exhibits superior sample efficiency, robustness, and generalization capabilities under non-I.I.D. settings."
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
QA		QA
QC		QC
linking		linking
rerank		rerank
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TARGA: Targeted Synthetic Data Generation for Practical Reasoning over Structured Data

KB and data

KB employment

Data

Run

Query Construction

Rerank

Question Answering

File Structure

Citation

About

Uh oh!

Releases

Packages

Languages

cdhx/Targa

Folders and files

Latest commit

History

Repository files navigation

TARGA: Targeted Synthetic Data Generation for Practical Reasoning over Structured Data

KB and data

KB employment

Data

Run

Query Construction

Rerank

Question Answering

File Structure

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages