Semantic parsing, which converts natural language questions into logic forms, plays a crucial role in reasoning within structured environments. However, existing methods encounter two significant challenges: reliance on extensive manually annotated datasets and limited generalization capability to unseen examples. To tackle these issues, we propose Targeted Synthetic Data Generation (Targa), a practical framework that dynamically generates high-relevance synthetic data without manual annotation. Starting from the pertinent entities and relations of a given question, we probe for the potential relevant queries through layer-wise expansion and cross-layer combination. Then we generate corresponding natural language questions for these constructed queries to jointly serve as the synthetic demonstrations for in-context learning. Experiments on multiple knowledge base question answering (KBQA) datasets demonstrate that Targa, using only a 7B-parameter model, substantially outperforms existing non-fine-tuned methods that utilize close-sourced model, achieving notable improvements in F1 scores on GrailQA(+7.7) and KBQA-Agent(+12.2). Furthermore, Targa also exhibits superior sample efficiency, robustness, and generalization capabilities under non-I.I.D. settings.
You can follow this to deploy Freebase on your local device:
https://github.com/dki-lab/GrailQA?tab=readme-ov-file#setup
Please remember to modify the SPARQLPATH to your own endpoint.
You can download the datasets, embedding files, and intermediate results from the following link: https://pan.quark.cn/s/7247fd19a451
Please ensure that all downloaded files are placed in their corresponding directories as required by the project structure.
parallel_qc.py is for GrailQA, GraphQ, KBQA-Agent
db_qc.py is for WikiSQL
meta_qc.py is for MetaQA
- set the parameters in
QC/parallel_qc.py - set KB query endpoint to your own in
QC/parallel_qc.py - execute
QC/parallel_qc.pyto generate construction result(Our results are also provided in above link)
bge_rerank_simulation.py is for GrailQA, GraphQ, KBQA-Agent
bge_rerank_wikisql.py is for WikiSQL
bge_rerank_metaqa.py is for MetaQA
- based on construction result, execute
rerank/bge_rerank_simulation.pyto generate rerank result
QA_concurrent.py is for GrailQA, GraphQ, KBQA-Agent
QA_concurrent_wikisql.py is for WikiSQL
QA_concurrent_metaqa.py is for MetaQA
- config the parameters in
QA/QA_concurrent.py - modify
QA/utils.pyto make sure api-key is correctly processed and set KB query endpoint to your own - execute
QA/QA_concurrent.pyto generate QA result
Targa/
├── data: dataset for experiment
│ ├── GraphQ
│ ├── wikisql
│ ├── metaQA
│ ├── kbqa_agent
│ ├── GrailQA
│ └── freebase
├── linking: for relation linking
│ ├── embedding
│ ├── result: class linking result
│ ├── rel_linking_1hop.py
│ ├── utils.py
│ ├── rel_linking_2hop.py
│ └── rel_linking_embedding.py
├── QC
│ ├── parallel_qc.py: query construction for GrailQA, GraphQ and KBQA-Agent
│ ├── db_qc.py: query construction for WikiSQL
│ ├── meta_qc.py: query construction for MetaQA
│ └── result
├── rerank
│ ├── bge_rerank_train_retrieved.py
│ ├── bge_rerank_simulation_metaqa.py
│ ├── bge_rerank_simulation_wikisql.py
│ ├── bge_rerank_simulation.py
│ ├── retrieve_train_bm25.py
│ ├── utils.py
│ └── result
└── QA
├── QA_concurrent.py: main entrance
├── QA_concurrent_wikisql.py: main entrance
├── QA_concurrent_metaqa.py: main entrance
├── QA_attack.py
├── get_embedding.py
├── result
├── prompt_list.py: prompts used in our experiments
└── utils.py
If you have any questions, please feel free to contact us via issues or email (xianghuang@smail.nju.edu.cn or jyshen@smail.nju.edu.cn)
@inproceedings{huang-etal-2025-targa,
title = "{TARGA}: Targeted Synthetic Data Generation for Practical Reasoning over Structured Data",
author = "Huang, Xiang and
Shen, Jiayu and
Huang, Shanshan and
Cheng, Sitao and
Wang, Xiaxia and
Qu, Yuzhong",
editor = "Che, Wanxiang and
Nabende, Joyce and
Shutova, Ekaterina and
Pilehvar, Mohammad Taher",
booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.acl-long.137/",
pages = "2704--2726",
ISBN = "979-8-89176-251-0",
abstract = "Semantic parsing, which converts natural language queries into logic forms, plays a crucial role in reasoning within structured environments. However, existing methods encounter two significant challenges: reliance on extensive manually annotated datasets and limited generalization capability to unseen examples. To tackle these issues, we propose Targeted Synthetic Data Generation (Targa), a practical framework that dynamically generates high-relevance synthetic data without manual annotation. Starting from the pertinent entity and relation of a given question, we probe for the potential relevant queries through layer-wise expansion and cross-layer combination. Then, we generate corresponding natural language questions for these constructed queries to jointly serve as the synthetic demonstration for in-context learning. Experiments on multiple knowledge-based question answering (KBQA) datasets demonstrate that Targa, using only a 7B-parameter model, substantially outperforms existing non-fine-tuned methods that utilize close-sourced model, achieving notable improvements in F1 scores on GrailQA(+7.7) and KBQA-Agent(+12.2). Furthermore, Targa also exhibits superior sample efficiency, robustness, and generalization capabilities under non-I.I.D. settings."
}