Skip to content
/ Targa Public

Code and data for TARGA: Targeted Synthetic Data Generation for Practical Reasoning over Structured Data (ACL 2025)

Notifications You must be signed in to change notification settings

cdhx/Targa

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TARGA: Targeted Synthetic Data Generation for Practical Reasoning over Structured Data


(Paper) TARGA: Targeted Synthetic Data Generation for Practical Reasoning over Structured Data (ACL 2025)

Semantic parsing, which converts natural language questions into logic forms, plays a crucial role in reasoning within structured environments. However, existing methods encounter two significant challenges: reliance on extensive manually annotated datasets and limited generalization capability to unseen examples. To tackle these issues, we propose Targeted Synthetic Data Generation (Targa), a practical framework that dynamically generates high-relevance synthetic data without manual annotation. Starting from the pertinent entities and relations of a given question, we probe for the potential relevant queries through layer-wise expansion and cross-layer combination. Then we generate corresponding natural language questions for these constructed queries to jointly serve as the synthetic demonstrations for in-context learning. Experiments on multiple knowledge base question answering (KBQA) datasets demonstrate that Targa, using only a 7B-parameter model, substantially outperforms existing non-fine-tuned methods that utilize close-sourced model, achieving notable improvements in F1 scores on GrailQA(+7.7) and KBQA-Agent(+12.2). Furthermore, Targa also exhibits superior sample efficiency, robustness, and generalization capabilities under non-I.I.D. settings.

KB and data

KB employment

You can follow this to deploy Freebase on your local device:
https://github.com/dki-lab/GrailQA?tab=readme-ov-file#setup

Please remember to modify the SPARQLPATH to your own endpoint.

Data

You can download the datasets, embedding files, and intermediate results from the following link: https://pan.quark.cn/s/7247fd19a451

Please ensure that all downloaded files are placed in their corresponding directories as required by the project structure.

Run

Query Construction

parallel_qc.py is for GrailQA, GraphQ, KBQA-Agent db_qc.py is for WikiSQL meta_qc.py is for MetaQA

  1. set the parameters in QC/parallel_qc.py
  2. set KB query endpoint to your own in QC/parallel_qc.py
  3. execute QC/parallel_qc.py to generate construction result(Our results are also provided in above link)

Rerank

bge_rerank_simulation.py is for GrailQA, GraphQ, KBQA-Agent bge_rerank_wikisql.py is for WikiSQL bge_rerank_metaqa.py is for MetaQA

  1. based on construction result, execute rerank/bge_rerank_simulation.py to generate rerank result

Question Answering

QA_concurrent.py is for GrailQA, GraphQ, KBQA-Agent QA_concurrent_wikisql.py is for WikiSQL QA_concurrent_metaqa.py is for MetaQA

  1. config the parameters in QA/QA_concurrent.py
  2. modify QA/utils.py to make sure api-key is correctly processed and set KB query endpoint to your own
  3. execute QA/QA_concurrent.py to generate QA result

File Structure

Targa/
├── data: dataset for experiment
│   ├── GraphQ
│   ├── wikisql
│   ├── metaQA
│   ├── kbqa_agent
│   ├── GrailQA
│   └── freebase
├── linking: for relation linking
│   ├── embedding
│   ├── result: class linking result
│   ├── rel_linking_1hop.py
│   ├── utils.py
│   ├── rel_linking_2hop.py
│   └── rel_linking_embedding.py
├── QC
│   ├── parallel_qc.py: query construction for GrailQA, GraphQ and KBQA-Agent
│   ├── db_qc.py: query construction for WikiSQL
│   ├── meta_qc.py: query construction for MetaQA
│   └── result
├── rerank
│   ├── bge_rerank_train_retrieved.py
│   ├── bge_rerank_simulation_metaqa.py
│   ├── bge_rerank_simulation_wikisql.py
│   ├── bge_rerank_simulation.py
│   ├── retrieve_train_bm25.py
│   ├── utils.py
│   └── result
└── QA
    ├── QA_concurrent.py: main entrance
    ├── QA_concurrent_wikisql.py: main entrance
    ├── QA_concurrent_metaqa.py: main entrance
    ├── QA_attack.py
    ├── get_embedding.py
    ├── result
    ├── prompt_list.py: prompts used in our experiments
    └── utils.py

If you have any questions, please feel free to contact us via issues or email (xianghuang@smail.nju.edu.cn or jyshen@smail.nju.edu.cn)

Citation

@inproceedings{huang-etal-2025-targa,
    title = "{TARGA}: Targeted Synthetic Data Generation for Practical Reasoning over Structured Data",
    author = "Huang, Xiang  and
      Shen, Jiayu  and
      Huang, Shanshan  and
      Cheng, Sitao  and
      Wang, Xiaxia  and
      Qu, Yuzhong",
    editor = "Che, Wanxiang  and
      Nabende, Joyce  and
      Shutova, Ekaterina  and
      Pilehvar, Mohammad Taher",
    booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.acl-long.137/",
    pages = "2704--2726",
    ISBN = "979-8-89176-251-0",
    abstract = "Semantic parsing, which converts natural language queries into logic forms, plays a crucial role in reasoning within structured environments. However, existing methods encounter two significant challenges: reliance on extensive manually annotated datasets and limited generalization capability to unseen examples. To tackle these issues, we propose Targeted Synthetic Data Generation (Targa), a practical framework that dynamically generates high-relevance synthetic data without manual annotation. Starting from the pertinent entity and relation of a given question, we probe for the potential relevant queries through layer-wise expansion and cross-layer combination. Then, we generate corresponding natural language questions for these constructed queries to jointly serve as the synthetic demonstration for in-context learning. Experiments on multiple knowledge-based question answering (KBQA) datasets demonstrate that Targa, using only a 7B-parameter model, substantially outperforms existing non-fine-tuned methods that utilize close-sourced model, achieving notable improvements in F1 scores on GrailQA(+7.7) and KBQA-Agent(+12.2). Furthermore, Targa also exhibits superior sample efficiency, robustness, and generalization capabilities under non-I.I.D. settings."
}

About

Code and data for TARGA: Targeted Synthetic Data Generation for Practical Reasoning over Structured Data (ACL 2025)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages