You can access trained models and datasets used in this project from the following links:
- mT5-large trained on translated AMR 3.0 and HuAMR.
- Llama-3.2-1B-HuAMR trained on translated AMR 3.0 and HuAMR.
- HuAMR datasets
conda create --name my-env python=3.12.0
conda activate my-env
pip install -e .Set the HF_TOKEN (Hugging Face token) environment variable to enable Hugging Face model loading and saving functionalities across all entry points:
export HF_TOKEN='your_hugging_face_token'Here is the recommended sequence for running each script in the project to produce synthetic AMR data, train models, and evaluate results:
-
Generate Input Data for Synthetic AMR:
- Run
gen_input_for_synth_data.pyto prepare a CSV of selected sentences for synthetic data generation. - Alternatively, use
gen_europarl_synth_data.pyto prepare synthetic data from the Europarl dataset.
- Run
-
Generate Synthetic AMR Data:
- Run
generate_synthetic_data.pyto produce AMR graphs for the prepared sentences, using an AMR model adapter.
- Run
-
Train AMR Models:
- Use
train.pyortrain_s2s.py(for Seq2Seq models) to train models on AMR datasets with configurations as needed.
- Use
-
Generate AMR for Evaluation:
- After training, run
generate.pyorgenerate_s2s.pyto create AMR graphs for evaluation.
- After training, run
-
Evaluate Models:
- Run
evaluate.pyto calculate Smatch++ scores (F1, Precision, Recall) between reference and predicted AMR graphs. - For machine translation evaluation, use
comet_score.pyto calculate COMET scores between source and translated sentences.
- Run
To run train.py, provide a configuration file in YAML format (e.g., llm_config.yaml) containing the training parameters. Execute the script from the command line as follows:
python train.py path/to/llm_config.yamlRefer to llm_config.yaml for a complete list of configurable options and their descriptions.
To train a sequence-to-sequence (Seq2Seq) model using train_s2s.py, provide a configuration file in YAML format (e.g., s2s_train_config.yaml) containing the necessary training and model parameters. Run the script from the command line as follows:
python train_s2s.py path/to/s2s_train_config.yamlRefer to s2s_config.yaml for a complete list of configurable options and their descriptions.
To perform AMR generation using generate.py, you need a configuration file (e.g., generate_config.yaml) along with a specified adapter and output path. Execute the script as follows:
python generate.py path/to/generate_config.yaml path/to/adapter path/to/output_directory [batch_size]config_path: Path to the configuration file.adapter_path: Path to the adapter model for loading into the main model.output_path: Directory where the output CSV with generated AMR graphs will be saved.batch_size(optional): Batch size for inference (default is 32).
Refer to llm_config.yaml for a complete list of configurable options.
The script will save the output file (generated.csv) containing sentences and their corresponding generated AMR graphs to the specified output directory.
To run generate_s2s.py, provide a configuration file in YAML format (e.g., s2s_config.yaml) along with an output directory path. This script performs inference and saves the generated results.
python generate_s2s.py path/to/s2s_config.yaml path/to/output_directoryconfig_path: Path to the configuration file (e.g.,s2s_config.yaml).output_path: Directory where the output CSV with generated AMR graphs will be saved.
Refer to s2s_config.yaml for a complete list of configurable options and their descriptions.
The script will save the output file (generated.csv) containing the test set sentences and their generated AMR graphs to the specified output directory.
To generate synthetic AMR data with generate_synthetic_data.py, provide a configuration file (e.g., generate_synth_config.yaml), an adapter, an input file, and an output directory. The script processes the input file in batches and appends generated AMR graphs to the output file.
python generate_synthetic_data.py path/to/generate_synth_config.yaml path/to/adapter path/to/input_file path/to/output_directory [batch_size]config_path: Path to the configuration file (e.g.,generate_synth_config.yaml).adapter_path: Path to the adapter model for loading into the main model.input_file: Path to a CSV file containing sentences for which synthetic AMR graphs will be generated.output_path: Directory where the output CSV with generated synthetic AMR data will be saved.batch_size(optional): Batch size for inference (default is 32).
The input_file should be a CSV file with a sentence column containing sentences for AMR generation.
The script will generate generated.csv in the specified output_path, containing:
sentence: Original input sentences.generated_amr: Corresponding generated AMR graphs.
evaluate.py calculates the Smatch++ F1, Precision, and Recall scores between reference and predicted AMR graphs in a CSV file. The script processes each row, removing unwanted elements (e.g., :wiki attributes) from the graphs and handling errors where graphs cannot be parsed.
python evaluate.py path/to/data.csv ref_column pred_column path/to/output_directorydata_path: Path to the CSV file containing the reference and predicted AMR graphs.ref_column: Column name indata.csvfor the reference AMR graphs.pred_column: Column name indata.csvfor the predicted AMR graphs.out_path: Directory where the output CSV with evaluation results will be saved.
The data.csv file should contain columns for the reference and predicted AMR graphs specified by ref_column and pred_column.
The script will generate evaluated.csv in the specified out_path, containing:
smatchpp_f1: F1 score for each row.smatchpp_prec: Precision score for each row.smatchpp_rec: Recall score for each row.
Additionally, the script outputs the following metrics to the console:
- Mean Smatch++ F1 score across all entries.
- Mean Precision across all entries.
- Mean Recall across all entries.
- Count of unparsable AMR graphs.
comet_score.py calculates COMET scores for evaluating machine translation quality. The script uses the Unbabel/wmt22-cometkiwi-da model to predict scores based on pairs of source and target sentences.
python comet_score.py path/to/input_folderinput_folder: Path to a directory containing CSV files with source and target sentences.
The input_folder should contain one or more CSV files. Each CSV file should include:
sentence: Column for source sentences.hu_sentence: Column for the target or machine-translated sentences.
The script will load all CSV files in the folder and concatenate them for scoring.
The script prints the COMET scores for each sentence pair to the console.
gen_input_for_synth_data.py prepares a dataset for synthetic AMR generation by filtering and tokenizing sentences from the "HunSum-2-abstractive" dataset. The output is a CSV file with selected sentences ready for use as input in AMR generation tasks.
python gen_input_for_synth_data.pyEnsure the necessary datasets and tokenizers are available. This script uses the SZTAKI-HLT/HunSum-2-abstractive dataset, which will be automatically downloaded if not already present.
The script generates a file named synthetic_data_input.csv, containing:
uuid: Unique identifier for each entry.url: URL of the article source.sentence: First sentence of the tokenized lead section.
gen_europarl_synth_data.py generates AMR graphs from English-Hungarian sentence pairs in the Europarl dataset. This script uses an AMR parser model to create AMR graphs for English sentences, then saves the generated graphs along with the corresponding Hungarian sentences.
python gen_europarl_synth_data.py path/to/stog_model_dir path/to/europarl_folder path/to/output_directory [batch_size]stog_model_dir: Path to the directory containing the sentence-to-graph (STOG) model for AMR parsing.europarl_folder: Directory containing the Europarl dataset files (europarl-v7.hu-en.enandeuroparl-v7.hu-en.hu).output_path: Directory where the output file with generated AMR graphs will be saved.batch_size(optional): Batch size for inference (default is 32).
The europarl_folder should contain:
europarl-v7.hu-en.en: File with English sentences.europarl-v7.hu-en.hu: File with corresponding Hungarian sentences.
The script will create europarl_generated.txt in the specified output_path. Each entry in this file will contain:
- AMR Graph: Generated from the English sentence.
- Hungarian Sentence: Aligned with the AMR graph for reference.
If you use our dataset or models, please cite the following papers:
TODO