Training and Test
The library includes scripts to prepare training data and to train both the parse and generate
models. These scripts are not part of the pip
install so to use them simply download the
GitHub project and use them in place. The code for these is located in scripts/X_yyy
.
Scripts Directory
* 10_Misc PlotAMR.py, SpotlightDBServer.sh
* 20_Assemble_LDC2020T02 Collect the LDC training data into test, dev, and train files
* 30_Model_Parse_GSII Scripts to train the parsing model
* 40_Model_Generate_T5 Scripts to train the generation model
* 50_Build_AMR_View Script to build and run the GUI
Most of the files in these directories start with a number. To train the model, open the files
in numerical order, check the defined directories, etc.. under the main
statement and run.
Most of these scripts are simply there to setup parameters and then call the associated library
functions to execute the code. As such, they are all relatively short and should be fairly
self-explanatory.
Directory Structure and Run Locations
The scripts in the above directories have a statement at the top import setup_run_dir
. This
very simple import causes python to see the script as running from 2 levels up from the current
directory. This is just a simple way to keep the scripts all in an organized directory and still
have local import of amrlib
and a common path to data
. If you move the script or try to run
it from another location, be sure to remove this import and modify paths accordingly.
All training data, models, etc.. is expected to reside under amrlib/data
. When you run the first
script for data prep (scripts/20_Assemble_LDC2020T02/10_CollateData.py
) the only directory present
should be amrlib/data/amr_annotation_3.0
. After running all the scripts in 20_, 30_ and 40_ the
data directory layout will look something like...
├── amr_annotation_3.0
├── LDC2020T02
│ ├── dev.txt
│ ├── test.txt
│ └── train.txt
├── model_generate_t5
│ ├── config.json
│ ├── pytorch_model.bin
│ ├── test.txt.generated
│ ├── test.txt.ref_sents
│ └── training_args.bin
├── model_parse_gsii
│ ├── epoch200.pt
│ ├── epoch200.pt.dev_generated
│ ├── epoch200.pt.test_generated
│ ├── epoch200.pt.test_generated.wiki
│ └── vocabs
│ ├── concept_char_vocab
│ ├── concept_vocab
│ ├── lem_char_vocab
│ ├── lem_vocab
│ ├── ner_vocab
│ ├── pos_vocab
│ ├── predictable_concept_vocab
│ ├── rel_vocab
│ ├── tok_vocab
│ └── word_char_vocab
└── tdata_gsii
├── dev.txt.features
├── dev.txt.features.nowiki
├── spotlight_wiki.json
├── test.txt.features
├── test.txt.features.nowiki
├── train.txt.features
└── train.txt.features.nowiki
Note that when downloading models you will get a similar layout but the model directories generally
have a version number appended and a link set, ie.. model_stog -> model_parse_gsii-v0_1_0
.
Model Configuration
The configs
directory has json
files that contain model parameters used for training. You will
notice that the train scripts for both models load these. If you wnat to change the location of
the training data or any other model / training parameters (such as batch size) check in these
files.
Training data
The latest AMR training corpus, LDC2020T02, is available from the Linguistic Data Consortium. It is free for institutions that have a membership or $300 for non-members (for non-commercial use).
This newest corpus contains about 60K AMR graphs. Other versions of LDC data can be used for training and test, however earlier versions are generally smaller so expect SMATCH and BLEU scores to be slightly lower on the smaller datasets. The original, freely available, "Little Prince" corpus is much smaller that the LDC datasets. It is not big enough to do a good job of training these large models but it can be used for experimenting; just expect much lower scores during test.
If you want to try training but don't want to buy the LDC data, it's reasonable to use the pre-trained parser (or another existing one such as JAMR) to create a synthetic corpus by parsing a large number of sentences from a free corpus and then using the output AMR graphs as input for training. This technique has shown to be an effective pre-training method in some papers, however with the larger LDC2020T02 corpus, pre-training is not generally required.
If you change the name / location of the training files, be sure to update the associated .json
config
files.