qaeval-experiments

View the Project on GitHub CogComp/qaeval-experiments

This directory contains the code to reproduce the experiments that calculate QA metrics and correlations on the subset of the manually labeled Fabbri (2020) data (Tables 2 and 3).

Required data:

Required environments:

To recalculate the numbers, run:

sh experiments/question-answering/fabbri2020/run.sh

Because the question generation model randomly generates question IDs, rerunning the question generation will result in IDs that do not match the IDs that we used to do the manual annotation. Therefore, to recalculate the scores reported in the paper, we had to include intermediate outputs of the script, and run.sh only finishes the processing from those outputs. However, it contains the commented out code which we used to run the whole pipeline.

We did the annotation in batches, so you may see batch numbers 1 and 2 in the output directories and code.

Included data:

After the run.sh script finishes, the metric correlations (Table 3) will be written to the following locations:

The QA metrics (Table 2) will be written to output/all/squad-metrics.json and output/all/answer-verification/log.txt. The first file contains the is-answerable F1 (is_answerable -> unweighted -> f1) and the EM/F1 scores on just the subset of the data which is answerable (is-answerable-only -> squad -> exact-match/f1). The second file contains the human labeled answer accuracy given the question is answerable (Accuracy given ground-truth question is answerable 0.8629737609329446).