MRC

Machine Reading Comprehension. The task of understanding a given context and inferring answers to queries/questions.

The ultimate goal is to answer QA pairs that don’t exist in the training MRC dataset by using external data.

Extractive Answer Datasets

The answer to a question always exists as a segment (or span) within the given context.

Cloze Tests

e.g., CNN/Daily Mail, CBT While it follows a Question-Answering format, the questions aren’t in the complete form we want for MRC.

Span Extraction

e.g., SQuAD, KorQuAD, NewsQA, Natural Questions

Descriptive Narrative Answer Datasets

Instead of extracting an answer as a span within the context, the answer is determined as a generated sentence (or free-form) based on the question.

e.g., MS MARCO, Narrative QA

Multiple-choice Datasets

A task where the answer to a question is selected from answer candidates. Said to be not ideal for building MRC QA models. e.g., MCTest (reportedly the first public MRC dataset, released in 2013), RACE, ARC

Challenges in MRC

Paraphrased paragraph

P1 and P2 are sentences with the same meaning. They are paraphrased sentences.

P1 contains key words from the question like ‘selected’ and ‘mission’, and the sentence structure is straightforward. So if the model can find P1 in the context, it should be easy to answer the question.

But P2 doesn’t contain the words from the question at all, and the sentence structure is more difficult.

An MRC model needs to be able to find answers in both P1 and P2.

Coreference resolution

Coreference refers to entities that mutually refer to the same thing. Coreference resolution is recognizing these entities as the same entity. ref: Blog

Unanswerable questions

There are clearly cases where the answer cannot be determined from the context alone. But an immature model might force an answer anyway.

So for unanswerable questions, the model should respond that it cannot provide an answer.

Multi-hop reasoning

A task where supporting facts from multiple documents must be found to answer the question.

e.g., HotpotQA, QAngaroo

Evaluation methods

Exact Match, F1 score

Evaluation methods used when the answer exists within the passage (extractive answer) and for multiple-choice datasets.

Exact Match (EM) or Accuracy
- The ratio of predictions that exactly match the ground truth
- (Number of correct samples) / (Number of whole samples)
F1 score
- Compute the F1 score based on token overlap between predicted answer and ground truth

ROUGE-L, BLEU

Evaluation methods for descriptive answers.

ROUGE-L Score
- Overlap recall between prediction and ground truth
BLEU
- Precision between predicted answer and ground truth