[Paper Reading] BI-DIRECTIONAL ATTENTION FLOW FOR MACHINE COMPREHENSION

Ya-Liang Allen Chang

4 min readMay 9, 2019

Problem Definition

Improves attention mechanism on machine comprehension, answering a query about a given context paragraph.

Introduction

Machine comprehension (MC) and question answering (QA) have gained significant popularity over the past few years
One of the key factors to the advancement has been the use of neural attention mechanism, which enables the system to focus on a targeted area within a context paragraph (for MC) or within an image (for Visual QA), that is most relevant to answer the question

Previous works

The computed attention weights are often used to extract the most relevant information from the context for answering the question by summarizing the context into a fixed-size vector
In the text domain, they are often temporally dynamic, whereby the attention weights at the current time step are a function of the attended vector at the previous time step
Usually uni-directional, wherein the query attends on the context paragraph or the image

Contributions

Introduce the Bi-Directional Attention Flow (BIDAF) network, a hierarchical multi-stage architecture for modeling the representations of the context paragraph at different levels of granularity
BIDAF includes character-level, word-level, and contextual embeddings, and uses bi-directional attention flow to obtain a query-aware context representation

The proposed attention layer:

Not used to summarize the context paragraph into a fixed-size vector. Instead, the attention is computed for every time step, and the attended vector at each time step, along with the representations from previous layers, is allowed to flow through to the subsequent modeling layer. This reduces the information loss caused by early summarization
Use a memory-less attention mechanism. That is, while we iteratively compute attention through time as in Bahdanau et al. (2015), the attention at each time step is a function of only the query and the context paragraph at the current time step and does not directly depend on the attention at the previous time step. We hypothesize that this simplification leads to the division of labor between the attention layer and the modeling layer. It forces the attention layer to focus on learning the attention between the query and the context, and enables the modeling layer to focus on learning the interaction within the query-aware context representation (the output of the attention layer). It also allows the attention at each time step to be unaffected from incorrect attendances at previous time steps
Use attention mechanisms in both directions, query-to-context and context-to-query, which provide complimentary information to each other

Model

Six layers:

Character Embedding Layer maps each word to a vector space using character-level CNNs.
Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model.
Contextual Embedding Layer utilizes contextual cues from surrounding words to refine the embedding of the words. These first three layers are applied to both the query and context.
Attention Flow Layer couples the query and context vectors and produces a set of queryaware feature vectors for each word in the context.
Modeling Layer employs a Recurrent Neural Network to scan the context.
Output Layer provides an answer to the query.

Related Works

Machine comprehension

Contributions from large-scale datasets: Massive cloze test datasets, Childrens Book Test, Stanford Question Answering (SQuAD) …
Three main groups of attention mechanisms

Visual question answering

RNN
Different attention mechanisms

QA Experiments

Dataset: SQuAD

BIDAF (ensemble) achieves an EM score of 73.3 and an F1 score of 81.1, outperforming all previous approaches.

CLOZE TEST EXPERIMENTS

Conclusion

The paper introduces BIDAF, a multi-stage hierarchical process that represents the context at different levels of granularity and uses a bi-directional attention flow mechanism to achieve a queryaware context representation without early summarization.

Reference

Seo, Minjoon, et al. “Bidirectional attention flow for machine comprehension.” arXiv preprint arXiv:1611.01603 (2016).