Reading Wikipedia to Answer Open Domain Questions (2017)

April 26, 2025

Reading Wikipedia to Answer Open Domain Questions proposes DrQA, a system for open-domain question answering. DrQA consists of two components: Document Retriever and Document Reader.

Given a question, the Document Retriever uses bigram TF-IDF matching to search Wikipedia and retrieve the five most relevant articles. These articles, along with the question, are then passed to the Document Reader, a neural network model that embeds the question and the paragraphs into vector representations. The model then compares the paragraph embeddings with the question embedding to identify the most salient text span that likely contains the answer.

The Document Retriever splits each article into paragraphs and produces embeddings for each paragraph. Let $p_i$ denote the $i$th token in a paragraph of $m$ tokens. Each token $p_i$ is first mapped to a feature vector $\tilde{\mathbf{p}}_i$, resulting in the sequence $\{\tilde{{\rm \mathbf{p}}}_1,\dots \tilde{{\rm \mathbf{p}}}_m\}$. A multi-layer bidirectional LSTM processes this sequence, and for each token $p_i$, the concatenated hidden states across all layers form its final embedding $\mathbf{p}_i$.

The input embedding $\tilde{\mathbf{p}}_i$ is constructed from several features: the Glove word embedding $\mathbf{E}(p_i)$, a binary feature indicating whether $p_i$ appears in the question, part-of-speech (POS) tags, named entity recognition (NER) tags, normalized term frequency, and an aligned question embedding $f_{\textit{align}}(p_i)$. The aligned question embedding $f_{\textit{align}}(p_i)$ is calculated as follows. Let $\alpha(\cdot)$ represent a single dense layer followed by a ReLU activation. Given a question $q$, the aligned embedding is computed by:

$$ f\_{\textit{align}}(p\_i)=\sum\_j \frac{\exp(\alpha(\mathbf{E}(p\_i))\cdot \alpha (\mathbf{E}(q\_j)))}{\sum\_{j'}\exp \left(\alpha (\mathbf{E}(p\_i))\cdot \alpha (\mathbf{E}(q\_{j'}))\right)} \mathbf{E}(q\_j) $$

To encode the question, a slightly different method is used. Let $l$ be the number of tokens in the question. The embeddings of the tokens $q_j$ are passed through an RNN, producing hidden vectors $\mathbf{q}_j$. The final question embedding $\mathbf{q}$ is obtained by a weighted sum:

$$ \mathbf{q}=\sum\_j\frac{\exp(\mathbf{w}\cdot \mathbf{q}\_j)}{\sum\_{j'}\exp(\mathbf{w}\cdot\mathbf{q}\_{j'})}\mathbf{q}\_j $$

where $\mathbf{w}$ is a learnable weight vector.

Using these paragraph and question embeddings, DrQA predicts the start and end positions of the answer span independently. Let $\mathbf{W}_s, \mathbf{W}_e$ be learned weight matrices. For each token $p_i$, the probabilities for the start and end positions are given by:

$$ \begin{align*} P\_{\text{start}}(i)&\propto \exp(\mathbf{p}\_i\mathbf{W}\_s\mathbf{q})\\\\ P\_{\text{end}}(i)&\propto \exp(\mathbf{p}\_i\mathbf{W}\_e\mathbf{q}) \end{align*} $$

The model selects the span $(i, i’)$ that maximizes $P_{\text{start}}(i)\times P_{\text{end}}(i’)$ under the constraint $i \le i’ \le i+15$.