Dense Passage Retrieval for Open-Domain Question Answering (2020)
Open-domain question answering (QA) involves answering fact-based questions using documents. An open-domain QA system can be divided into two components: one that retrieves relevant passages and another that extracts the answer spans from those passages (Chen et al., 2017). While traditional approaches use sparse vector space models like BM25 for the retrieval step, Dense Passage Retrieval for Open-Domain Question Answering shows that dense representations can also be practically implemented using dense representations.
The embeddings are learned from the training dataset, optimizing for maximizing inner products between question and relevant passage vectors. The training is essentially metric learning, and each question needs irrelvant passages. In experiments, it was found that utilizing both the top passages returned by BM25, excluding the answer, and positive passages paired with other questions as negatives yielded best performance.