Phrase Retrieval in ODQA
Current limitation of Retriever-Reader ODQA
- Error propagation
- No matter how good the Reader is, if the Retriever fails to deliver proper context, the entire pipeline’s performance drops.
- Query-dependent encoding
- The encoding of the answer span changes depending on the query.
- e.g., when using a BERT retriever, the query and context are concatenated as model input, so a different query means:
- The encoding of the concatenated embedding must be recomputed every time.
- The encoding result changes each time.
Phrase retrieval was proposed as a method to solve these problems without going through the Retriever-Reader pipeline.
Solution

- Enumerate all phrases in the context.
- Map each phrase using its embedding vector as a key.
- The query vector is computed whenever a query comes in.
- The problem reduces to comparing the query vector against the key vectors.
 Comparing the training process
-
Conventional approach
- Feed phrase, question, and document into a score function to get all score combinations.
- The Retriever-Reader is expressed as a single function F.
- F takes a (answer candidates), d, and q as input and produces the predicted answer.
- The score function must be recomputed whenever the query changes.
-
Phrase retrieval
- Instead of F, the score is computed by combining a Question encoder and a Phrase encoder.
- H: takes a and d as input, maps them to a vector space, and finds the most similar vector — via dot product, distance calculation, etc.
- All results of H are pre-computed.
- So only Q needs to be computed each time a query arrives.
- Problem
- The assumption that F can be decomposed into G and H might be wrong.
- Rather than mathematically decomposing F, we define G and H and try to approximate F as closely as possible.
Key Challenge
How do we map phrases well into vector space? -> Try using both dense and sparse embeddings.
Dense-sparse representation for Phrase
- Dense vectors: good at capturing syntactic and semantic information.
- Sparse vectors: good at capturing lexical information.
Concat

The approach to combining both methods is: during phrase ODQA, compute a dense and sparse embedding vector for each phrase, then concatenate them.
Dense representation

A phrase vector (dense vector) is generated using the hidden state vectors at the start and end tokens of the answer span.

Coherency vector
- Indicates whether a phrase is a valid constituent of a sentence.
- Used to filter out phrases that don’t form proper units.
- Computed from the start and end vectors.
 Question embedding
- Generated from the CLS token.
- Same as standard document embedding.
Sparse representation
 Uses contextualized embeddings to build sparse vectors from the most relevant n-grams.
- Measure similarity with surrounding words of the target phrase.
- Insert the similarity scores into the sparse vector corresponding to each word.
- Similar to TF-IDF. The difference is that weights change dynamically per phrase and sentence.
- Unigram and bigram can also be used to capture overlapping information.
Scalability Challenge
Using Wikipedia data typically means dealing with around 60 billion phrases. Indexing and searching at this scale requires scalability considerations.
- Storage: pointer, filter, scalar quantization, etc. can reduce 240TB down to 1.4TB.
- Search: uses FAISS.
- FAISS can only search dense vectors, not sparse ones.
- Since phrase ODQA combines dense and sparse vectors, dense vectors are searched first.
- Sparse vector scores are then used to rerank the FAISS results.
Results & Analysis
 On SQuAD, there was a 3.6% performance improvement over DrQA (Retriever-Reader), and inference speed was 68x faster.  Faster than other Retriever-Reader approaches. It also relies on CPU computation, so there’s a side benefit of not needing GPU.
Limitation

- Requires a lot of RAM.
- Performance is lower than state-of-the-art Retriever-Reader models.
- Lower performance on Natural Questions.
- Caused by the decomposability gap (decomposing F into G and H).
Decomposability gap
 Approximating F through G and H is inherently error-prone. F represents a very complex Retrieval-Reader, so the approximation via G and H inevitably falls short.