Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks(2019)
September 23, 2023Sentence-BERT derives semantically meaningful sentence embedding that can be compared using cosine-similarity. BERT achieved new state-of-the art performance on various sentence-pair regression tasks using a cross-encoder. A cross-encoder accepts two sentences as input to the transformer network and the target value is predicted. Semantic textual similarity is one of the sentence-pair regression tasks. However, this setup is often not scalable for various pair regression tasks due to many possible combinations. The semantic search that maps each sentence to a vector space where semantically similar sentences are close alleviates the combinatorial explosion. Sentence-BERT uses a siamese network in which the two BERT networks have tied weights such that the produced sentence embeddings can be semantically compared using cosine-similarity.
Sentence-BERT adds a pooling operation to the output of BERT to derive a fixed-sized sentence embedding.
The authors of Sentence-BERT propose three strategies.
Using the output of the CLS
-token, computing the mean of all output vectors (the default configuration), and computing a max-over-time of the output vectors.
Sentence-BERT depends on the available training data. Three different objective functions are proposed.
Classification objective function accepts the sentence embeddings \(u\) and \(v\) with the element-wise difference \(|u-v|\), , and multiplies the concatenated embeddings with the trainable weight \(W_t \in \mathbb{R}^{3n\times k}\). $$ o = \text{softmax}(W_t(u, v, |u-v|)) $$ where \(n\) is the dimension of the sentence embeddings and \(k\) the number of labels. We update the weights to optimize cross-entropy loss.
Regression objective function computes the cosine similarity between the two sentence embeddings \(u, v\). The authors use mean-squared-error loss.
Triplet objective function takes an anchor sentence embedding \(s_a\), a positive sentence embedding \(s_p\), and a negative sentence embedding \(n\). The function updates the weights such that the distance between \(s_a\) and \(s_p\) is smaller than the distance between \(s_a, s_n\). The authors use the following function: $$ \max(||s_a - s_p|| - ||s_a - s_n || + \epsilon, 0) $$ \(\epsilon\) ensures that \(s_p\) is at least \(\epsilon\) closer to \(s_a\) than \(s_n\).