Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks(2019)

September 23, 2023

Sentence-BERT derives semantically meaningful sentence embedding that can be compared using cosine-similarity. BERT achieved new state-of-the art performance on various sentence-pair regression tasks using a cross-encoder. A cross-encoder accepts two sentences as input to the transformer network and the target value is predicted. Semantic textual similarity is one of the sentence-pair regression tasks. However, this setup is often not scalable for various pair regression tasks due to many possible combinations. The semantic search that maps each sentence to a vector space where semantically similar sentences are close alleviates the combinatorial explosion. Sentence-BERT uses a siamese network in which the two BERT networks have tied weights such that the produced sentence embeddings can be semantically compared using cosine-similarity.

Sentence-BERT adds a pooling operation to the output of BERT to derive a fixed-sized sentence embedding. The authors of Sentence-BERT propose three strategies. Using the output of the CLS-token, computing the mean of all output vectors (the default configuration), and computing a max-over-time of the output vectors.

Sentence-BERT depends on the available training data. Three different objective functions are proposed.

Classification objective function accepts the sentence embeddings $u$ and $v$ with the element-wise difference $|u-v|$, , and multiplies the concatenated embeddings with the trainable weight $W_t \in \mathbb{R}^{3n\times k}$. $$ o = \text{softmax}(W_t(u, v, |u-v|)) $$ where $n$ is the dimension of the sentence embeddings and $k$ the number of labels. We update the weights to optimize cross-entropy loss.

Regression objective function computes the cosine similarity between the two sentence embeddings $u, v$. The authors use mean-squared-error loss.

Triplet objective function takes an anchor sentence embedding $s_a$, a positive sentence embedding $s_p$, and a negative sentence embedding $n$. The function updates the weights such that the distance between $s_a$ and $s_p$ is smaller than the distance between $s_a, s_n$. The authors use the following function: $$ \max(||s_a - s_p|| - ||s_a - s_n || + \epsilon, 0) $$ $\epsilon$ ensures that $s_p$ is at least $\epsilon$ closer to $s_a$ than $s_n$.