Introduction to Sentence Transformers: Enhancing Sentence Embeddings for NLP

In Natural Language Processing (NLP), Sentence Transformers are a powerful tool that convert entire sentences into high-dimensional vectors, or embeddings. Unlike traditional word embeddings like Word2Vec and GloVe, which focus on individual words, sentence transformers capture the full context of a sentence. This allows for more accurate tasks such as semantic search, clustering, and sentence similarity, making them a valuable asset for modern NLP applications.

In this article, we’ll dive into the fundamentals of Sentence Transformers, exploring how they work, their advantages over traditional word embeddings, and how they are transforming various NLP tasks like semantic search, clustering, and sentence similarity.

1. The Evolution of NLP Models: From RNNs to Transformers

Traditional Approaches: RNNs and Their Limitations

Historically, Recurrent Neural Networks (RNNs) were the go-to models for sequential data tasks in NLP. Despite their ability to handle sequences, RNNs suffered from significant drawbacks:

Slow Training: RNNs process data sequentially, making them computationally intensive and slow to train.
Limited Context Understanding: They struggled to capture long-range dependencies, often failing to understand the broader context of a sentence.

To address these issues, variants like Long Short-Term Memory (LSTM) and Bidirectional LSTMs were introduced. While they improved context understanding by capturing information from both past and future tokens, they still fell short in scalability and efficiency for more complex tasks.

The Advent of Transformers

The introduction of the Transformer architecture by Vaswani et al. in 2017 revolutionized NLP. Transformers leverage self-attention mechanisms to process entire sequences in parallel, significantly enhancing training speed and scalability. They excelled in various tasks, including:

Sequence-to-Sequence Tasks: Such as translation and summarization.
Token-Level Tasks: Including Named Entity Recognition (NER) and Part-of-Speech (POS) tagging.

However, transformers like BERT (Bidirectional Encoder Representations from Transformers) primarily focused on token-level embeddings, which limited their effectiveness in tasks requiring comprehensive sentence-level understanding, such as semantic similarity and information retrieval.

2. Introducing BERT: A Milestone in NLP

BERT marked a pivotal shift in NLP by enabling models to understand language contextually. Its key features include:

Pretraining on Large Text Corpora: BERT is pretrained on massive datasets, allowing it to grasp the nuances of natural language.
Fine-Tuning for Specific Tasks: After pretraining, BERT can be fine-tuned on task-specific datasets, enhancing its performance across various NLP applications.

While BERT excels in general NLP tasks, it encounters challenges in sentence similarity tasks, where understanding the semantic relationship between entire sentences is crucial. This limitation paved the way for the development of Sentence-BERT (SBERT).

3. BERT vs. SBERT: A Comparative Overview

Aspect	BERT	SBERT
Purpose	General NLP tasks (e.g., NER, POS tagging, Q/A) focused on token-level understanding	Sentence-level tasks (e.g., sentence similarity, clustering) focused on generating fixed-size sentence embeddings
Output	Token-level embeddings & [CLS] token for sentence-level tasks	Fixed-size, dense sentence vectors
Computational Efficiency	Computationally intensive, especially for large-scale sentence comparisons	More efficient due to the use of Siamese/triplet network architectures
Architecture	Standard transformer encoders	Siamese or triplet network structures built upon transformer encoders

4. Understanding Sentence Transformers (SBERT)

What Are Sentence Transformers?

Sentence Transformers are specialized models that transform entire sentences into fixed-size vector representations, known as sentence embeddings. These embeddings encapsulate the semantic essence of sentences, enabling effective comparison, clustering, and classification.

How Do Sentence Transformers Work?

A simple SBERT architecture follows these steps:

BERT processes input tokens and outputs token embeddings.
Pooling (mean, max, or [CLS] token) aggregates these token embeddings into a single, fixed-size sentence embedding.

While this approach provides a basic sentence vector, the quality of embeddings is often subpar. To enhance the semantic richness of sentence embeddings, SBERT employs more sophisticated training mechanisms and architectures.

Architecture Details

SBERT typically builds upon models like BERT or RoBERTa, incorporating:

Siamese Network Structure: Utilizes two identical transformer encoders to process sentence pairs, facilitating the learning of relationships between them.
Triplet Networks: Incorporates an additional encoder to handle triplet loss scenarios, optimizing the embeddings further.

Training Mechanism

Contrastive Loss

Contrastive loss is commonly used for training models in tasks like sentence similarity or semantic textual similarity (STS). It aims to bring similar sentences closer together in the embedding space and push dissimilar sentences farther apart.

How it works: The model is given pairs of sentences and must predict whether they are similar or dissimilar.
- For similar sentence pairs, the goal is to minimize the distance between their embeddings (push them closer in the vector space).
- For dissimilar pairs, the goal is to maximize the distance between their embeddings (push them further apart).
Formula: The loss is typically calculated as a function of the distance between the embeddings of the two sentences, such as the Euclidean distance or cosine similarity.
Use Case: This loss is particularly useful for tasks that require measuring sentence similarity. For instance, when comparing pairs of sentences to check if they are paraphrases or have a similar meaning.

Triplet Loss

Triplet loss works with three sentences: anchor, positive, and negative. It’s mainly used in tasks where you need to learn a ranking of sentence similarity, such as sentence retrieval or semantic search.

How it works:
- Anchor: A reference sentence.
- Positive: A sentence similar or related to the anchor.
- Negative: A sentence dissimilar or unrelated to the anchor.

The goal is to minimize the distance between the anchor and positive while maximizing the distance between the anchor and negative. This helps the model learn a meaningful distance metric where similar sentences are close and dissimilar sentences are far apart.

Use Case: This loss is useful for tasks like semantic search or question answering where you want to rank sentences by relevance. For example, when given a query, you want to retrieve the most relevant document or sentence from a database of candidate documents.

Multiple Negative Ranking Loss (MNRL)

MNRL is an extension of contrastive loss that is more efficient for training on large datasets. Instead of training with one positive and one negative sample at a time, MNRL works with multiple negative samples in each batch.

How it works: The model is given a positive sample and a batch of negative samples (instead of just one negative sample as in contrastive loss). The goal is to bring the positive sample closer to the anchor sentence and push the negative samples farther away.
Formula: Similar to contrastive loss, but it uses multiple negative samples in one batch.
Use Case: MNRL is particularly beneficial for efficient training on large datasets, especially in tasks like ranking or retrieval where you have a lot of negative samples (e.g., for each query, you may have thousands of irrelevant documents).

5. Training and Fine-Tuning SBERT

Training SBERT from Scratch

Training SBERT involves:

Pretrained Language Model: Starting with a model like BERT or RoBERTa.
Pooling Layer: Aggregates token embeddings into a sentence embedding using mean, max, or [CLS] pooling.
Sentence Embedding: Generates a fixed-size vector representing the sentence.
Loss Function: Applies a suitable loss function (e.g., Contrastive Loss, Triplet Loss) to optimize the embeddings.
Backpropagation: Adjusts the model weights based on the loss to improve embedding quality.

Fine-Tuning SBERT

Why Fine-Tune?

While pretrained SBERT models are versatile, fine-tuning them on task-specific datasets enhances their performance for particular applications. Fine-tuning adapts the general language understanding of SBERT to the nuances of specific tasks or domains.

Fine-Tuning Steps:

6. Fine-Tuning SBERT for Natural Language Inference (NLI)

NLI is a subtask of NLP where the model determines the relationship between two sentences: a premise and a hypothesis. The possible relationships are:

Entailment: The hypothesis logically follows from the premise.
Contradiction: The hypothesis contradicts the premise.
Neutral: The relationship is neither entailment nor contradiction.

Fine-Tuning SBERT for NLI:

Siamese Network Setup: Use two identical BERT encoders to generate embeddings for both the premise and hypothesis.
Embedding Generation: Apply pooling to obtain sentence vectors u and v.
Concatenation: Combine embeddings using methods like concatenation of (u, v, |u - v|).
Feedforward Neural Network: Pass the concatenated vector through a neural network to produce logits.
Softmax Layer: Convert logits into probability distributions over the classes (entailment, contradiction, neutral).
Loss Function: Apply cross-entropy loss comparing predicted probabilities with true labels.
Backpropagation: Update model weights to minimize the loss.

Sentence Similarity and Metric Learning

Sentence Similarity tasks involve determining how similar two sentences are in meaning. SBERT enhances these tasks by providing robust sentence embeddings that can be compared using metrics like cosine similarity.

Metric Learning in SBERT focuses on learning an embedding space where semantic similarity is reflected in the geometric distance between vectors.

Loss Functions:

Contrastive Loss: Pulls similar sentence embeddings closer while pushing dissimilar ones apart.
Triplet Loss: Uses triplets of anchor, positive, and negative sentences to ensure the anchor is closer to the positive than to the negative.

7. Case Study: Quora Similar Questions

Objective: Identify if two questions on Quora are semantically similar.

Challenges with BERT:

Word-Level Embeddings: BERT provides word vectors, which require aggregation (e.g., mean pooling) to form sentence embeddings.
Poor Quality Embeddings: Simple pooling methods yield suboptimal sentence vectors, sometimes outperforming more complex models like GloVe.

Solution with SBERT:

Sentence Embedding: Use SBERT to generate high-quality sentence embeddings.
Training SBERT: Fine-tune SBERT on relevant tasks (e.g., NLI, sentence similarity) to enhance embedding quality.
Inference:
- Embed All Questions: Generate embeddings for all existing questions in the Quora database.
- New Question Embedding: Embed the new question.
- Similarity Computation: Calculate cosine similarity between the new question's embedding and existing embeddings.
- Retrieve Similar Questions: Identify and present the most similar questions using similarity scores or k-Nearest Neighbors (k-NN) algorithms.

Further Reading: sbert.net/examples/training/quora_duplicate..

References

Sentence Transformers - EXPLAINED! by CodeEmporium
Watch the video on YouTube
Fine-Tune Embedding Models for Semantic Search
Learn more from Marqo's course
Training and Finetuning Embedding Models with Sentence Transformers v3
Read on Hugging Face's blog
Sentence-Transformers Documentation
Official training overview and guidelines

Conclusion

That concludes the article on Sentence-Transformers. I've included as much detail as possible to give you a comprehensive understanding of how to use these models for sentence embeddings. If you notice any inaccuracies or have suggestions for improvement, feel free to share them! Your feedback will not only help me refine my knowledge but will also benefit future readers.

While I haven’t included code implementations in this article, I strongly encourage you to check out the Hugging Face guide on Sentence Transformers for a more hands-on experience with the library. Additionally, the course Fine-Tune Embedding Models for Semantic Search on Marqo is an excellent resource for diving deeper into the subject.

Thank you for reading, and happy experimenting with Sentence-Transformers!