Skip to content

Understanding Vector Embeddings and how to work with them in PEACH

Introduction to embeddings

Vector embeddings are numerical representations of data that capture the inherent characteristics and relationships between items in a high-dimensional space. They play a crucial role in various machine learning tasks, enabling systems to understand and process complex information more effectively.

What are Vector Embeddings?

At its core, a vector embedding is a set of numbers arranged in a particular order. Each number in the set represents a feature or aspect of the item being encoded. By arranging these numbers in a specific way, embeddings can encode rich information about the item's attributes, context, and relationships with other items.

How are Vector Embeddings Computed?

Vector embeddings are often computed using machine learning models, such as encoder Transformer architectures like BERT, GPT, or similar techniques. These models are trained on vast amounts of data to learn meaningful representations of input items and learn to "understand" the content.

During training, the model learns to map input items, such as text, images, or audio, to dense numerical vectors in a high-dimensional space. This mapping process captures semantic similarities between items, allowing similar items to be closer together in the embedding space.

An important paper from Google Brain that introduces Transformers architecture is worth to get familiar with: Attention Is All You Need, as it became a foundation for a lot of recent progress in the field.

What Can Vector Embeddings Be Used For?

Vector embeddings have diverse applications across various domains, including:

  1. Semantic Similarity: Embeddings enable systems to measure the semantic similarity between items. For example, in natural language processing, embeddings can capture the similarity between words, sentences or even longer texts based on their semantic meaning.
  2. Recommendation Systems: Embeddings power recommendation systems by representing items and users in a unified embedding space. Similar items or users are clustered together, enabling personalized recommendations based on similarity of content or user tastes.
  3. Search and Retrieval: Embeddings facilitate efficient search and retrieval of similar items. By indexing embeddings in a database like Milvus, systems can quickly retrieve items that closely match a given query.
  4. Clustering and Classification: Embeddings can be used for clustering similar items together or classifying items into predefined categories. This is particularly useful in tasks like image recognition and document classification. Clustering

Examples of Vector Embeddings

  1. Word Embeddings: In natural language processing, word embeddings represent words as dense vectors in a continuous space. Words with similar meanings are mapped to nearby points in the embedding space. Example: Word2Vec, GloVe. Word2Vec

  2. Document Embeddings: Document embeddings capture the semantic content of entire documents, enabling systems to compare and retrieve similar documents. Example: Doc2Vec, BERT, Universal Sentence Encoder, Language-agnostic BERT Sentence Embeddings or some of LLM-based embeddings. Doc2Vec LaBSE

  3. Image Embeddings: Image embeddings encode visual features of images, allowing systems to recognize similar images or detect visual patterns. Another common scenario is facial recognition. Example: Convolutional Neural Network embeddings. Facial recognition

  4. Audio Embeddings: Audio embeddings represent audio signals as numerical vectors, enabling tasks such as speech recognition and audio classification. Example: VGGish embeddings for audio classification.

Why Vector Database?

Vector Database

  1. Efficient Storage: Vector databases are specifically designed to store and manage high-dimensional vector data efficiently. Traditional relational databases are not optimized for this type of data, leading to suboptimal performance and scalability issues when dealing with large-scale vector datasets.
  2. Fast Retrieval: Vector databases employ specialized indexing structures and algorithms tailored for similarity search tasks. These optimizations enable fast retrieval of nearest neighbors for a given query vector, even in high-dimensional spaces.
  3. Scalability: As the size of the dataset grows, performing brute-force similarity search becomes increasingly impractical due to its computational complexity. Vector databases are designed to scale efficiently with large volumes of vector data, allowing for real-time or near-real-time retrieval of similar items even on tens of millions of embeddings
  4. Approximate Nearest Neighbors (ANN): Vector databases often support approximate nearest neighbor search algorithms, which trade-off a small degree of accuracy for significant gains in search speed. These algorithms provide fast and scalable solutions for similarity search tasks, making them suitable for many real-world applications.

In PEACH we adopted open-source vector database Milvus, which is currently powering a lot of different usecases for PEACH members.

How to use Milvus in PEACH

First install the correct version of pymilvus, any 2.3.x version should suffice:

pip install pymilvus==2.3.5

Now let's set correct environment variables (they will already be setup for your organisation's tasks and endpoints environment, just need to do in PeachLab) and learn how to connect with the client:

import os

from pipe_algorithms_lib.similarity_2 import connect

os.environ['CODOPS'] = 'ltlrt'  # codops of your organisation
os.environ['CODOPS_LTLRT_MILVUS_PASSWORD'] = 'xxx'  # secret for your organisation. Already configured for tasks and endpoints. Contact PEACH Core team if you don't have it yet. Make sure to not push secret to git!

# needs to be executed only once. So for endpoints - better to move it out from inside your endpoint entry point function
connect()

Then, let's create schema for our embeddings collection. Milvus allows us to add fields in addition to just vector fields, so we can later query data not just based on vector similarity, but also to filter items based on other metadata fields (common fields could be category, publication_date, language, etc). We should also create required indexes, to get the most optimal querying performance. You need to ensure that this code is executed once, as we don't want to create new collection for each task execution:

from pymilvus import DataType, FieldSchema

from pipe_algorithms_lib.similarity_2 import create_collection


create_collection(
    collection_name=COLLECTION_NAME,
    dimension=768,
    extra_fields=[
        FieldSchema(name='publication_date', dtype=DataType.INT64),
        FieldSchema(name='language', dtype=DataType.VARCHAR, max_length=16),
        FieldSchema(name='category', dtype=DataType.VARCHAR, max_length=255),
    ],
    indexes={
        'vec': {
            'metric_type': 'IP',
            'index_type': 'HNSW',
            'M': 32,
            'efConstruction': 128,
        },
        'publication_date': None,
        'language': None,
        'category': None,
    }
)

Here we create scalar indexes for all non-vector fields (if we plan to query based on them, but we can always add them later separately) and a vector field index. We use an HNSW index type with some reasonable hyperparameters, but don't hesitate to tune parameters as appropriate. You can read more on different vector index types and their hyperparameters here.

Once our collection has been created in your namespace, you can now insert documents into it:

from random import random
from datetime import datetime

from pipe_algorithms_lib.similarity_2 import index

# normally comes as output from an embedding model, but let's use a random vector for now
vector = [random() for _ in range(768)]

document = {
    'id': 'my_document_42',
    'vec': vector,
    'publication_date': int(datetime(2024, 3, 1, 12, 0, 0).timestamp()),
    'language': 'en',
    'category': 'sports',
}

# here we simply insert a single document. If you have many documents to index - better to send them in larger batches
documents_to_index = [document]

index(
    collection_name=COLLECTION_NAME,
    data=documents_to_index,
    normalize=True,  # whether to normalize input vectors. Usually required if using Cosine Similarity
    upsert=True,  # if upserting - then we override data by `id` to avoid duplicates
    flush=True,  # enabling flushing would make insert slower but we can be sure that data is now queryable
    timeout=5,
)

If we want to retrieve some documents (to get some metadata for a current item quickly or to get vectors for clustering, for example):

from pipe_algorithms_lib.similarity_2 import get_documents

document = get_documents(COLLECTION_NAME, ['my_document_42'], output_fields=['id', 'language', 'vec'])

# document -> {
#   'id': 'my_document_42',
#   'language': 'en',
#   'vec': [-0.041, 0.214, 0.012, -0.112, ...]
# }

And the most important step is actually querying the data, for find the most similar items to a given vector (useful to use some meaningful embeddings encoder and to have more items in the collection):

from pipe_algorithms_lib.similarity_2 import retrieve_similar

filter_expr = 'language == "fr" and publication_date >= 1704067200'

recs = retrieve_similar(
    COLLECTION_NAME,
    ids=['my_document_42'],  # if you pass multiple IDs - it finds mean vector from all the input vectors and finds similar to it
    size=10,
    filter_expr=filter_expr,
)

# alternatively, you can pass query vector instead of `ids`, if you want to find similar vectors without inserting current document into the database
recs = retrieve_similar(
    COLLECTION_NAME,
    query_vectors=[vector],
    size=10,
    filter_expr=filter_expr,
)

# recs -> [
#   {'id': 'my_document_42', 'distance': 0.99999},
#   {'id': 'my_document_73', 'distance': 0.86124},
#   {'id': 'my_document_14', 'distance': 0.74124},
#   ...
# ]

Quantization

If you are dealing with a large dataset (millions of embeddings) or trying to squeeze some miliseconds of inference latency - you can consider applying quantization. It's a technique where we take each dimension of an embedding and cast into smaller data type, for example float32 into int8 or even bool. This way we can gain performance improvement searching for similar vectors and at the same time get massive savings in storage. Whilst rather counterintuitive, often taking only direction of the vector (i.e. positive or negative, assuming [-1.0, 1.0] range), we can maintain very high recall, while being able to compress data 32 times! For that we need to do [0.0, 1.0] normalization and cast to 0 or 1 according to selected threshold (typically 0.5 should be a good starting point).

An complete example on how to create Milvus binary index, convert an embedding into encoded binary embedding, insert it into Milvus:

import numpy as np

from pymilvus import (
    FieldSchema,
    DataType,
    CollectionSchema,
    Collection,
)
from pipe_algorithms_lib.similarity_2 import index, connect, get_collection


COLLECTION_NAME = 'my_binary_collection'
NUM_DIMS = 768
THRESHOLD = 0.5

def create_collection():
    fields = [
        FieldSchema(name='id', dtype=DataType.VARCHAR, max_length=255, is_primary=True),
        FieldSchema(name='vec', dtype=DataType.BINARY_VECTOR, dim=NUM_DIMS),
    ]
    fields.extend(extra_fields or [])
    schema = CollectionSchema(fields=fields, description=COLLECTION_NAME)
    collection = Collection(name=collection_name, schema=schema)

connect()
create_collection()


# already normalized, but you might need to apply normalization separately
random_vector = np.random.rand(NUM_DIMS)
binary_vector = np.where(random_vector > THRESHOLD, 1, 0)
binary_encoded_vector = bytes(np.packbits(binary_vector, axis=-1).tolist())

document = {
    'id': '42',
    'vec': binary_encoded_vector,
}
res = index(COLLECTION_NAME, normalize=False, data=[document])

Then to query this binary collection we also need to provide properly encoded vector:

def revert_to_binary_vector(binary_encoded_vector):
    np_array = np.frombuffer(binary_encoded_vector, dtype=np.uint8)
    binary_vector = np.unpackbits(np_array) 
    return binary_vector


collection = get_collection(COLLECTION_NAME)
vec = revert_to_binary_vector(get_vectors(COLLECTION_NAME, ['42'])[0][0])
binary_vector = bytes(np.packbits(vec, axis=-1).tolist())

res = collection.search([binary_vector], 'vec', param={"metric_type": "HAMMING"}, limit=10)[0]

Using binary quantization helps us to improve inference performance usually around 20% and decrease storage cost by 32x! However first best to start without quantization and only when your collection is large enough - to experiment whether you get good results with binary quantized embeddings!

Conclusions

  • Vector embeddings are powerful tools for representing and understanding complex data in machine learning applications. By capturing semantic similarities between items in a high-dimensional space, embeddings enable tasks such as recommendation, search, clustering, and classification to be performed efficiently and effectively.
  • Vector databases like Milvus provide efficient storage, fast retrieval, scalability, and support for approximate nearest neighbor search algorithms, making them indispensable tools for content-based similarity search and recommendation systems in various domains. They offer significant advantages over brute-force search methods, enabling organizations to build high-performance and scalable solutions for similarity-based tasks.
  • We learned how to interact with existing Milvus setup within PEACH environment, how to create collection with defined schema and indexes, add vectors with metadata and query for similar vectors, representing semantically similar content