What Are Vector Embeddings?

Vector embeddings are a crucial concept in the field of artificial intelligence, helping to transform complex data into a format that AI models can easily understand and process. At their core, vector embeddings represent words, phrases, images, or even entire documents as multi-dimensional vectors—essentially points in a high-dimensional space.

If this sounds complicated, think of it like this: vector embeddings allow AI systems to turn messy, unstructured data into neat, structured numbers that can be analyzed more easily. This technique is used in natural language processing (NLP), image recognition, and many other AI applications to make sense of patterns in data.

In essence, vector embeddings are like AI’s version of translating language, images, or concepts into math, which makes it possible to compare and process complex data more efficiently.

How Do You Create Embeddings?

A neural network creates embeddings.

You can think of a neural network as a smart organizer trying to arrange a messy pile of items into neat, understandable categories.

These items could be anything: words, pictures, or sounds. The process it uses to do this is called representation learning. What it does is take something big and complicated—like a long sentence or a detailed image—and shrink it down to a simpler, more compact form, while still keeping the important details.

Let’s break this down step by step. At first, the neural network starts with raw data. For example, let’s say it’s dealing with words. The network doesn’t understand words directly, so it translates them into numbers (kind of like putting labels on different objects). These numbers are called embeddings. The trick is that these embeddings are not random—they’re designed to capture the essence of the word or image.

To make these embeddings meaningful, the network trains itself through practice. It starts by guessing how to categorize things and then checks how well it did. If it’s wrong, it adjusts its approach using something called backpropagation, which is just a fancy way of saying it learns from its mistakes and tries again. Over time, it gets better and better at understanding the relationships between things.

Try to picture it this way: imagine you’re trying to organize a bunch of books in a library. You start by randomly placing them on shelves, but as you read and understand them more, you start moving them around. Books on similar topics (like adventure stories) end up next to each other, while books on totally different topics (like cookbooks) are placed farther apart. The more you organize, the clearer the categories become.

In the neural network’s world, it’s doing something similar. Initially, it places things randomly, but through training, it starts grouping similar items together (like words that have similar meanings) and pushing unrelated items apart. Over time, these groups become more defined, making it easier for the network to recognize patterns, like how certain words or images should be categorized.

In the end, these embeddings help the network understand and process information faster and more accurately, whether it’s identifying objects in pictures or understanding the meaning behind words.

How Do Vector Embeddings Work?

Now that we know how neural networks create embeddings, let’s break down how these embeddings actually function.

Understanding the Vector Space

Picture vector embeddings as points in a multidimensional space, where each dimension captures some feature or attribute of the data. For example, if you're embedding words, the space represents relationships between words, with similar words placed close together.

Think of it like placing cities on a map: cities that are close geographically (like New York and Boston) are near each other, while cities farther apart (like New York and Tokyo) have more distance between them. Embeddings work similarly, but instead of geographic distance, they capture semantic or structural similarities.

Distance Measures

How do we quantify similarity between vectors? This is where distance metrics come in. A couple of common ones include:

Euclidean Distance: This is the straight-line distance between two points. It's useful when you want a literal sense of how far two vectors are from each other in space.
Cosine Similarity: Instead of distance, cosine similarity measures the angle between two vectors. If the angle is small (close to 0 degrees), the vectors are considered more similar, even if their magnitudes (or lengths) differ.

These distance measures help the system figure out relationships between different data points. For example, in a word embedding space:

The words "cat" and "dog" might be close together because they share a semantic relationship (both are animals).
Meanwhile, the word "car" would be farther from "cat" since they belong to different categories.

Capturing Meaning and Context

Embeddings are powerful because they capture subtle nuances. Consider these examples:

In a word embedding model, the words “king” and “queen” will be close together, as they share royal and human-related features.
If you perform arithmetic on vectors (e.g., king - man + woman), the result will point to a vector close to queen. This shows how embeddings preserve complex relationships like gender and hierarchy.

Let’s consider an example where words are represented in this vector space. Imagine three words: a laptop, a tablet, and a refrigerator. In the vector space:

The laptop might be represented by the vector [1.5, 0.8]
The tablet by [1.4, 0.9]
The refrigerator by [0.2, -1.3]

In this case, the vectors for the laptop and tablet are close to each other because they’re similar—they’re both electronic devices with similar features. On the other hand, the refrigerator’s vector is much farther away because it’s completely different in function and category.

Just like how products are grouped in a store based on their type, embeddings group similar items closer together in a way that reflects their relationships. The closer two items’ vectors are, the more related they are.

Why Does This Matter?

Once data is embedded into this vector space, you can use it for various applications:

Chatbots like Chatbase use embeddings to provide more relevant answers by understanding the relationships between different words or phrases.
Search: If a user searches for “dog,” an embedding-based search engine can retrieve results for “puppy,” “canine,” and “pet” because they are close in the vector space.
Recommendation Systems: Embeddings allow systems to recommend similar content (movies, products, etc.) by finding items that are close in the embedding space.
Data Preprocessing Tools: Use embeddings in tasks such as language translation, sentiment analysis, or entity recognition to streamline and enhance the performance of your systems.

An Example in Action

Imagine you're using a music streaming service. The system has embeddings for different songs, capturing characteristics like genre, mood, and tempo. If you frequently listen to rock music with a fast tempo, the system will recommend songs that are nearby in the vector space—songs with similar rhythms, beats, or genres.

In short, vector embeddings work by placing data points in a multidimensional space where distances between points capture their relationships.

A powerful use of embeddings is in applications that use retrieval-augmented generation (RAG). This approach blends the ability of large language models (LLMs) to generate content with the precision of retrieving relevant information. For example, in a support assistant, embeddings can help pull up relevant customer data, which the LLM can then use to generate a highly personalized and accurate response. This results in a smarter, more helpful interaction for the user.

Creating Your First Embeddings: A Step-by-Step Guide

We’ve talked a lot about the theory behind embeddings, but now it’s time to dive into a practical example to see embeddings in action. In this guide, we’ll create a simple chatbot that compares the lyrics of songs using embeddings. The idea is to see how close or similar song lyrics are, based on the way the embeddings capture their semantic meaning. We’ll start with just four song lyrics for demonstration purposes.

Let’s jump into how to embed these song lyrics and compare them using Hugging Face's Inference API.

Embedding Song Lyrics

Our goal is to embed these song lyrics, transforming them into numerical representations so that the chatbot can determine how similar they are. Each song lyric will be turned into a vector of numbers, where similar lyrics will have embeddings that are close together in space. For this demonstration, we will use a pre-trained model from Hugging Face called "sentence-transformers/all-MiniLM-L6-v2," which is well-suited for text-based embeddings.

Here’s a quick overview of the steps we’ll take:

Select a Pre-trained Model: We’ll use the Sentence Transformers library to select a model that can generate embeddings for text.
Embed the Lyrics: We’ll send our song lyrics through the model and generate embeddings for each.
Compare the Lyrics: Once we have embeddings, we’ll use similarity measures to find out how closely related each song lyric is to the others.

To get started, let’s begin by embedding the lyrics.

Preparing the Model and API Token

First, you’ll need a Hugging Face write token, which you can create by logging into Hugging Face and generating it in your Account Settings. This token will be used to authenticate our requests to the API.

snippet 1

1model_id = "sentence-transformers/all-MiniLM-L6-v2"
2
3hf_token = "your_token_here"
4
5api_url = f"https://api-inference.huggingface.co/pipeline/feature-extraction/{model_id}"
6
7headers = {"Authorization": f"Bearer {hf_token}"}

Song Lyrics for Embedding

We’ll use the following four songs for this demonstration:

Song A:"Here comes the sun, shining bright on a brand-new day. The clouds have rolled away, and the skies are clear. Birds are singing sweet melodies, and I feel a warm embrace. It’s a beautiful day, and everything seems alright."
Song B:"We’re on a road to nowhere, driving through the open plains. The wind is in our hair, and the horizon seems endless. Every mile we go, the world feels wide and free. Join me on this journey, let’s see where it leads."
Song C:"I can see clearly now, the rain has finally stopped. The sun is casting golden rays, painting everything with light. The streets are sparkling clean, and the sky is a brilliant blue. It’s a perfect day to start anew, with hope and joy ahead."
Song D:"Is this the real life, or just a dream that’s passing by? We’re floating through the clouds, lost in a world of wonder. Reality and fantasy are intertwined in this magical moment. Let’s savor the illusion while it lasts, and embrace the mystery."

We’ll take these lyrics and run them through the model to generate their embeddings.

snippet 2

1#Example song lyrics
2 texts = [ "Here comes the sun, shining bright on a brand-new day. The clouds have rolled away, and the skies are clear. Birds are singing sweet melodies, and I feel a warm embrace. It's a beautiful day, and everything seems alright.",
3
4"We're on a road to nowhere, driving through the open plains. The wind is in our hair, and the horizon seems endless. Every mile we go, the world feels wide and free. Join me on this journey, let's see where it leads.",
5
6"I can see clearly now, the rain has finally stopped. The sun is casting golden rays, painting everything with light. The streets are sparkling clean, and the sky is a brilliant blue. It's a perfect day to start anew, with hope and joy ahead.",
7
8"Is this the real life, or just a dream that's passing by? We're floating through the clouds, lost in a world of wonder. Reality and fantasy are intertwined in this magical moment. Let's savor the illusion while it lasts, and embrace the mystery." ]

And here’s the function to generate the embeddings:

snippet 3

1# Function to generate embeddings
2def query(texts):
3    response = requests.post(api_url, headers=headers, json={"inputs": texts, "options":{"wait_for_model":True}})
4    return response.json()
5# Generate embeddings
6output = query(texts)
7
8

Once the embeddings are generated, the API will return a list of vectors for each song lyric. These vectors represent the lyrics in numerical form, allowing us to compare their semantic similarity. It will typically look like this, but much more:

snippet 4

1[ [0.051, -0.123, 0.321, -0.045, 0.056, -0.234, 0.098, 0.221, -0.087, 0.134, ...],
2
3[-0.032, 0.089, -0.145, 0.212, -0.076, 0.029, -0.061, 0.144, -0.093, -0.065, ...],
4
5[0.112, -0.084, 0.278, -0.091, 0.067, -0.223, 0.102, 0.169, -0.122, 0.087, ...],
6
7[-0.072, 0.045, -0.092, 0.139, -0.053, 0.111, -0.042, 0.198, -0.083, -0.070, ...] ]
8
9

In the next step, we’ll upload these embeddings and use them to compare the song lyrics.

Processing the Embeddings

Once you have received the embeddings from the Hugging Face Inference API, the output will be a list of lists. Each inner list contains the embedding vectors for the corresponding song lyric. Here's how to convert this output to a Pandas DataFrame for easier handling and analysis:

snippet 5

1import pandas as pd
2# Convert the list of embeddings to a DataFrame
3embeddings_df = pd.DataFrame(output)

You can inspect the DataFrame to ensure it has been created correctly and to understand its structure.

snippet 6

1print(embeddings_df.head())  # Display the first few rows of the DataFrame

To store the embeddings for later use or for uploading to Hugging Face Hub, save the DataFrame as a CSV file.

snippet 7

1embeddings_df.to_csv("embeddings.csv", index=False)

Upload the CSV File to Hugging Face Hub:

Steps:

Go to the Hugging Face Hub and log in.
Create a new dataset and upload the CSV file.
Commit the changes to host the dataset on the Hub.

Comparing Song Lyrics

After hosting the embeddings, you can compare a new query (song lyric) to find the most similar one among the stored embeddings:

snippet 8

1def retrieve_embeddings():
2    retrieve_url = "https://huggingface.co/api/models/your_model_id"  # Replace with your actual model ID
3    response = requests.get(retrieve_url, headers=headers)
4    if response.status_code == 200:
5        return response.json()["embeddings"]
6    else:
7        print(f"Error retrieving embeddings: {response.status_code} - {response.text}")
8        return None

Convert Query Text to Embedding:

Use the same model to convert the new query (song lyric) into an embedding.

snippet 9

1query_text = ["A sunny day with clear skies and cheerful birds."]
2query_embedding = query(query_text)[0]

Compare Embeddings:

Calculate the similarity between the query embedding and each stored embedding using cosine similarity.

snippet 10

1from sklearn.metrics.pairwise import cosine_similarity
2import numpy as np
3
4def compare_query(query_embedding):
5    stored_embeddings = retrieve_embeddings()
6    if stored_embeddings is None:
7        return None
8
9    # Convert stored embeddings to numpy array for similarity computation
10    stored_embeddings_array = np.array(stored_embeddings)
11
12    # Calculate cosine similarity
13    similarities = cosine_similarity([query_embedding], stored_embeddings_array)
14    
15    # Find the index of the most similar embedding
16    most_similar_index = np.argmax(similarities)
17    return most_similar_index
18
19most_similar_index = compare_query(query_embedding)
20print(f"The most similar song lyric is at index: {most_similar_index}")

Interpret Results:

Use the index to retrieve the closest song lyric.

snippet 10

1similar_lyric = texts[most_similar_index]
2print(f"The most similar song lyric is: {similar_lyric}")

By following these steps, you will be able to effectively use embeddings to compare and find the most similar song lyrics or any other text data.

This guide has given you a basic introduction to vector embeddings and how they can be used to create smarter systems, like chatbots that understand relationships between song lyrics. However, building embeddings and handling vector spaces manually can get complex, especially if you’re looking for a streamlined solution for your retrieval-augmented generation (RAG) tasks.

If you'd rather skip the technical work and focus on results, Chatbase is your go-to platform. Chatbase uses more advanced embedding tools to vectorize your data and serves those results seamlessly through an API or an embeddable chatbot that you can integrate directly into your website.

Ready to make your data smarter and more actionable? Sign up for Chatbase today and experience how effortlessly you can manage your RAG needs!