A vector store is a type of data store designed to store text and then when provided a new text, find the existing texts in the store that are most similar to the given text.
Vector stores are widely used in the field of NLP (Natural Language Processing). They allow us to search for text matches.
DISCLAIMER: The goal of this article is to explain vector databases to the average person. If you’re looking for technically accurate description: Vector Database.
Why not a normal database?
A normal database stores data as either:
- Tables and columns for a relational database (like Microsoft SQL Server, PostGres, MySQL etc)
- JSON for a document database (like Mongo, Redis etc)
You can query normal databases via exact match queries:
SELECT name FROM Person where birthDate = '2001-01-01'
What if you just have a document of text with the person’s birth date:
Jim lived a long life. He was born in 2001 on January the 1st.
And the user asked the question:
What is the date of birth for Jim?
This question would be really hard to answer in a normal database.
Vector databases are designed exactly for this kind of problem.
How Vector Databases work?
Vector databases take a given piece of text and:
- Tokenize it so different forms of the same word are treated the same (e.g., “birth date”, “date of birth” etc will likely end up being tokenized to the same tokens “birth” and “date”. Similarly “eating”, “ate”, “eat” will likely tokenize to the same token “eat”. The exact tokenization is dependent on the tokenizer used.
- For details on tokenization: What Is Tokenization?
- Tokens in the sentence are combined to create an “embedding”. An embedding is a vector representation of the tokenized sentence.
- Technical embeddings capture the semantic meaning of words and the relationships between them. Words with similar meanings will have similar vectors, allowing models to generalize across different instances of a word or phrase.
- For more details on embeddings: Guide on Embeddings in NLP.
- Vector databases store these embedding vectors
- When a text is supplied to query, the provided text is also converted to embedding vector
- Finally the vector database can match the provided text to “similar” text by running similarity searches on the vectors stored in the vector databases.
- There are many similarity metrics that can be used to find similar text. A common one is cosine distance.
This is a highly simplified explanation of what vector databases do. For more details: Comprehensive Guide to Vector Databases.
Semantic Relationships
The vector databases model the semantic relationships between words that don’t have the same token.
Let’ start with the same text in the vector database:
Jim lived a long life. He was born in 2001 on January the 1st.
In the example above, “birth date” and “date of birth” would tokenize to the same tokens.
Now however let’s say the user asked the following question:
What is the age of Jim?
The text “age” would not tokenize to the same tokens as “birth date”.
Age is semantically related to birth date though. This will be represented in the embedding vector so the vector database will still match the second text to the first text.
This semantic matching is the super power of vector databases.
Examples of Vector Databases
Depending on your needs there are three types of vector databases:
- Vector databases offered by normal database vendors. These are designed to integrate with your existing databases. These are great to being with since you don’t need to switch database vendors.
- Examples are ElasticSearch, Redis, Postgres and Mongo.
- Dedicated vector databases. These are designed from scratch for vector database functionality. They offer more similarity metrics to use and advanced vector capabilities. If your use cases are more advanced than #1 above then you can upgrade to these.
- Store vectors in a normal database or in memory. I don’t recommend this solution since you will have a hard time achieving the performance needed as your data size increases.
Benefits of Vector Databases
Vector databases allow us to store knowledge in text form and then given a question find all the knowledge in the vector database that is related to the text in the question.
This means we can use vector databases to store documents of text and then find the matching text given a new piece of text such as a question.
You can read Introducing the Knowledge Store to learn how vector stores are a key part of a knowledge store.