AI Primer for Business People: What the heck is a Vector Store?

Chetan Kumar
Jan 23, 2024
5 min read

Quick Answer: A Vector Store is actually a data structure or database that stores and manages vectors, which are mathematical representations of objects or data points used in machine learning and information retrieval.

Imagine you have a set of documents, perhaps PDFs, related to a deal, or contract, and you would like to extract information from them, using AI.

Since LLMs can only work with a maximum of around 4000 words, we cannot give it the entire document with 50,000 words. This means, we need to feed it only relevant sections and paragraphs, which may contain the answer to our question.

The first step in the process is figuring out which paragraphs and sections are more likely to contain the information that will be relevant our question.

Let's take the example of a common scenario. You're onboarding a new client or vendor, and you have a deal document to vet.

The document has the following sections.

Introduction/Parties
Description of Goods or Services
Pricing and Payment Terms
Delivery and Acceptance
Quality Assurance and Compliance
Warranties and Representations
Confidentiality Clause
Liability and Indemnification
Term and Termination
Dispute Resolution
Force Majeure
Miscellaneous Provisions
Governing Law
Signatures

Let's say you want to ask a question related to warranties, and would like to ask

"How long does the warranty last? What are the terms for extending or renewing it?"

As a human, you would find the relevant section title and read through till you get the answer to your question. Similarly, The first step in the AI insight pipeline is to narrow down the document, into a set of paragraphs that most probably contain the answer.

In the document, this information may be spread out among various disconnected paragraphs.

Warranty Duration 
"The Vendor warrants that all goods and services provided under this Agreement shall be free from material defects in materials and workmanship for a period of twelve (12) months from the date of delivery to the Buyer. This warranty shall expire at the end of the said twelve-month period, unless otherwise specified herein."

.....followed by some other paragraphs, and then:

Conditions for Warranty Validity 
"For the warranty to remain valid, the Buyer must use and maintain the goods or services in accordance with the Vendor's instructions and must not modify, misuse, or subject them to unauthorized repairs. Failure to adhere to these conditions will void the warranty."

But we need a mechanism where we feed the question.... "How long does the warranty last? What are the terms for extending or renewing it?"..... and it returns a list of paragraphs above.

In the past we did this with keyword searching. Similar to what google did. Extract the terms "Warranty", "Extending", "Renewing", "Terms" etc., from the question and then scan the document for sentences and paragraphs which contain the above keywords.

Now since LLMs have a better understanding of language, we can do better. Embedding Vectors are a kind of mathematical construct that allow more meaning to be embedded in each word. For example, the word "king" and "queen" might be completely disconnected from each other as keywords, but when written as vectors they are very similar.

Vectors are also better at contextual meaning, where if you search for "Where is the nearest bank from my location, I need to make a withdrawal", it will match more closely with paragraphs contain references to the bank as financial institution, rather than "river banks" because of the added context "I need to make a withdrawal".

This advantage, makes it better to index the document paragraphs as vectors, and not as keywords.

An keyword index is modelled on the one that you find at the back of books. It's a database of words, and where in the document are they located.

Our vector index is similar, but instead of the words, we store the mathematical vector for the word, along with the page/section/line number where those words were found.

A vector is a list of around 1500 numbers, and it's not important to go how these are calculated or what the numbers mean. As long as you can understand that the numbers have the special property where similar "meaning" words will have similar vectors, and words typically found together, will as vectors be similar or closely grouped.

In the past, we had text search engines, which worked on keywords. Example include, Apache Solr, or ElasticSearch

With the onset of LLMs, we need a new "database" of sorts to store these vectors. These are vector databases.

A vector database

Efficiently Stores the vector representation of the words
Has a Query mechanism to search similar vectors given a question vector or word.

Now if we go back to the example of the PDF deal document, we would calculate the vector associated with each sentence or paragraph (Open AI has a API which does this for very very cheap), and store it in the vector database, along with the information of where it's located in the document.

When we want to ask a question, we give the question first to the vector database, which finds the similar paragraph vectors and locations back to you.

You can then send these highly relevant paragraphs to the LLMs, along with your question, and the LLM would be happy to give you a very good answer.

Now which Vector database should you choose? It's best to leave it to your technical team, but you should know some basics.

Here is a list of the current Vector Databases with some pros and cons.

Milvus:

Pros:

High performance and scalability for vector similarity search.
Supports multiple index types for efficient querying.
Easy integration with machine learning models and frameworks.

Cons:

Relatively new in the market, so might lack some advanced features.
Community and support are growing but not as extensive as some established databases.

Pinecone:

Pros:

Managed service, easy to set up and use.
Good for real-time vector search applications.
Scalable and supports large-scale data.

Cons:

Being a managed service, it may offer less control over certain configurations.
Cost can be higher compared to self-hosted solutions.

FAISS (Facebook AI Similarity Search):

Pros:

Highly efficient for similarity search in large datasets.
Open-sourced by Facebook, with strong community support.
Good integration with Python and machine learning libraries.

Cons:

Primarily a library, not a full-fledged database, so requires additional setup for database-like functionality.
Less out-of-the-box functionality compared to other vector databases.

Weaviate:

Pros:

Combines full-text search with vector search.
GraphQL-based query language for easy data retrieval.
Supports semantic search and automatic classification.

Cons:

Being relatively new, might lack some advanced features.
Smaller community and ecosystem compared to more established databases.

ChromaDB:

Pros:

Designed specifically for handling high-dimensional vector data, making it suitable for AI and ML applications.
Offers efficient and scalable similarity search capabilities.
Can handle large-scale datasets and is optimized for quick query responses.

Cons:

As a more specialized database, it might have a steeper learning curve compared to more general-purpose databases.
The community and ecosystem might be smaller compared to more established databases like Elasticsearch, which could affect the availability of resources and support.
Depending on its release and development cycle, it may not have as comprehensive a feature set as more mature products in the field.

Core Maitri is an enterprise software consultancy specializing in Excel-to-Web, AI Integration, Custom Software Programming, and Enterprise Application Development services. Our approach is deeply consultative, rooted in understanding problems at their core and then validating our solutions through iterative feedback.

If you're interested in exploring how AI can benefit your business, contact us to find out more!

Let's Talk