- Oct 31, 2023
- 5 min read

Understanding RAG, without complex math.

Updated: Nov 28, 2023

If you're active on the AI front, you must have come across the term RAG. RAG stands for Retrieval Augmented Generation. In this article I'm going to help you understand what RAG is and how and when you should use it.

Use Case

As a user, you have a corpus of documents. You want to use AI to answer some questions from the documents. Prompt engineering allows you to ask the following:

You are a helpful assistant who is tasked with reading text and answering questions.
Here is the text:
<Insert document text>

Here is the question I would like to ask:
<Insert question>

You can easily copy the above prompt, and paste into Chat-GPT or bard, replace the placeholders with actual document text and your question and it will work.

Here is a link to a chat asking about Hurricane Willa using the above prompt.

https://chat.openai.com/share/377a0986-d70b-40ad-a5b8-112c46293d8e

Problem Statement

Large Language models have a important limitation, and that is the amount of input they can accept. For example, GPT-3.5 can only take around 4000 tokens as input, and what's worse is that the limit is shared between the input and output. So if your input is 2000 tokens, your output will be 4000 - 2000 = 2000 tokens. If your input is 3500 tokens, your output can not exceed 4000 - 3500 = 500 tokens. (A token is a 4 character word).

GPT-4 is slightly better with 8000 tokens. There are more models which further can accept 32,000 tokens. But the larger the token size, the slower the model performance.

Now if you have a large pdf document, it will certainly exceed 4000 tokens. So you can't "fit" the entire document in the prompt and say "I want know the summary of this book".

Solution

Now what if we were to break the document into chunks. Say sections and paragraphs can be chunks. Each page can be 3-4 chunks and approximately 200-400 tokens wide.

So a typical document may have anywhere from a 50 to a 1000 chunks.

Now if we needed to ask a question, we can insert the "relevant chunks" and answer our question.

Example:

You are a helpful assistant who is tasked with reading text and answering questions.
Here is the text:
<Chunk-1>
<Chunk-2>
<Churk-3>

Here is the question I would like to ask:
<Insert question>

But how do we make sure that the chunks that we have selected actually contain the answer to our question?

Enter, Embeddings

Embeddings

Embeddings are by-products of Large Language models training. Like the valuable husk that remains after sugar extraction, it has use-cases of it own.

Let's understand what embeddings are. Machine Learning models are computer algorithms. And computers don't understand anything except 0s and 1s. Everything, needs to be converted to a "number".

So a sentence like "I took my cat to the vet" needs to become a sequence of numbers like [1223, 342, 34, 7635, 23, 37, 5723] before an algorithm can process it. And that's what early machine learning algorithms did. They took the dictionary and associated a number against each word.

Word to Number Mapping

I => 1223
took => 342
my => 34
cat => 7635
to => 23
the => 37
vet = 5723

But alas, it did not work out very well. The word "bank" for example, can be a river bank or a money bank, and depending on the context it can mean 2 different things.

A single number can not represent a word. A word is too rich in meaning. It can be noun, verb, pronoun. It can be past, present, future. It can be happy, or sad.

So instead of a single number for each word, we need multiple numbers. Let's look at a very crude example:

Word	Type	Mood	Tense
Laughing	verb	Happy	Present
Cried	verb	Sad	Past
Running	veb	Neutral	Present
Cat	noun	Happy	None
I	pronoun	Neutral	None

The benefit of the above is that now we are capturing the meaning and the richness of the word more effectively.

So instead of a single number, we express each word as a sequence of attributes

Word = [Type, Mood, Tense]

So Laughing = [verb, happy, present]

We can add more attributes, like the

Gender: (King = Male, Prince= Male, Princess = Female, Queen = Female)
Socio-Economic Standing: (Peasant = Poor, Lord = Rich)
Ethical Standing: (Thief = Bad, Farmer = Good, Robber= Bad, Murderer=Bad)
Royalty: (King = Royal, Queen = Royal, Peasant = Common)

Instead of a word being mapped to a number, we can associate each "attribute" to a number. Example, let's take 2 attributes Gender and Royalty.

Gender:

Gender-Neutral => 0
Male => 1
Female => 2

Royalty

Neutral => 0
Royal => 1
Common => 2

So the word "king" if expressed as the list of attributes [gender, royalty], becomes [Male, Royal] or [1,1]

This number sequence of attributes [1,1] is called a "vector". It's a list of numbers. The length of this list is as long as the number of attributes we choose, where each number in the vector represents an attribute.

If we do this correctly. We can begin to get a very interesting result. Vectors can be added, subtracted, multiplied and divided just the way normal number are:

Example:

[1,1,1] + [2,2,2] = [3,3,3]

[4,3,2] - [1,1,1] = [3,2,1]

What interesting about that?

Let's take 4 words. King, Queen, Man, Woman.

The number sequence or vectors for king, queen, man and woman will start to share interesting "mathematical" properties.

king - man + woman = queen 
king - man = queen - woman

In plain words:

# If I remove "man" from king, and add "woman" what will I get? Queen
# What man is to king, woman is to queen

So these vectors are amazing to capture the richness of words. What's more is that even phrases or sentences can have their vectors.

How are these vectors determined? Well, it's calculated by the machine learning algorithm itself when the LLM is being trained. It's a by-product of the model training.

OpenAI uses a vector size of 2048 length. Meaning the word "Laughing" will be represented by a sequence like [1.34, 2.13, 3.56 .....(2048 numbers)...4.56].

Imagine the kind of richness of each word, it's able to capture in 2048 attributes!

Back to RAG

So if we take each of our chunks, and just get the embedding vector for it from Open-AI, we can save it in some database. Typically a "Vector Database".

Now when we want to ask a question, we again get the embedding vector for it from Open-AI. But this time, we go to the vector database and ask, vectors of which chunks are similar to the vector of this question? Can because vectors are mathematically similar to numbers, the vector database can easily find "close" vectors to a given vector.

We can now get a list of vectors, which are linked to the chunks themselves, to be returned. Now that we have the chunks, which are "relevant to the question", we can use our previously created prompt and voila! We can break out of the limitations of the token length.

I hope this article cleared up your understanding of RAG. Thanks for reading!

Core Maitri is an enterprise software consultancy specializing in Excel-to-Web, AI Integration, and Enterprise Application Development services. Our approach is deeply consultative, rooted in understanding problems at their core and then validating our solutions through iterative feedback.