We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Fundamentals of Retrieval Augmented Generation

00:00

Formal Metadata

Title
Fundamentals of Retrieval Augmented Generation
Title of Series
Number of Parts
131
Author
Contributors
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Retrieval Augmented Generation (RAG) has emerged in recent years as a popular technique at the crossroads of Information Retrieval and Natural Language Generation. It represents a promising new approach that combines the strengths of both retrieval-based systems and generative AI models, aiming to address the limitations of each, while enhancing their overall performance on document intelligence tasks. This talk will introduce the key frameworks, methodologies and advancements in RAG, exploring its ability to empower Large Language Models with a deeper comprehension of context, by leveraging pre-existing knowledge from external corpora. We will review the theoretical foundations, practical applications, and technical challenges associated with RAG, showcasing its potential to impact various fields, such as document summarization or database management. Through this talk, attendees will gain insights into the most relevant topics related to RAG, including token embedding, vector indexing and semantic similarity search.
Fundamental theorem of algebraInformation retrievalAugmented realityComputer fileTask (computing)AbstractionElectric currentAreaTranslation (relic)Formal languageCodeFingerprintGroup actionContext awarenessHeegaard splittingElectronic mailing listStrategy gameInformation retrievalDimensional analysisDatabaseVisualization (computer graphics)MeasurementInformationVector spaceTask (computing)Topological algebraFormal languageContext awarenessWindowOrder (biology)Meta elementType theoryWave packetEndliche ModelltheorieProcess (computing)Variety (linguistics)Category of beingAddress spaceCASE <Informatik>AlgorithmEinbettung <Mathematik>DistanceSubject indexingLimit (category theory)1 (number)Maxima and minimaAugmented realityOnline helpPoint (geometry)Query languageEvent horizonMedical imagingActive contour modelMereologyCorrespondence (mathematics)ResultantSource codeMoore's lawProgramming languageAdditionToken ringOperator (mathematics)Real numberPreprocessorForm (programming)Parameter (computer programming)Graphics processing unitoutputEntire functionFile archiverGroup actionCodeSemantics (computer science)Content (media)Computer configurationDirection (geometry)Computer animationLecture/Conference
Augmented realitySimilarity (geometry)AngleVector spaceCASE <Informatik>Run time (program lifecycle phase)DistanceMultiplication signNeighbourhood (graph theory)Different (Kate Ryan album)Artificial neural networkComplex (psychology)AlgorithmNumberAxiom of choiceMeasurementMetric systemResultantTerm (mathematics)Element (mathematics)Order (biology)Partition (number theory)1 (number)ApproximationQuery languageProduct (business)Cartesian coordinate systemDatabaseGene clusterPoint (geometry)Closed setGeometric quantizationInformation retrievalTrigonometric functionsLengthDirection (geometry)Orientation (vector space)Electronic mailing listStrategy gamePythagorean theoremForcing (mathematics)QuicksortNeuroinformatikCoordinate systemEuklidischer RaumComputer animation
Context awarenessAnnihilator (ring theory)MeasurementMatching (graph theory)Metric systemInformation retrievalReduction of orderStrategy gameInformationBitDependent and independent variablesDefault (computer science)Context awarenessChainAugmented realityVector spaceDatabasePhysical systemLimit (category theory)outputNeuroinformatikGroup actionSimilarity (geometry)System callMultiplication signEndliche ModelltheorieCASE <Informatik>WindowHeegaard splittingTopological algebraMultiplicationFitness functionPerformance appraisalSlide ruleFormal languageToken ringScaling (geometry)Proof theorySet (mathematics)ResultantSoftware testingConnectivity (graph theory)Computer animation
Structural loadCodeDifferent (Kate Ryan album)Subject indexingReduction of orderLevel (video gaming)Locally compact spaceStrategy gamePresentation of a groupPhysical systemState of matterTheorySoftware frameworkInformation retrievalChainLine (geometry)Capability Maturity ModelComputer animationLecture/Conference
Dot productInformation retrievalAugmented realityMereologyTerm (mathematics)Task (computing)Physical systemCASE <Informatik>Modal logicComputer animation
Default (computer science)Physical systemNumberAugmented realityStrategy gameDifferent (Kate Ryan album)Lecture/Conference
Vector graphicsAlgorithmDistanceApproximationProcess (computing)NumberCASE <Informatik>Context awarenessWindowLimit (category theory)InformationComputer animation
Angle of attackInformationBitExpert systemPhysical systemScaling (geometry)Performance appraisalComputer animationLecture/Conference
Performance appraisalTask (computing)Physical systemEvoluteLibrary (computing)Computer animation
Source codePhysical systemNumberInformationWebsiteComputer animationLecture/Conference
Augmented realityNumberWeb pageLink (knot theory)InformationFormal languageSystem callEndliche ModelltheorieSource codeEntire functionElectronic mailing listComputer animationLecture/ConferenceProgram flowchart
BEEPPresentation of a groupNumberMultilaterationMultiplication signLecture/ConferenceComputer animation
Transcript: English(auto-generated)
Okay, thank you very much for the introduction. So in the last two years, especially since the release of ChatGPT, we have seen language models become increasingly more capable on a variety of NLP tasks.
For example, they can generate coherent text, which is almost indistinguishable from that of a human. They can accurately translate from one foreign language to another, and they can even write code in most of the major programming languages. However, despite all these impressive abilities, we also know that LLMs suffer from some important
limitations. The most well-known issue is that of hallucinations, which is when a model gives an answer which is totally incorrect, but in a very confident and convincing way. Another common problem is that of outdated or incomplete knowledge.
This happens because LLMs were trained on data that was collected up to a certain point in the past, and so they don't have any knowledge about more recent events that happened beyond that point. We should also mention their inability to perform actions in the sense that a language
model by itself cannot directly access information stored in an external database. And of course, we know that all LLMs have a maximum token limit or context window, which means that it wouldn't be possible to give an entire archive of documents as input to
a language model because we would go above the token limit. Now, these are inherent issues which cannot be solved by just increasing the size of the model with more parameters or scaling the computation power with more GPUs. Instead, there is a technique that was developed a few years ago by researchers at META, which addresses these challenges.
It's called Retrieval Augmented Generation, or RAG for short, and it works in the following way. Let's say we have a language model and a database with documents which are either private or very recent. So in any case, these documents were not used during the training of the LLM, and it's
not possible to directly ask the model any questions about their content. The first step in the RAG process is to query the database using a type of algorithm called semantic search. This will look for fragments of text which are semantically similar to the user question
and return the top results. The motivation for doing this is that the documents which are the most similar to the question are usually the ones that also contain the answer. This is what we mean by retrieval.
There is a common misconception that the LLM is somehow involved in the retrieval process, but this is actually wrong. The retrieval can be done entirely without the help of the LLM. Then the next step is to append these retrieved documents to the initial question and create a larger prompt which contains both the question and the potential answer.
This is called augmentation, and it's basically a form of prompt engineering. And finally, we give this enhanced prompt to the LLM, which will generate an answer based on the retrieved information and not on its original training data.
This is essentially how RAG allows an LLM to answer questions about topics which it hasn't seen during training. In addition to the answer, we could also provide the retrieved documents as a reference list so that the user can check how well the generated answer matches with the original
sources. Out of all these steps, the retrieval is the most important and difficult one, and in order for the retrieval to be successful, there are a few data preprocessing steps that have to be done in advance.
Specifically, we have to split every document from the database into smaller fragments of text, which are called chunks. This is necessary because we want to be sure that in the end, the LLM will receive only the most relevant pieces of information, and not entire documents, which might also
contain redundant information or might be too large to fit in the context window. Ideally, every chunk should contain information that is semantically related, and there are different strategies for how best to split a document. For example, we could split it into chunks of fixed equal size, or we could do the
splitting based on a list of special characters. After splitting, we use an embedding model to convert every chunk of text into a vector of real numbers. We do this because the algorithm for semantic search is a mathematical operation which has
to be applied, has to be performed on numerical vectors. It cannot be applied directly on plain text. The vectors are then stored and indexed into a special type of database which is optimized for handling high-dimensional vectors, and when the user asks a question, the question
doesn't have to be split into chunks. It can be directly embedded and compared with the rest of the vectors. Now, in order for these vector embeddings to be useful, they have to satisfy a few basic properties. The most important one is that the geometric relationship between any two vectors should
reflect the semantic relationship between the corresponding chunks of text. So this means that chunks that are semantically similar should be embedded into vectors that are located next to each other, and on the other hand, if two vectors are far away,
then they should correspond to chunks which are not related. For example, in this image, we can see that chunks which are referring to the Python snake are grouped next to each other in one part of the vector space, and the chunks which are related to the Python programming language are clustered in a separate region.
And if the user asks a question about one of these two topics, the question is embedded into a vector which will be located in the corresponding cluster, hopefully somewhere close to the answer. So we need to have a way of finding out which vectors from the database are the
nearest neighbors to the query vector. This is not an easy task because in reality, the vector space is not two-dimensional like in this picture. Depending on which embedding model we're using, the vectors could have hundreds or even thousands
of dimensions, so it's not possible to identify the nearest neighbor by direct visualization. Instead, we have to use a metric that can measure distance between vectors in high-dimensional space, and there are several options for doing this.
Let's say we have two vectors, A and B. One metric that can be used to measure vector similarity is the Euclidean distance, which is given by the straight line distance between A and B. This is an intuitive metric because it depends on the distance between the corresponding coordinates.
If the coordinates are close to each other on all the axes, then the vectors are also close and vice versa. Another metric is the dot product, which is given by the product between the lengths of the vectors and the cosine of the angle between them. This is a metric that depends on the orientation of the vectors in the sense that the dot
product of two vectors will be larger if the vectors are pointing in the same direction and it will be smaller if they're pointing in opposite directions. And probably the most popular similarity metric is the cosine similarity, which is
just the cosine of the angle between the vectors. It depends only on the angle and not on the length or the orientation of the vectors. And in particular, if the vectors are normalized, if they have length equal to one, then the dot product is the same thing as the cosine similarity.
Now, the choice of the similarity metric is quite important because it can have a direct influence over the final results of the semantic search. Consider the following example. We have three vectors, A, B, and C, and we would like to know which one is closer
to A. Is it B or C? Well, the answer depends on which metric we're using to measure the distance. If we use Euclidean distance, we see that the distance between A and B is the same as the distance between A and C. So both B and C are equally close to A. But at the
same time, we see that the angle between A and C is much smaller than the angle between A and B. So if we use cosine similarity, then A is closer to C than it is to B. And there are also cases where it can happen the other way around. So in this example, we see that the angle between A and B is equal to the angle between
A and C. So the vectors A and B have the same cosine similarity as the vectors A and C. But on the other hand, we notice that the Euclidean distance between A and B is much smaller than the Euclidean distance between A and C.
So the main point is that in both examples, if we're using one metric, then the vectors are equally close to A, while if we switch to the other one, then one becomes closer than the other. That's why we say that the choice of the concept of the nearest neighbor can
change depending on our choice of the similarity metric. After we've selected a similarity metric, we can use it to identify which vectors from the database are the nearest neighbors of the query. The easiest approach would be the following.
Compute all the distances between the query vector and all the other elements of the database, sort them in order of increasing distance, and then just return the vectors that have the smallest distance. This approach is called k-nearest neighbors, or KNN for short.
It's a brute force approach because we are iterating over all the elements of the vector space. It has the advantage that it's always guaranteed to be 100% accurate, so we can use it in cases where perfect accuracy is a mandatory requirement. But it also has the disadvantage that it's very inefficient in terms of time
performance. This is because the runtime complexity of this algorithm grows linearly with the number of vectors. So, if we have a very large database with millions of vectors, then the KNN would not be a good choice because it would be too slow and it would take too long.
Instead, in such cases, we have to make a compromise and use a different kind of algorithms called approximate nearest neighbors, or ANN. These are algorithms which are much faster but less precise, in the sense that they
will return vectors from the neighborhood of the query vector, but they're not guaranteed to be the absolute closest. They're just close enough. One example of ANN algorithm is product quantization.
Let's say these are the initial vectors from the database. The first step is to partition these vectors into sub-vectors, and each partition will be considered separately from the other ones.
Then, on each partition, we apply the K-means algorithm and we group the sub-vectors into various clusters. Every cluster will have a centroid, and these centroids can be considered as approximations of the vectors from that cluster.
And then, when we have a query vector, this one is also partitioned into sub-vectors. Each sub-vector is assigned to one of the clusters from that partition. Specifically, it's assigned to the cluster that has the nearest centroid.
The idea is that instead of comparing the query vector with all the initial vectors, we're comparing it with the cluster centroids, which are fewer in number. Then, going to the initial vectors, those that have sub-components which are in the same clusters as the query vector are considered to be the nearest
neighbors and are returned by the algorithm. This is how product quantization works. There is a trade-off between speed and precision, determined by the number of clusters, because as we increase the number of clusters, the
precision will improve, but it will also take longer because we are iterating over all the clusters. And of course, there are also other ANN algorithms which are even more efficient. This was just one example.
Then, after we finished the retrieval, we will have a list of vectors, and we know that these vectors are associated with chunks of text that are similar to the user question. Now, we would like to have a way of... We would like to have a prompt that somehow combines the question with
these chunks of text that were retrieved. Just like in the case of retrieval, there are different strategies for how to do the augmentation. The simplest approach is called the staffing strategy. This is the name used in languages.
So, in this case, we just concatenate all the retrieved chunks, we give them to the LLM, together with the question, and the following instruction. Use the following pieces of context to answer the question at the end. If you don't know the answer, just say you don't know. Don't try to make up an answer.
It's a simple approach, but for many use cases, this can be good enough. And the staffing strategy is actually the default used in language. Another augmentation strategy, which is a little bit more complex, is
the map reduce strategy. So, in this one, again, we start with chunks of text which have been retrieved. But this time, instead of combining them, we give them to the LLM one by one, together with the question and the following instruction.
Use the following portion of a long document to see if any of the text is relevant to answer the question. Return any relevant text verbatim. So, here we are instructing the LLM to reduce the size of the chunks by keeping only the most relevant pieces of information and discarding anything else which is not related to the question.
And then, after we get these reduced chunks, we just repeat the steps from the staffing strategy. We concatenate them, and we give them to the LLM, and we ask it to use them for answering the initial question.
This is a strategy that can be useful if the language model has a small token limit or if the chunks are too large and don't fit in the context window. It also has the downside that we are making here, in this
case, multiple calls to the LLM API. So, the final cost might be higher, the usage cost might be higher, and also the waiting time for the response might be a little bit longer. Okay, then after we've selected all the different components
of the RAG system, the similarity metric, the semantic search algorithm, the augmentation strategy, we would like to have a way to evaluate that performance. The evaluation could be done manually by a human comparing
the questions with the generated answer and the retrieved documents, but if we want to do this efficiently at scale, again, we will have to use an LLM. The LLM used for evaluation doesn't necessarily have to be the same one as the one used for the RAG system, but it
should be at least as advanced. So, for example, we could use GPT 3.5 to perform the RAG, and we could use GPT 4 to evaluate the RAG, but not the other way around. So, we give the initial documents to this LLM, and we
ask it to generate a test set of question and answer pairs. In this case, the answers can be considered to be correct. Because the LLM isn't searching for anything. It receives the information as input, and it just has to
convert it into question and answer pairs, which is something an LLM can easily do. Then the questions are given to the RAG system one by one, and for each question, as we know, the retrieval will provide
a few relevant documents from the database, and the LLM will use them to generate an answer. And now that we have these four items, the question, the retrieved documents, the generated answer, and the reference answer, we can give them to the LLM, and the
LLM can use them to compute various metrics for the performance of the RAG system. Here, I'm going to focus on two kinds of metrics, metrics for retrieval and metrics for generation.
So, on the retrieval side, we have two metrics, context relevance and context to recall. The context relevance metric measures if the information from the retrieved documents is addressing the user question, and question recall evaluates how good is the
match between the retrieved documents and the reference answer. On the generation side, we have here three metrics, answer relevance, faithfulness, and answer accuracy. Answer relevance measures if the generated answer is
providing the information that was requested in the question. Faithfulness tells us if the generated answer is based on the information that was retrieved and doesn't contain any kind of made-up information.
So, this is the metric that can be used to check if the model is hallucinating. And answer accuracy evaluates if the generated answer is consistent with the reference answer, which is considered the ground truth, as I mentioned.
Okay, so to conclude, let's review the ways in which RAG is addressing the limitations of LLMs that I mentioned at the beginning on the first slide. So, the problem of outdated or incomplete knowledge is solved
by storing the new information in a database and retrieving only what is necessary or relevant through semantic search. The inability to perform actions is addressed by the fact that the retrieval is done by the vector database, and the top results are just given to the LLM inside the prompt.
The limited context window is solved by splitting the documents into smaller chunks. And in case the chunks are still too big, we can also use an augmentation strategy like MapReduce to shrink the prompt even more.
And the risk of hallucination is reduced by explicitly requesting the model to admit that it doesn't know the answer when that is the case. Okay, this was all. I'm available if there are any questions, and thank you
for your attention. Thank you, Kathleen. Thank you for the great presentation. So, yes, please go to the first mic to ask your questions. Yeah, hi. Thank you for this kind of like very interesting overview.
In your opinion, how like mature is the tooling for using Rags, and do you have any suggestions of kind of like what tools should be like now are the state of the art and should be used, or is it just kind of like, well, each of its own. You gave the theory, now go implement it. So are you familiar, for example, with frameworks like
Langchain or Lamma Index? Somewhat. Okay, because a lot of the things which are relevant for retrieval for Rags have already been implemented in Langchain, for example. So you can implement a simple Rack system with only a few lines of code in Langchain.
So I would recommend either Langchain or Lamma Index, which is also pretty good. Okay, thank you. You're welcome. Thank you. My question is, is the MapReduce strategy the same as Agentic Rack, and if not, what's the difference? And also, so in the MapReduce strategy, you have two LLMs.
Would you suggest that you use the same LLM at both stages or different LLMs? So to start with the second question, so in this case, again, this is an easy task.
So you can use the same LLM as you're using in the Rack system. It doesn't necessarily have to be a different one. It can be the same one. And so can you repeat the first part of the question? Is this what is referred to as Agentic Rack systems,
as in using the LLMs as agents? No, I'm not sure exactly what you mean by Agentic Rack system. So it's like this new, relatively new term that's thrown around, and it's the idea is like an LLM,
different LLMs are doing different things in the system before it goes to the final LLM that generates the answer. No, I would say that is something much more general, much more complex. This is just, here you're just using the same LLM with a few extra steps before generating the final answer.
So I mean, this is just an augmentation strategy, so it's much more basic than Agentic Rack. Hey, thanks for the great talk. I have a question regarding KNN or the approximate nearest neighbors.
I'm curious about the number of relevant chunks, how that's determined. I know that by default, it's up to the scientist or the engineer to determine what's the best K, the number. So is it here the manual step that you need to play with it
and find a K, or is there some, let's say, process on top of it that determines it for you? Well, for example, the number of chunks and also the size of the chunks are related to each other,
because I mean, remember that the LLM has a context window, so that is giving you an upper limit of how many chunks you can give it inside the prompt. So that's like an upper limit.
And apart from that, I don't know. Typically, a typical number of chunks is between four and six, but I've also seen more cases where you don't use the chunks that are the most similar,
but you also use the chunks which are directly before and directly after the retrieved chunks, just to give it even more context and make sure you're not missing any relevant information. All right, thank you. You're welcome.
Hello, thanks for the great talk. I was just wondering, you suggest to use an LLM to do evaluation of RAG, and I'm always a bit skeptical about such approaches, especially if you do it at scale, you can't really manually check everything that LLM has evaluated. So my question would be, did you check
if the judgment of the LLM is somehow correlated with the judgment of experts to evaluate the performance of a RAG system? So first of all, that's why I suggested that the LLM used for evaluation should be more advanced than the one used for the RAG system,
because that's already giving you some assurance that the evaluation is meaningful. But so there are a few libraries, which are specialized in this kind of task evaluation, like RAGAS, for example, which is short for RAG assessment.
And so there is a paper written by the creators of RAGAS, and yes, they claim that they've done this kind of evaluation, this kind of evaluations, and it does match with human evaluators.
I see. Thank you. You're welcome. Hi, thanks for the talk. I wanted to ask about something that may be a little out of scope of your talk, but it would be interesting to have. In a RAG system, to have it be able to cite its sources, do you know of any systems that chunks of data are tagged
so that the final answer can be referring back to the original source of the information? So you want to have inside the answer numbers for referring to? A link to a web page that the original information came from,
or something like that? So you can definitely, when you give the answer to the user, you can also give it not just the generated answer, but also the sources of also the retrieved document, the documents that were retrieved, the entire list.
And for example, you could maybe change the prompt a little bit and ask the language model to insert numbers after every claim it makes so that it tells you from which chunk this claim is coming from.
Or another approach would be to just make one more LLM call with the generated answer and these chunks. And again, to ask the LLM to insert numbers telling you which answer is coming from which chunk.
So if you know the chunks that you gave it, and you ask to give it like this was from chunk number three, then you can trace it back up. Exactly, yeah, something like that. Sorry, we are out of time. Thank you so much for your questions. You can ask your question later. Thank you, Kathleen, for your presentation.