We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Is RAG all you need? A look at the limits of retrieval augmented generation

00:00

Formal Metadata

Title
Is RAG all you need? A look at the limits of retrieval augmented generation
Title of Series
Number of Parts
131
Author
Contributors
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Retrieval-Augmented Generation (RAG) is a widely adopted technique to expand the knowledge of LLMs within a specific domain while mitigating hallucinations. However, it is not a silver bullet that is often claimed to be. A chatbot for developer documentation and one for medical advice may be based on the same architecture, but they have vastly different quality, transparency and consistency requirements. Getting RAG to work well on both can be far from trivial. In this talk we will first understand what RAG is, where it shines and why it works so well in these applications. Then we are going to see the most common failure modes and walk through a few of them to evaluate whether RAG is a suitable solution at all, how to improve the quality of the output, and when it's better to go for a different approach entirely.
Information retrievalLimit (category theory)SatelliteVertex (graph theory)Information retrievalLimit (category theory)Topological algebraAsynchronous Transfer ModeTranslation (relic)InformationDifferent (Kate Ryan album)Point (geometry)Physical systemSource codeProcess (computing)SoftwareQuicksortComputer fontCore dumpTask (computing)Wave packetContext awarenessReading (process)BitDependent and independent variablesEndliche ModelltheorieForm (programming)Connectivity (graph theory)Message passingCodecVector spaceOrder (biology)CASE <Informatik>Strategy gameGoogolMultiplication signTopological algebraDatabaseMathematicsLevel (video gaming)Revision controlPerformance appraisalComplete metric spaceValidity (statistics)ImplementationProper mapComputer animationLecture/Conference
Formal languageInformation retrievalOffice suiteTrigonometric functionsModal logicPlastikkarteEinbettung <Mathematik>Context awarenessTopological algebraFormal languageInformation retrievalInformationPhysical systemInternetworkingCASE <Informatik>Flow separationType theoryDifferent (Kate Ryan album)Search algorithmRight angleRankingBitQuicksortAlgorithmVector spaceMultiplication signWeb 2.0WebsiteEndliche ModelltheorieSet (mathematics)DatabaseSource codeElectronic mailing listTask (computing)Performance appraisalComputing platformQuery languageCuboidTopological algebraTrailConnectivity (graph theory)Software testingKnowledge baseDigital object identifierCartesian coordinate systemWeb pageComputer animation
Formal languageOffice suiteInformation retrievalData modelInformation retrievalSound effectGeneric programmingVector spaceParameter (computer programming)Category of beingEndliche ModelltheorieCartesian coordinate systemType theoryDomain nameContext awarenessPerformance appraisalSoftware bugReading (process)BitINTEGRALInformationPhysical systemTopological algebraConnectivity (graph theory)Arithmetic meanFlow separationException handlingStrategy gameFormal languageDatabaseAreaQuicksortFile formatAlgorithmProper mapTerm (mathematics)Condition numberLatent heatMultiplication signSimilarity (geometry)CASE <Informatik>Black boxWave packetStandard deviationComputer animation
Information retrievalSoftware frameworkTime evolutionRankingWeb pageVector spaceSimilarity (geometry)AveragePerformance appraisalTrigonometric functionsSource codeMobile appRandom number generationSoftware frameworkSet (mathematics)Similarity (geometry)BitRoutingMetric systemEvelyn PinchingCartesian coordinate systemFlow separationLibrary (computing)Personal digital assistantCASE <Informatik>AverageProof theoryTrigonometric functionsOnline helpPerformance appraisalValuation (algebra)Semantics (computer science)Data conversionMultiplication signLatent heatResultantDependent and independent variablesQR codeFunction (mathematics)Focus (optics)Complete metric spaceQuicksortWeightOpen sourceMereologyEndliche ModelltheorieTopological algebraContext awarenessComputer animation
Successive over-relaxationIndian Remote SensingMobile appPerformance appraisalEuclidean vectorMetric systemPerformance appraisal2 (number)Military baseFunction (mathematics)Metric systemInformation retrievalDeterminismObject (grammar)Connectivity (graph theory)Flow separationTurbo-CodeInterpolationData storage deviceQR codeContext awarenessSystem callMoment (mathematics)Software frameworkMeasurementMultiplication signMereologyPersonal digital assistantSet (mathematics)Library (computing)CodeBitDecision theoryAnnihilator (ring theory)CodeTopological algebraProgrammschleifeCASE <Informatik>Link (knot theory)Branch (computer science)Computer animationSource code
WebsiteThermische ZustandsgleichungSuccessive over-relaxationRemote Access ServiceDecision tree learningChainData modelCASE <Informatik>Latent heatEndliche ModelltheorieTunisContext awarenessSimilarity (geometry)Multiplication signResultantLoop (music)Information retrievalField (computer science)Type theoryBitPhysical systemDomain nameQuery languagePerformance appraisalOcean currentOrder (biology)Different (Kate Ryan album)Function (mathematics)Cartesian coordinate systemSeries (mathematics)BlogMedical imagingWordArithmetic meanEinbettung <Mathematik>ChainData structureTopological algebraoutputMultiplicationTrailMereologyDirection (geometry)Exterior algebraRankingProgrammable read-only memoryError messageQuicksortFile formatSet (mathematics)Computer animation
Steady state (chemistry)Bit error rateStorage area networkCuboidDesign of experimentsPersonal digital assistantSimilarity (geometry)Normal (geometry)Einbettung <Mathematik>Endliche ModelltheorieOpen setSlide ruleSemantics (computer science)Execution unitContext awarenessCASE <Informatik>Computer configurationSoftware testingPower (physics)LengthTrigonometric functionsCartesian coordinate systemSet (mathematics)Level (video gaming)Game controllerMeeting/InterviewLecture/Conference
Query languageRemote Access ServiceSeries (mathematics)Context awarenessSuccessive over-relaxationExact sequencePseudodifferentialoperatorHydraulic jumpState of matterProof theoryNumberInformation retrievalEinbettung <Mathematik>BitContext awarenessOrder (biology)Drag (physics)ImplementationProjective planeEndliche ModelltheorieInternetworkingScaling (geometry)Different (Kate Ryan album)Computer animationLecture/Conference
Data modelComputer iconSystem of linear equationsData acquisitionEndliche ModelltheorieEinbettung <Mathematik>MereologyInformation retrievalLengthToken ringInformationNormal (geometry)Context awarenessArithmetic meanPort scannerWindowComputer animationLecture/Conference
Transcript: English(auto-generated)
Okay, so thank you everybody for showing up. This talk will be about RAG. So is really RAG all you need? So we're gonna have a look at the limits of retrieval of metageneration, what it
is, what is it used for, what is it very good at, and also what is it not very good at? Like where it fails, what can go wrong with it, and how to mitigate a little bit the problems that you might have with your RAG system. So let's get started. First of all, a quick outline.
Again, first of all, we're gonna see what RAG is, like what does RAG stands for, how does it work more or less. We're gonna see the failure mode of these sort of systems. So really, in which cases it fails, why and what you can do about it. We're gonna see the evaluation strategies that you can put in place in order to understand
if your system is failing this way. And then at the end, we're gonna have a little bit of an overview of what you can do on top of it. So RAG itself, we're gonna see, it's pretty simple in this most basic form. So we're gonna see how can you improve on it, what can you do to go further than basic RAG. Okay.
So first of all, what is it? The definition of RAG is like retrieval, augmented generation. So RAG is a technique to augment the LLM's knowledge beyond its training data by retrieving contextual information before generating an answer. So these three steps, or better two steps we're gonna see, are really the core of the
whole technique. And you're gonna see that's much easier than a lot of other explanations make it seem like. First of all, RAG is really a technique that is very suitable for question answering. So LLM's can do many things, can do summarization, can do translation, but RAG is a technique that focuses on answering questions that users might have.
So sort of like, you know, search systems or document search, something like this, already question answering, RAG is really good at these sort of tasks. So first of all, when users have questions, in a RAG system, first of all, this question goes to a component that is called a retriever, which is the first step we've seen also
in the acronym like RAG, that is first a retrieval step. What this retriever is, it's basically any sort of software system that can do search. So for example, a lot of people thinks immediately, because you're talking about LLM's and stuff like this, that a retriever must be a vector database, like Wiviate or Chroma or something
like that. But in practice, any system that given a question or a keyword even, can return you some related documents, related context, can be used as a retriever here. So for example, Google, any search API that you might have, Elasticsearch, or even something much more like corner case, imagine like searching through Discord messages or like
GitHub search, whatever you might want to use to search for like snippets of text into any sort of corpus you might have, can work as a retriever. After you have done this step, so from your question, you retrieved somehow in some
way some sort of context that you might want to use to answer the question, then the next step is to building a prompt for the LLM. Because before generating, you have to take the information you retrieved at this stage and basically put it together with the question in a way that the LLM can understand.
Rag prompts are normally very easy. In their most basic form, they really read like this. For example, read the text below and answer the question. This is already a valid rag prompt, and as soon as you attach like the context you retrieved before and the question the user has given, this is ready to go to the LLM next. After you have this prompt again, you send it to the generator, which nowadays really
mostly represents an LLM, because that's the system that can most reliably generate text from a prompt like this, and this prompt given to an LLM will normally result in an answer that can be given back to the user.
Yeah, so Rag systems in their very core, in their very basic essence, are something like this. It works in three steps. Retrieving, building a prompt, and then generating an answer from this prompt. Why using Rag? So when would you want to do this?
First of all, one of the main benefits that are mostly advertised is that Rag reduces hallucination. That's actually true, because the LLM, when it's given a Rag prompt, it doesn't have to like know the answer to your question. It can just read the context that is in the prompt and just basically rephrase it, which is a lot easier for the LLM to do, because LLMs are much better at just manipulating
text than also like storing information. They're very good at that too, but really like reading text and understanding it and rewriting it, it's a lot easier task for them. Also, it can be handled by, for example, simpler and smaller LLMs. However, it is to be kept in mind that it does not remove hallucinations completely.
We're gonna see about this, because a lot of people seems to imagine that Rag is like a silver bullet. You implement Rag, you have no hallucinations whatsoever. This is not true. Rag reduces hallucinations, but it doesn't remove them, so you always have to pay attention to that. Rag is also great to let an LLM access fresh data.
So for example, if you want to use LLMs to discuss about the news, you cannot really do it with a model that doesn't have really Rag implemented on top of it. Because LLMs are trained on data that is necessarily always older than the point when you ask the question. Training an LLM is a very long process, and you have also to collect the data beforehand,
so with all the time that passes, by the time you collect the data, you use it to train the LLM, and then you ask the question. And by then, even the most recent data you added, it will be old. So the only real way to make Rag able to reason on the news, for example, or very fluctuating information like the weather forecast, the stock market, whatever, you
need to give them access to recent data. And this is something that Rag really shines at, because it can retrieve data at the time when the question is asked, and give this sort of data to the LLM for it to be able to, like, reason properly. And last but not least, Rag can be seen as a way to increase the transparency
of how, like, LLM respond to the users. Because when, for example, also when the LLM responds in a specific way, then maybe it's not the way you expected, it's easier to go see why the LLM replied in that specific way by checking what the prompt contained.
If the prompt contained a source that indeed, like, seems to, let's say, contradict the way you were expecting the reply to be, then you know that it was the retriever that found something that made the LLM reply this way. Of course, again, this is subject to mistakes, but it normally helps a lot understanding what's going on in the LLM and why it replied that way.
So let's have an example of two systems, one that doesn't use Rag and one that uses Rag to see really what the difference can be in practice. For a system that doesn't have Rag implemented, I have an example from Chudge GPT, like the really basic version, the 3.5 that has no Rag implemented. While for a Rag system, I could have used, like, a later GPT,
but just for a change, I decided to use Perplexity in here to see a system that really implements Rag can do. So if I ask Chudge GPT 3.5, where does Europe Python 2024 takes place, Chudge GPT is very confident, that tells you that it takes place in Dublin,
because back then, when it was trained, Europe Python has been in Dublin, the most recent one has been there, so it's like, yeah, of course it's been there, like, that's the best of my knowledge. Interestingly, Chudge GPT doesn't even realize that it's inferring something wrong. It just assumes very confidently that this is the answer,
because this is all that it can take from its internal knowledge. That's really the best it can do. Instead, if I ask Perplexity, where does Europe Python 2024 takes place, first of all, the answer is correct. Europe Python takes place in Prague, so Perplexity at least gives me a proper answer.
Not only, but it also tells me where it found this information. So you can see on the top, there is, like, a sources list, where you can see the Europe Python website, and you know that most likely this is correct, because this is the official Europe Python website. So RAG really reduces hallucination, it increases the transparency, and also, like, makes the LLM able to access fresh data.
So you can see in here really clearly that these three things all take place. Again, probably some of you know that, like, Perplexity AI is also, like, not infallible by any means, but at least for these, like, simple tasks, the improvement is really, like, strong. Okay, so now that we talk about what RAG is good at doing,
let's see what is, like, what are the ways in which it can fail. First of all, like, we've seen that RAG has two critical steps. So first of all, there is retrieval, and then there is generation. So each of these can fail in different ways. A retrieval failure looks like an issue when a retriever component,
the retriever component, whatever it is, fails to find the relevant context that you needed to answer the question. So the retriever returns some data that doesn't really help answering the question, so the LLM really doesn't know what to do with it, and most likely returns, like, a weird answer. Instead, a generation failure is when even when the retriever found
the correct context, so all the data is in the prompt, then this prompt is given to the LLM, and the LLM still replies something that is, like, just not related to the question or even, like, an hallucination. So there are these two different types of failures, and these have to be evaluated, like, separately.
This is an example of a successful query that you can do with RAG. So for example, let's ask a question that normally CHGPT doesn't know out of the box. For example, what was the official language of the Republic of Rose Island? Also, like, I suppose most of you don't even know what the answer to this question would be, so you cannot just double-check for yourself.
In the case of a successful RAN, the retriever would find you some snippets from wherever it's looking, for example, the Internet. You can see in the snippets, it's very clear, like, what the language is. So the first one, all of them really clarify that the official language was Esperanto,
and then the LLM reads this context and say, okay, the official language was Esperanto. So this is how you expect the system really to work in practice. A retrieval failure looks like this. Given the same question, the retriever finds some information about the topic, so it's not entirely wrong, but also the information it retrieved
doesn't really contain the information you expected it to. So for example, it tells you where this C platform was. It tells you that it was near the coast of Italy. So the LLM already starts thinking, okay, it's there, that might help inferring the language, and then it can tell you, like, yeah,
it had the problem with the Italian police. So, okay, so it was really close to Italy somehow. It had the relationship with it, and it says nothing else. So the LLM really takes a best guess here, reads the context, and it's like, most likely the official language must have been Italian, which we know is not true now. So this is clearly a retrieval failure. The LLM is doing its best,
but the retriever just literally confuses the LLM to provide a wrong answer. Why retrieval fails? So there are several reasons, because also there are several ways to do retrieval in the first place. But one of the most common is that sometimes the relevant data just does not exist in the database. If you, for example, were looking in a very restricted
set of web pages, or actually Wikipedia contains this information, but maybe Wikipedia in other languages doesn't. So if your system was only looking in that sort of corpus the data might just not be there, and then the retrieval will fail because simply the data is just not retrievable because it's not existing.
So in that case, you want to just make sure that the data is there, first of all. In other cases, the retrieval algorithm might be just too naive. There are a lot of fairly naive retrieval algorithm. For example, one based on keyword search. For this sort of, for like RAG application, you have to be a bit careful with them,
because for example, they are unable to retrieve context that contains synonyms. So if the user poses a question with a synonym, and then you pass it through, for example, TF-IDF, the ranking will not be as you expect because a lot of the context that contains the synonym only will not be retrieved. And that also makes retrieval harder,
and therefore it's more prone to these sort of failures. There are more advanced methods, like for example, doing a vector search, but those are based on using embedding models for both the question and the context. And sometimes also the embedding model might be too small or maybe too naive, or maybe not suitable for the data.
It happens, especially if you have a very niche context, I don't know, medical or legal or something very complicated and scientific that has many weird terminology, for example. Embedding models might be just not up to the task. So you might want to choose an embedding model that is suitable for the data you're having, in case you're using already vector search
and still retrieval is like a weak spot. On top of that, the data might not be chunked properly because this is something that not everybody knows, but most of the retrieval techniques really are not very good at handling huge chunks of text or very tiny chunks of text. So some people think that just having a list of PDFs,
maybe huge PDFs, it's enough for doing search. But most of the time you want to pre-process your data a very little, just to chunk it in manageable pieces that contain just a little bit of information. This way retrieving really the right piece of information is much easier for most of the retrieval methods out there
and it's also easier for the LLM to parse because the LLM has to read less text, there are less occasions to get confused. So in general, chunking the data in such a way that it's in like small bites, it's very helpful for retrievers to handle. And then last but not least, the data in the questions might be in different languages which like throws off keyword search
and it's sometimes also challenging for vector embeddings. So just keep in mind that if you are handling questions and questions in context from different languages, you really want to make sure that you can handle multilingual queries. All right, next, let's see. Oh no, I forgot this one.
So sometimes there is a tricky situation in which the return might be failing, but you don't see a failure at the end of the pipeline because if you're using a very, very smart LLM, like an LLM that has a large knowledge base, like I don't know, a GPT or something like that, sometimes the LLM is smart enough
that it sees the question, it sees the retrieved context and it's like, well, no, the context is bogus. I'm gonna try something else. I'm gonna just invent something that is not in the context. So sometimes the LLM is smart enough to really understand that there was a failure in between and hide the failure. So if you only check your system from question to answer
without checking what's happening in the middle, you might be missing a lot of retrieval failures if your LLM is too smart to let you notice that the failure was there. So it's a bit tricky also to spot this and it's something to keep in mind for evaluation especially. Yes, so moving on, there also are generation failures.
So something that goes a bit like that. We have the same question as before. What is the official language of the Republic of Rose Island? The retrieval is spot on. It gives you all the snippets that contain this information. But somehow your LLM is such that it sees this context, it sees this question and it doesn't understand
what the question or the context was really about. So it just reads it but it doesn't understand what it's reading and then it makes up an answer. These sort of failures are the most puzzling because even if you see the prompt, the LLM is just really not following it. So this might be really confusing and they are also like, to be honest, the most opaque to the bug.
Why do they happen? Most often, this is because you choose a model that is too small and just cannot follow instructions. Sometimes people go for really, really tiny models that can barely follow instructions at all. So if you have maybe a slightly more complicated prompt or something like that, the LLM just cannot follow instructions very well
and it's prone to hallucinating somehow. Sometimes maybe the model is not that small, even though this is rare to happen with really large models maybe the model is like, okay, it's like maybe eight billion parameters or something like this. But maybe it's too small to understand the domain really because your domain is so specialized
and it has such a weird terminology that really the model doesn't understand the question and it kind of makes up something generic. For example, if you're asking, I don't know, the side effects of a specific medicine or a specific condition or something like this, the model trained on Wikipedia is not gonna know most of these terms maybe. So sometimes it's just prone to making stuff up
because it doesn't understand what it's being asked to do. Another potential issue is that maybe the RAG prompt is not built properly. Especially some small models are very sensitive to how the prompt is built. So if you have tappos, if you have weird formatting or special characters, sometimes the LLM gets confused by those. So you have to check what's in your RAG prompt
and maybe slightly clean your data to avoid this sort of confusion from the LLM. And then last but not least again, just make sure that the model is multilingual when you do this because sometimes if you ask a question in a language that the LLM doesn't understand, just like a human, it will not be able to answer you. So yeah, these are the most common.
And now that we've seen the failures that we might come across, let's have a look at the evaluation strategy. So let's see in which way we can evaluate a system like RAG in a way that makes sense and gives us the most insights. First of all, there are two ways to evaluate a RAG system.
You can do both, so they are not exclusive, but there are two main techniques. So one is called isolated evaluation. In this case, you evaluate the retrieval and the generation separately. So you just take the two components in isolation and you test them like they were working in isolation. This is useful to understand really which one is weaker
and which one is stronger. But the other method like end-to-end evaluation, which means like to evaluate the entire system from the question to the answer as if like the RAG application was a black box, that is also very useful and it can spot issues that you might not be able to spot in isolated evaluation
because it can spot integration issues. For example, it can spot issues with the RAG prompt or something that goes on in the way you pass the context to the LLM. So both are useful, they are useful in different ways. And especially when you're starting, you might want to do a little bit of both and then you can choose which one is most important
for the type of application you're making. So for isolated evaluation, let's see how to evaluate like the retrieval step. This is a very huge topic to be honest because there are a lot of retrieval methods and all of them are very different. Most of them have their own evaluation techniques.
So there are, what I can say here is that there are two main categories of like retrieval methods. There are some that are keyboard-based and some that are vector-based. So keyboard-based algorithms are a bit less powerful because again they have problem with multilingual corpora and they have problem with synonyms most of the time,
but they are much easier to evaluate because you can evaluate them with them using like traditional techniques like measuring recall precision, F1. So those are pretty standard, they are like a much more, how can I say, well-explored area. So this is easier to do. There are much more tools available
for this sort of evaluation. On the flip side, like the search methods of like vector databases or like embedding similarity, these are a bit newer and evaluating them is much harder because the context and the question might not have nearly anything in common except from the meaning connecting the two.
So measuring whether the context you retrieve through vector search, it's relevant or not and how relevant it is, it's not very easy. So some people does this with other algorithms. It's expensive so it really depends on what you're doing and how much of it you're doing. Or they're using metrics like really semantic answer similarity,
which we'll see how is it computed in a second. But like the evaluation itself is a bit less clear, a bit less like, how can I say, a bit less meaningful because indeed understanding also how relevant the context is to a question is also a bit subjective so it's really hard to assess in a really objective way.
For evaluating generation, it is even harder because LLMs are really hard to evaluate, like the output of an LLM, it's really hard to evaluate. A simple way of doing this is using another LLM to evaluate your main LLM, especially if you plan to use a rag with a small model,
you can evaluate it by sending like its answers to a larger model. This is something that a lot of people do because it's very easy fundamentally and it provides quite decent results. Also a benefit of this method is that responses can be evaluated according to different criteria. So you can choose the criteria you want to evaluate it about.
For example, you can check for completeness, conciseness, relevance, like you have many metrics you can define and then ask another LLM to evaluate about. Also the metrics or the criteria you want to evaluate generation about really depends on the applications of your rag application.
So for example, if it's like a scientific application, you might want to focus especially on factual accuracy. If it's more like for customer support, conversation quality is also very important because you want to like help people feeling a bit helped or for personal assistance, you want to really be a bit more concise.
So you can also evaluate criteria that normally would be very hard to like assess just with a numeric metric like F1, for example. So next, about end-to-end evaluation. This is honestly the hardest in general because understanding if a question and an answer
actually match really depends on the person. Like sometimes it's easy to measure. Like if I ask, what's the capital of France? Okay, the answer has to contain Paris somehow. But if I'm asking something open-ended, it's really hard to assess. So one of the techniques that is taking place in here is measuring semantic similarity,
which is nothing else than the cosine similarity between the answer and the ground truth you were expecting for that question. So you have a data set already with question and answers you expect, reasonable answers you expect, and then you measure the cosine similarity. This is not foolproof because again, there might be different ways to answer a question
and cosine similarity by not be perfect in those cases. So you have to take it with a pinch of salt, but right now this is one of the most popular techniques. So that's a way you can use to measure the performance of your RAG pipeline. Some libraries also uses indeed this weighted average
between the semantic answer similarity and F1 score because some question indeed need to contain keywords. As I said, if you're asking what's the capital of France you can measure with F1 metrics if Paris is present. And that's already a bit better than just semantic similarity alone.
So you might also make a weighted average here depending on whether your application is supposed to answer in a specific way. On top of that, to even add more generated stuff on top of this evaluation pile, sometimes you can use synthetic evaluation data sets to do your evaluation.
So you don't even need to build yourself the data set with question and answers to evaluate against, but some libraries like Ragas, for example, make you create the synthetic evaluation data sets for you. So you just generate the data set and then evaluate against it. Again, each of these steps need to be taken with a pinch of salt. You should check the data that is generated
before like trusting the scores. But this is something that can help you also save a lot of time and money depending on like how often you have to do these sort of evaluations. Next, for the example that I wanted to show you like real quick, I'm using this framework called Haystack.
So Haystack is a Python open source framework that can be used to build like Rag applications. The good part about this one, which like other LLM framework, is that you can use it to build your Rag application and evaluate it at the same time. So it's nice that several evaluation libraries are already integrated and you can also integrate your own fairly easily.
So it's a pretty nice framework. It's much simpler to use them like other popular LLM frameworks out there. It's like really lightweight. It's pretty cool. So if you didn't hear about this, I recommend you to check it out. And in here in the QR code, you can see their tutorial about how to do pipeline, basically Rag pipelines evaluation.
So if you're curious about how they do it, just have a look in there and check it out. For the example though, I'm not using any of the frameworks that they integrated natively. I just use another one that is called continuous evaluation. Now we're gonna see how to do it in a second.
First of all, this is how I build an Haystack pipeline. Now this might look like a lot of code, but in practice, there are like three steps. First of all, you generate all your components. In this case, you generate a text embedder and a retriever, which work together to embed the question and then retrieve the question from a corporate that you might have put in your document store here.
Then you have a prompt builder that does really string interpolation to just build the Rag prompt that we've seen before. And then you have a generator in here that indeed is a GPT 3.5 turbo. So pretty easy. These are the components of the pipeline. Then you create a pipeline object and then you build the pipeline
with this like two sets of calls. First of all, you add the components to the pipeline and in the moment when you add them, they are not connected to each other. You connect them to each other at the last step with a connect call where you basically tell the text embedder to send its output to the retriever, the retriever to send its output to the prompt builder and the prompt builder to send its output to the LLM.
So you build pipelines in this way, which is probably a bit verbose, but it's like at least very declarative. You can see what's going on and it's pretty simple. Actually, Haystack supports very complicated pipelines, so you can make branches, you can make loops, but indeed for Rag, you mostly don't need it. So this is a simple Rag pipelines that you can build with it.
From this pipeline, how do you evaluate it? Here is how you evaluate with a library called continuous eval. This is built by Real AI and it's again a pipeline that allows you, sorry, a framework that allows you to evaluate pipelines in isolation and end to end. So you have several ways to evaluate the pipeline.
You can see in here what's happening. First of all, you decide which are the outputs that you want to measure. So you want to measure, for example, the LLM answer and the retrieved context because we are gonna do isolated evaluation. So you basically select those. Then you create a pipeline evaluator object in here
and then you add metrics to it. So for the LLM, for example, you can add several metrics. In this case, there are deterministic metrics like something checking for F1. For example, like if a keyword is present in the specific answer or not. There are metrics that are LLM based.
For example, this faithfulness that makes sure that the question and the answer, like sorry, the answer really is answering the question so it's not talking about something related or it's not like going on a tangent or something. And then you can have like custom metrics like conciseness. For example, I decided like it's not in the example but you can see it in the QR code
and decide it's defined in the code in there. For example, conciseness. Maybe I want to build like a personal assistant so I don't want GPT to talk for three hours every time I ask a question. So I make sure that conciseness is also measured. For the retrieval, you do basically the same. In this case, for example, we're measuring F1,
precision recalling here, which is deterministic. Again, it depends really what the question you expect are but there are also several metrics for retrievers as well. So this is just a part of the example. See the full example here at the link in here. It's pretty easy to make it run
so just have a try and check it out. Okay, so this is more or less the evaluation part. After your like RAG system works and it's like it's evaluated properly, you know the performance it's having, you may want to improve on it. And there are several ways to do this.
Actually, there are too many ways to do it. So I've kind of selected four that I found the most like used, the most like, the most representative of the field right now. One of the easy, like, let's say the lowest hanging opportunities to improve your RAG system is to have multiple retrievers
because sometimes your application might need to access data that is in very different format. For example, you might need to search through PDFs, like long text PDFs. But you might also need to search at the same time through, for example, semi-structured text. Or you might need to search maybe even through images. How do you know? In this case, it's very hard to find a retriever
that always works for all these types of data at the same time. So you might want to have specialized retrievers and make them all run, make them all retrieve some context or they think that's relevant in their specific dataset. And then use a re-ranker to take these few results and really understand which one is best.
Re-rankers are much more accurate than retrievers normally because they are designed to work with very little data and really be much more accurate. Retrievers are normally focusing on like speed and performance at the expense a little bit on precision. So re-rankers instead do the opposite. So they expect very little input data, like three times the data they are expected to output,
maybe, and then just shrink it down a little bit to re-rank it precisely. So this is like one of the most straightforward way to improve your rank system. Another fairly straightforward one we like about self-correcting. So for example, at the end of the pipeline,
maybe before sending the answer to the user, you might want to send it first to another LLM and check whether the other LLM as well thinks that the answer is correct. So you may want to like, if the answer is not correct, you may want to send it back to the retriever, do maybe another loop. So self-correcting is designed to catch these sort of like hallucinations
by just double checking. It's not like foolproof again, but can catch some errors. Like you are again reducing a little bit more the probability of hallucinations. So it might be worth it depending on your application. Another one is like agentic systems. So this is just a fancy word for application
that needs to handle both rag and no rag queries. So for example, if people is acting at the same time to your application, what's the capital of France and what is 20 plus 30? The first one clearly might need some context. The second one clearly does not. Actually giving context to the second question might even confuse the LLM and make it produce an answer that is really wrong.
So at the start of these pipelines, you might want to put a classifier. So something, maybe it's still an LLM or maybe another classifier that you trained that decides whether this question needs context or not. And just run the retriever if the question needs context. So this is nice. You can improve the scope of the question
that your system can handle. It's actually, as far as I know, quite used. So that's a nice one. And the last one, which is like, I think the most fancy one, the most like complicated, is multi-hop. So in this case, you use chain of thought to make a series of retrieval based on the context. Maybe the question is really complicated. I don't know. Something like,
when was the sister of the current king of Sweden born? Something like this. When you really need to retrieve it in steps. You need to retrieve, who's the current king of Sweden? Does this person have a sister? And when does this sister was born? So you need to do three retrieval runs. The LLM needs to understand, okay, I have to ask three questions here.
It has to send three retrieval queries, and then it has to aggregate all the results together. Multi-hop is much more complicated, and it's useful if you have really complex question or you need to explore a topic. But in practice, it's also very expensive, and it can be a bit brittle. So this one needs to be used with care.
Okay, last but not least, a word on fine tuning. So a lot of people seems to believe that fine tuning is an alternative to RAGI. You need to do one or the other. In practice, this is really not true. Fine tuning sometimes should be used with RAGI. Like the two really can only help each other. Especially fine tuning can be used together with RAGI
if your domain is really complex, like indeed medical, legal, or like scientific domains, because it can help the LLM understand what the question is about. It can help improve situations in which your LLM has a lot of generation failures, because it doesn't understand the question at all. So fine tuning is really good in this situation.
At Kavit here is that sometimes you need to fine tune both the LLM and your embedding model, because in order to do retrieval with an embedding, with an embedding similarity, for example, also the embedder needs to understand what is the meaning of the questions. So it also needs to, for example, understand what the keywords kinda,
what's the meaning of the keywords you're using in order to embed them properly in order to find more relevant results. So this is something that is a bit less little known, but there's something that can also improve a lot of situations in which your retriever is underperforming and you kind of find a way to improve it in other ways.
Last but not least, fine tune can also help to alter a little bit the behavior of your LLM. So if you want to have, for example, a specific voice, or you want to improve conciseness or something like this, you can do it with Rompet engineering, but it's often not very consistent. So if you want to be really sure that you want to improve your LLM in that direction and you want to do it strongly, fine tuning is normally the best idea.
All right, so that's all. You can see in here, I have also a talk summary on my personal blog, so if you want to check it out, it's there. That's it, thank you for listening.
All right, thanks a lot, Sara, for this very interesting talk. We still have a good five minutes left,
so for Q&A, we would ask you to queue up on the two microphones in the middle of the floor, so if there's any questions to be asked, feel free to just come up to the microphone and ask them.
Hi, Sara, thanks for your talk. I have a question about slide for end-to-end testing. You mentioned about casino similarity between answer and ground truth, and my question, it's question similarity for embeddings for these things, or for what?
Yes, yes, semantic similarity in here, like cosine similarity in here really means between the embedding of the question, sorry, the embedding of the answer and the embedding of your ground truth. So normally in this case, when you do end-to-end evaluation, you have a data set of your question and your answer that you expect, so what we're doing here is that you embed the answer and you embed the ground truth answer
and then you do cosine similarity between them. So this is what is meant by semantic similarity, semantic answer similarity, because you just do cosine similarity between these two. Okay, got it. And is there any advices which model to use to generate these embeddings? That really depends a lot of what you're doing
because there are some normal embedding models that are like, for example, Genia embedding models, those are pretty powerful, they have a big context length, for example, you can find them on Hackingface, but honestly, there are so many options. There are OpenAI embeddings as well. The advantage of using one that you can find on Hackingface
often is that you can fine-tune it if you need it, but OpenAI, on the other hand, is easier to use if you're in a hurry, let's say, because it's just hosted, so it's there. It depends on the level of control you need, how specific is your application. There are really many, many ways to select the embedding model, yeah. Okay, got it, thanks. Okay.
Once again, thanks for the talk. I thought your slide about the different approaches to solving, this one was brilliant. When working with business stakeholders, you often find people are very worried about hallucinations, so how do you, what kind of accuracy jump can you expect
from using these techniques, and how do you talk to the business people about reassuring them that the hallucinations aren't going to stymie the whole project? Yeah, I mean, this is a bit like the holy grail, I would say, like having no hallucination, or at least, let's say, being able to guarantee that there will be very little,
it's really like a bit of a holy grail, so what you can say with RAG is that if you want, you can go full-scale with everything. You can go with the best embedding model, the best search, the best models, and you can keep, let's say, cutting edge, and that way, you can, let's say,
guarantee that you have the lowest hallucination possible with the state of the heart. That's kind of the best you can do right now. Normally, especially with the very latest model, like GPT4O, at least in my experience, the hallucination, especially when paired with RAG, they tend to be really, really low, but it depends a lot on what you're doing, because also GPT4O is not gonna be an expert doctor,
so you can try your best, but also giving numbers makes little sense until you have a full system, and as we've seen, retrievers can vary a lot. Their performance can be really, really variable, especially, it also depends on the data you have.
Are you searching on the internet? Are you searching on your real data set? So seeing numbers in here on such a wide possibilities, like so many possible implementations, is really hard. I can say that in my experience, RAG really reduces them drastically, and it's also easy to tell for them. So when you make them, for example,
you make a proof of concept, and you let them try with and without, especially it's normally very easy to tell the difference, but again, you cannot guarantee zero hallucinations or zero dangerous hallucinations, or tell them, yeah, we're not gonna have any, it's all good. State of the art is improving, but we're not there at zero hallucinations yet.
Perfect, thank you. Hi, Andrea. I have a question about the pre-retrieval step of chunkization. As you said, it can affect retrieval. So in your opinion, what would be some of the best practice for chunking the context
in order to improve retrievals? Yeah, wow, okay. This is an interesting question, because again, it depends on your retrieval. I will give you my experience with mid-size or relatively large-ish embedding models. The most important part, of course, like I would say,
is to really stay within the context length of the embedding model. Some people seems not realize that embedding models really also have a context length, and it's normally much smaller than LLMs. Not always, for example, genus embeddings are very large, very large context window, but normal embeddings models actually have a very tiny one.
It might be as low as 500 tokens. So you, first of all, want to chunk them lower than that. And next, honestly, I found out that kind of the 500 tokens might be pretty good, like 1,000 tokens maybe, depending on, again, depending on your embedding models a lot. But yeah, going very big or going close to the,
like for example, the 8K that the Jina models can embed, it makes it harder for a retriever to retrieve, just because there is so much information in the chunk you're embedding that all the meaning averages out. So it's harder for the embedding model to tell whether the information is really there, because there is so much other information
that got averaged out with it. So maybe small paragraphs, 1,000 tokens, something like this, 500 tokens, depending on your model. But yeah, I would say that's kind of more or less the size I have in mind when I talk about this. I am very sorry to have to- What? Thank you. Ah, yeah. I'm very sorry to have to interrupt,
but we have to stop now because it's already too late. So thanks you all for the nice questions. Thanks.