We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Fact Checking Rocks: how to build a fact-checking system

00:00

Formal Metadata

Title
Fact Checking Rocks: how to build a fact-checking system
Title of Series
Number of Parts
60
Author
Contributors
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
In this infodemic era, fact-checking is becoming a vital task. However, it is a complex and time-consuming activity. In this talk, we will see how to combine Information Retrieval tools with modern Language Models to simply implement a fact-checking baseline with low human effort. I will show you how to build a funny use case around rock music. The application is based on several Python open-source libraries: Haystack, FAISS, Hugging Face Transformers, Sentence Transformers. This step-by-step implementation will be an opportunity to learn more about Dense retrieval and Natural Language Inference models in a hands-on way. I will share some insights into developing modern Natural Language applications. **Why it's relevant:** Fact-checking is significant to the society, although it is still difficult to do automatically. Using modern NLP tools can help speed up and automate part of this task. **What the audience will learn:** - Dense retrieval for semantic search - Natural Language Inference models - How to build a fact-checking system using Haystack, FAISS, Hugging Face Transformers, Sentence Transformers. - How to integrate powerful (Large) Language Models in your NLP applications, conditioning them to operate on your knowledge base - How to efficiently combine tools from Information Retrieval, NLP, and Vector Search
Linker (computing)Machine learningVector spaceStatement (computer science)ImplementationScalabilityModul <Datentyp>System programmingNatural languageMathematical modelSoftware frameworkAnalog-to-digital converterComputer filePreprocessorGame theoryUniform convergenceTask (computing)InferenceMathematical modelNatural numberThresholding (image processing)Musical ensembleLibrary (computing)Cartesian coordinate systemSound effectOpen sourceMathematical modelRepresentation (politics)InferenceVector spaceData storage deviceSubject indexingNatural languageTask (computing)Transformation (genetics)Projective planeSoftware frameworkKnowledge baseMetadataSemantics (computer science)Computer fileImplementationData conversionPattern recognitionRange (statistics)MereologyQuery languageFocus (optics)Flow separationEinbettung <Mathematik>Virtual machineWaveProcess (computing)Physical systemGame theoryInformation retrievalStatement (computer science)Green's functionCASE <Informatik>Sampling (statistics)Set (mathematics)outputCovering spaceConnectivity (graph theory)HypothesisElectric generatorWeb 2.0Noise (electronics)Representational state transferSingle-precision floating-point formatSpacetimeDecision support systemOperator (mathematics)Search engine (computing)Speech synthesisDatabaseFamilyBuildingBimodal distributionWeb applicationScaling (geometry)Element (mathematics)Right angleGrass (card game)Multiplication signWave packetTexture mappingDevice driverMonster groupStokes' theoremMathematical modelHoaxShared memorySimilarity (geometry)Open setElasticity (physics)SequelWeightMachine learningLine (geometry)MathematicsAsynchronous Transfer ModeProof theoryNear-ringIdentity managementProduct (business)Software testingNumber theoryRule of inferenceThetafunktionDiagramLecture/ConferenceComputer animationMeeting/Interview
Statement (computer science)Thresholding (image processing)ImplementationInformationVertex (graph theory)PreprocessorHeegaard splittingCoprocessorProcess (computing)Structural loadStapeldateiComputer-assisted translationTransformation (genetics)String (computer science)Query languageWritingEinbettung <Mathematik>MultiplicationData storage deviceCASE <Informatik>Meta elementInferenceFunction (mathematics)Social classKey (cryptography)InformationMereologyDecision theorySimilarity (geometry)Order (biology)Functional (mathematics)Statement (computer science)Attribute grammarVector spaceLibrary (computing)Group actionComputer fileCartesian coordinate systemMusical ensembleElectronic program guideImplementationoutputNatural languageMathematical modelHypothesisPhase transitionRepository (publishing)Connectivity (graph theory)Electronic mailing listWeb pageInformation retrievalAverageField (computer science)Constructor (object-oriented programming)Multiplication signThresholding (image processing)Subject indexingToken ringNoise (electronics)WeightPoint (geometry)Element (mathematics)Loop (music)Smith chartFile formatData dictionaryInverter (logic gate)IntelDiscrete element methodRight angleNeuroinformatikTexture mappingPay televisionNormal (geometry)Operator (mathematics)Computer animation
Musical ensembleLocal GroupTerm (mathematics)Standard deviationWell-formed formulaNumberGroup actionInformationInformation retrievalStatement (computer science)Random numberTwin primeCartesian coordinate systemVideoconferencingStatement (computer science)Moment (mathematics)Open sourceNatural languageMathematical modelKnowledge basePlotterTriangleComputer animationMeeting/Interview
Statement (computer science)InformationRandom numberInformation retrievalSource codeWeb pageMathematical modelNatural languageContent (media)Arithmetic progressionFamilyTexture mappingDifferent (Kate Ryan album)InformationStatement (computer science)CASE <Informatik>Sampling (statistics)Knowledge baseMereologyWeb pageGoodness of fitSubsetMeeting/InterviewComputer animation
Similarity (geometry)Source codeStatement (computer science)Performance appraisalPhysical systemReal numberMountain passProjective planeMultiplication signInferenceStatement (computer science)Semantics (computer science)Natural languageSource codeSimilarity (geometry)Texture mappingPoint (geometry)Medical imagingFlow separationRobotWaveSampling (statistics)Web 2.0Food energyOpen sourceBuildingPerformance appraisalYouTubeDomain nameInformation retrievalHoaxSet (mathematics)Limit (category theory)Knowledge baseComputer animation
BuildingSystem programmingInformation retrievalNatural languageMathematical modelMathematical modelNatural languageShared memoryMathematical modelSpeech synthesisGame theoryInverter (logic gate)Open setElectric generatorLocal ringLatent heatMultiplication signTask (computing)Open sourceDifferent (Kate Ryan album)Projective planeMathematicsComplete metric spaceInformation retrievalSoftware testingDeterminismPhysical systemFlow separationCalculationInferenceAbstractionDomain nameGoodness of fitData compressionLecture/ConferenceComputer animation
Bit error rateReading (process)Moment (mathematics)Electronic mailing listLecture/ConferenceMeeting/Interview
Natural languageElectronic mailing listInformationReading (process)Web 2.0Presentation of a groupMathematical modelFile formatCASE <Informatik>Open sourceDescriptive statisticsServer (computing)AuthorizationModemDiscrete element methodMiniDiscBuildingSampling (statistics)Lecture/Conference
Web 2.0Ocean currentOpen sourceMeeting/Interview
Game theoryDemo (music)Open sourceProjective planeMultiplication signConfidence intervalThresholding (image processing)TwitterWordStrategy gameComputer animationLecture/Conference
Thresholding (image processing)Natural languageChainInferenceInformationMathematical modelSet (mathematics)Point (geometry)Meeting/Interview
HoaxLine (geometry)State observerData storage deviceStatement (computer science)Dependent and independent variablesNatural languageMultiplication signWebsiteMathematical modelMusical ensembleLecture/Conference
BitHoaxSource codeFocus (optics)Wave packetMeeting/Interview
Musical ensembleLecture/ConferenceMeeting/InterviewDiagram
Transcript: English(auto-generated)
Hello, everybody. Thanks for having me here. This is Fact-Checking Rocks, How to Build a Fact-Checking Application Baseline Using Python Open Source Libraries.
I am Stefano Fierucci, a machine learning engineer focused on natural language processing. I am interested in semantic neural and vector search. And I recently started to work for Deepset after being a long-time contributor of the ASTAC NLP framework.
In this infodemic era, fact-checking is becoming a vital task while being difficult and time-consuming. And with the latest wave of generative AI,
we can expect more and more noise on the web in the near future. So fact-checking will become more and more important. In this simple project, I build a fact-checking baseline about rock music, where the user enters a factual statement to check.
And my system tries to see if the knowledge base confirms or contradicts the statement. I use the knowledge contained in Wikipedia, modern natural language processing tools, and with a very low effort on manual annotation.
And the final outcome of my project is a simple Streamlit web app which is hosted on a game face spaces. And we will see it better later on. This is the agenda of this talk. Fact-checking rocks is built upon the ASTAC NLP framework.
So we'll cover the basics of ASTAC. Then I'll show you the idea and the implementation of this simple system. So let's start. ASTAC by Deepset is an end-to-end open source NLP framework
that you can use to build search systems, including semantic search systems, multimodal search applications, also extractive question answering, but it is also a large language model framework.
So you can build generative question answering application and other more complex and decision-making tools that use large language models. It is modular, so you can choose several databases
and various NLP tasks. And it is meant for production, so it can easily scale to millions of documents and it also provides a REST API. And it is very customizable and developer-friendly,
particularly I developed a custom node, so we will see how to customize it. In ASTAC, the document stores play a central role and there is a wide range of document stores and in the document store, you save the documents,
metadata and vector representation. So the document stores can range from SQL databases to search engines such as Elasticsearch or OpenSearch, to modern vector databases, including Milvus, Waviate or Quadrant.
In ASTAC, the nodes are the components that perform a single operation on data, so you have nodes called data connectors to convert files into documents, to pre-process these documents.
And there are also several nodes that cover NLP tasks, including retrieval, question answering, summarization, text classification, name identity, recognition and so on. And you can combine these nodes
to form directed acyclic graphs, which are called the pipelines in ASTAC and you commonly use them for indexing or querying purposes. Now, let's go to the idea of my project.
We have an indexing part where the data are crawled from Wikipedia, then they are pre-processed and encoded using a language model, so they are transformed into vectors and then they are saved in a document store. While the querying part,
in the querying part, the user enter the statement to check, then the retriever tries to find the most relevant textual passages and then this new custom node, the entailment checker, tries to compare the statement to check with the most relevant passages
and produces a summary score that indicates entailment, neutral or contradiction. Now, let's focus on the most important parts of this system, starting from the embedding retriever. Generally, a retriever is a component
that sweeps through a document store, an index and tries to find the most relevant documents, textual passages, particularly in A-Stack, the embedding retriever is based on the transformers model and between transformers models,
the sentence transformer are a family of models that are trained in such a way to embed similar texts near to each other in a shared vector space, in a shared embedding space.
So, what does it happen in short? At indexing time, all the textual passages, all the documents, are converted into vectors using this language model.
In this image, we can see three different documents, two speaking about Green Day and one about Jeff Buckley. And then, at query time, also the query or the statement to check in this case is represented as a vectors. And then,
using vector search, the most relevant textual passages, the most relevant documents are found by semantic similarity. In this case, the query is about Green Day being a bankrupt band, so probably the retriever will return
the two documents speaking about Green Day. Another important part of this project are natural language inference models. In NLP, natural language inference is the task of determining whether an hypothesis is true, false, or undetermined, given a certain premise.
And in this slide, we can see a sample of a natural language inference data set where, in each row, we have the premise, the hypothesis, and the label, which can be contradiction, neutral, or entailment. So, once such a model is trained,
we can pass to it a premise, an hypothesis, and get back the probabilities of contradiction, neutral, and entailment. Nowadays, these models are very commonly used for zero-shot classification tasks.
For example, if I have some sentence and I want to understand if it speaks about sports without training a model, I can pass the original sentence as a premise. And as an hypothesis, this sample is about the sports.
And then I get back the probabilities of entailment or contradiction. But in my project, I used the natural language inference models for their original task. And particularly, I integrated them in the entailment checker node. This node takes as input
the most relevant textual passages and the statement to check, and perform several operations. First, we compute the textual entailment between each passage and the statement to check. Then these scores are aggregated
by computing a weighted average, where the weight is the relevance score given by the retriever, which expresses, in this case, the similarity between the single passage and the statement to check. Then I also applied an empirical consideration.
If, in the first N passages, where N is less than key, there is a stronger evidence of entailment or contradiction, it is better not to consider the less relevant passages. Otherwise, they would add noise and ignoring them, we could also be faster.
So, this is the raw idea of my project, but now let's go to the implementation using Haystack. I start with the indexing phase, and the first step is loading and pre-processing data.
I already crawled some JSON files from Wikipedia. You can find the crawling part in the GitHub repository. So, now I have some JSON files, each one with the text and some metadata, including the page title and the URL.
So, I am putting all this dictionary in a list. Then I'm initializing a preprocessor, in order to split every file,
text file, by chunks of two sentences. And then I only select the documents with the list and words, otherwise they are not very informative. In the second step, we encode the documents and write them in the document store.
So, we first initialize a document store, a files document store, which is based on the meta-library files, which is a library for vector similarity and vector search. But in this case, I could also use
other vector databases, more performant, such as WebV8, Milvus Quadrant. Then, by calling the method write documents, I'm writing the text of these documents in the document store. I also initialize an embedding retriever,
which is based on a sentence transformers model. And I choose to embed the meta field name, containing the title of the page, because it is an important information to be encoded in the vectors. Then, by calling update embeddings,
I'm generating these vectors, these embeddings, and storing them in the document store. So, now we have done the indexing phase, and we already defined the embedding retriever. Now, we have to develop
our custom entailment checker node. But how can you add a custom node to a stack? There is a simple guide in the stack documentation. So, you have to create a new class that inherits from base component.
Then, you have to set outgoing edges as a class attribute. And if it's not a decision node with multiple edges, there is only one outgoing edge. And then, in the run method, the actual action on data happens.
And this method should return a tuple, which consists on a dictionary containing the data, and the string that indicates the outgoing edge. Then, you can also define a run batch method for the case when you want to submit
several queries at the same time. But this is the idea of adding a custom node in a stack. Let's go to my actual implementation. So, I started with an init method, the constructor method, where I defined several attributes to be reused later.
Particularly, I give the user the possibility to use the GPU or not, which can speed up the inference. Then, I am loading the tokenizer and the model from a given phase transformers.
I'm also setting the batch size, the entailment contradiction threshold to be used later for that empirical consideration, and some other attributes. Then, in this method, get entailment, the actual natural language inference happens.
It takes as input two strings, the premise and the hypothesis. And in this method, we first prepare the input of our model, which consists of the premise, the separator, and the hypothesis. I'm tokenizing this input. Then, I make a forward pass
in our natural language inference model, and it returns the logits, which are unnormalized probabilities. So, I apply the soft function in order to normalize these probabilities. And this function returns a dictionary
which contains the labels as keys and the probabilities as values. We can see a little example of the output of this function. I pass as a premise, I have a blue-eyed cat. And, as an hypothesis, my cat has yellow eyes, and I get this dictionary showing a probability
of contradiction of 0.99. Then, I define the run method, where I'm, yeah, this method takes as input the query, so the statement to check, and a list of textual passages, the documents.
And in this for loop, we are calling several times the get entailment method, passes, passing the passage text as a premise, and the user statement as an hypothesis. And we are computing the weighted average.
I'm also applying this empirical consideration that if there is a strong evidence of entailment contradiction, I don't want to consider less relevant documents, so I break the for loop. And then, in the end, I return a dictionary
containing the used documents and aggregate entailment information with summary scores about entailment, neutral, or contradiction. So, now, we can simply build a fact-checking pipeline,
and we start with the new entailment checker node, which is based on a strong natural language inference model. And then, I create a pipeline, and I add two nodes, the retriever and the entailment checker. Now, I can test this pipeline.
For example, passing the statement, this meets a Veddini influence on other bands, and it retards the used documents, but the most important information is the aggregate entailment information,
which shows a probability of neutral of 0.48, and entailment of 0.52. So, probably, this meets a Veddini influence on other bands. Now, I want to show you, yeah,
my application in a little video. A moment. So, we start with a very simple statement.
Elvis Presley is alive. And the knowledge base seems to contradict our statement. We also get some scores, summary scores, and a triangle plot. It doesn't work. Yeah? Maybe.
Yeah, Wikipedia says that it is dead, he is dead. Yeah. And it also shows the relevant snippets from Wikipedia. And I recently implemented also an experimental feature
explaining using a large language model. I'm using a 25 large model from Google, an open source model. And the explanation is that Elvis Presley, according to Wikipedia, died on 1977.
Therefore, the final answer is no. Yeah, this 25 large, yeah, is a good model, but not so strong, but the explanation can be good. So, let's go to other examples. And so, yeah.
Now, we try with another example about the Beach Boys were involved with the Manson Family. In this case, the knowledge base confirms our statement. And yeah, the interesting fact is that in this case,
four different textual passages were used. And so, this information was somehow aggregated. The explanation in this case is not very good, but yeah, acceptable. Last example is about King Crimson
that played progressive rock. And the knowledge base seems to confirm this statement. And yeah, what is interesting in this case is that our knowledge base does not contain information
about King Crimson because I only crawled a small part of Wikipedia. And so, it uses the information from the tool Wikipedia page with a good explanation. King Crimson are a progressive rock pioneers
and tool have always expressed the massive impact that progressive rock pioneers have had on their music. Yeah, which is good. Let's go back to the presentation. Yeah, my system is very simple.
It's a baseline, so it has several limitations. First, for real world fake news detection, for example, you need statement detection. In this case, the user was entering the statement to check, but yeah, it's not a real world example.
The second point is that Wikipedia is taken as a source of truth, but Wikipedia does not contain universal knowledge. And there are also other interesting approaches. For example, last year, Meta released a paper
where they introduced a data set called Sphere, where they take a snapshot of the entire web as a non-curated source of knowledge. A practical consideration is that there is also
no guarantee that the best textual passages for natural language inference emerge from semantic similarity. For example, if I say the Beatles were influenced by the U2, I probably get back several passages
about Beatles and some about U2, but none of them is useful to understand if my statement is true or not. Finally, I should say that no organic evaluation of this project was performed, but only experiment. What I can say is that some open source user
forked my project and used it for energy sector in the French language. So maybe sometimes it can be useful and it is adaptable. I can easily think of some quick ways
and easy ways to improve this very simple project. Yeah, the first way is to expand the knowledge base for infrastructural limitation. I only consider the small portion of Wikipedia, but another most important and easy way
to improve this system is to adapt the retriever to the domain using generative pseudo labeling or other similar techniques. And yeah, it is a technique from Neil Schreimers
and other researchers. And several times in this conference, I heard about it or similar techniques. What is interesting, it doesn't need a manual effort.
And so, yeah, you can adapt the retriever to the domain using this technique. Building this project, I learned some lessons that I want to share. So nowadays, language models and large language models
have strong abilities in text compression and generation, but their knowledge is generic, is not domain specific, is not easily updated over time. And speaking of generative models, we know that they can easily hallucinate.
So a good idea that several times in this conference emerged is to combine them with retrieval systems. Yeah, and along these lines, I want to speak for a few minutes
about the LLM support in the ASTAC project, particularly in ASTAC, you have the prompt node, which is a thin abstraction over large language models. And so you can have this prompt node
and connect it to several different large language models, including those from OpenAI, Coir, Anthropic, but also open source models through, yeah, so you can also load your local model
through Aginface or call the Aginface inference API. And you can, with this abstraction, you can easily change the used model. And what you can also do with large language models
and with ASTAC is the agents, you probably heard of agents also for other project. An agent is a large language model with a specific prompt and with some tools.
These tools are focused on specific tasks and these tools can be deterministic, such a calculator, but also statistical. And in ASTAC, you can use other nodes or pipelines as tools.
So about this aspect and about the agent, I want to show you a little experiment that I did some time ago, playing with the agents. Yeah, moment.
Yeah, so in this example, I'm importing my reading list,
so the books that I want to read in a CSV format. And this is, yeah, my reading list. So you have very little information, pages, the topic,
then I first build a pipeline for question answering on my reading list. So I'm storing my reading list in a document store. And then I'm defining a pipeline for question answering
based on open source models. I also create a pipeline for question answering on the web using a server API and a prompt node. In this case, I use an OpenAI model.
Then I want to start building my agent. So the reasoning engine of this agent is another prompt node also based on an OpenAI model in this case.
And then I want to use the already defined pipelines as tools, so the reading list question answering, which is useful for when you need to answer question about the books in my reading list. This description is for the large language model, yeah.
The web question answering pipeline, useful for when you need to Google questions. Now I can finally define my agent, which is based on the prompt node and on the already defined tools.
And then I can ask a question. For example, can you provide me with information about the shortest book on my reading list, including author and price on Amazon? And so it starts calling the tools when it needs it.
In this case, this example is very simple and naive. What it is interesting about this approach, I think, is that you can connect, you can use the large language model
as a reasoning engine and you can use your data, but also the data present in the web. And in this case, I use the OpenAI models because they are known to work well. But yeah, with the current situation
of new open source model being released every week, the idea is that you can also use open source models on your data. And so, yeah, there are fewer problems
with the privacy of data. Yeah, and I wanted to show you this little example. And you can find a stack on the web. It is an open source model. You can find my project on GitHub and a game face space.
You can play with the demo and you can find me on several social. And thank you for your attention and I hope for your questions.
Thank you. Thank you very much, Stefano. It was a very nice, very nice talk, very informative. You have time for questions. So again, Josue, as always, don't be shy. Thank you for a wonderful talk. Congratulations on having most of the talk
being devoid of LLMs. It was refreshing. Some years ago, there was this tweet going around which went something like, tell a horror story about your industry with four words. And my reply to that tweet was, set the confidence threshold. So what was your strategy to set the confidence threshold
for the textual entailment task? Yeah. Yeah. As I said, it is a toy project, so yeah. In this case, I did some experiments
and I saw, yeah, I saw that for my entailment checker node, having, yeah, because since I am using a natural language inference model that can properly do sophisticated techniques,
for example, chain of thought or other. Yeah, having too much information was bad. And so, yeah, in my experiments, I ended up with fixing the threshold, setting this threshold at 0.5.
But yeah, I should experiment more, yeah. Thank you for your talk, Stefano. So I have an observation is probably more philosophical question, somebody ends up in potential additional steps
in your pipeline. So some fake news or fake statements doesn't necessarily have something you can use to contradict them, right? So if I say like around Alpha Centauri, there is like a blue metal unicorn flying, you don't really have a document that can prove
that's not because I should prove there is a unicorn flying around Alpha Centauri. So at the same time, you can guess that what I'm saying is not really likely to be true. So when you don't find anything in your data store to contradict, for example,
or at the same time be in line with the statement, have you thought of adding an additional step because I guess at this point, you will just get a neutral response. I guess, potentially you can use classifiers based on large language models and maybe like Oax kind of websites
where you find like Oax texts labeled as like, yeah, this is an Oax for these motivations. Yeah, I was just guessing maybe, I mean, have you thought about these kind of scenarios where how to refine neutrality? Yeah, I haven't tried,
but it can absolutely make sense, I think. And yeah, I think also that, yeah, I spoke a bit of fake news, but there are several subtle types of fake news.
And yeah, I was, yeah, with this approach, I was trying to use some source of truth if available while some other approaches to fake news detection focus on classification
and on training, but probably, yeah, if you try to combine the two things, you can take the best from both worlds. Thank you.
Anyone else? No? Well. Move here. If they know, then we can thank Stefan again for the great talk. Thank you.