We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

This is not the search you are looking for

00:00

Formal Metadata

Title
This is not the search you are looking for
Subtitle
Searching is easy, finding the right stuff is hard.
Title of Series
Number of Parts
163
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Searching for stuff seems like it should be easy; after all, it’s something we do every day. Take your favourite search engine, type something in the box, click search, and you’re done! All the answers you are looking for, ready for you to consume. With these systems, you generally have no idea what answers you should be getting; just that you’ll get something. Your universe of knowledge is now whatever Google just told you. When the person searching knows the results they should be getting, producing the expected search results is much harder. This talk is an introduction to text searching: some search terminology, some basics on how text search engines work, the questions you should ask at the start, and some of the common problems you run into when designing search systems.
String (computer science)Term (mathematics)Control flowInverse problemPrice indexMarkup languageWordPower (physics)VideoconferencingComputer-generated imageryCodierung <Programmierung>Order (biology)Thermal expansionSheaf (mathematics)Subject indexingField (computer science)MathematicsWeb pageFile formatMereologyPosition operatorInformationAdditionDifferent (Kate Ryan album)LengthMultiplication signPoint (geometry)TouchscreenFormal languageSemiconductor memorySpeech synthesisSlide ruleArithmetic meanBitTexture mappingWordProjective planeOffice suiteRiflingMultilaterationElectronic mailing listStreaming mediaPhysical systemPatch (Unix)Data managementContext awarenessProcess (computing)Physical lawReading (process)RoutingMilitary baseKeyboard shortcutSheaf (mathematics)Computer programmingGroup actionDegree (graph theory)Complete metric spaceSoftware bugGraph coloringMetreFuzzy logicNumberSet (mathematics)Drag (physics)DistanceAxiom of choiceMessage passingQuicksortCodeComputer virusData structureRepresentation (politics)Spacetime1 (number)Medical imagingGoogolTerm (mathematics)Similarity (geometry)Network topologyValue-added networkString (computer science)NP-hardCASE <Informatik>Search engine (computing)Basis <Mathematik>PlastikkarteType theoryLatent heatCoefficient of determinationResultantLevel (video gaming)ArmGoodness of fitEmailRevision controlAddress spaceAlgorithmOpen sourceOrder (biology)Lattice (order)Vector spaceQuery languageNumerical taxonomyOpen setAreaControl flowCartesian coordinate systemTrailExpandierender GraphProduct (business)WindowComputer-assisted translationFreewarePower (physics)RootCuboidConcordance (publishing)Single-precision floating-point formatPattern languagePreprocessorIdeal (ethics)UsabilityEntire functionDefault (computer science)Regulärer Ausdruck <Textverarbeitung>Finite differenceWave packetRule of inferenceDistributed computingLink (knot theory)Theory of relativityRight angleNoise (electronics)Sign (mathematics)SoftwareDataflowGame controllerSource codeServer (computing)Markup languageCellular automatonBit rateDatabaseVideo gameCodeChemical equationFraction (mathematics)Statement (computer science)LaptopTraffic reportingDecision theoryInsertion lossWebsiteMixture modelElectric generatorSoftware frameworkDivisorInverter (logic gate)HierarchyThermal expansionGreatest elementEndliche ModelltheorieFeedbackDisk read-and-write headRow (database)Block (periodic table)Codierung <Programmierung>MetadataBinary codeProbability density functionSocial classJava appletLetterpress printingRepresentational state transferBinary fileComputer fileLibrary (computing)Matching (graph theory)Port scannerCausalityLine (geometry)Normal (geometry)File systemTwitterInformation retrievalPairwise comparisonParsingVideoconferencingFlow separationImplementationPascal's triangleJSONXML
Transcript: English(auto-generated)
Can you all hear me? Good morning. Welcome to the first session. This is a non-coding session. It's just me talking about some of the things I've done with text in the past, and I do nowadays. So if you're looking for coding about tech, this is
not a session for you. I'm just going to be talking about mainly the things that you need to do before you put stuff into a text search engine. So if you're looking for specific technologies for tech searching, this is not the talk for you. So the basic things in the agenda are just the text
searching, just the basics about text searching. That's what I'll be talking about. So what you need to do, indexing and creating and
what an index is, basically. Getting the text, extracting the text, and then doing some stuff with the text. And then just a little bit on queries. That's all it is. So I'm kind of setting up the track for the rest of the conference. So if this is not the talk that you're looking for, you
probably should go to another room. And if it's not what you want, please be free to move on. That's kind of weird, because it's not showing the slide that I have. OK, who am I? I'm Toby Henderson. I am an application architect at Huddle. It's a big UK startup, or not startup anymore, past
startup stage. This is my Twitter and GitHub details. Normally I talk about OpenRaster or distributed systems, so this is a bit of a different talk for me. But if you're interested in some of the projects, OpenRaster
is a REST framework for creating REST APIs. Brighter is a command processor if you're interested in creating event-driven systems, which just made itself onto the ThoughtWorks tech radar. OK, so document retrieval.
Who here has created or has used text searching or text search engines? OK, how many of you are indexing the metadata? And how many of you are actually indexing
large bodies of text? OK, so there's a few different people. This talk is mainly focused on the large bodies of text, because indexing metadata is quite easy. It's already got a place. You heard Bruce talking about it, that metadata is
probably more important than big bodies of text. Big bodies of text are hard. You've got tons of information inside them. How do you get any meaning or even know what it's even talking about? It's quite difficult. And the results I get back to you, why am I getting the results? It's difficult.
So when you just come down to talking about text searching, at the end of the day, it's just a string search. And you've got two types of things. You've got a string search, or you've got a full text search. There is a difference between the two. And then also you've got searching the metadata, which is basically matching keywords or things like that.
With string searching, you're looking for a pattern in a piece in a body of text. You're basically just, when you're using grep or find in files, something like that, you put a pattern of what you're looking for, and you're going through the entire piece of text, character by character, and
seeing if it matches anywhere. And then you've got regular expressions, which do a lot of fancy things with look-aheads and look-behinds and things like that. But basically, what you're just trying to do is find a pattern in an entire body of text. When you want to do stuff like this, we do it all the time.
We do it in our IDEs, things like that. What we are doing is there's no pre-processing happening. So we're actually opening the text file. We're starting in the beginning, reading all the way through it, trying to find our pattern. And to get a set of matching documents, we would have to read all the documents. And obviously, over time, as your document collection grows,
that gets slower and slower and slower. This is called serial scanning. So you're basically scanning all the way through the text. And this is where text search engines came from. They went, well, this is really slow. We need to make this better.
How do we make it better? So that's where full text search comes from. Full text search is not serial scanning the text inside the document. What it is is getting all the words inside the document and searching your collection of words and not the entire
body of the text. There's two parts, basically two parts to full text search. It's the indexing part, which is actually getting all the words out of the text. And then the second part, which is the searching of this index that you've now created.
Index is another name for index is concordance. So years and years ago, they did it with the Bible. They basically took every word in the Bible and created a reference, it looked like. So in the back of your books and stuff like that, you would have a reference to where a word was
and which page it appeared on. This is called the concordance. But a lot of people have done this manually. So it's not like something we've invented. It's been around for a very long time. And the searching part is just referencing that index.
So what is indexing? The first thing in the part of indexing is you want to break down your piece of text into individual terms. Just for easy sakes, I'll just use the word words. So the easiest way to do that is just to break
on the white space and you get a whole set of words. It gets a little bit more complicated like that because you don't want things like any of the special characters. You don't want quotes, you don't want dashes. But sometimes those things have special meanings in different languages. So you would have to think about these things
right at the beginning. One of the things is if you don't think about a lot of these things, a lot of people just go, great, I need a text search engine. Okay, we're just going to take the text over here, we're going to put it into that text search engine, search it, it all works great. If that doesn't actually work for you, you're very lucky.
Generally, it doesn't work as well as you want it to. You end up with special characters. So generally the steps in indexing is you break everything into words, you get rid of all the special characters that you don't need, you keep the words themselves, you lowercase them. The reason why is because when you want to compare
what people are searching, you want a common thing to compare against. So even capitalization will cause you problems. You have things about stemming. I wasn't going to talk about stemming a bit later. And what do you do with all these words? And what generally most text search engines use is what is called an inverted index.
The most basic thing about an inverted index is if I take this list of words, these three sentences, I put the numbers on one side as if it were documents, so it's document one, two, three, four, five, and piece of text. An inverted index would, when you index,
you'll break everything on the white space, you'll get rid of the comma, you get rid of the full stop at the end, and you just have each, every one of those words. If you put that in an inverted index, it would look like this. So you have the hello, and then every document that that word appears in.
The nice thing is when you come to search, this becomes very easy. When you type in the word hello, it looks into the alphabetically ordered index and goes, well, I can really do a binary search straight down there, find where hello is very quickly, and I can find out all the documents that word is inside. This is just basically 101 text,
how text search engines work. And from there, you can take a lot further and you say, well, if I've got an or, you know, hello or like, so it's in documents one to five and in document two. And if you do an and, it's just a union between those two sets of documents.
So it's quite easy. This is really easy to get on. The other thing that having an inverted index also makes it very easy is because it's in an alphabetical order, when you wanna do wildcards on the sort of starts with type style searching, it's very easy to scan the index and find exactly what you want.
But if you wanted to do a write-in search, you would have to basically have two indexes, one for doing a starts with and another one for doing the ends with in a different order. So I just wanted to kind of refresh people's minds on how indexes work
and this is what we need to get to to actually search anything. Getting to this point is not always easy. With searching, you have different types of search. You go to term search. So if you just put in the word hello, that's a term search. You're searching for one term and one term only
and using all the different operators between that. So if you just stick with the basements, you've got and, or, or then you've got not. But in sort of more modern text search engines, you've got a lot more fancy things. You've got like distance. So you can say this word within so many words of this word. You've got fuzziness, which is saying any words kinda like this word
and your degree of fuzziness. You can use regular expressions in a lot of text search engines nowadays, character replacements, things like that to make it all easier. And then you also got phrase searching. Phrase searching is not looking for a term, but what you're doing is you're looking for multiple terms. So if you're using Google or generally most text search engines,
term searching has all come known as the search within the quotes. So if I put in five words, like, you know, hello Toby, welcome, I expect to see that exact match. But if you remember, your index looks like this.
How did that work? How does, cause if I've got three terms, if I just put them in without the quotes, by default, most search engines will do an or, so you'll get quite a lot of documents back. If you made that an and, you'll get documents back, but they won't, what's that saying is those three words
appear in that document. You're not saying that those three words appear in order. So one of the things that's really important is when you are scanning your text, you have to get extra information about the text you're scanning in, the position that it is, that it appears in, and the length of it is also important,
and how many times that word appears inside of a document. This is all also installed inside the index. So every word for every word that you stick inside of index, it has a position and a length and how many times it appears. You can turn this stuff on and off in most search engines
if you need it or not. If you do add it, it adds a lot, your index has got a lot larger. But if you have position, now if I'm searching, if I do a term search, I can say you make sure that these three words appear one after each other. So it doesn't matter if people put a comma in or full stop or stuff like that
because you throw away the special character, but you have to remember that there was something there or that you just need to know the three words come after each other. Phrase searching was quite a massive change to text searching. Most of the searching in the past has all been just basically term-based. A lot of us just use terms, but I'm sure a lot of you nowadays
just copy and paste a piece of text in, put some quotes around it, and you get your document back. But a lot happens in the background to get that information. But one of the things is the most important thing about text searching is the text at the beginning and the quality of that text,
and getting that text is not easy. Generally, if it's already in a database, it's pretty much gonna be in a known format, and that's quite easy. So we've got different sources of text. So we have plain text. That's a nice, easy one. Just put that in. You read it straight out,
and it's all good. RTF is a text format, but now it has special control characters inside of them. You don't want them inside of your text search, and you're only interested in the text between the control characters. So now you need something that actually parses that text. Another thing that comes to a shock to a lot of people is search engines
don't extract their text for them. It only takes text. You can't give it a Word document. It doesn't work. You gotta get the text outside of the Word document first. If you have markup, any of the markup lasts, some XML, SGML, HTML. What you have to do is you have to get parses to get the text inside of it,
but it's not just the text. You also, sometimes there is some metadata or there's information. There's the title of the page. You also gotta extract that text. So a lot of people seem to, I've found when designing text search engines is they forget about all the extra information that text extractors can give them.
So one of the easier ones to kind of work with are Word, Excel, and Palp. The easiest of all the Office documents is Word. Word is fantastically structured. You can just use,
you can basically just use Word itself and say save as text and you'll get most of the information you want. There's the Apache Tika project. Tika project is a Java text extractor which does most of the Office formats for you. But one of the problems with all the open source projects I have found
is they're keeping up with the changes in the formats. So Word just doesn't have the current format which is docx, which is underneath as a zip file which is then XML internally. You could write a program to open, unzip that, go into the XML and read it out yourself. But it's not always work.
It doesn't always work that well. Sometimes you've got an entire history of documents from like 15 years ago which are in the Word binary format. And it's not easy. You probably have a lot of information and people want all that stuff indexed as well. So I don't think Apache Tika actually supports it.
It used to be a proprietary format so the old Word format used to be a stream with streams inside the stream. So it was quite painful to work with. But once you get stuff out of Word, you have quite a lot of rich just besides the text. The reason why you just don't want to read the text is you actually sometimes the information
about the text is important. So that's something that was a heading is important. Something that was bold is important. Something, links within the document are important. If you can get that extra data out at the same time, you can add some richness to your text. You can then create other metadata from the text itself.
If you move on to something like Excel, Excel is great. You can read through each sheet, each cell, extract all the text. But the problem with Excel is you get a lot of repeated data. So people just sort of get a column and they drag it down and they get a thousand words repeated at the same time.
When you get a thousand words in one body of text in one field in a text search engine, the relevance of that word drops massively because in relevance you're saying if a piece of text is short and it has the word appear in it, it's probably more important than the word appearing lots of times in a bigger body of text.
PowerPoint is also another one. A lot of people use text extractors. They use the, if you go for any of the free ones, so you've got the Tika Project, you have the iFilters from Microsoft, which you can write on top of and extract text out.
I really encourage people to look at the text that gets instructed because sometimes it misses a lot of information. One of the things that you find with the iFilter is it'll miss the notes on your PowerPoint. It'll extract all the information from your slides. You don't get any notes.
Which is the most important text? It's actually the notes are probably sometimes more important than what's on the slide itself. You can also extract text from audio and video. You're just using speech to text. And some of the nice things about that is you can extract some extra information.
So what point did that word get said within the video stream? The nice thing about then is when you return it to people, you can actually take them back to the point that word was said within the video stream. It's quite easy to do this type of stuff. Images themselves, you can OCR them to get text out of them.
Or a lot of the, again, with text search engines, you can also then create sort of image comparison or image search systems. But the one thing is you always got to think about, if you're thinking, well, how does that work? What you have to do is you have to take the image and convert it to a text representation.
So what you do is you start looking at the image and you'll say, okay, it's got a color palette. So what I'll do is I extract the color palette as a set of letters or a set of numbers. And I'll stick them in and I'll be able to compare if those color palettes. So then you can just compare two color palettes and get similarity. Say, well, if these got almost all the same numbers
and this one's got all the same numbers, they must be similar colors. So it'll be the same color palette, like CMY or RGB or anything like that. And you can just start breaking your image and get your, then you can get brightness and you can get all these different parts of the image and create text representations of them
and then work out similarity. One of the important things you got to deal with with all these different texts from different sources is basically encoding. And I'm gonna talk about that in a little bit. One of the formats I have left off is one of my not so favorite formats, which is the PDF.
I absolutely despise this format. PDFs were originally designed for printing. So there is no, there's no, not like a Word document,
we have sections. Words are actually just characters that are apparently close to each other. And paragraphs are a bunch of characters which are close to each other, even, you know, just evenly spaced out. One of the things is when you use a lot of the, sort of the text extractors,
they don't take into account that there is a layout involved. So the first item on when you get to the, you open a page on a PDF, you might read the text. The text will say this text is actually positioned in the bottom left-hand corner. Now you may be talking about something up here
when you're viewing the page in your PDF, but the text that you actually read is actually the bottom left-hand corner. So the positioning is a problem. Layout is also a problem. And layout and what I call flow is column. So what happens is it's basically text blocks.
So if you've got the columns of different text, so you've broken up everything to look like a nice newspaper and you've got nice columns, PDF doesn't know that this column then continues into this column and then continues into the next column. You can do all this stuff with PDF. It's, part of it is PDF accessibility.
It's what screen readers use to try and work out what to read to you first and what order they should be read in. And that's a logical order. You probably think, well, what does order matter? I just care about the text. Well, if you want to do anything beyond and, or, and sort of not searches and binary searches, if you want to say, I want to find these words close to each other,
this starts becoming extraordinarily important because you want to say, I want this word within 10 or 12 words of this. You say, well, what type of people search use this? So lawyers, it's really good for lawyers are one, one example. So they know within their language and their, in their world,
that if one word appears in the same paragraph as another word, it's going to be some type of legalese that they are interested in. And you can use this type of stuff to pre-process and get, get more intelligence from words, try and work out patterns
and actually use it to create more meaningful metadata on top of big bodies of texts, but PDF, no. So one of the things is once you've got the text outside of the documents,
you then have to deal with a whole lot of set of other problems. And the first one is your, your character encoding. You need to know if it's ASCII, UTF-8, UTF-16, ISO of, of, I think there's about 20 different ISO character sets.
So you can do this, some EPCDIC, which is really some really old ones. You don't, a lot of the time, you don't always know where you're getting your data from. And you've got to make sure that you read it in, in the right encoding. Cause what you want to do is you always want to make sure that you write it out, ready for your text search engine in one format
and one format only. I mean, other things that start becoming, can be important sometimes is the byte order itself. So the ending-ness, if it's a big ending or little ending, and why would you choose that? You know, some people like to, like the big end better than the small end. That's basically only a choice,
but it can cause problems. If you open up one ending and format in one system and you don't convert it and you open another one, it can actually swap the letters around in the word. Just, and the other problems, you know, between Linux and Windows, the LF or CLRF, this can cause you all different problems in parsing.
And then also byte order mark. Byte order mark identifies, you know, that this document is in a UTF-8. It's byte order. And it just makes it a lot easier, but a lot of systems don't actually obey this. And if you take Java, for example,
they've had a bug, a bug in their system for a very long time. I'm not sure if it's actually fixed now, but in their text stream reader, it never, it couldn't, it didn't understand byte order mark and byte order mark would have those three characters at the beginning of any string. So when something reads it, it goes,
oh, okay, this is using this byte order. It's using this character encoding. Okay, I know that after these marks, everything, I know how to understand this. So most stuff in Windows does this, but a lot of the old Linux systems don't do it. And Java didn't understand this. It caused me probably a week of pain because I didn't realize there's a bug
in the Java system. I was creating a system where I was reading all the text. Position of that text was extraordinarily important to me. I couldn't work out why some of my documents or a lot of my documents were always three characters out of continuously highlighting in the wrong places. And it's a bug in the Java text stream reader
or text reader, I'm not sure which class it is, which this was created before byte order mark, and they never fixed it because everyone started creating patches on top of that problem to always know that everything will be potentially three, if these three dodgy characters at the beginning, you know, what we're gonna do is
we're just gonna compensate that. So the Java guys couldn't actually fix the problem because so many people, if they did, they'll break so many systems. But it's just something that could save a lot of people a lot of time sometimes. That's how I felt.
Okay, now we have text and we've dealt with encoding and now we need to actually move on to sort of the words themselves.
So when we talk about the words themselves, sort of some of the things that come from the language can cause you a lot of problems. If we just take the basic breaking of whitespace, get rid of special characters, break on whitespace,
if you have a hyphen, you'll end up deleting that one hyphen and breaking the two words apart so you'll get toll free. But now they are two separate words that will appear in an index and have no relation to each other. And this can be a problem. I mean, hyphens, basically there's like three different hyphens in the English language. There's like hyphen, em dash, en, and en dash,
and they're all slightly different lengths and they have slightly different meanings and different purposes. The hyphen is the small one, which means these two words are a conjunction, is a conjunction, I can't remember, but belong together and have a meaning together. So when you're going through your text and you're extracting it, you have to look for these
sort of nuances in different languages. And so what you want to do is actually keep the word, you want to break the word apart and you want to keep a version of it together because they both have meaning. If someone did type toll free as separate words, you will want to give them a result back. But if they did toll dash free,
you want to give that back to them because that's more important than, it's the same as pre-war. If you're searching on world war, you still want to get world war. But if someone says pre-world war, you would also want to give them a more accurate result back. But if you broke up pre-world war, it wouldn't always work the way you want it to.
The next thing is stemming. Stemming is when you take a word and you break it down to its root word. So as I've got example here, is you've got stemmer, stemming, stemmed, and you want it down to one word, which is to stem.
So this helps you out of the thing, like people saying, well, I searched for dogs, but I want every article with dog to come back or all the other way around. If I search for cat, I also want articles that got cats in. That's quite simple, isn't it? I mean, those are very basic things. There was a couple of algorithms
that were created to help get rid of this. So you get rid of the I-E-S on the end of things, get rid of the F, the pluralization. It's called the Porter-Sterming algorithm. And there's another one which is sort of more advanced than that, Snowboard. There's a lot of commercial ones now. So a lot of things are looking at text
and knowing a lot about a language, moving things down to the most root word. Sometimes the problem with moving down to root words is it doesn't always make sense. So, and a lot of the problem with a lot of the stemming algorithm research, it's all done mainly in English.
So there are some others in sort of German and different languages, but most of the research you're gonna find nowadays is, has been on English. I mean, it's getting better and better. Another thing you have to start looking at is expanding words within your text back out to kind of their full word.
I mean, if someone search, if you don't, if you put an, if you index an amplisand, and someone puts and in, you wanna get the result back. People get confused. They're like, well, you know it's the same thing. And it's like, but you know, no one's told me that you wanted it to do that.
And by default, if you don't do it, you know, the search engines don't do this for you. You've got to do a lot of the work yourself. So you wanna expand like code dot, you wanna identify that, expand it out to company and then limited, you know, LTD to limited, amplisands to and, you wanna make, you were basically trying to reduce the text you down, your text that you have
that you're gonna put into the text search engine to a very common denominator, which is sort of the sort of the one set of truth. The other thing is on the other side of things, when people put a search in, you have to do the same thing again, and change their search. So if they put an amplisand into the search,
you wanna change that to an and, so it matches your text search. But one of the big problems with this is you can't just randomly go and expand all the different sort of, all the different things in different parts of text, you have to understand where your text comes from, which starts becoming very important
is understanding the context of the document that it belongs to, you know, is it a science document? Is it a geography document? Is it a biology? You know, is it a specialized section within there? If you can start breaking, if you can understand where you're, what your text is about, or who you've got it from,
you can do a lot more intelligent expansion of the words or the sort of, even like using like sort of like political parties instead of ANC, you know, you want African National Congress, but in a different country
that may mean something completely different. So you have to understand the context of the text, and if you can expand all the words out, it creates a much richer search for people to get results from. If you don't do this, you're gonna get a lot of conflicting results and people won't really, they'll be like, well, I just want it to work like Google.
Google doesn't really work that way. Google tells you what the results are and you believe that. Another thing is the context of the topic. So if you take like vet, if someone search for, if you've got a piece of text and there's a word vet in it, how do you know it's veterinarian? How do you know it was, or is it a veteran?
It's very difficult. And text search engines can't do this for you. You have to build stuff yourself or pay people a lot of money to do this for you. So a lot of things, there are a lot of software nowadays prior to indexing all your text,
is it'll look at the context of your text or read your whole document and try and work out generally, oh, it's about war. They must be talking about veterans. But if you had one was talking about veterans who were becoming veterinarians, your text search, it's gonna make it very difficult for your text.
So doing all this is, people are gonna, people will struggle. One of the other problems is you've got to do this, but now you have to do it in multiple languages. And when you do it in English, it's not comparable to doing in Swedish or Danish or Japanese or Chinese.
You know, you've got reading, you'd be reading from left to right, right to left, you know, up and down. You've got to make sure that your text is extracted correctly again and in the right order. And so in different languages, the rules are all different.
So again, if you're just doing with English or one single language, doing, creating text search engines are nice and easy. But if you're doing with multiple languages, you have to stop and really think and plan out. And before you index your text, you have to try and know what your language is.
Okay, the next thing is, once you have this text and you've broken it down nice and neatly is, you want to, as during the keynote was mentioned,
that metadata is more important than these big bodies of text. If you've got huge documents with lots and lots of text inside of it, the relevance of the words inside of it become very irrelevant. And the way,
there are a lot of sort of research projects nowadays trying to work out ways to get meaning from big bodies of text. So one of the things that you can do to just make it a little bit easier is doing topic extraction. So trying to work out the general topics within a document.
And there's a lot of different ways. Some of the research we've been looking at is what you can do is you can take, I mean, there's commercial algorithms, which I don't know what they're doing themselves, but you basically give it a body of text, it gives you a list of topics.
It's not always accurate, but it works pretty well. But if you want to, if you're thinking, well, how does this work? Is they have like a set of a lot of times taxonomies or information from different taxonomy so they can get clues about things. One of the things we've been looking at is you can take,
just take Wikipedia, which is kind of like the worldwide knowledge. And when you get a body of text, a text search engine can take that and create a vector, which is basically a number which represents this entire body of text. And it's a vector, it's in multi-dimensional space.
And you can take, then basically index the whole of Wikipedia, all the articles, and each page, article in Wikipedia gets its own vector. And you can say to it, well, I'll try and find the two vectors that are very close to each other. And those are the similar similarity, the similar things.
These documents are similar to your original body of text. Now, Wikipedia has a very, very light hierarchical system. So each article is under some type of title. So what you can do is you can compare all these bodies, you can compare your text against the whole of Wikipedia and get some general high level topics.
It's quite a nice easy way to get sort of a topics extraction without having to pay a lot of people a lot of money for proprietary systems that I would say don't always work exactly like you want. Another way is to actually start breaking your documents down into different sections.
So if you have something like company annual reports, it has a chairman's statement, it has a letter from the CEO, it's got a profit loss accounts, balance sheet, cash flow statements. It's got multiple parts, but parsing the text to work all that stuff out
is very difficult. So you're basically looking for patterns on headings and things like that. Now, if you go back to extracting the text in the first place, if you had a word version of a document, working out what is a heading is a lot easier because if you use something like there's a commercial library called ASBOSE,
you can actually load the entire word model into memory and then go through it and say, oh, I know, give me all the titles in the document and then break all the text out between this title and this title. And what you do is you start, instead of having one big body of text, you can start breaking each part into its own index. And now that becomes more interesting
because you can compare very now similar items together and get much richer search and much richer feedback. So don't always throw away the extraction part of text searching because at that point in time, you can get a lot of metadata and a lot of information about your text,
which then can influence how you build your indexes in the future or what type of indexes. Reducing things down and creating more metadata at the beginning will help create sort of a much richer system. So even like things like going through your text,
finding all the email addresses and indexing them separately against that same document ID. Or list of company names, any URIs within the document, pulling them out, putting them in a different index and not just leaving them in that body of text. This becomes, allows you to create
a lot better search engines instead of just having one big blob of text and getting a whole bunch of results out, which is useful but not that useful. So I really encourage you to go look at the tools and the different sort of extraction tools you can get
because you just don't want your big blob of text. You want to look at your big blob of text. You want to break it down into a more easy, easy to work with sort of meaningful information. You want to create more metadata from your text
using different topic extractions is the kind of easiest one. So what you do is you, when you've extracted your topics, you put them into the index with the original body of text and then you can build up doing similarity. You don't do similarity in the big blob of the text. You just do the similarity on the topics that you've extracted now. And now similarities a lot quicker as well.
Now we've talked about extracting the text. The other part of this is now you have to do the same with a query. So when someone puts a query in, if he's put a single word in, that's fine, but people aren't gonna get the results that they want
just from a word. I have no idea where the context that they've got this word from, what are they searching in? So sometimes having the big sort of just search box put anything in, you can't get context of it. If you have a product, what you want to do is you want to make sure that they may be in a certain area
inside of your product when they type the text in, you should probably keep the context of what they are in because they've always thought of something while they're in that area. And you can focus the results back based on the area that they're in. So if they in, let's say they were in,
let's say science journals within Wikipedia, and they search for a topic, a single word, you probably wanna give more meaning to science journal information than you wanna do to sort of Disney.
Also, any queries that you get, so if you're doing any sort of anything within a quote, you also wanna then break that text down exactly the same as how you've created and put things into your index. You wanna break on the white space,
you wanna get rid of the special characters, you wanna expand all the keywords. And this, again, will just make your search more accurate and get back a lot more results than people expect. A lot of other things you wanna do is when you, instead of just the, I forgot one of the items
about words is sometimes you wanna break words up. So capitalizations, you wanna break words on their capitals, so like every Pascal cased type, you know, type names in programming. You wanna break the words on their Pascal cases so people can search for the whole thing together, all the individual parts of the word.
It makes people get back a lot more results and they're probably finding exactly what they want for it. And doing that allows you to not having to implement things like wildcard and things like that. And that is about it.
As you say, strings aren't always the easiest thing to work with and they will frustrate you. And it's amazing that just text or a string is not just a string. It's amazing how many people in business go, just index everything, it'll be fine. We'll find it what we want. We're getting away from that.
So we've had a kind of a world sort of in the early sort of 90s where people were very much about classification and hierarchy. So when they search for things, I'll go, I want this company, I want this sector and I wanna find this piece of text. And then sort of the noughties came along and that generation has kind of been like,
well, I'll just put something into Google. I want 200 pages of results. I will search through them myself and decide which one's relevant to me or not. And now we're kind of moving out of that phase and we're getting to a mixture between the two. You're saying, I want you to kind of work out where I am
and give me more relevance to what I'm searching on. I want you to give me back some more information so I can make more decisions about my results. So basically most of the sort of online shopping sites nowadays you'll put a piece of text in,
you'll get a whole list of sort of, let's say you're searching for laptops, you'll get on the side, it'll say, there'll be like a 12 inch, 13, 17 inch screens, memory sizes, ranges. All these things are the extra additional metadata and you can click on them to reduce your results set a lot quicker.
So getting all that extra metadata, you have to, if all you have is a piece of text, you have to pause all that information first and you have to be able to give it back to the user so they can then decide if they want to reduce it down themselves or they want to page through pages and pages of results.
But pausing again is not as easy as everyone likes. It's probably the worst part of the text searching is creating the text in the first place just to get into your text search engine. Okay, that was me and I hope you enjoy that. Is there any questions?
Okay, so once you've, in text search engines basically everything, the indexes are basically like fields. So you would add in, once you've extracted your metadata, you'll add them in as extra fields
and most such engines, when you're returning these extra pieces of metadata, they call it a FAS, basically it's a FAS, what they used to call a facet search. Some search engines have changed the name for that nowadays and call it an aggregation. But what you're doing is you're, it's doing some processing after doing the search.
So it's getting, it's finding your results. It's pulling them out with the metadata along with those documents. It's then looking at those, that extra metadata and then working out how many items in that, so it's kind of, it's pivoting the data and saying, okay, for these metadata items,
they appear in 20 documents, in document five, one, two, three, four. And then that allows you to know that this word has appeared in these documents under this topic. Which allows you to drill down a lot faster
into what you want. Because sometimes people don't know what they're searching for and what you have to do is hint back the topics to them with, so they can go, oh, yeah, yeah, it's that thing that I'm looking for. Because people might remember, so if you give back a list of documents, let's say you've indexed a file system. If you say, well, 20 of these documents
are in this folder and five in that folder and two in that folder. You're searching documents, you weren't searching folders. But coming back with your list of documents, you can sit on the side, you can say, these folders had these documents inside of them. And you'll be like, oh, that's the folder. Well, people will actually look for folders by typing a piece of texting because it was the document they read
after the one they're looking for to find the folder that the other document was inside of. So people start using search for really weird navigation. But it helps to give this context back and the topics back. But to be able to get that, you have to pause the text correctly in the first place.
Any other questions? Great, I was aiming for 50 minutes, so I've been leaving about 10 minutes for questions. But if there's no other, nothing? Okay, great. Remember on the way out,
there's the new traffic light system. So manage, I hope you enjoy this. Just tap the green one. If not, I don't mind. And I hope you enjoyed. Thanks very much.