We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

The Search Technology Behind WorldWideScience

00:00

Formal Metadata

Title
The Search Technology Behind WorldWideScience
Title of Series
Number of Parts
19
Author
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Lecture/Conference
Lecture/Conference
Lecture/ConferenceMeeting/Interview
Meeting/Interview
Lecture/Conference
Lecture/Conference
Meeting/Interview
Transcript: English(auto-generated)
So my personal background is I started working with OSTI back around 1997 or so when I got asked by Dave Henderson if I could build a federated search application to search for environmental research sources, and I said sure, and at the time I was working
at my house and I built the first federated search explorer application in about six weeks. You may be happy to know that since about 2003, 2004, I haven't written a single line of code, and probably not a single line of code that I have written is used in our current products.
I leave that to more state-of-the-art engineers than me, so I've been involved with WellScience.gov since the beginning and World Wide Science since the beginning as well.
I attended that first meeting at the British Library. I attended, I think it was the following year, that meeting in Seoul where World Wide Science got put under the auspices of the Alliance and xt, and I attended this last meeting in September
at the British Library. All right, I will try not to duplicate too much of what Dobie and Laurie have said so far. So what is federated search? Federated search is the ability to
use the technology or capability with one search to search many sources. To elaborate on that, federated search is an application or a service that lets users submit a query
in real time in parallel to multiple distributed information sources and retrieve these results, aggregate them, rank them, and de-duplicate them. Again, in World Wide Science, we're searching over 100 sources in real time. We have other applications, like at Stanford
University, where we search more than 300 sources at the same time. So basically, what does a search look like? So a user starts out, most users start out in just filling out a search query in the simple search form. In some cases, they'll go to an advanced
search. We can then, World Wide Science submits searches to all of these sources in parallel. The results are then aggregated. They're ranked. We display one of the novel
capabilities that we've had for a very long time is that we display results incrementally as they are received so a user doesn't have to wait for the slowest source to return its results. Typically within two, three seconds, we'll have results from, I don't know,
half of the sources. What I'm amazed is having been doing this for such a long time is how much faster sources have gotten over the last decade or more. One of the
things I want to point out in that item four about how we do, so we apply our own ranking to results. We developed back in, for science.gov, what is a capability we call QuickRank that I'll talk more about later that takes a look at the titles and snippets of results
and assigns its own rank to those results. When I first came up with that idea, the there was a lot of skepticism, but it's proven to work very well. We have, as Lori mentioned, we have the ability to narrow and filter results both when we submit a search or when
the results are brought up and that we can cluster and insert subsets of results. So then we also have an ability through alerts to monitor new results that come in on a topic
of interest. And as Doby and Lori mentioned, one of the very novel features that we've developed is this multilingual search capability that I will talk a little bit more about
in a few minutes. So one of the very, I think, novel capabilities that we have is that we can search lots of sources in parallel and, again, display results incrementally as they
become available by the source. And at all points we display results and typically in ranked order. As I said, what we do is typically we wait two, three seconds. That's all configurable so that we get a set of results. We display those results. And then
typically, as new results come in, we intentionally don't merge those results until the user says merge new stuff that you have or we wait for the search to finish because
otherwise your results may be changing every few seconds while the search completes. We also, all of the core explored capabilities that are behind science.gov are available
via an asynchronous JSON-based API. So we've had a number of other customers that have used our API to access some of our applications. And this is one of the things that I've
periodically or once in a while I mentioned to the folks at OSTI is perhaps making our API available to worldwide science for perhaps other ways that technology can get integrated into,
you know, maybe there is a government agency or some agency that is focused on a particular subject and they want to automatically bring in results on robotics or some other topic into their site. I think it's a way to, you know, expand usage and there are sites like Programmable
Web and so on which encourage that. It's hard to just press one at a time. Okay, so a key, very key part of Federated Search and what we do are the connectors.
We have a group of people at Deep Web whose job it is to create and maintain connectors. So connectors are custom, what I call custom, small pieces of software. For those that care,
they're written in JRuby. And a connector takes the query syntax that Explorit supports, including Boolean operators, wildcards, etc. It combines it if you've done an advanced search
and you're searching for authors or titles or specific date ranges. It takes that query and knows how to translate it into the syntax that is supported by each of the resources that we federate. Then when the result and we can search information sources in a variety of ways.
Obviously we prefer for sources that have an API. We will use their API, their web services API or an XML gateway. But as I believe Lori mentioned, we do not require
content owners to do anything in particular to support us searching them. And when we got started 20 years ago, nobody had API. So the only way to search a source
was to do what is called screen scraping. So we basically mimic a user at a browser submitting a search to a source. The results come back and then we use sophisticated extraction techniques, parsing techniques to extract the metadata that is returned in the result list from
the source. These fields, things like authors, titles, URLs, dates, publishers, etc. are then
used to display in our result list. In effect we are normalizing the result list. What gets displayed, these fields are also get displayed as clusters that you'll see in a minute. What's also that we do that it's a lot of the effort is going into, again we don't like to
cater to a lowest common denominator. So if a user doesn't, we don't want to support only databases that support author searching. That would probably cut down things a lot. But if a user does an author search, then we have to decide and we work with our customers
to decide how do they want to handle. We can either, in the case of an author search, we can decide we're not going to search that particular source because it doesn't support author searching or we'll put the author name into the full text field and it's not
ideal but it may be the best that you can do for a source that doesn't support that. The other thing that we have to do that we're very good at is that we typically like to retrieve about a hundred results from a source. That's all configurable.
And so what we need to do in many cases is in the same way that you as a user going to a particular website, we'll do a search and then go and look at the next page of results and the behind the scenes. One of the questions that I get asked a lot and I think we've written some
blog articles about is can you bring back all the results from a source? And that is, one is it's not feasible when you're searching like a large database like the USPTO or PubMed
and you search for diabetes, you don't want to bring back 200,000 results or something. And what I also tell folks is a way to explain that. So one is we're bringing back typically the hundred best results from each source. Each source is telling us what are their
best results. And also the reality is if you were going to a source directly, you wouldn't be searching and looking at 10,000 results. If you got past the first 10 or 20, that would be probably plenty. And so anyway, and then what we're doing of course is we're
combining the best results from each of the sources that we're searching into the best results across all of these sources. This has already been talked about. So clustering, we use a technology from a company called Carrot 2. And so clusters are basically generated
dynamically. They're hierarchical so that it basically organizes typically 1,000 or 1,200 or whatever number of results that come back into meaningful clusters that share common terms.
So and typically clustering works well when you've done a really broad search, let's say diabetes, and then it gives you suggestions about how to narrow down what you're interested in. One of these days, hopefully soon, I want to be able to go and say, okay, I'm on this
subcluster. I found that interesting. Let me go do a new search on the terms that kind of generated that subcluster. So both Adobe and Lori have talked quite a bit about translation.
And so here is kind of a diagram, a workflow. So a user starts and they enter a query. And again, it's in their language. It doesn't need to be in English. Somebody in China can enter a query in Chinese. We take that query, we translate it into the languages
of the databases that we're going to search. It might be as many as 10 different languages. We're using Microsoft translation services in real time to do this. Their results come back to the server at OSTI where worldwide science lives. And then we submit those translated
queries. So if a user entered a query in Chinese and we have a translation into German and we're searching the TIB database in Germany, we will translate that query into German. The results
come back and we kind of rank each of the results in their native language so that we can display a kind of an aggregated list of results from different languages. So we're displaying, you know, the top German result may be better than the top Japanese result, et cetera.
So we're doing all that. And we're doing for performance reasons, cost reasons, we're doing this what I call translation on demand. So you get the first page of results and they may have some results that are not in your language. And then when the user presses
a translate button, we translate those results that need to be translating. And then when you go to a next page of results, those get translated. All that work of translation is now with the new Microsoft translation services, we move that translation to the server level,
what used to be done at the browser level. And one of the benefits that we're also getting now is that we are caching the results that we've translated for some small period of time.
So we're not translating this, you know, if a user goes to the next page, it comes back to the first page, we're not translating things again. You've seen already some examples. Here is an example that we have for query, you know, what the query was. But anyway,
we have a search with results in, I think in German, Russian, and Japanese. And you can see both the original, let's say, titles and abstracts and the translated titles and abstracts.
And if you click on the translated title, we will go and, again, use Microsoft translation services to translate that particular, typically a web page into from the, let's say, from Japanese into English. Again, using Microsoft translation services.
Microsoft used to have a service where you could see side by side the original page and the translated page, and it seems like this has gone away. But that was a very cool capability.
So again, part of what my connector folks do, and we've developed tools to do this, is that we're actively monitoring each of the sources that we are federating to make sure that they're all still working properly. Sometimes a content owner will change
what their result list looks like. We are not necessarily notified in advance, but our tools will catch that, and very quickly, somebody in the connector group will fix that connector. So the final thing I want to talk just very briefly about is our concept of
hierarchical federated search or federation. A federation is that we can federate from
explore it. We can federate other applications that are federating sources. Obviously, we like federating some of our own stuff best, but we could federate other things. So, Laurie, so when you guys talk about how worldwide science searches, you know, whatever, 107 sources
now, but one of those sources is science.gov, which itself is, I think, 60, 70 sources now. So you could start saying that worldwide science actually searches about 170 sources.
And so back about 10 years ago, I did a presentation at an SLA 100th anniversary conference titled Journey to 10,000 Sources. And this is how, if we were interested
in searching, you know, thousands of sources at the same time, this is what I was talking to some of you in my poster on greyhub. This is, you know, how we've been doing. We would put together multiple federated search applications, each that searched, perhaps,
you know, a few hundred to 500 sources, and we combine 10, 20 or more of these into one massive search of 10,000 sources. And that is it. Thank you.