HBSKnowledge: A Knowledge Graph and Semantic Search for HBS
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 20 | |
Author | ||
Contributors | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/58073 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Meeting/Interview
02:34
Computer animation
15:30
Computer animationMeeting/Interview
Transcript: English(auto-generated)
00:00
She is the manager of the information management team in Baker Library at Harvard Business School, the HBS. And her theme is HBS knowledge, a knowledge graph and semantic search for HBS. So we are now waiting for Erin Wise to join us. She's here now.
00:22
I've just heard from our direction here that she's coming. As soon as you're in Erin, go ahead and just give us a quick thumbs up and a hello that we know you're there. Hello and thumbs up, I'm here. Fantastic, wonderful. And I see and hear you excellently. So Erin, real quick, before we begin, could you just tell us where you're physically at right now
00:42
and something that's a tourist attraction or something that should be seen when we visit that area? I am in Boston, Massachusetts, and I recommend visiting the Seaport District when you're visiting Boston. And I guess having some clam chowder soup as well.
01:01
Yes, of course, always. Excellent, good. So glad you could join us. As I mentioned before, Erin is the manager of the information management team in Baker Library at Harvard Business School known as the HBS. Her theme today is HBS knowledge, a knowledge graph and semantic search for HBS. Again, to our viewers, there's two ways to ask questions.
01:21
We're doing great so far. You can either scroll down to your interactive tool or you'll see the QR code coming up. You can put your phone up and go ahead and use that to put in your question as well. And Erin, we have about 15 minutes for the presentations and then about five minutes for the Q&A. So if you've shared your screen and you're good to go,
01:40
then I would say the digital stage is yours. Okay, thank you. So I am sharing my screen. Hopefully you all can see it. Yep. Great. Looks good here, yeah. Okay, excellent. Thank you. So I am pleased to present about a website
02:00
that we have been developing at Harvard Business School's Baker Library. We launched it as a proof of concept last June. And the concept that we are proving is that we can use a knowledge graph plus semantic search as a way for HBS users to uncover connections across our data silos.
02:21
Our experience shows that we can do a lot with a small handful of librarians and developers and that librarians have a critical role to play in any semantic data project. So first, what is our product HBS Knowledge or our website?
02:41
So the HBS Knowledge website is our solution to a problem that many of us are struggling with, bringing together information about a resource using data across repositories. With our website, we are proposing a knowledge graph as a way to integrate data across repositories, a search application, a semantic search application
03:02
for effective entity identification and findability of resources, and vocabularies for consistent and unambiguous language. And when I say vocabularies, I mean any controlled list of terms, I'm defining it very loosely as any controlled list of terms or entities that aids consistent indexing
03:21
and accurate retrieval of resources. So who we are, we are a small team, just a few librarians and developers, and we were fortunate to be given time and autonomy to act on our ideas to develop a site that we think solves significant problems for the school.
03:42
So we are a product owner, a taxonomy and ontology specialist, a semantic technology lead, a semantic search specialist, and a UI designer. So some definitions, what is a knowledge graph? We are defining it simply as a model of a specific domain
04:03
or sphere of activity. In our case, we're modeling the domain of business as practiced at HBS. The knowledge graph gives us a common structure for data and allows us to create relationships across repositories to provide a holistic view of entities.
04:21
So in this example, and I'm afraid it probably looks quite small too, so I apologize for that, but hopefully you can at least see that we have Amy Edmondson in the middle here in green, and she is a faculty member. And by combining data from various sources, we can see that she is an alum of HBS,
04:43
and we can link to her alumni profile. We see that she's a faculty member and we can link to her faculty profile. We know that she's the author of multiple cases and publications. For example, she wrote, teaming up to win the rail deal at GE.
05:00
We know that that case is about General Electric Company and we have alums who are currently employed by General Electric Company, sorry. And so alums employed by General Electric Company and we have alumni stories that are about those alums.
05:21
Those stories have topics of leadership, et cetera, et cetera. So the idea is that we just continue to make these connections and one can start at any given point in this graph and expand outwards and explore. So a definition for semantic search,
05:42
we define it simply as search with meaning. Semantic search uses our vocabularies and our ontology to understand the intent of queries. So in our ontology, in this example, in our ontology, Michael E. Porter is identified as a person and as a faculty member.
06:01
Our faculty vocabulary lets us know that Michael Porter is an alternate name for Michael E. Porter. So a user entering a search term, Michael and Porter, or a search, yes, a combined search term, Michael and Porter. Michael Porter is recognized as a named entity as opposed to two separate keywords, Michael plus Porter.
06:27
So search with meaning gives us Michael E. Porter as opposed to literal matches on query strings. We are using our library managed vocabularies
06:40
such as company names and topics along with entity vocabularies managed at the school and at university levels, such as faculty and alumni names. In the case of alumni names, we took the extra step to disambiguate alumni names
07:00
by appending additional metadata for degrees and class years. So otherwise we would end up with a case where we have about 10 different people with the exact same first name plus last name combination and no way to distinguish between them when you're selecting from a list of terms.
07:23
So again, I apologize for how small this slide is, but hopefully you can see the main points in blue. So we're providing an experience that probably looks familiar to you from Google. We're returning info boxes or knowledge panels on the right hand side
07:41
that focus on information about a particular entity and show that entity's relationships to other resources that are important to HBS. On the left side in this view, we are showing keyword search results from our selected sources. So we have on the right, we have explicitly related data objects from the graph.
08:03
And on the left, we have keyword search results on the left and there may or may not, there may be some, there's likely some overlap between the two, but the right hand side is showing very explicitly related things that we know to be true. And the left is just, it's relying on a keyword search to show you results.
08:26
So why did we do this? We have multiple data sources, schemes and vocabularies at HBS. So the situation is that we have metadata that's inconsistent across sources. We have metadata fields that even when they,
08:43
the names of the fields are consistent across sources, they've been interpreted differently in different ways and applied in different ways from source to source. We have entities that are not uniquely identified across sources. So companies and some data sources are text strings, for example, and sources may be using
09:03
different vocabularies. So local topic vocabularies, for example, would be used in different sources. One source would use one local vocabulary and other would use another local vocabulary. So we're essentially cataloging HBS assets in a consistent and systematic and controlled way.
09:23
And our goal is to be able to make any data, take any data source, normalize it, scrub it, process it and add it to our graph. So basically we're taking all the disparate sources and aggregating them into a single intelligible and searchable source.
09:42
So to go a little deeper into an example here, here's an example of different metadata for the same entity. Faculty names are represented in multiple ways in the source data. So using publications in working knowledge,
10:00
which is the library's online publication that features faculty work, we have a field or a property called WK faculty name. And the values in those fields uses a byline format. So it's first name, middle initial, last name.
10:20
And the relationship is not clear. So there's no stated relationship. It just says faculty name. Does that mean that this article is about the faculty? Did the faculty write it? It's not clear from the data. The faculty and research site on the other hand uses a last name, first name format
10:41
and identifies that the faculty member contributed to the work. So they have a risk contributor name field, which shows us, picks out the names are consistently identified and consistently formatted and the relationship to the work is identified. They also have a field called HBS suggestion,
11:00
which we think is related to the implementation of enterprise search at HBS, but it includes faculty names along with other data. Then again, we have alumni stories, which uses a JSON string that concatenates multiple facts about the faculty number. So we have ID, name and title.
11:23
The relationship is identified as featured. So HBS story featured faculty. Is that the same as about? Possibly, maybe even probably, but we're not sure. And finally, in all three sources, there is a field called HBS faculty, which uses a username format
11:42
and doesn't identify a specific relationship to the publication at all. So A. Brooks, A. Moreno, J. Macomber, no relationship identified. It's just picking out that a faculty is somehow related to this publication.
12:03
And then again, another example. So of the need to resolve meaning in the data, in the source data, we have different interpretations of one attribute. So HBS content type is an attribute that's common across all sources. The faculty and research data uses a vocabulary
12:21
of publication types that we recognize. So books, book chapters, articles, external relations, who owns the alumni stories content, they call everything, all of their stories news. So there's only one value and it's all called news. Maybe that's an article, maybe not.
12:41
Working knowledge uses the content type field to describe categories of articles. So everything that they produce, we would consider an article, but they have categories that indicate specific focus, the specific focus of an article. So it might be about a podcast.
13:00
It might be about a working paper that a faculty member wrote. It might be about a book, et cetera, et cetera. So how did we do it? It's a kind of very iterative process, but it's essentially the way this works
13:21
is that the information management team of librarians analyzes source data. We specified how to translate the source data to our ontology. And we were simultaneously developing the ontology as we went. We translate source data
13:40
and give those specifications to the technical team, just essentially our semantic technical lead and developer. He converts the source data into a triple structure with URIs for the knowledge graph. So the triple structure is subject, predicate, object, and the knowledge graph is essentially a series
14:04
of statements that look like this. So Amy Edmondson is author of Teaming Up to Win the Rail Deal. And when we apply URIs to that, we have the little piece that you see here where it's a URL and then a specified relationship
14:22
and then another URL to identify the publication. So the knowledge graph is just, it's still, I think, relatively small. I think it's over a million triple statements, but this is what the data looks like. So since launching the proof of concept
14:42
and demoing the site to colleagues at the business school and beyond, the use cases have been coming out of the woodwork at HBS. So some examples include using the graph to give the MBA program a view into the cases being taught in their required curriculum
15:01
and giving HBS initiatives, which are topic-focused research areas at HBS, giving them a view into the publications, events, faculty, and activities of the school related to their topic. They're essentially interested in promoting the activity of the school in relationship to a particular topic or industry.
15:21
And these are all excellent use cases for an HBS knowledge graph, and we plan to implement them. So some of our takeaways, there were many takeaways, but the ones that, the highlights are that there are many principles and conventions
15:42
from the library cataloging world that informed our work on the construction of a knowledge graph. So we borrowed from vocabularies for specifying relationships among entities and roles people have in relation to publications. We relied on our knowledge of library conventions for tracking name changes over time
16:02
and disambiguating conflicting names. For resource description purposes, we start with the basic principle of clarifying what it is that we're trying to describe and whether and how references to the resource or entity appear in our sources.
16:21
Scoping our project was key. So we focused on specific use cases and problems that we were trying to solve, just a few, and the types of data that address those use cases, as well as the HBS repositories that best represented those types of data. So we chose use cases, specific content types
16:41
or class types or types of data that address those use cases. And then from there, what are the specific repositories that would best help us address those use cases? And finally, if there's one thing that we learned above all, it is the importance of having unique identifiers
17:01
for entities across data sources. We knew this going in, of course, but we really knew it by the time we were finished. It was so helpful to have unique identifiers anywhere we could find them in the source data. And I would say in some, that any data-driven project we undertake is highly dependent on clean, trustworthy,
17:22
uniquely identified data. And when undertaking a semantic data project, you would do well to ask your friendly neighborhood librarian for help. Thank you. Thank you very much.
17:41
And I'd like to give you an applause as a tradition here. We know we're all at home and our offices. Yes, and thank you very much for that. We have a few questions waiting. Before we begin, I wanted to ask you a quick question. I find it interesting as more and more information and data goes digital, is cybersecurity an issue for you all as well?
18:01
Not only in that someone could come in and steal information, but also change that information that's being stored. Yeah, so it is an issue for us. And one of the reasons we, one of the other considerations for scoping our data was being very careful about what kind of data
18:22
we were exposing to the outside world. So in security, in that sense, we weren't thinking so much about people coming in and changing our data, but we were definitely thinking about exposing, internal data to the outside world. So for now this site is behind login. We aren't using anything that people can't already find
18:45
if they have an HBS login. So as long as we're behind login, we're sort of bypassing that question, but that's one when we, the goal is to open this up more once we have more permissions and security
19:03
things in place. Okay, great, thank you. Okay, let me get to our audience questions. The first one we have here is, do you have feedback from users? Do they understand the knowledge graph without introduction? To me, it sounds great.
19:20
Great, thanks. We have feedback from users. So the idea is not that people should have to understand the knowledge graph in order to be able to use the source. When we demo the site, we try to explain what it is that we were doing and why it's different from
19:43
what others may be doing at HBS right now. But really we've had great feedback from our users. And I didn't include a slide about comments. I could have, and I guess I should have, but some of the feedback is that, wow, you've solved a really big problem for us.
20:04
This is what everyone wants to do at HBS. They wanna have this view across and that's helped us sort of demoing this and talking to people about it has brought out all these use cases and people can really see themselves in it, which is fabulous for us.
20:21
That's kind of what makes it exciting. Super, that's a big plus. Good, our next viewer, we have two questions for this person. It says here, an excellent presentation. I can only agree. I would love to understand the project development phase and the proof of concept. Can you speak more about the proof of concepts?
20:40
How did you decide what data was useful? What data sources did you use for the knowledge graphs? So there were several considerations. One consideration, which was actually pretty big, was what data can we get access to pretty quickly? So some of this, we had to request permission
21:01
and it wasn't so easy to get access, but for data that's available publicly on the HBS website, so the faculty and research data, it's the citations and data about their publications and the work that they're doing, that was easy to get. The alumni stories data,
21:21
which talks about alums and also faculty members, that's also on the public website, so that was easy to get. And the work knowledge data, again, was available on the public website, so that was easy to get. So we were focused primarily on publication data,
21:40
people data, and organization data. Okay, good. That answers the question. Yeah, that does. A follow-up to that is, is this knowledge graph available to the public? Sadly, no, not yet. Any plans? It's in the future.
22:01
We have lots of plans. It's definitely something that we want to do in the future. We have a lot of sort of permissions, functionality to put into place before we can do that, but definitely the goal, so this was our proof of concept, and the goal now is to scale it up
22:23
and to start expanding the knowledge graph and incorporating data sets and making it available to more people, more users than it's currently available to. Okay, great. Good, all right. Next question here. How exactly does the disambiguation work
22:42
with same name or common name researchers? The viewer says here, I did not get that, sorry. Oh yes, okay. So when we were working with the alumni data, we discovered all these, if we were just looking at the name data, we saw all these duplicates, duplicate values.
23:01
So we had to look at other metadata related to the alums and start and bring that in to try to disambiguate between different names. So for example, we might have had a lot of John Smiths, but those John Smiths were in different,
23:22
they took, maybe they were not all MBAs. Some of them were either MBA students or exec ed student alums. So we included their degrees and then we also, we further disambiguated by including their class year. So if they graduated in 1995 versus 2000,
23:43
so we included the degree plus class year from other metadata that was already there and just appended it to the names. Wonderful. And Erin, that's all the time we have now. So once again, I'd like to thank you for your excellent presentation and also for taking the time to answering some of our questions.
24:01
A round of applause from around the world. We have over 50 countries represented with over 300 participants. So we'll thank you to that and continued success at Harvard. Thank you. Thank you. Bye-bye.
Recommendations
Series of 9 media