We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Matscholar: The search engine for materials science researchers

00:00

Formal Metadata

Title
Matscholar: The search engine for materials science researchers
Title of Series
Number of Parts
56
Author
Contributors
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Matscholar (Matscholar.com) is a scientific knowledge search engine for materials science researchers. We have indexed information about materials, their properties, and the applications they are used in for millions of materials by text mining the abstracts of more than 5 million materials science research papers. Using a combination of traditional and AI-based search technologies, our system extracts the key pieces of information and makes it possible for researchers to do queries that were previously impossible. Matscholar, which utilizes Vespa.ai and our our own bespoke language models, greatly accelerates the speed at which energy and climate tech researchers can make breakthroughs and can even help them discover insights about materials and their properties that have gone unnoticed.
Musical ensembleNatural languageHill differential equationFood energyProjective planeQuicksortStudent's t-testXMLUMLLecture/ConferenceComputer animation
Inheritance (object-oriented programming)Reduction of orderAlgorithmTable (information)Data structureWell-formed formulaMusical ensemblePredictionInformationNatural languageNatural numberProcess (computing)Scientific Word <Programm>Endliche ModelltheoriePattern recognitionProcess modelingMachine learningWave packetData modelMoving averageSoftware testingArtificial neural networkState of matterBit error rateLogic synthesisField (computer science)BitTask (computing)1 (number)Similarity (geometry)AbstractionDescriptive statisticsGoodness of fitEndliche ModelltheorieVirtual machineVector potentialSet (mathematics)Computer hardwareSlide rulePattern recognitionProjective planePredictabilityState of matterElement (mathematics)Software developerSurfaceComputer simulationMathematicsDatabaseProof theoryInformationScaling (geometry)Price indexFood energyFrequencyStudent's t-testPoint (geometry)Recurrence relationType theoryArtificial neural networkCASE <Informatik>Group actionPopulation densityTable (information)Figurate numberException handlingForm (programming)Process (computing)Bit error rateGoogolParameter (computer programming)Order (biology)Green's functionTransportation theory (mathematics)AlgorithmTheoryPlanningForcing (mathematics)Information extractionWebsiteCollaborationismSemantics (computer science)Category of beingFlow separationWordDegree (graph theory)Natural languageEinbettung <Mathematik>Representation (politics)QuicksortWave packetInternetworkingComputer clusterFront and back endsLine (geometry)Link (knot theory)Computer animation
View (database)InformationPhysicsMetric systemRational numberLogic synthesisVariety (linguistics)Uniqueness quantificationFunctional (mathematics)SurfaceReduction of orderIntegrated development environmentFocus (optics)Food energyShape (magazine)Electric currentForm (programming)MathematicsVector spaceSparse matrixSimilarity (geometry)Vector graphicsProduct (business)Einbettung <Mathematik>StochasticNormed vector spaceElement (mathematics)WordVector spaceQuantum mechanicsGoogolSurfaceCategory of beingEinbettung <Mathematik>Block (periodic table)Cartesian coordinate systemCodierung <Programmierung>Population densityRepresentation (politics)Line (geometry)QuicksortPlotteroutputImage warpingData structureEndliche ModelltheorieGraph (mathematics)AbstractionThermoelectric effectLevel (video gaming)InformationView (database)SpacetimeResultantMultiplication signSelectivity (electronic)Subject indexingDimensional analysisContext awarenessVisualization (computer graphics)ACIDNatural languageComputer animation
FrequencyElement (mathematics)Image warpingVector spaceDimensional analysisCASE <Informatik>PlotterMusical ensembleArrow of timeLecture/Conference
Vector graphicsLinear mapLinear regressionEndliche ModelltheorieRadiusPredictionSchmelze <Betrieb>WeightAlgorithmQuery languageTelecommunicationSolid geometryThermoelectric effectData structureThermal conductivityMeasurementPlanar graphBit rateSample (statistics)Branch (computer science)QuantumOrganic computingData modelDatabaseGroup actionCASE <Informatik>Visualization (computer graphics)Point (geometry)Interface (computing)WordDimensional analysisCategory of beingEndliche ModelltheorieElement (mathematics)Different (Kate Ryan album)ResultantAreaOnline helpCross-correlationExistential quantificationDistribution (mathematics)Linear regressionKritischer Punkt <Mathematik>Representation (politics)Einbettung <Mathematik>QuicksortSimilarity (geometry)Multiplication signArithmetic meanData structureTask (computing)Level (video gaming)Set (mathematics)WeightCartesian coordinate systemNoise (electronics)SpacetimeNatural languageState of matterCombinational logicBitTerm (mathematics)Thermoelectric effectPredictabilityComputer animationXML
Category of beingQR codeComputer simulationMultiplication signDirection (geometry)Endliche ModelltheorieComputer animation
Food energyInheritance (object-oriented programming)Lattice (order)WebsiteParsingGroup actionComputer animation
Food energyHill differential equationParsingMereologySearch engine (computing)Scaling (geometry)Mathematical analysisMultiplication signPhysical systemBitLecture/Conference
Field (computer science)Domain nameExpert systemNatural languageSoftware repositoryDirection (geometry)Token ringEndliche ModelltheorieLattice (order)Wave packetLecture/Conference
Food energyMusical ensembleLecture/ConferenceJSONXMLUML
Transcript: English(auto-generated)
Hi everybody, how are you doing this morning? Thanks for coming. Hi everybody online. So my name is John Dadelen, I'm a PhD student at UC Berkeley and Lawrence Berkeley National Laboratory at the, which is just up the hill from UC Berkeley, we're a Department
of Energy research lab, so that means we work on trying to invent new things to make energy cheaper and cleaner. And today I want to talk about Matt Scholar, it's a project we've been working on for a few years now, and this is sort of a bait and switch, we are a search engine, but that's
not our main research, and I actually want to pitch you all on how language technologies can completely revolutionize science and energy, and I think that there's a lot of contributions that this community can bring back to science that we're missing right now. So I'm going to show you a couple examples of how that can work, and then also how we're trying to put that into the hands of researchers. So first I want to pitch you on why materials are important.
So all of the technology that you use today, and all of the technology you'll use in the future, it's not just the inventiveness of engineers that brings it forth, actually engineers are so creative, they're so intelligent, they can design anything, they'll make you an elevator to the moon, except that the materials that they have to work
with cannot withstand the forces, or hold enough energy, or do all the things they want them to do. So they're bottlenecked by the materials that they have to work with, the alloys, the compounds. So some examples of technologies that were purely enabled by materials advances.
We knew in theory it could be done, but we didn't know what materials would enable it. You have high-temperature superalloys in the engines of planes that allow them to be much more efficient. You have hybrid solar cells, so these are things like perovskites that allow solar cells to be flexible, much cheaper, can be manufactured as thin films and put on top of other
solar panels and make them 50% more efficient. Lithium-ion batteries, which are enabling electric vehicles and green transportation, and things like catalysts for chemical reactions. The only issue is that we're, as materials researchers, we don't have access to all
the information that we need to use, or that we need in order to make these advances. So most of the knowledge about the human knowledge on materials science is trapped, basically, in unstructured forms, the text tables and figures of millions of research papers. And I'm sure many of you maybe have been there. We can't read everything, and so we miss stuff.
And we also can't leverage this larger picture that the literature provides on things. So we do our best, but the goal that we have as a project is to try to use natural language processing algorithms to harness all of this information and do material science. So let's talk about some of the things that we wish we could do as materials
researchers that are not possible right now. So some of these are small things, like if I do a search for titanium nickel tin on Google Scholar or something like that, the slightly different ordering of those elements is still the same chemical formula, but it's not going to link to the same material.
And also, there are different ways that chemists and material scientists write things that are basically the same, but conventional ways of doing search or doing language knowledge discovery wouldn't surface them. So medium difficulty things or medium importance things, it's really hard to ask certain questions,
such as, what are all the materials that have been tried to store lithium in a battery or something like that, or just facts. So sometimes just looking at facts is actually pretty hard for us. And big things, we can't use this big body of information to make predictions
about what we should study next or maybe how to synthesize something. Those are really actually big challenges to our field. You want to take, here, you can grab it, okay. My slides will be up on line two afterwards, so you can... So that's where this project comes in, MatScholar. It's, I think, about three, four years old now, since we began.
And it's twofold, right? We have, first, we want to use supervised natural language processing to just extract this data out and make data sets that we can use to do machine learning analysis or answer some of these bigger questions that we have. And then also, we want to use self-supervised or unsupervised natural language processing
to kind of uncover the emergent properties of large bodies of scientific knowledge. There's a lot of untapped potential within the literature that we're trying to leverage. So I'm going to kind of give you some pictures of both of those today, and then how we're trying to start making these things into tools that researchers can use in their day-to-day work. Okay, first, let's talk about name density recognition.
This might be a task that many of you are familiar with. Materials researchers are usually not familiar with this because it's still new to us. We want to extract knowledge and facts out of research papers, and then make databases of this knowledge that is searchable. So in this case, we started with something like these recurrent neural network LSTM type approaches,
and they did pretty well. We were getting F1 scores around 0.86, 0.85, something around there. And to do this, you annotate by hand thousands of abstracts, as a usually PhD student like me is doing this. And then we train these models, and then we can run them on our whole dataset.
So this allows us to start building these search indices. This is kind of what that pipeline looks like. So you would take a big database of text that you get from the internet. We have about 5 million full research papers in our corpus that we've collected in collaboration with some other groups.
You tokenize them. So this is another place where conventional language technology approaches fall short, and we had to do our own developments. Because when you tokenize chemical formulas and other ways of writing things in science, it's not quite straightforward sometimes. Then you label them, you turn them into a training set, you train your model,
you maybe get some word embeddings for them or some models like BERT that you can then use to create dense representations of these concepts and words. Then you can use those to do useful things like extracting data or some other downstream tasks. Now, the big development that's been happening,
I think it's affecting a lot of us, is these big, large language models that use billions of parameters and have state-of-the-art performance on many downstream tasks, things like question answering, information extraction, things like that. So in our case, we find that these models do improve our scores, not by as much as you might think.
But we also found an interesting feature that if you pre-train these on scientific text in general, they do almost as well as pre-training them on just a similar-sized corpus of just material science. We slightly edge them out on the task of named entity recognition and a couple of other things. But the good news basically is that it seems like for scientific purposes,
a general model trained on general scientific text can be used almost as effectively as these ones that you trained yourself. So for a lot of researchers in various fields, pre-training a BERT model might be a little bit outside their ability because they may not have the just hardware to do it or the experience to do it.
So luckily they can just download a pre-trained model and then use it for their own tasks and maybe only fine-tune it on the downstream task rather than pre-training their own model. So we've used these kind of techniques. We've extracted properties from millions of abstracts.
So we started with the abstracts because they usually contain a concise description of what was done, what the big conclusions were, and what materials were studied. This is a good proving ground for testing out ideas. And we've extracted millions of properties from research papers. And this represents a really significant step change in the amount of data available to researchers.
So previously there were only maybe 100,000 to maybe 1 million compounds, and their properties in material science databases. And these were usually made with first-principle simulations, so they weren't even from the experimental literature. And the data set sizes that we tend to work with from experimental literature are in the hundreds of data points.
So this is a really important transition period where we start getting things at scale. And science has not traditionally scaled well, and we're trying to learn from all of you about how we can try to do that. And so we're using Vespa as our search backend.
So we want to start turning this data set into a tool that people can use to surface ideas and understand the literature at scale. And Vespa has been a really nice tool for that purpose. We've got users all over the world. And we actually have not launched the site. We just built it and then gave it to one or two of our collaborators to test out.
And it sort of is just, by word of mouth, started being used by hundreds of people all over the world. And we actually don't know who they are. So it's just a couple degrees separation from us. Mostly the United States, Germany, and China. OK, so I just want to show a quick example of something you might want to do.
So a search that might have been difficult with traditional ways of doing it on maybe SemanticScholar, GoogleScholar, something like that, is find papers that are about gold nanoparticles but not nanorods. So nanorods have an aspect ratio to them.
And then narrow it down to papers that are only about gold nanoparticles as catalysts. Doing that search might be kind of difficult. And also you would want to see the bigger research picture there. So let me just start through the GIF. So we have our first text with a couple, must include nanoparticles, not include nanorods.
We can filter it down by material that's mentioned in those research papers. And we can end up getting to our paper. So I'll just show it one more time because I went kind of fast. So we're going to select the material, scroll down to the applications, select catalyst. So now all the research results are catalysts with gold nanoparticles, no nanorods.
And you can actually go to the research papers. That's just one way of doing information discovery. And one thing that we've noticed when we use our way of indexing these research papers rather than the kind of more conventional search way of doing it. So this is like chemistry-aware indexing, things like that.
You can avoid problems that might have hit you before. So for example, in this paper, they never use the actual AU element name for gold or chemical formula. They just say the name. So this happens all the time in chemistry research. We have a word for something. And then we have a chemical representation and chemical formula.
And those two things are the same concept, the same thing. But if you index one, you might not get papers on the other if you do a search for that. And so this is an example. This is, I think, the second or third result. This is a paper. If you do the same search on Google Scholar, you're going to be actually served with millions of research papers about gold nanoparticles used in drugs because the biology research
is so much bigger than the material science research. And that's not useful to us. And even if you append material science to it, this paper wouldn't surface to you. So it's sometimes hard to find things we're looking for. Something else we can do is we can train these language models on this corpus of data that we have. And what we find is actually the embeddings or the dense representations for words encodes
information in a way that is consistent with how we think about materials. So for example, the word embedding for lithium cobalt oxide that our model learns from just reading research abstracts is really, really similar. It's very nearby in this vector space to a bunch of other materials that are used
for the same application, which is the cathode, the electrode in a lithium ion battery. And then if you also train that, if you also look at something like a property of materials like ferromagnetic, all of the embeddings in this vector space that are closest to that word are ferromagnetic, anti-ferromagnetic. So to people that work with these technologies, this makes sense,
but this blew the minds of all the materials researchers I've shown it to because they didn't realize that you could learn representations for these things that then this is extremely useful. Because now what we can do is turn it into basically a big graph of a big knowledge map and then start at one node and work your way out around it
and try to understand what are the relationships between these things. So this is a t-SNE visualization of that. So our vectors are 200 dimensional. And when you visualize them in two dimensions, you can see this is, I think, the 10,000 most common materials in our corpus. And they cluster based on their application, their properties.
So this blue cluster in the middle here are the thermoelectrics. They turn heat into electricity. They're used on the Voyager spacecraft, for example, that is starting to send back some weird data. I think you might have seen it in the news. That's what's powering it. There's a nuclear material that's producing heat, and then that turns into electricity.
And this is really interesting because you can start mapping the space of all human knowledge about materials and try to understand what are the origins of how properties are manifested in things. So this is something that we've been trying to do for many, many hundreds of years. So for example, Mendeleev really changed chemistry
by noticing that elements, and this was before we really had a strong knowledge about quantum mechanics or even the structure of matter. And what he does, he laid out the elements based on how they react with each other.
The metals tend to react with acid and produce gas and stuff like that. And the notable thing he did, though, lots of people have tried something like that before, but he left holes in it based on where he thinks new elements should be. And indeed, that's where elements were discovered to be. So when we train on this and we map out the word embeddings for the elements themselves, what we find is it has an extremely similar structure to the periodic table.
And it has never seen the periodic table. It doesn't know anything. The input to this model are one-hot encodings of just the word's existences. And so you can see that this plot, if you actually trace down the columns of the periodic table here, those are those lines I'm showing in the gray there.
Actually, the same columns are represented in even a t-SNE, which is sort of warped view of the thing. So the nearest neighbors of elements are the nearest neighbors on the periodic table, which the reason that is is because elements behave similarly. They manifest in the literature and in chemical formulas based on their properties. And the periodic table is laid out based on their properties.
So you can rediscover kind of concepts that scientists use in these word embeddings. Yeah, so what I did, I'm just using them to help you see what's going on here. So if you trace down this column, that's that arrow.
Yeah, yeah. So in this case, cadmium and lead are probably mentioned.
So one, this is a t-SNE plot, which kind of warps the vector space. They might actually be closer together in the higher dimensionality space than this might suggest. But also cadmium and lead are both toxic, so a lot of times in literature they're mentioned together and we should not use them in materials. So that might be one reason why they're closer than in this case.
But there is some sort of similar structure to the overall data. Yeah, yeah, mercury is also poisonous too. Okay, and so we wanted to see how good are these representations and actually encoding the chemical knowledge in there. So we trained just a simple linear model on a couple properties of these things.
So this is the embedding of the word, the element name, and then we're mapping it onto properties that are of the elements. And we found really, really strong correlations there. And the areas where it kind of fails are things like TC or RA, and these are also used as abbreviations for critical temperature.
So that's the reason that they're kind of off the distribution, and that's where some of the noise comes in. And so now what we're trying to do is use contextualized language models like BERT and things like that, which can remove that noise or the degeneracy in that. And F, for example, Fahrenheit, that kind of thing. So now we can map the space of materials.
We can start asking questions. What is the similarity between a material and an application? And we can get a heat map of what materials are similar to that application. And basically, where do they lie? So this is the heat map for thermoelectric. And you can see a lot of the best-known thermoelectrics here are very strongly correlated with that term.
And same with luminescent. And for us, the ability to ask, why is something more similar to both thermoelectric and luminescent and less similar to some other property? Like, what chemically is going on there? And now we can actually start interrogating these questions a little bit deeper. So one thing we wanted to do is try to put this in the hands of people.
So we actually built a search interface that uses the word embedding visualization. You can isolate things by similarity and then retrain a t-SNE to re-visualize just those points and their relationships to each other and start segmenting the data.
And then you can actually click on these individual ones or highlight a group of them and then rerun it and then eventually get yourself to a search. So in this case, we went to the materials cluster, the top results for similarity to thermoelectric, and then grabbed bismuth telluride search results.
You can also do things like get insights from the compositions of these materials. So if you look at the top materials that are most similar to the word organic, you can find that the elements in those materials are distributed how you might expect. And then you see the same thing with batteries. So oxides are really prevalent in battery materials
and piezoelectrics, you see a slightly different distribution. So we're starting to be able to actually get what are the correlations between the composition and properties, which is the ultimate question in material science. If you can predict all of something's properties from just the combination of the elements, then what you can do is just start mixing and matching elements,
predicting their properties, and find yourself into an area where there might be new materials that can help society. So the larger impact that these embeddings have had, they're actually used as representations for elements in the state-of-the-art, all of the best models for predicting materials properties from composition.
So those two are roosting CrabNet. And we actually are using these models ourselves, or using these embeddings ourselves, and we're trying to take it to the next level. So we haven't published that research yet, and I can't get into it today. But we're finding that when you pretrain models on big datasets of text, you basically can use some of that background knowledge about stuff,
about materials, about matter, about chemistry, and bring that knowledge and distill it into your model that you're going to then fine-tune on some sort of predictive task that might be more scientifically rigorous, or specific, I guess I mean. So that's all good. We can try to understand what's already been done.
But can you actually use this to make anything new? And usually this is where the second half of my talk would begin, but I don't have time in this short one. So if you'd like to check that out, we had a pretty interesting paper that we published a couple years ago. People like Vinod Khosla and Andrew Yang were tweeting it out, and it was pretty fun. But here's a QR code that will take you to the paper,
and you can find it, there's some news articles about it for maybe a non-material science audience. But basically we showed that you could discover new materials using these methods that scientists had missed. So these are materials that we know exist, but we didn't know that they had these important properties.
And so this is kind of the direction our research is going now, is using these self-supervised models, combining that with other data that we might know from maybe simulations and things like that, and then predicting new materials. So this is a big effort with a lot of people who have been involved. Here's just some of them. And we also have some...
So our work is supported by the Toyota Research Institute, the Advanced Materials Discovery and Design Group. And then these are some other folks that have helped in meaningful ways. Olga wrote the material parser, so this really complicated essentially regex nightmare of parsing chemical formulas, she solved that. And then also our work is hosted on the Super Competing Center
at LBNL, NERSC, and SPIN specifically is what hosts our website. So thank you very much for having me at this conference, I really enjoyed meeting all of you, and I'll take a couple questions. Yeah. Sure. We don't have much time, unfortunately,
so only one question. I'll be around afterwards too, and you can always email me too. This is really exciting stuff, this is great stuff. Thanks. So you mentioned earlier that science doesn't scale, and I also wonder about search engine scaling. So you mentioned some of the... Somatic Scholar and things like this, they have a different approach, they do a lot more with citation analysis,
there are other... So how can you take your work and make it into a modular part of a larger ecosystem of search engines? Could you plug in your chemical parser into some other system? Have you been thinking about that kind of thing? Have you talked to Somatic Scholar people? Yeah, I've talked to Somatic Scholar a bit.
Actually, so when the pandemic happened, we transitioned to trying to do a lot of this for the COVID-19 literature. So actually we were involved in... We had some meetings with them about how we could help and work together. I think that things like the language models that we're training, and the tokenizers we're using, and the chemical parsing, a lot of that, we're trying to write it as modular pieces
that you could then plug in. So for example, our tokenizer is something that you can use, and we have it on a GitHub repo. So that's something that I think is a really good direction to go. What we've found is, this is kind of in our work on COVID-19, we realized that domain expertise is really, really important for making this stuff.
If you're just an expert in language technology trying to solve some of these issues with those tools, unless you're working extremely closely with the scientists who know that field really well, you might run into potholes you didn't know existed, basically. So that's what we're trying to do, is build things that can then be adopted more broadly by people.
Thanks. Thank you very much. Thank you again, everybody.