We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Parsing and deduplicating Scientific Text at Scale for AuroraGPT

00:00

Formal Metadata

Title
Parsing and deduplicating Scientific Text at Scale for AuroraGPT
Title of Series
Number of Parts
3
Author
License
CC Attribution - NonCommercial 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
In this talk, Robert Underwood will share the recent progress of the AuroraGPT Team, how data contributes to the project of building a science-focused LLM with AuroraGPT, and the topics that his team sees as open questions. He will discuss the systems and data quality challenges the team tackles to prepare terabytes of scientific data and text to produce high-quality text and data for training.
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Transcript: English(auto-generated)
Today I'm going to be talking about my work on the Aurora GPT project. So Aurora GPT as a project is organized into a bunch of different sub-teams. The specific sub-team that I work on is called the Data Team, so we're responsible for identifying, curating, parsing, and analyzing,
and if deduplicating the various different corpuses of documents that we receive, and then making them available for our training and for systems like what was presented by the previous speaker. So I want to kind of begin by talking a little bit about an overview about why Argonne is interested in building a science-focused LLM
and why we might want to build and look at that as opposed to kind of general-purpose LLMs. I'll talk a little bit about what my team's role with respect to data and what role we have in that process. We'll talk about a little bit about what Data Team does specifically and why that can be very challenging,
and then we'll talk about some recent research highlights from my team. So why should we build a science-focused LLM? So AI models are already producing fundamental changes in the way that we search for, access information, control, and operate processes, simulations, operate and steer robots, machines, vehicles,
experimental facilities, simulations, and solve optimization problems. Even today, we're already seeing that AI is profoundly changing how we approach research, and in some ways, more so than even the internet has. So as a research institution, Argonne is interested in developing AI foundation models
that are specifically optimized for science. So here's one way that we can kind of think about this as applied to the domain of computer science. So first, we would say that we want the system to be what we would describe as a scientific assistant. We're not trying to make it replace the scientist,
we're trying to help it assist the scientist across the different stages of the scientific method from problem formulations such as identifying optimization objectives or areas where you may need to do additional uncertainty quantification on a particular problem through natural language, but also potentially also mathematical formulations of particular problems.
We see it affecting the modeling stage, so how are we applying the constraints, how are we wrapping a problem description down to a particular problem instantiation. Retrieval, how would we find related scholarly research in this area, and I think the last presentation did a really good job of showing how that can really be done very well.
The solver stage, so how are we solve the problem, implementing specific software components or mathematical model components that will allow us to tackle a particular problem, and then implementation. Okay, but why specifically Argonne? So Argonne is a leading community of scientists across a variety of different fields
that have a unique capability for us to evaluate and understand scientific outcomes from models. So Argonne has about 3,000 scientists across the enterprise working on different specific problems and domains. If you compare the amount of scientific workforce that like an open AI or a meta has, yes, they may have computer scientists working or maybe even biologists or chemists working on specific areas,
but we have some of the best scientists in the world and we have a lot more of them than a lot of these corporate entities are able to acquire to work on specific problems. And that expertise is critical both for understanding what data sources we need to incorporate for particular problems in order to solve them efficiently, but also understanding how well we're doing at solving that particular problem.
So that matched pair of both data with evaluation driven by expertise is really critical and really positions national laboratories such as Argonne to work on this problem. Two, we have unique access to data from large-scale scientific simulations and instruments.
So Argonne has a number of user facilities, things like the Advanced Photon Source or things like ALCF, the Argonne Leadership Computing Facility, which has one of the largest machines in the world, Aurora, which produce and are used to produce large-scale scientific data sets. So as we produce these data sets, we in some ways have more native access to these
than people who are outside of Argonne would necessarily have. Additionally, with this unique access to data, we also have unique access to the expertise that both produced that data and is used in analyzing that data.
So having this access of data gives us kind of a foundation that we can use to build upon to understand how to apply this in particular contexts. Additionally, we have the exascale computing required to achieve this through systems like Aurora and as well as Polaris. And it's well aligned within broader US Department of Energy initiatives such as FAST, which is the large AI effort that's being proposed before Congress.
So what role does data and the data team play in this process? So in the context of Aurora GPT, we have on the order of eight teams. So I've depicted them here. So we have data, that's the team I'm going to be talking about today, training, which is responsible for taking the data information produced by data
and then throwing that through pre-training. There's usually then what's called a post-training process, which then aligns a trained model to do things like question answering. I think Alan mentioned this notion of alignment or instruct models. That's where post-training is responsible for. Evaluation says, how well are we doing this?
And serving or inference is responsible for making these facilities available to the larger lab community. There's also two non-technical teams, distribution and communication. Distribution is functionally our legal group, which helps us navigate some legal challenges and communication helps us discern, distribute this information to the broader and general public. So here's just kind of a brief picture of some of the people
that are involved with the project. And I'll then kind of transition to the goals for our team specifically. So the mandate that we have is to generate on the order of 20 trillion tokens. So as the first presenter said, like a token is not necessarily a word, it could be a portion of a word or it could be a combination of words
depending on the specific case of high quality scientific text and structured data with strong quality control, deduplication and provenance and subsetting. So this is manifested in a bunch of different like concrete tasks such as PDF parsing, let's tack XML to text as depending on what the data source is,
preparing that data for training across a variety of different fields, tools and workflows to understand how contamination occurs. So a good example of this is some of the major benchmarks that are used actively GSM at 8K. The Allen Institute for AI has said that that benchmark is probably completely leaked at this point and is basically no longer useful for evaluating LLMs.
So how do we detect that that's happened so that when we're doing evaluation on LLMs, we understand what evaluations are fair and accurately reflect the state of the art. Then additionally beyond kind of preparing the data for the training team, we also work with the inference team for things like retrieval, augmented generation as discussed before,
as well as the evaluation team so that they can understand what evaluations are appropriate for particular scientific data sets that we produce. And then as we kind of continue in this work, we're looking at how do we kind of go beyond traditional sequences and start to incorporate things like graphs,
climate data, images, structured grid data, unstructured grids, tensor data into a system. Okay, so how does the data team do this specifically and why can this potentially be hard? So we can kind of think of this as three broad categories of tasks,
converting and pre-processing, de-duplication and integration. So within conversion and pre-processing, we're converting text and data to structured, curated and traceable formats for use in training. Why this is important, we want it to be consistent so that it's easy to use, not just for the purposes of training, but also so we can use it on the back end
for things like retrieval, augmented generation. As you likely know, every single different data source vendor produces data in completely different formats with different structures. Normalizing across the different schemas that are used to describe that is really an important step to understand the specific subsets of data that are potentially relevant for particular questions.
Two, if you have artifacts in your data, and I'll talk a little bit more about why you might have artifacts in your data, artifacts in your training data can potentially reduce the quality of the inferences that you're able to produce based off of that. And then lastly, traceability and reproducibility and audibility
are really important to understand as we iterate on our design approach, what experiments have we tried before, what impacts did they have in being able to tie for all the way from the data set and the individual articles that have been used to outputs that we have in the system and understanding how that whole pipeline potentially plays a role in the quality of the models that we're able to produce.
De-duplication, so de-duplication has kind of a bunch of different implications here, but in short, excessive duplication in your training set can lead to a variety of problems. It can lead to memorization, which means that your model, instead of understanding how to multiply or divide as described in the first presentation, it learns how to regurgitate specific answers.
So for example, if you look at math specifically, you'll notice that the first presenter presented math examples that were longer form, so they had four digits or more of multiplication or addition. If you actually try simple multiplications and additions, things that are like two or three digits, in many cases, the LLM can actually do that correctly.
So what's going on there is that the model has memorized these simple cases because there's enough examples of these out in the wild that models have understand how to incorporate them. And if you look at more advanced models like LAMA 3.2 or some of the other models, they specifically have a synthetically generated data set
that has many examples of common multiplications and additions for small digit cases to specifically improve performance on those specific cases. So this memorization is potentially a problem for evaluation, but it's also a problem for generalizability. Because you're learning on these specific examples,
generalizing to five digit or six digit multiplication becomes much more difficult for some of the reasons described in the first presentation. Also from a legal perspective, copyright concerns are a major aspect that we have to think about when we're talking about duplicating data. There's a bunch of results that have said that if your model has observed a particular substring
in your training data multiple times, I believe it's like more than five times or 10 times, the likelihood of regurgitation becomes really, really high. So the legal questions on this are not at all settled. There's major court cases going on right now. New York Times in the United States is suing OpenAI, who is the maker of ChatGBT.
There are other very similar lawsuits going on with major publishing houses. So understanding whether you're likely to produce exact output versus likely to produce similar output actually has a lot of bearing on whether or not the information produced by these is actually a facsimile for purposes of copyright law.
So this is a very interesting and an important question that you have to think about. And then additionally, we need to integrate with other teams to build things like services that actually end users will then in turn use. So why can this potentially be hard? While LaTeX is available for many scientific documents, PDFs are probably the most common form of scientific text, but they're very expensive and error prone to process.
You're very likely to get translation errors from optical character recognition, OCR techniques. Even more advanced techniques like vision transformers still tend to produce artifacts in the translation of text in PDFs to like natural text formats that are more easily processed by computers.
And those artifacts can then appear when you start to generate text based off of them, especially if you make the same consistent parsing error across a large number of documents. Nearly every text format is different, meaning that there's a bunch of different things we potentially need to integrate with. We have 100, we have 14 terabytes of raw text alone. So processing anything based off 14 terabytes
of raw text is somewhat difficult. And if you talk about the non raw text and you talk about the plain text sources, we have hundreds of terabytes, if not more of documents that need to potentially go through these processing pipelines. And all of that's just for text. As we talk about multimodal, this process becomes progressively more and more complex
and more and more expensive to conduct at scale. Additionally, deduplication for this is very expensive. Computationally, it's an order of n-squared algorithm means we have to compare every document pairwise, which is not really scalable for the, when you're talking about hundreds of billions of documents being used to train modern LLNs.
And then how do you consider doing deduplication for multimodal data? That's actually just a completely open question. Whereas for text, we have some vague ideas of how to do it and which approaches are better or worse than others. And what role does data reduction play in this? For example, if you're looking at data coming out of, for example, CERN or at the APS, they can generate hundreds or one to hundreds
of terabytes of data every day. So as you're producing that volume of data, is the data production in that native format, is that the most intuitive way to understand the data or maybe should we be reducing that data to statistical summaries or to semantic descriptions of that data to help us better understand
what's being represented there. And then with multimodality, this is also a challenge for integration because you're not just looking at text pipelines. You have to figure out, okay, well, how do I adapt RAG, retrieval augmented generation, when my input is unstructured grid data? So as we look at kind of scientific data sets,
that represent things that actual scientists use, these problems become progressively more and more complex and then more and more difficult to scale to large document collections. So I'll end the presentation with some recent highlights for my team. So in fiscal year 24, we successfully identified parsed
and de-duplicated 12 data sets of scientific text and code. This includes the document collection that we have from OSTI, which is the DOE's scientific and technical information facility, as well as DOE code, which is a similar kind of set of data sets, which is the code information from OSTI.
And then we're also incorporating different data sets from scientific publishers. So we have memorandums of understanding with the Association for Computing Machinery and the American Society for Microbiology. And we have additional agreements in process with kind of vendors who are able to provide us with data sets that are otherwise
not available on the open web. So understanding- One minute, sorry, one minute. Yep, yes, thank you. This will help us understand, like, to what extent are these important? And that plays into things like fair use arguments, for example, around this context. Additionally, we've done some work on scalable parsing for high-quality data sets. The work that we have is called AdaParse.
More that I can say here, but the kind of key takeaways that I want you to say is that PDF parsing is potentially very expensive, especially the good PDF parsing, but you don't have to use a high-quality PDF parse in every case. But when you do, it's important to be able to discern between them. So we created a hybrid parsing technique
that combines lightweight parsing techniques with heavyweight parsing techniques to achieve the highest blur score, highest roo score, and the highest rate of accepted tokens of any technique that has been developed so far. And we've also done some work on scalable deduplication with a technique that we call LSH Bloom, which allows us to do extremely large data sets.
LSH Bloom produces the most accurate, or some of the most accurate parses that we can out of all the methods that we've tried while using substantially less both computing time and computing storage resources of any of the techniques that we've used. So if you apply this to Dolma, we would take a duplication process from taking 77 days down to 30,
and we can take the amount of data that's required for doing this down to hundreds of terabytes down to tens of terabytes, which is really dramatic savings. There's still headroom for improving these performances even further by porting some of our key components from Python to Rust. So porting the most hot loop of LSH Bloom
from Python to Rust resulted in a 200x improvement on that operation, which results in a 11 times improvement overall. So there's a lot more room that we can do to further optimize the performance here. And with that, I turn it back over to our hosts.