We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Parsing and deduplicating Scientific Text at Scale for AuroraGPT

Formal Metadata

Title
Parsing and deduplicating Scientific Text at Scale for AuroraGPT
Title of Series
Number of Parts
3
Author
License
CC Attribution - NonCommercial 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
In this talk, Robert Underwood will share the recent progress of the AuroraGPT Team, how data contributes to the project of building a science-focused LLM with AuroraGPT, and the topics that his team sees as open questions. He will discuss the systems and data quality challenges the team tackles to prepare terabytes of scientific data and text to produce high-quality text and data for training.