Parsing and deduplicating Scientific Text at Scale for AuroraGPT

Cite

International Council for Scientific and Technical Information (ICSTI)

Underwood, Robert

Formal Metadata

Title

Parsing and deduplicating Scientific Text at Scale for AuroraGPT

Title of Series

ICSTI Impact Forum: Large Language Models - Trends and Use Cases

Number of Parts

Author

Underwood, Robert

License

CC Attribution - NonCommercial 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/69581 (DOI)

Publisher

International Council for Scientific and Technical Information (ICSTI)

Release Date

2024

Language

English

Content Metadata

Subject Area

Information Science

Genre

Conference/Talk

Abstract

In this talk, Robert Underwood will share the recent progress of the AuroraGPT Team, how data contributes to the project of building a science-focused LLM with AuroraGPT, and the topics that his team sees as open questions. He will discuss the systems and data quality challenges the team tackles to prepare terabytes of scientific data and text to produce high-quality text and data for training.