We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Electronic Theses and Dissertations: A Research Corpus of Scholarly Big Data

00:00

Formal Metadata

Title
Electronic Theses and Dissertations: A Research Corpus of Scholarly Big Data
Title of Series
Number of Parts
30
Author
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Computer animation
Computer animation
Transcript: English(auto-generated)
Hello, my name is Bill Ingram for the University Libraries at Virginia Tech. I'm presenting our poster on electronic theses and dissertations, a research corpus of scholarly big data for the 24th International Conference on Gray Literature. Our work is motivated by the need for a large multidisciplinary corpus of scholarly
text to facilitate research investigations in text mining and natural language processing. Thanks to the efforts of university libraries, graduate programs, and the open repository movement, millions of electronic theses and dissertations are publicly disseminated online.
This enormous volume of scholarship exhibits many interesting characteristics which make it valuable for developing new technologies based on computational analysis of academic writing. We have constructed a large document corpus consisting of full text PDFs and metadata
for more than half a million ETDs retrieved from university institutional repositories across the United States. Digital archives of scholarly publications have been used to support research, but ETDs
are unique in that they are much longer than most conference papers and journal articles. ETDs contain novel ideas and findings that contribute significantly to the subject areas of their authors. They often contain useful figures, tables, equations, as well as extensive literature reviews, bibliographies, and links to other publications.
As gray literature, access to ETDs is not controlled by commercial publishers, copyright belongs to the authors, and most are disseminated under permissive copyright licenses. Our ETD corpus supports research projects conducted by librarians, computer science
faculty, undergraduates, masters students, and doctoral students studying natural language processing, information retrieval, bibliometrics, language modeling, and other areas of investigation related to scholarly big data. So far, analysis of the ETD corpus has aided the creation of new models for extracting
figures and tables from academic papers, segmenting long documents into chapters and sections, topic modeling algorithms, document classification, summarization algorithms, and improved digital library user interfaces.
Thank you for your time. Please feel free to contact us with questions.