Electronic Theses and Dissertations: A Research Corpus of Scholarly Big Data
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 30 | |
Author | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/59869 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | |
Genre |
6
00:00
Computer animation
02:40
Computer animation
Transcript: English(auto-generated)
00:02
Hello, my name is Bill Ingram for the University Libraries at Virginia Tech. I'm presenting our poster on electronic theses and dissertations, a research corpus of scholarly big data for the 24th International Conference on Gray Literature. Our work is motivated by the need for a large multidisciplinary corpus of scholarly
00:24
text to facilitate research investigations in text mining and natural language processing. Thanks to the efforts of university libraries, graduate programs, and the open repository movement, millions of electronic theses and dissertations are publicly disseminated online.
00:45
This enormous volume of scholarship exhibits many interesting characteristics which make it valuable for developing new technologies based on computational analysis of academic writing. We have constructed a large document corpus consisting of full text PDFs and metadata
01:05
for more than half a million ETDs retrieved from university institutional repositories across the United States. Digital archives of scholarly publications have been used to support research, but ETDs
01:20
are unique in that they are much longer than most conference papers and journal articles. ETDs contain novel ideas and findings that contribute significantly to the subject areas of their authors. They often contain useful figures, tables, equations, as well as extensive literature reviews, bibliographies, and links to other publications.
01:43
As gray literature, access to ETDs is not controlled by commercial publishers, copyright belongs to the authors, and most are disseminated under permissive copyright licenses. Our ETD corpus supports research projects conducted by librarians, computer science
02:02
faculty, undergraduates, masters students, and doctoral students studying natural language processing, information retrieval, bibliometrics, language modeling, and other areas of investigation related to scholarly big data. So far, analysis of the ETD corpus has aided the creation of new models for extracting
02:23
figures and tables from academic papers, segmenting long documents into chapters and sections, topic modeling algorithms, document classification, summarization algorithms, and improved digital library user interfaces.
02:41
Thank you for your time. Please feel free to contact us with questions.
Recommendations
Series of 2 media