Classification and extraction of information from ETD documents

Identifying methods to accurately extract information (citation, tables, figures, etc.) from ETDs and to appropriately classify them into the ProQuest subject category classification system.

Abstract

In recent years, advances in natural language processing, machine learning, and neural networks have led to powerful tools for digital libraries, allowing library collections to be discovered, used, and reused in exciting new ways. However, these new tools and techniques are not well-adapted to long documents such as electronic theses and dissertations (ETDs). The report describes three areas of study into improving access to ETDs. Our first goal is to use machine learning to automatically assign subject categories to these documents. Our second goal is to employ a neural network approach to parsing bibliographic data from reference strings. Our third goal is to use deep learning to identify and extract figures and their captions from ETDs.

Approach:

Our approach to parsing citations from ETDs is to research and utilize state-of-the-art nlp tools that 1) already aim to accomplish the same goal 2) give information that can be used as features to “define” the context of citations (dependency and semantic parsers/word embeddings).
Our approach for figure, table and caption extraction will involve researching and evaluating the performance of current state-of-the-art tools that achieve the same goal on our dataset of ETDs. Further, we will also try to improve the model by identifying the instances where the current state-of-the-art model fails.
Our approach for classification will involve dropping the top most level of ProQuest subject categories while keeping the next two levels. We will train a neural network architecture using metadata of the ETDs, abstract information as well as attempting to see if adding full text data helps in the classification task.

Related Projects:

Big Data Text Summarization (Fall 2018):
Ashish Baghudana's text summarization project
Neural-ParsCit

Description: A lot of techniques that exist for processing digital documents do not extend well to book length documents such as theses and dissertations. Thus, there is a need to develop techniques that are capable of extracting information from book length documents.

Our project will consist of three areas:

Citation Parsing: As part of the project, we will aim to accurately extract citations from ETDs using various NLP tools. Furthermore, we aim to identify particular pieces of information within the citations such as the author names. Ideally, we hope to use and adapted Neural-ParsCit to accomplish these tasks.
Figure and caption extraction: As part of the project, we aim to accurately extract the figures, tables and the corresponding captions from our collections of ETD. Ideally we hope to use and adapt DeepFigures to accomplish to this task.
Categorization: As part of the project, we aim to perform multi-class classification of ETD documents using the ProQuest subject categories as the target classification system.

Data: Virginia Tech collection of ETDs, downloaded from ETDs: Virginia Tech Electronic Theses and Dissertations

Tools:

Project Team

John Aromando (@JAromando)
Bipasha Banerjee (@Bipasha-banerjee)
Bill Ingram (@waingram)
Palakh Jude (@palakhjude)
Sampanna Kahu (@sampannakahu)

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
Parsing		Parsing
classification		classification
extraction		extraction
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Classification and extraction of information from ETD documents

Abstract

Contents

Project Team

About

Releases

Packages

Contributors 4

Languages

License

Opening-ETDs/CS6604-ETD

Folders and files

Latest commit

History

Repository files navigation

Classification and extraction of information from ETD documents

Abstract

Contents

Project Team

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages