DBLP Dataset Parser

It is a python parser for DBLP dataset, the XML format dumped file can be downloaded here from DBLP Homepage.

This parser requires dtd file, so make sure you have both dblp-XXX.xml (dataset) and dblp-XXX.dtd files. Note that you also should guarantee that both xml and dtd files are in the same directory, and the name of dtd file shoud same as the name given in the <!DOCTYPE> tag of the xml file. Such information can be easily accessed through head dblp-XXX.xml command. As shown below

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp-2017-08-29.dtd">
<dblp>
<phdthesis mdate="2016-05-04" key="phd/dk/Heine2010">
<author>Carmen Heine</author>
<title>Modell zur Produktion von Online-Hilfen.</title>
...

A sample to use the parser:

def main():
    dblp_path = 'dataset/dblp.xml'
    save_path = 'article.json'
    try:
        context_iter(dblp_path)
        log_msg("LOG: Successfully loaded \"{}\".".format(dblp_path))
    except IOError:
        log_msg("ERROR: Failed to load file \"{}\". Please check your XML and DTD files.".format(dblp_path))
        exit()
    parse_article(dblp_path, save_path, save_to_csv=False)  # default save as json format

Some extracted results:

Count the number of all different type of publications:

Count the number of all different attributes among all publications:

Count the number of five different features of articles:

Distribution of published year of articles:

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
img		img
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DBLP Dataset Parser

About

Releases

Packages

Contributors 2

Languages

License

26hzhang/DBLPParser

Folders and files

Latest commit

History

Repository files navigation

DBLP Dataset Parser

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages