Skip to content

Datasets and code used in my DMKD Provenance Network Analytics paper

License

Notifications You must be signed in to change notification settings

trungdong/datasets-provanalytics-dmkd

Repository files navigation

Provenance Network Analytics Datasets

This repository provides the datasets used in the Provenance Network Analytics paper and the code for its analyses. The code was also used to generate the charts shown in our paper. Please note that the information provided here is meant to accompany the paper, where the analytic method is described in more detail.

Overview

Provenance network analytics is a novel data analytics approach that helps infer properties of data, such as quality or trustworthiness, from their provenance. Instead of analysing application data, which are typically domain-dependent, it analyses the data's provenance as represented using the World Wide Web Consortium's domain-agnostic PROV data model. Specifically, the approach proposes a number of network metrics (PNM) for provenance data and applies machine learning techniques over such metrics to build predictive models for some key properties of data. Applying this method on the provenance of real-world data from three different applications, we show that provenance network analytics can successfully identify the owners of provenance documents, assess the trustworthiness of crowdsourced data, and identify instructions from chat messages in an alternate-reality game with high levels of accuracy.

The notebooks and the accompanied datasets provided in this repository demonstrate how the method can be applied in a number of domains as a useful and generic tool for data analytics.

Installation

You do not need to install anything to see the notebooks provided in this repository (linked below). However, if you want to re-run the code on the datasets, you will need to install a number of required Python packages as listed in the requirements.txt as shown below.

The code provided with the datasets were run on Python 3.6. However, it might still run on other Python versions, but this is not guaranteed. All the packages required to run the experiments are listed in requirements.txt. In order to install those, run the following command with pip.

pip install -r requirements.txt

Provenance Datasets

We use three datasets in our paper, which listed below. Each dataset contains a number of provenance graphs and their labels. Instead of providing the actual provenance graphs, due to privacy issues, we only provide here the provenance network metrics calculated from those graphs (which are used in our analyses).

  1. Provenance documents on ProvStore:
    • provstore/data.csv: the PNM of provenance documents uploaded to ProvStore and their corresponding owners (anonymised as u_1, u_2, ...)
  2. Provenance of CollabMap data:
    • collabmap/trust_values.csv: the trust value of each data entity from CollabMap (identified by the id column).
    • collabmap/depgraphs.csv: the PNM of the provenance dependency graph of each data entity. (See our paper for the definition of a provenance dependency graph)
    • collabmap/ancestor-graphs.csv: the PNM of the (historical) provenance graph of each data entity (i.e. the graph records how it was generated).
  3. Provenance from the Radiation Response Game (RRG).
    • rrg/depgraphs-k.csv, e.g. rrg/depgraphs-5.csv: the PNM of the provenance dependency graph level k of a RRG chat message (k = 1..18).
    • rrg/depgraphs.csv: the PNM of the full dependency graph of a RRG chat message (i.e. without restricting a dependency graph to k edges away from a message entity).
    • rrg/ancestor-graphs.csv: the PNM of the (historical) provenance graph of the messages.

IPython Notebooks

The notebooks below provide the code for the analysis of the above datasets as reported in our paper. They detail the steps we took in our experiments and also show their results.

In addition, we also provide here extra materials to help with replicating the experiments and to document extra experiments we carried out, which are not included in the paper due to space constraints.

About

Datasets and code used in my DMKD Provenance Network Analytics paper

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published