Visual Exploration of Association Rules

The goal of this visualization is to provide interactive browsing of association rules within named entities of a RDF graph through three synchronized and complementary visualization techniques:

Go to demo.


Demo Information

About the Dataset

The input dataset correspond to the output data of a mining algorithm that discover association rules within a RDF graph. This algorithm is applied the CovidOnTheWeb which is derived from the CORD19 dataset, which, at the time of this study, contained around 50,000 scientific publications on the topic of coronaviruses-related diseases. The CovidOnTheWeb dataset comprises two graphs: the CORD-19 Named Entities Knowledge Graph, and the CORD-19 Argumentative Knowledge Graph.
For extracting the association rules, we used the CORD-19 Named Entities Knowledge Graph, which describes the named entities embedded in the publications, which are linked to DBPedia, Wikidata, and Bioportal datasets. Particulary, at the moment, we use the named entities linked to the Wikidata dataset. Furthermore, we only treat publications between 1990 and 2020.

About the Algorithm (Cadorel and Tettamanzi, 2020)

Input data

The dataset is pre-processed and transformed into a transaction matrix, which rows represent the publications and columns represent the named entities. The intersection cells are binary values indicating whether the named entity exist on the paper (1) or not (0).

The algorithm

The main goal of the algorithm is to derive association rules between named entities. As an illustrative example, let us take the named entity coronavirus, which in 70% of cases where the term was found in a publication, one would also find to the named entity China.
We use the FP-Growth algorithm from the Python library mlxtend.frequent_pattern, created by Sebastian Raschka.

Output data

The output data (represented in the visualization) consists of a table which rows correspond to rules and the columns contain descriptive variables, defined as follows: In order to keep only non-evident rules, we applied filtering methods based on: The resulting dataset contains a total of 1772 rules extracted through different clustering approaches:

Publications


People involved in the project