KGTopologyToolbox walk-through

In this notebook we give a general overview of the classes and methods included in the kg-topology-toolbox library and explain how to use them to extract topological data from any knowledge graph. As an example, we use the open-source biomedical dataset ogbl-biokg.

Dependencies

[1]:

import sys
!{sys.executable} -m pip uninstall -y kg_topology_toolbox
!pip install -q git+https://github.com/graphcore-research/kg-topology-toolbox.git --no-cache-dir
!pip install -q jupyter ipywidgets ogb seaborn

Found existing installation: kg-topology-toolbox 1.0.0
Uninstalling kg-topology-toolbox-1.0.0:
  Successfully uninstalled kg-topology-toolbox-1.0.0

[1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import ogb.linkproppred
from kg_topology_toolbox import KGTopologyToolbox

dataset_directory = "../../../data/ogb-biokg/"

Data preparation

We load the OGBL-BioKG dataset using the ogb.linkproppred.LinkPropPredDataset class and store all (h, r, t) triples in a pandas DataFrame.

[2]:

dataset = ogb.linkproppred.LinkPropPredDataset(
    name="ogbl-biokg", root=dataset_directory
)

all_triples = []
for split in dataset.get_edge_split().values():
    all_triples.append(np.stack([split["head"], split["relation"], split["tail"]]).T)
biokg_df = pd.DataFrame(
    np.concatenate(all_triples).astype(np.int32), columns=["h", "r", "t"]
)
biokg_df

[2]:

	h	r	t
0	1718	0	3207
1	4903	0	13662
2	5480	0	15999
3	3148	0	7247
4	10300	0	16202
...	...	...	...
5088429	2451	50	5097
5088430	6456	50	8833
5088431	9484	50	15873
5088432	6365	50	496
5088433	13860	50	6368

5088434 rows × 3 columns

Based on this representation of the knowledge graph, we can proceed to instantiate the KGTopologyToolbox class to compute topological properties.

[6]:

kgtt = KGTopologyToolbox(biokg_df)

/nethome/albertoc/research/knowledge_graphs/kg-topology-toolbox/.venv/lib/python3.10/site-packages/kg_topology_toolbox/utils.py:42: UserWarning: The Knowledge Graph contains duplicated edges -- some functionalities may produce incorrect results
  warnings.warn(

Notice the warning raised by the constructor, which detects duplicated edges in the biokg_df DataFrame: to ensure optimal functionalities, duplicated edges should be removed before instantiating the KGTopologyToolbox class.

[8]:

# find duplicated edges
biokg_df.loc[biokg_df.duplicated(keep=False)]

[8]:

	h	r	t
3854407	1972	45	1972
4000534	1972	45	1972

Node-level analysis

The method node_degree_summary provides a summary of the degrees of each individual node in the knowledge graph. The returned DataFrame is indexed on the node ID.

h_degree is the number of edges coming out from the node;
t_degree is the number of edges going into the node;
tot_degree is the number of edges that use the node as either head or tail;
h_unique_rel (resp., t_unique_rel) is the number of unique relation types that come out from (resp., go into) the node;
n_loops is the number of loop edges around the node.

[5]:

node_ds = kgtt.node_degree_summary()
node_ds

[5]:

	h_degree	t_degree	tot_degree	h_unique_rel	t_unique_rel	n_loops
0	27	72	99	4	4	0
1	14	94	108	3	6	0
2	208	95	303	5	7	0
3	28999	26154	55153	10	11	0
4	362	302	664	11	12	0
...	...	...	...	...	...	...
45080	21	22	43	1	1	0
45081	29	32	61	1	1	0
45082	28	30	58	1	1	0
45083	17	19	36	1	1	0
45084	28	31	59	1	1	0

45085 rows × 6 columns

[6]:

metrics = [
    "h_degree",
    "t_degree",
]
fig, ax = plt.subplots(1, len(metrics), figsize=(4.5 * len(metrics), 4))

for i, metric in enumerate(metrics):
    x = np.log2(node_ds[metric])
    sns.histplot(
        x=x, stat="probability", binwidth=1, binrange=[0, x.max() + 1], ax=ax[i]
    )
    ax[i].set_xlabel(f"log2({metric})")
plt.tight_layout()

/nethome/albertoc/research/knowledge_graphs/kg-topology-toolbox/.venv/lib/python3.10/site-packages/pandas/core/arraylike.py:399: RuntimeWarning: divide by zero encountered in log2
  result = getattr(ufunc, method)(*inputs, **kwargs)

../_images/notebooks_ogb_biokg_demo_11_1.png

[7]:

metrics = [
    "h_unique_rel",
    "t_unique_rel",
]
fig, ax = plt.subplots(1, len(metrics), figsize=(4.5 * len(metrics), 4))

for i, metric in enumerate(metrics):
    x = node_ds[metric]
    sns.histplot(
        x=x, stat="probability", binwidth=1, binrange=[0, x.max() + 1], ax=ax[i]
    )
    ax[i].set_xlabel(f"{metric}")
plt.tight_layout()

../_images/notebooks_ogb_biokg_demo_12_0.png

Edge-level analysis

Edge degrees and cardinality

The method edge_degree_cardinality_summary provides, for each edge (h, r, t) in the KG, detailed information on the connectivity patterns of the head and tail nodes:

h_unique_rel (resp. t_unique_rel) is the number of unique relation types coming out of the head node (resp. going into the tail node);
h_degree is the out-degree of the head node and h_degree_same_rel is the degree when only considering edges of the same relation type r;
t_degree is the in-degree of the tail node and t_degree_same_rel is the degree when only considering edges of the same relation type r;
tot_degree is the total number of edges with either head entity h or tail entity t (in particular, tot_degree <= h_degree + t_degree); tot_degree_same_rel is computed only considering edges of the same relation type r;
triple_cardinality is the cardinality type of the edge:
- one-to-one (1:1) if h_degree = 1, t_degree = 1;
- one-to-many (1:M) if h_degree > 1, t_degree = 1;
- many-to-one (M:1) if h_degree = 1, t_degree > 1;
- many-to-many (M:M) if h_degree > 1, t_degree > 1.
triple_cardinality_same_rel is defined as triple_cardinality but using h_degree_same_rel, t_degree_same_rel.

[8]:

edge_dcs = kgtt.edge_degree_cardinality_summary()
edge_dcs

[8]:

	h	r	t	h_degree	h_unique_rel	h_degree_same_rel	t_degree	t_unique_rel	t_degree_same_rel	tot_degree	tot_degree_same_rel	triple_cardinality	triple_cardinality_same_rel
0	1718	0	3207	191	5	116	46	6	14	236	129	M:M	M:M
1	4903	0	13662	544	8	33	1975	9	50	2518	82	M:M	M:M
2	5480	0	15999	108	3	5	72	4	22	179	26	M:M	M:M
3	3148	0	7247	110	4	99	673	11	271	782	369	M:M	M:M
4	10300	0	16202	414	4	315	148	6	31	561	345	M:M	M:M
...	...	...	...	...	...	...	...	...	...	...	...	...	...
5088429	2451	50	5097	636	5	272	803	10	272	1437	543	M:M	M:M
5088430	6456	50	8833	743	10	259	371	10	100	1111	358	M:M	M:M
5088431	9484	50	15873	652	8	213	486	6	163	1135	375	M:M	M:M
5088432	6365	50	496	922	9	277	618	19	173	1537	449	M:M	M:M
5088433	13860	50	6368	485	7	175	455	8	147	939	321	M:M	M:M

5088434 rows × 13 columns

The data on the distribution of degrees and cardinalities can be then easily visualized.

[9]:

# Edge frequency when binning by head and tail degree

metrics = [("h_degree", "t_degree"), ("h_degree_same_rel", "t_degree_same_rel")]
fig, ax = plt.subplots(1, len(metrics), figsize=[5 * len(metrics), 4.5])

for i, (group_metric_1, group_metric_2) in enumerate(metrics):
    df_empty = pd.DataFrame(
        columns=np.int32(2 ** np.arange(15)), index=np.int32(2 ** np.arange(15))
    )
    df_tmp = edge_dcs[[group_metric_1, group_metric_2]]
    df_tmp.insert(
        0,
        f"log_{group_metric_1}",
        np.int32(2 ** np.floor(np.log2(df_tmp[group_metric_1]))),
    )
    df_tmp.insert(
        0,
        f"log_{group_metric_2}",
        np.int32(2 ** np.floor(np.log2(df_tmp[group_metric_2]))),
    )
    df_tmp = (
        df_tmp.groupby([f"log_{group_metric_1}", f"log_{group_metric_2}"])
        .count()
        .reset_index()
    )
    df_tmp[group_metric_1] /= df_tmp[group_metric_1].sum()
    sns.heatmap(
        df_tmp.reset_index()
        .pivot(
            columns=f"log_{group_metric_2}",
            index=f"log_{group_metric_1}",
            values=group_metric_1,
        )
        .combine_first(df_empty),
        annot=False,
        vmin=0,
        vmax=0.05,
        ax=ax[i],
    )
    ax[i].set_xlabel(group_metric_2)
    ax[i].set_ylabel(group_metric_1)
    ax[i].invert_yaxis()
fig.suptitle("Edge frequency")
plt.tight_layout()

../_images/notebooks_ogb_biokg_demo_16_0.png

[10]:

metrics = ["triple_cardinality", "triple_cardinality_same_rel"]
fig, ax = plt.subplots(1, len(metrics), figsize=[4 * len(metrics), 4])

for i, metric in enumerate(metrics):
    sns.countplot(
        x=edge_dcs[metric],
        order=["1:1", "1:M", "M:1", "M:M"],
        stat="probability",
        ax=ax[i],
    )
fig.suptitle("Cardinality distribution")
plt.tight_layout()

../_images/notebooks_ogb_biokg_demo_17_0.png

Edge topological patterns

KGTopologyToolbox also allows us to perform a topological analysis at the edge level, using the method edge_pattern_summary, which extracts information on several significant edge topological patterns. In particular, it detects whether the edge (h,r,t) is a loop, is symmetric or has inverse, inference, composition (directed and undirected):

For inverse/inference, the method also provides the number and types of unique relations r' realizing the counterpart edges; for composition, the number of triangles supported by the edge is provided.

[11]:

edge_eps = kgtt.edge_pattern_summary()
edge_eps

[11]:

	h	r	t	is_loop	is_symmetric	has_inverse	n_inverse_relations	inverse_edge_types	has_inference	n_inference_relations	inference_edge_types	has_composition	has_undirected_composition	n_triangles	n_undirected_triangles
0	1718	0	3207	False	False	False	0	[]	False	0	[0]	False	True	0	15
1	4903	0	13662	False	False	False	0	[]	False	0	[0]	True	True	44	153
2	5480	0	15999	False	False	False	0	[]	False	0	[0]	False	True	0	1
3	3148	0	7247	False	False	False	0	[]	False	0	[0]	True	True	10	29
4	10300	0	16202	False	False	False	0	[]	False	0	[0]	True	True	3	79
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
5088429	2451	50	5097	False	False	True	1	[46]	True	1	[46, 50]	True	True	1532	5722
5088430	6456	50	8833	False	False	True	2	[45, 46]	True	2	[45, 46, 50]	True	True	234	913
5088431	9484	50	15873	False	False	True	1	[46]	True	2	[46, 45, 50]	True	True	1326	5004
5088432	6365	50	496	False	False	True	2	[45, 46]	True	2	[45, 46, 50]	True	True	1433	5554
5088433	13860	50	6368	False	False	False	0	[]	False	0	[50]	True	True	119	489

5088434 rows × 15 columns

If we need to identify the different metapaths [r_1, r_2] that give triangles (h,r1,x) - (x,r2,t) over an edge (h,r,t), we can do so by setting return_metapath_list=True in the call of edge_pattern_summary. In order to disaggregate the total number of triangles over an edge into separate counts for each existing metapath, the edge_metapath_count method should be used instead.

We can now easily produce a global view of the distribution of topological properties.

[12]:

print("Fraction of triples with property:")
edge_eps[
    ["is_loop", "is_symmetric", "has_inverse", "has_inference", "has_composition"]
].mean()

Fraction of triples with property:

[12]:

is_loop            0.000011
is_symmetric       0.713743
has_inverse        0.409704
has_inference      0.410111
has_composition    0.997605
dtype: float64

[13]:

metrics = [
    "n_inverse_relations",
    "n_inference_relations",
    "n_triangles",
    "n_undirected_triangles",
]
fig, ax = plt.subplots(2, 2, figsize=(9, 7))

for axn, metric in zip(ax.flatten(), metrics):
    x = np.sqrt(edge_eps[metric])
    sns.histplot(x=x, stat="probability", binwidth=1, binrange=[0, x.max() + 1], ax=axn)
    axn.set_xlabel(f"sqrt({metric})")
plt.tight_layout()

../_images/notebooks_ogb_biokg_demo_22_0.png

Filtering relation types

The edge-level methods presented in the previous section simultaneously compute statistics for all edges in the KG, and this can be expensive on larger graphs. Moreover, in many practical cases the user might be interested in looking only at the properties of edges of one or few specific relation types.

The methods edge_degree_cardinality_summary, edge_pattern_summary and edge_metapath_count can be passed a list of relation type IDs to restrict computations of their outputs to edges of those specific relation types.

[7]:

filtered_metapath_counts = kgtt.edge_metapath_count(filter_relations=[7, 24])
filtered_metapath_counts

[7]:

	index	h	r	t	r1	r2	n_triangles
0	334382	732	7	1225	41	2	10
1	334382	732	7	1225	39	2	123
2	334382	732	7	1225	38	2	200
3	334382	732	7	1225	37	2	27
4	334382	732	7	1225	36	2	6
...	...	...	...	...	...	...	...
732149	4953327	1529	24	2492	13	41	2
732150	4953327	1529	24	2492	11	41	2
732151	4953327	1529	24	2492	6	41	2
732152	4953327	1529	24	2492	4	41	1
732153	4953327	1529	24	2492	2	41	2

732154 rows × 7 columns

The previous cell computes the number of triangles of each existing (r1, r2) metapath, but only over (h,r,t) edges of the two relation types with ID 7 and 24 (the column index gives the index of the edge in the biokkg_df DataFrame). This is the same as calling kgtt.edge_metapath_count().query('r==7 or r==24'), but the computation is much cheaper and faster.

[8]:

filtered_metapath_counts.r.value_counts()

[8]:

r
24    413366
7     318788
Name: count, dtype: int64

Relation-level analysis

All edge topological properties seen in the previous section can be aggregated over triples of the same relation type, to produce relation-level statistics. To do so, we can either set the option aggregate_by_r = True when calling the methods edge_degree_cardinality_summary, edge_pattern_summary, or - if edge topological metrics have already been precomputed - use the utility function aggregate_by_relation, which converts DataFrames indexed on the KG edges to DataFrames indexed on the IDs of the unique relation types.

[15]:

from kg_topology_toolbox.utils import aggregate_by_relation

aggregate_by_relation(edge_dcs).head()

[15]:

	num_triples	frac_triples	unique_h	unique_t	h_degree_mean	h_degree_std	h_degree_quartile1	h_degree_quartile2	h_degree_quartile3	h_unique_rel_mean	...	tot_degree_same_rel_quartile1	tot_degree_same_rel_quartile2	tot_degree_same_rel_quartile3	triple_cardinality_1:M_frac	triple_cardinality_M:1_frac	triple_cardinality_M:M_frac	triple_cardinality_same_rel_1:1_frac	triple_cardinality_same_rel_1:M_frac	triple_cardinality_same_rel_M:1_frac	triple_cardinality_same_rel_M:M_frac
r
0	81066	0.015931	9742	9337	569.252202	1083.315332	111.0	222.0	521.0	8.110293	...	45.0	112.0	211.0	0.0	0.0	1.0	0.001628	0.023586	0.064959	0.909827
1	5669	0.001114	698	1536	2518.765391	2186.452620	435.0	2087.0	4028.0	27.048157	...	14.0	32.0	60.0	0.0	0.0	1.0	0.002822	0.104251	0.027518	0.865408
2	66954	0.013158	612	612	4129.511919	1935.630599	2548.0	3968.0	5649.0	36.404307	...	332.0	404.0	482.0	0.0	0.0	1.0	0.000000	0.000254	0.000239	0.999507
3	19585	0.003849	491	491	4527.399592	1943.714179	2925.0	4507.0	6161.0	37.095941	...	114.0	157.0	202.0	0.0	0.0	1.0	0.000000	0.000868	0.000970	0.998162
4	32034	0.006295	526	525	4511.067834	1905.395180	2931.0	4507.0	6148.0	37.319567	...	188.0	243.0	299.0	0.0	0.0	1.0	0.000062	0.000531	0.000593	0.998814

5 rows × 51 columns

Notice on the extra columns num_triples, frac_triples, unique_h, unique_t giving additional statistics for relation types (number of edges and relative frequency, number of unique entities used as heads/tails by triples of the relation type).

Similarly, by aggregating the edge_eps DataFrame we can look at the distribution of edge topological patterns within each relation type.

[16]:

aggregate_by_relation(edge_eps).head()

[16]:

	num_triples	frac_triples	unique_h	unique_t	is_loop_frac	is_symmetric_frac	has_inverse_frac	n_inverse_relations_mean	n_inverse_relations_std	n_inverse_relations_quartile1	...	n_triangles_mean	n_triangles_std	n_triangles_quartile1	n_triangles_quartile2	n_triangles_quartile3	n_undirected_triangles_mean	n_undirected_triangles_std	n_undirected_triangles_quartile1	n_undirected_triangles_quartile2	n_undirected_triangles_quartile3
r
0	81066	0.015931	9742	9337	0.000012	0.000222	0.009474	0.018762	0.336120	0.0	...	49.615572	816.776738	3.0	7.0	16.00	136.452841	1421.830008	18.00	36.0	68.0
1	5669	0.001114	698	1536	0.000000	0.000353	0.061563	0.527783	2.502323	0.0	...	1630.912154	6563.522736	13.0	84.0	234.00	2864.104428	9520.116812	54.00	224.0	586.0
2	66954	0.013158	612	612	0.000000	0.947367	0.998253	11.019118	4.707246	8.0	...	27666.694925	15797.649746	14990.0	25934.0	38868.50	32678.993563	18619.016056	16691.00	32647.5	48637.0
3	19585	0.003849	491	491	0.000000	0.947358	0.999592	13.417258	4.585150	10.0	...	30250.858974	17053.925410	16204.0	28873.0	43798.00	32696.125351	18685.281686	16563.00	32808.0	48653.0
4	32034	0.006295	526	525	0.000000	0.947368	0.999376	13.299588	4.427898	10.0	...	30942.231192	16888.956656	17303.0	30137.5	44161.25	32685.210464	18685.267154	16645.25	32580.0	48767.0

5 rows × 32 columns

Additional methods are provided in the KGTopologyToolbox class for analysis at the relation level: jaccard_similarity_relation_sets to compute the Jaccard similarity of the sets of head/tail entities used by each relation; relational_affinity_ingram to compute the InGram pairwise relation similarity (see paper).

[17]:

kgtt.jaccard_similarity_relation_sets()

[17]:

	r1	r2	num_triples_both	frac_triples_both	num_entities_both	num_h_r1	num_h_r2	num_t_r1	num_t_r2	jaccard_head_head	jaccard_head_tail	jaccard_tail_head	jaccard_tail_tail	jaccard_both
1	0	1	86735	0.017046	14338	9742	698	9337	1536	0.064112	0.055301	0.037317	0.079635	0.112289
2	0	2	148020	0.029089	13934	9742	612	9337	612	0.056531	0.056531	0.031947	0.031947	0.041768
3	0	3	100651	0.019780	13929	9742	491	9337	491	0.045037	0.045037	0.026530	0.026530	0.033527
4	0	4	113100	0.022227	13931	9742	526	9337	525	0.048290	0.048188	0.027610	0.027506	0.035819
5	0	5	132276	0.025995	13931	9742	576	9337	578	0.053287	0.053491	0.029815	0.029916	0.039624
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2446	47	49	18021	0.003542	2414	806	1885	809	1886	0.131148	0.131568	0.132409	0.132353	0.135874
2447	47	50	374193	0.073538	5592	806	5224	809	5228	0.082391	0.082526	0.083318	0.083453	0.084764
2497	48	49	43122	0.008475	3407	2728	1885	2729	1886	0.371284	0.369952	0.371989	0.371064	0.370707
2498	48	50	399294	0.078471	6201	2728	5224	2729	5228	0.287356	0.287379	0.286061	0.286084	0.289147
2549	49	50	388092	0.076269	6169	1885	5224	1886	5228	0.156876	0.156773	0.156850	0.156748	0.158373

1275 rows × 14 columns

[18]:

kgtt.relational_affinity_ingram()

[18]:

	h_relation	t_relation	edge_weight
0	0	1	5.565931
1	0	2	0.244410
2	0	3	0.049564
3	0	4	0.079068
4	0	5	0.159787
...	...	...	...
2545	50	45	393.082900
2546	50	46	421.818843
2547	50	47	1.194898
2548	50	48	18.124874
2549	50	49	5.420267

2550 rows × 3 columns