kg_topology_toolbox.topology_toolbox.KGTopologyToolbox

class kg_topology_toolbox.topology_toolbox.KGTopologyToolbox(kg_df, head_column='h', relation_column='r', tail_column='t')[source]

Toolbox class to compute Knowledge Graph topology statistics.

Instantiate the Topology Toolbox for a Knowledge Graph defined by the list of its edges (h,r,t).

Parameters:
  • kg_df (DataFrame) – A Knowledge Graph represented as a pd.DataFrame. Must contain at least three columns, which specify the IDs of head entity, relation type and tail entity for each edge.

  • head_column (str) – The name of the column with the IDs of head entities. Default: “h”.

  • relation_column (str) – The name of the column with the IDs of relation types. Default: “r”.

  • tail_column (str) – The name of the column with the IDs of tail entities. Default: “t”.

edge_cardinality()[source]

Classify the cardinality of each edge in the KG: one-to-one (out-degree=in-degree=1), one-to-many (out-degree>1, in-degree=1), many-to-one(out-degree=1, in-degree>1) or many-to-many (in-degree>1, out-degree>1).

Return type:

DataFrame

Returns:

The result DataFrame, with the same indexing and ordering of triples as the original KG DataFrame, with columns (in addition to h, r, t):

  • triple_cardinality (int): cardinality type of the edge.

  • triple_cardinality_same_rel (int): cardinality type of the edge in the subgraph of edges with relation type r.

edge_degree_cardinality_summary(aggregate_by_r=False)[source]

For each edge in the KG, compute the number of edges with the same head (head-degree, or out-degree), the same tail (tail-degree, or in-degree) or one of the two (total-degree). Based on entity degrees, each triple is classified as either one-to-one (out-degree=in-degree=1), one-to-many (out-degree>1, in-degree=1), many-to-one(out-degree=1, in-degree>1) or many-to-many (in-degree>1, out-degree>1).

The output dataframe maintains the same indexing and ordering of triples as the original Knowledge Graph dataframe.

Parameters:

aggregate_by_r (bool) – If True, return metrics aggregated by relation type (the output DataFrame will be indexed over relation IDs).

Return type:

DataFrame

Returns:

The results dataframe. Contains the following columns (in addition to h, r, t):

  • h_unique_rel (int): Number of distinct relation types among edges with head entity h.

  • h_degree (int): Number of triples with head entity h.

  • h_degree_same_rel (int): Number of triples with head entity h and relation type r.

  • t_unique_rel (int): Number of distinct relation types among edges with tail entity t.

  • t_degree (int): Number of triples with tail entity t.

  • t_degree_same_rel (int): Number of triples with tail entity t and relation type r.

  • tot_degree (int): Number of triples with head entity h or tail entity t.

  • tot_degree_same_rel (int): Number of triples with head entity h or tail entity t, and relation type r.

  • triple_cardinality (int): cardinality type of the edge.

  • triple_cardinality_same_rel (int): cardinality type of the edge in the subgraph of edges with relation type r.

edge_head_degree()[source]

For each edge in the KG, compute the number of edges (in total or of the same relation type) with the same head node.

Return type:

DataFrame

Returns:

The result DataFrame, with the same indexing and ordering of triples as the original KG DataFrame, with columns (in addition to h, r, t):

  • h_unique_rel (int): Number of distinct relation types among edges with head entity h.

  • h_degree (int): Number of triples with head entity h.

  • h_degree_same_rel (int): Number of triples with head entity h and relation type r.

edge_pattern_summary(return_metapath_list=False, composition_chunk_size=256, composition_workers=32, aggregate_by_r=False)[source]

Analyse structural properties of each edge in the KG: symmetry, presence of inverse/inference(=parallel) edges and triangles supported on the edge.

The output dataframe maintains the same indexing and ordering of triples as the original Knowledge Graph dataframe.

Parameters:
  • return_metapath_list (bool) – If True, return the list of unique metapaths for all triangles supported over one edge. WARNING: very expensive for large graphs.

  • composition_chunk_size (int) – Size of column chunks of sparse adjacency matrix to compute the triangle count.

  • composition_workers (int) – Number of workers to compute the triangle count.

  • aggregate_by_r (bool) – If True, return metrics aggregated by relation type (the output DataFrame will be indexed over relation IDs).

Return type:

DataFrame

Returns:

The results dataframe. Contains the following columns (in addition to h, r, t):

  • is_loop (bool): True if the triple is a loop (h == t).

  • is_symmetric (bool): True if the triple (t, r, h) is also contained in the graph (assuming t and h are different).

  • has_inverse (bool): True if the graph contains one or more triples (t, r’, h) with r' != r.

  • n_inverse_relations (int): The number of inverse relations r’.

  • inverse_edge_types (list): All relations r’ (including r if the edge is symmetric) such that (t, r’, h) is in the graph.

  • has_inference (bool): True if the graph contains one or more triples (h, r’, t) with r' != r.

  • n_inference_relations (int): The number of inference relations r’.

  • inference_edge_types (list): All relations r’ (including r) such that (h, r’, t) is in the graph.

  • has_composition (bool): True if the graph contains one or more triangles supported on the edge: (h, r1, x) + (x, r2, t).

  • n_triangles (int): The number of triangles.

  • has_undirected_composition (bool): True if the graph contains one or more undirected triangles supported on the edge.

  • n_undirected_triangles (int): The number of undirected triangles (considering all edges as bidirectional).

  • metapath_list (list): The list of unique metapaths “r1-r2” for the directed triangles.

edge_tail_degree()[source]

For each edge in the KG, compute the number of edges (in total or of the same relation type) with the same tail node.

Return type:

DataFrame

Returns:

The result DataFrame, with the same indexing and ordering of triples as the original KG DataFrame, with columns (in addition to h, r, t):

  • t_unique_rel (int): Number of distinct relation types among edges with tail entity t.

  • t_degree (int): Number of triples with tail entity t.

  • t_degree_same_rel (int): Number of triples with tail entity t and relation type r.

jaccard_similarity_relation_sets()[source]

Compute the similarity between relations defined as the Jaccard Similarity between sets of entities (heads and tails) for all pairs of relations in the graph.

Return type:

DataFrame

Returns:

The results dataframe. Contains the following columns:

  • r1 (int): Index of the first relation.

  • r2 (int): Index of the second relation.

  • num_triples_both (int): Number of triples with relation r1/r2.

  • frac_triples_both (float): Fraction of triples with relation r1/r2.

  • num_entities_both (int): Number of unique entities (h or t) for triples with relation r1/r2.

  • num_h_r1 (int): Number of unique head entities for relation r1.

  • num_h_r2 (int): Number of unique head entities for relation r2.

  • num_t_r1 (int): Number of unique tail entities for relation r1.

  • num_t_r2 (int): Number of unique tail entities for relation r2.

  • jaccard_head_head (float): Jaccard similarity between the head set of r1 and the head set of r2.

  • jaccard_tail_tail (float): Jaccard similarity between the tail set of r1 and the tail set of r2.

  • jaccard_head_tail (float): Jaccard similarity between the head set of r1 and the tail set of r2.

  • jaccard_tail_head (float): Jaccard similarity between the tail set of r1 and the head set of r2.

  • jaccard_both (float): Jaccard similarity between the full entity set of r1 and r2.

loop_count()[source]

For each entity in the KG, compute the number of loops around the entity (i.e., the number of edges having the entity as both head and tail).

Return type:

DataFrame

Returns:

Loop count DataFrame, indexed on the IDs of the graph entities.

node_degree_summary(return_relation_list=False)[source]

For each entity in the KG, compute the number of edges having it as a head (head-degree, or out-degree), as a tail (tail-degree, or in-degree) or one of the two (total-degree). The in-going and out-going relation types are also identified.

The output dataframe is indexed on the IDs of the graph entities.

Parameters:

return_relation_list (bool) – If True, return the list of unique relations going in/out of an entity. WARNING: expensive for large graphs.

Return type:

DataFrame

Returns:

The results dataframe, indexed on the IDs e of the graph entities, with columns:

  • h_degree (int): Number of triples with head entity e.

  • t_degree (int): Number of triples with tail entity e.

  • tot_degree (int): Number of triples with head entity e or tail entity e.

  • h_unique_rel (int): Number of distinct relation types among edges with head entity e.

  • h_rel_list (Optional[list]): List of unique relation types among edges with head entity e. Only returned if return_relation_list = True.

  • t_unique_rel (int): Number of distinct relation types among edges with tail entity e.

  • t_rel_list (Optional[list]): List of unique relation types among edges with tail entity e. Only returned if return_relation_list = True.

  • n_loops (int): number of loops around entity e.

node_head_degree(return_relation_list=False)[source]

For each entity in the KG, compute the number of edges having it as head (head-degree, or out-degree of the head node). The relation types going out of the head node are also identified.

Parameters:

return_relation_list (bool) – If True, return the list of unique relations going out of the head node. WARNING: expensive for large graphs. Default: False.

Return type:

DataFrame

Returns:

The result DataFrame, indexed on the IDs e of the graph entities, with columns:

  • h_degree (int): Number of triples with head entity e.

  • h_unique_rel (int): Number of distinct relation types among edges with head entity e.

  • h_rel_list (Optional[list]): List of unique relation types among edges with head entity e. Only returned if return_relation_list = True.

node_tail_degree(return_relation_list=False)[source]

For each entity in the KG, compute the number of edges having it as tail (tail-degree, or in-degree of the tail node). The relation types going into the tail node are also identified.

Parameters:

return_relation_list (bool) – If True, return the list of unique relation types going into the tail node. WARNING: expensive for large graphs. Default: False.

Return type:

DataFrame

Returns:

The result DataFrame, indexed on the IDs e of the graph entities, with columns:

  • t_degree (int): Number of triples with tail entity e.

  • t_unique_rel (int): Number of distinct relation types among edges with tail entity e.

  • t_rel_list (Optional[list]): List of unique relation types among edges with tail entity e. Only returned if return_relation_list = True.

relational_affinity_ingram(min_max_norm=False)[source]

Compute the similarity between relations based on the approach proposed in InGram: Inductive Knowledge Graph Embedding via Relation Graphs, https://arxiv.org/abs/2305.19987.

Only the pairs of relations witn affinity > 0 are shown in the returned dataframe.

Parameters:

min_max_norm (bool) – min-max normalization of edge weights. Defaults to False.

Return type:

DataFrame

Returns:

The results dataframe. Contains the following columns:

  • h_relation (int): Index of the head relation.

  • t_relation (int): Index of the tail relation.

  • edge_weight (float): Weight for the affinity between the head and the tail relation.