kg_topology_toolbox.topology_toolbox.KGTopologyToolbox
- class kg_topology_toolbox.topology_toolbox.KGTopologyToolbox(kg_df, head_column='h', relation_column='r', tail_column='t')[source]
Toolbox class to compute Knowledge Graph topology statistics.
Instantiate the Topology Toolbox for a Knowledge Graph defined by the list of its edges (h,r,t).
- Parameters:
kg_df (
DataFrame
) – A Knowledge Graph represented as a pd.DataFrame. Must contain at least three columns, which specify the IDs of head entity, relation type and tail entity for each edge.head_column (
str
) – The name of the column with the IDs of head entities. Default: “h”.relation_column (
str
) – The name of the column with the IDs of relation types. Default: “r”.tail_column (
str
) – The name of the column with the IDs of tail entities. Default: “t”.
- edge_cardinality()[source]
Classify the cardinality of each edge in the KG: one-to-one (out-degree=in-degree=1), one-to-many (out-degree>1, in-degree=1), many-to-one(out-degree=1, in-degree>1) or many-to-many (in-degree>1, out-degree>1).
- Return type:
DataFrame
- Returns:
The result DataFrame, with the same indexing and ordering of triples as the original KG DataFrame, with columns (in addition to h, r, t):
triple_cardinality (int): cardinality type of the edge.
triple_cardinality_same_rel (int): cardinality type of the edge in the subgraph of edges with relation type r.
- edge_degree_cardinality_summary(aggregate_by_r=False)[source]
For each edge in the KG, compute the number of edges with the same head (head-degree, or out-degree), the same tail (tail-degree, or in-degree) or one of the two (total-degree). Based on entity degrees, each triple is classified as either one-to-one (out-degree=in-degree=1), one-to-many (out-degree>1, in-degree=1), many-to-one(out-degree=1, in-degree>1) or many-to-many (in-degree>1, out-degree>1).
The output dataframe maintains the same indexing and ordering of triples as the original Knowledge Graph dataframe.
- Parameters:
aggregate_by_r (
bool
) – If True, return metrics aggregated by relation type (the output DataFrame will be indexed over relation IDs).- Return type:
DataFrame
- Returns:
The results dataframe. Contains the following columns (in addition to h, r, t):
h_unique_rel (int): Number of distinct relation types among edges with head entity h.
h_degree (int): Number of triples with head entity h.
h_degree_same_rel (int): Number of triples with head entity h and relation type r.
t_unique_rel (int): Number of distinct relation types among edges with tail entity t.
t_degree (int): Number of triples with tail entity t.
t_degree_same_rel (int): Number of triples with tail entity t and relation type r.
tot_degree (int): Number of triples with head entity h or tail entity t.
tot_degree_same_rel (int): Number of triples with head entity h or tail entity t, and relation type r.
triple_cardinality (int): cardinality type of the edge.
triple_cardinality_same_rel (int): cardinality type of the edge in the subgraph of edges with relation type r.
- edge_head_degree()[source]
For each edge in the KG, compute the number of edges (in total or of the same relation type) with the same head node.
- Return type:
DataFrame
- Returns:
The result DataFrame, with the same indexing and ordering of triples as the original KG DataFrame, with columns (in addition to h, r, t):
h_unique_rel (int): Number of distinct relation types among edges with head entity h.
h_degree (int): Number of triples with head entity h.
h_degree_same_rel (int): Number of triples with head entity h and relation type r.
- edge_pattern_summary(return_metapath_list=False, composition_chunk_size=256, composition_workers=32, aggregate_by_r=False)[source]
Analyse structural properties of each edge in the KG: symmetry, presence of inverse/inference(=parallel) edges and triangles supported on the edge.
The output dataframe maintains the same indexing and ordering of triples as the original Knowledge Graph dataframe.
- Parameters:
return_metapath_list (
bool
) – If True, return the list of unique metapaths for all triangles supported over one edge. WARNING: very expensive for large graphs.composition_chunk_size (
int
) – Size of column chunks of sparse adjacency matrix to compute the triangle count.composition_workers (
int
) – Number of workers to compute the triangle count.aggregate_by_r (
bool
) – If True, return metrics aggregated by relation type (the output DataFrame will be indexed over relation IDs).
- Return type:
DataFrame
- Returns:
The results dataframe. Contains the following columns (in addition to h, r, t):
is_loop (bool): True if the triple is a loop (
h == t
).is_symmetric (bool): True if the triple (t, r, h) is also contained in the graph (assuming t and h are different).
has_inverse (bool): True if the graph contains one or more triples (t, r’, h) with
r' != r
.n_inverse_relations (int): The number of inverse relations r’.
inverse_edge_types (list): All relations r’ (including r if the edge is symmetric) such that (t, r’, h) is in the graph.
has_inference (bool): True if the graph contains one or more triples (h, r’, t) with
r' != r
.n_inference_relations (int): The number of inference relations r’.
inference_edge_types (list): All relations r’ (including r) such that (h, r’, t) is in the graph.
has_composition (bool): True if the graph contains one or more triangles supported on the edge: (h, r1, x) + (x, r2, t).
n_triangles (int): The number of triangles.
has_undirected_composition (bool): True if the graph contains one or more undirected triangles supported on the edge.
n_undirected_triangles (int): The number of undirected triangles (considering all edges as bidirectional).
metapath_list (list): The list of unique metapaths “r1-r2” for the directed triangles.
- edge_tail_degree()[source]
For each edge in the KG, compute the number of edges (in total or of the same relation type) with the same tail node.
- Return type:
DataFrame
- Returns:
The result DataFrame, with the same indexing and ordering of triples as the original KG DataFrame, with columns (in addition to h, r, t):
t_unique_rel (int): Number of distinct relation types among edges with tail entity t.
t_degree (int): Number of triples with tail entity t.
t_degree_same_rel (int): Number of triples with tail entity t and relation type r.
- jaccard_similarity_relation_sets()[source]
Compute the similarity between relations defined as the Jaccard Similarity between sets of entities (heads and tails) for all pairs of relations in the graph.
- Return type:
DataFrame
- Returns:
The results dataframe. Contains the following columns:
r1 (int): Index of the first relation.
r2 (int): Index of the second relation.
num_triples_both (int): Number of triples with relation r1/r2.
frac_triples_both (float): Fraction of triples with relation r1/r2.
num_entities_both (int): Number of unique entities (h or t) for triples with relation r1/r2.
num_h_r1 (int): Number of unique head entities for relation r1.
num_h_r2 (int): Number of unique head entities for relation r2.
num_t_r1 (int): Number of unique tail entities for relation r1.
num_t_r2 (int): Number of unique tail entities for relation r2.
jaccard_head_head (float): Jaccard similarity between the head set of r1 and the head set of r2.
jaccard_tail_tail (float): Jaccard similarity between the tail set of r1 and the tail set of r2.
jaccard_head_tail (float): Jaccard similarity between the head set of r1 and the tail set of r2.
jaccard_tail_head (float): Jaccard similarity between the tail set of r1 and the head set of r2.
jaccard_both (float): Jaccard similarity between the full entity set of r1 and r2.
- loop_count()[source]
For each entity in the KG, compute the number of loops around the entity (i.e., the number of edges having the entity as both head and tail).
- Return type:
DataFrame
- Returns:
Loop count DataFrame, indexed on the IDs of the graph entities.
- node_degree_summary(return_relation_list=False)[source]
For each entity in the KG, compute the number of edges having it as a head (head-degree, or out-degree), as a tail (tail-degree, or in-degree) or one of the two (total-degree). The in-going and out-going relation types are also identified.
The output dataframe is indexed on the IDs of the graph entities.
- Parameters:
return_relation_list (
bool
) – If True, return the list of unique relations going in/out of an entity. WARNING: expensive for large graphs.- Return type:
DataFrame
- Returns:
The results dataframe, indexed on the IDs e of the graph entities, with columns:
h_degree (int): Number of triples with head entity e.
t_degree (int): Number of triples with tail entity e.
tot_degree (int): Number of triples with head entity e or tail entity e.
h_unique_rel (int): Number of distinct relation types among edges with head entity e.
h_rel_list (Optional[list]): List of unique relation types among edges with head entity e. Only returned if return_relation_list = True.
t_unique_rel (int): Number of distinct relation types among edges with tail entity e.
t_rel_list (Optional[list]): List of unique relation types among edges with tail entity e. Only returned if return_relation_list = True.
n_loops (int): number of loops around entity e.
- node_head_degree(return_relation_list=False)[source]
For each entity in the KG, compute the number of edges having it as head (head-degree, or out-degree of the head node). The relation types going out of the head node are also identified.
- Parameters:
return_relation_list (
bool
) – If True, return the list of unique relations going out of the head node. WARNING: expensive for large graphs. Default: False.- Return type:
DataFrame
- Returns:
The result DataFrame, indexed on the IDs e of the graph entities, with columns:
h_degree (int): Number of triples with head entity e.
h_unique_rel (int): Number of distinct relation types among edges with head entity e.
h_rel_list (Optional[list]): List of unique relation types among edges with head entity e. Only returned if return_relation_list = True.
- node_tail_degree(return_relation_list=False)[source]
For each entity in the KG, compute the number of edges having it as tail (tail-degree, or in-degree of the tail node). The relation types going into the tail node are also identified.
- Parameters:
return_relation_list (
bool
) – If True, return the list of unique relation types going into the tail node. WARNING: expensive for large graphs. Default: False.- Return type:
DataFrame
- Returns:
The result DataFrame, indexed on the IDs e of the graph entities, with columns:
t_degree (int): Number of triples with tail entity e.
t_unique_rel (int): Number of distinct relation types among edges with tail entity e.
t_rel_list (Optional[list]): List of unique relation types among edges with tail entity e. Only returned if return_relation_list = True.
- relational_affinity_ingram(min_max_norm=False)[source]
Compute the similarity between relations based on the approach proposed in InGram: Inductive Knowledge Graph Embedding via Relation Graphs, https://arxiv.org/abs/2305.19987.
Only the pairs of relations witn
affinity > 0
are shown in the returned dataframe.- Parameters:
min_max_norm (
bool
) – min-max normalization of edge weights. Defaults to False.- Return type:
DataFrame
- Returns:
The results dataframe. Contains the following columns:
h_relation (int): Index of the head relation.
t_relation (int): Index of the tail relation.
edge_weight (float): Weight for the affinity between the head and the tail relation.