kg_topology_toolbox.topology_toolbox.KGTopologyToolbox
- class kg_topology_toolbox.topology_toolbox.KGTopologyToolbox(kg_df, head_column='h', relation_column='r', tail_column='t')[source]
Toolbox class to compute Knowledge Graph topology statistics.
Instantiate the Topology Toolbox for a Knowledge Graph defined by the list of its edges (h,r,t).
- Parameters:
kg_df (
DataFrame
) – A Knowledge Graph represented as a pd.DataFrame. Must contain at least three columns, which specify the IDs of head entity, relation type and tail entity for each edge.head_column (
str
) – The name of the column with the IDs of head entities. Default: “h”.relation_column (
str
) – The name of the column with the IDs of relation types. Default: “r”.tail_column (
str
) – The name of the column with the IDs of tail entities. Default: “t”.
- edge_cardinality()[source]
Classify the cardinality of each edge in the KG: one-to-one (out-degree=in-degree=1), one-to-many (out-degree>1, in-degree=1), many-to-one(out-degree=1, in-degree>1) or many-to-many (in-degree>1, out-degree>1).
- Return type:
DataFrame
- Returns:
The result DataFrame, with the same indexing and ordering of triples as the original KG DataFrame, with columns (in addition to h, r, t):
triple_cardinality (int): cardinality type of the edge.
triple_cardinality_same_rel (int): cardinality type of the edge in the subgraph of edges with relation type r.
- edge_degree_cardinality_summary(filter_relations=[], aggregate_by_r=False)[source]
For each edge in the KG, compute the number of edges with the same head (head-degree, or out-degree), the same tail (tail-degree, or in-degree) or one of the two (total-degree). Based on entity degrees, each triple is classified as either one-to-one (out-degree=in-degree=1), one-to-many (out-degree>1, in-degree=1), many-to-one(out-degree=1, in-degree>1) or many-to-many (in-degree>1, out-degree>1).
The output dataframe maintains the same indexing and ordering of triples as the original Knowledge Graph dataframe.
- Parameters:
- Return type:
DataFrame
- Returns:
The results dataframe. Contains the following columns (in addition to h, r, t):
h_unique_rel (int): Number of distinct relation types among edges with head entity h.
h_degree (int): Number of triples with head entity h.
h_degree_same_rel (int): Number of triples with head entity h and relation type r.
t_unique_rel (int): Number of distinct relation types among edges with tail entity t.
t_degree (int): Number of triples with tail entity t.
t_degree_same_rel (int): Number of triples with tail entity t and relation type r.
tot_degree (int): Number of triples with head entity h or tail entity t.
tot_degree_same_rel (int): Number of triples with head entity h or tail entity t, and relation type r.
triple_cardinality (int): cardinality type of the edge.
triple_cardinality_same_rel (int): cardinality type of the edge in the subgraph of edges with relation type r.
- edge_head_degree()[source]
For each edge in the KG, compute the number of edges (in total or of the same relation type) with the same head node.
- Return type:
DataFrame
- Returns:
The result DataFrame, with the same indexing and ordering of triples as the original KG DataFrame, with columns (in addition to h, r, t):
h_unique_rel (int): Number of distinct relation types among edges with head entity h.
h_degree (int): Number of triples with head entity h.
h_degree_same_rel (int): Number of triples with head entity h and relation type r.
- edge_metapath_count(filter_relations=[], composition_chunk_size=256, composition_workers=3)[source]
For each edge in the KG, compute the number of triangles supported on it distinguishing between different metapaths (i.e., the unique ordered tuples (r1, r2) of relation types of the two additional edges of the triangle).
- Parameters:
filter_relations (
list
[int
]) – If not empty, compute the output only for the edges with relation in this list of relation IDs.composition_chunk_size (
int
) – Size of column chunks of sparse adjacency matrix to compute the triangle count. Reduce the parameter if running OOM. Default: 2**8.composition_workers (
int
) – Number of workers to compute the triangle count. By default, assigned based on number of available threads (max: 32).
- Return type:
DataFrame
- Returns:
The output dataframe has one row for each (h, r, t, r1, r2) such that there exists at least one triangle of metapath (r1, r2) over (h, r, t). The number of metapath triangles is given in the column n_triangles. The column index provides the index of the edge (h, r, t) in the original Knowledge Graph dataframe.
- edge_pattern_summary(return_metapath_list=False, filter_relations=[], aggregate_by_r=False, composition_chunk_size=256, composition_workers=3)[source]
Analyse structural properties of each edge in the KG: symmetry, presence of inverse/inference(=parallel) edges and triangles supported on the edge.
The output dataframe maintains the same indexing and ordering of triples as the original Knowledge Graph dataframe.
- Parameters:
return_metapath_list (
bool
) – If True, return the list of unique metapaths for all triangles supported over each edge. WARNING: very expensive for large graphs.filter_relations (
list
[int
]) – If not empty, compute the output only for the edges with relation in this list of relation IDs.aggregate_by_r (
bool
) – If True, return metrics aggregated by relation type (the output DataFrame will be indexed over relation IDs).composition_chunk_size (
int
) – Size of column chunks of sparse adjacency matrix to compute the triangle count. Reduce the parameter if running OOM. Default: 2**8.composition_workers (
int
) – Number of workers to compute the triangle count. By default, assigned based on number of available threads (max: 32).
- Return type:
DataFrame
- Returns:
The results dataframe. Contains the following columns (in addition to h, r, t):
is_loop (bool): True if the triple is a loop (
h == t
).is_symmetric (bool): True if the triple (t, r, h) is also contained in the graph (assuming t and h are different).
has_inverse (bool): True if the graph contains one or more triples (t, r’, h) with
r' != r
.n_inverse_relations (int): The number of inverse relations r’.
inverse_edge_types (list): All relations r’ (including r if the edge is symmetric) such that (t, r’, h) is in the graph.
has_inference (bool): True if the graph contains one or more triples (h, r’, t) with
r' != r
.n_inference_relations (int): The number of inference relations r’.
inference_edge_types (list): All relations r’ (including r) such that (h, r’, t) is in the graph.
has_composition (bool): True if the graph contains one or more triangles supported on the edge: (h, r1, x) + (x, r2, t).
n_triangles (int): The number of triangles.
has_undirected_composition (bool): True if the graph contains one or more undirected triangles supported on the edge.
n_undirected_triangles (int): The number of undirected triangles (considering all edges as bidirectional).
metapath_list (list): The list of unique metapaths “r1-r2” for the directed triangles.
- edge_tail_degree()[source]
For each edge in the KG, compute the number of edges (in total or of the same relation type) with the same tail node.
- Return type:
DataFrame
- Returns:
The result DataFrame, with the same indexing and ordering of triples as the original KG DataFrame, with columns (in addition to h, r, t):
t_unique_rel (int): Number of distinct relation types among edges with tail entity t.
t_degree (int): Number of triples with tail entity t.
t_degree_same_rel (int): Number of triples with tail entity t and relation type r.
- jaccard_similarity_relation_sets()[source]
Compute the similarity between relations defined as the Jaccard Similarity between sets of entities (heads and tails) for all pairs of relations in the graph.
- Return type:
DataFrame
- Returns:
The results dataframe. Contains the following columns:
r1 (int): Index of the first relation.
r2 (int): Index of the second relation.
num_triples_both (int): Number of triples with relation r1/r2.
frac_triples_both (float): Fraction of triples with relation r1/r2.
num_entities_both (int): Number of unique entities (h or t) for triples with relation r1/r2.
num_h_r1 (int): Number of unique head entities for relation r1.
num_h_r2 (int): Number of unique head entities for relation r2.
num_t_r1 (int): Number of unique tail entities for relation r1.
num_t_r2 (int): Number of unique tail entities for relation r2.
jaccard_head_head (float): Jaccard similarity between the head set of r1 and the head set of r2.
jaccard_tail_tail (float): Jaccard similarity between the tail set of r1 and the tail set of r2.
jaccard_head_tail (float): Jaccard similarity between the head set of r1 and the tail set of r2.
jaccard_tail_head (float): Jaccard similarity between the tail set of r1 and the head set of r2.
jaccard_both (float): Jaccard similarity between the full entity set of r1 and r2.
- loop_count()[source]
For each entity in the KG, compute the number of loops around the entity (i.e., the number of edges having the entity as both head and tail).
- Return type:
DataFrame
- Returns:
Loop count DataFrame, indexed on the IDs of the graph entities.
- node_degree_summary(return_relation_list=False)[source]
For each entity in the KG, compute the number of edges having it as a head (head-degree, or out-degree), as a tail (tail-degree, or in-degree) or one of the two (total-degree). The in-going and out-going relation types are also identified.
The output dataframe is indexed on the IDs of the graph entities.
- Parameters:
return_relation_list (
bool
) – If True, return the list of unique relations going in/out of an entity. WARNING: expensive for large graphs.- Return type:
DataFrame
- Returns:
The results dataframe, indexed on the IDs e of the graph entities, with columns:
h_degree (int): Number of triples with head entity e.
t_degree (int): Number of triples with tail entity e.
tot_degree (int): Number of triples with head entity e or tail entity e.
h_unique_rel (int): Number of distinct relation types among edges with head entity e.
h_rel_list (Optional[list]): List of unique relation types among edges with head entity e. Only returned if return_relation_list = True.
t_unique_rel (int): Number of distinct relation types among edges with tail entity e.
t_rel_list (Optional[list]): List of unique relation types among edges with tail entity e. Only returned if return_relation_list = True.
n_loops (int): number of loops around entity e.
- node_head_degree(return_relation_list=False)[source]
For each entity in the KG, compute the number of edges having it as head (head-degree, or out-degree of the head node). The relation types going out of the head node are also identified.
- Parameters:
return_relation_list (
bool
) – If True, return the list of unique relations going out of the head node. WARNING: expensive for large graphs. Default: False.- Return type:
DataFrame
- Returns:
The result DataFrame, indexed on the IDs e of the graph entities, with columns:
h_degree (int): Number of triples with head entity e.
h_unique_rel (int): Number of distinct relation types among edges with head entity e.
h_rel_list (Optional[list]): List of unique relation types among edges with head entity e. Only returned if return_relation_list = True.
- node_tail_degree(return_relation_list=False)[source]
For each entity in the KG, compute the number of edges having it as tail (tail-degree, or in-degree of the tail node). The relation types going into the tail node are also identified.
- Parameters:
return_relation_list (
bool
) – If True, return the list of unique relation types going into the tail node. WARNING: expensive for large graphs. Default: False.- Return type:
DataFrame
- Returns:
The result DataFrame, indexed on the IDs e of the graph entities, with columns:
t_degree (int): Number of triples with tail entity e.
t_unique_rel (int): Number of distinct relation types among edges with tail entity e.
t_rel_list (Optional[list]): List of unique relation types among edges with tail entity e. Only returned if return_relation_list = True.
- relational_affinity_ingram(min_max_norm=False)[source]
Compute the similarity between relations based on the approach proposed in InGram: Inductive Knowledge Graph Embedding via Relation Graphs, https://arxiv.org/abs/2305.19987.
Only the pairs of relations witn
affinity > 0
are shown in the returned dataframe.- Parameters:
min_max_norm (
bool
) – min-max normalization of edge weights. Default: False.- Return type:
DataFrame
- Returns:
The results dataframe. Contains the following columns:
h_relation (int): Index of the head relation.
t_relation (int): Index of the tail relation.
edge_weight (float): Weight for the affinity between the head and the tail relation.