kg_topology_toolbox.utils.aggregate_by_relation

kg_topology_toolbox.utils.aggregate_by_relation(edge_topology_df)[source]

Aggregate topology metrics of all triples of the same relation type. To be applied to a DataFrame of metrics having at least columns h, r, t (e.g., the output of KGTopologyToolbox.edge_degree_cardinality_summary() or KGTopologyToolbox.edge_pattern_summary()).

The returned dataframe is indexed over relation type IDs, with columns giving the aggregated statistics of triples of the corresponding relation. The name of the columns is of the form column_name_in_input_df + suffix. The aggregation is performed by returning:

  • for numerical metrics: mean, standard deviation and quartiles (suffix = “_mean”, “_std”, “_quartile1”, “_quartile2”, “_quartile3”);

  • for boolean metrics: the fraction of triples of the relation type with metric = True (suffix = “_frac”);

  • for string metrics: for each possible label, the fraction of triples of the relation type with that metric value (suffix = “_{label}_frac”)

  • for list metrics: the unique metric values across triples of the relation type (suffix = “_unique”).

Parameters:

edge_topology_df (DataFrame) – pd.DataFrame of edge topology metrics. Must contain at least three columns h, r, t.

Return type:

DataFrame

Returns:

The results dataframe. In addition to the columns with the aggregated metrics by relation type, it also contains columns:

  • num_triples (int): Number of triples for each relation type.

  • frac_triples (float): Fraction of overall triples represented by each relation type.

  • unique_h (int): Number of unique head entities used by triples of each relation type.

  • unique_t (int): Number of unique tail entities used by triples of each relation type.