besskge.dataset.KGDataset

class besskge.dataset.KGDataset(n_entity, n_relation_type, triples, original_triple_ids, entity_dict=None, relation_dict=None, type_offsets=None, neg_heads=None, neg_tails=None)[source]

Represents a complete knowledge graph dataset of (head, relation, tail) triples.

Parameters:
classmethod build_ogbl_biokg(root)[source]

Build the ogbl-biokg dataset [HFZ+20]

Parameters:

root (Path) – Local path to the dataset. If the dataset is not present in this location, then it is downloaded and stored here.

Return type:

KGDataset

Returns:

The ogbl-biokg KGDataset.

classmethod build_ogbl_wikikg2(root)[source]

Build the ogbl-wikikg2 dataset [HFZ+20]

Parameters:

root (Path) – Local path to the dataset. If the dataset is not present in this location, then it is downloaded and stored here.

Return type:

KGDataset

Returns:

The ogbl-wikikg2 KGDataset.

Build the high-quality version of the OpenBioLink2020 dataset [ASAM20]

Parameters:

root (Path) – Local path to the dataset. If the dataset is not present in this location, then it is downloaded and stored here.

Return type:

KGDataset

Returns:

The HQ OpenBioLink2020 KGDataset.

classmethod build_yago310(root)[source]

Build the YAGO3-10 dataset. This is the subgraph of the YAGO3 knowledge graph [MBS15] containing only entities which have at least 10 relations associated to them. First used in [DPPR18].

Parameters:

root (Path) – Local path to the dataset. If the dataset is not present in this location, then it is downloaded and stored here.

Return type:

KGDataset

Returns:

The YAGO3-10 KGDataset.

entity_dict: Optional[List[str]] = None

Entity labels by ID; str[n_entity]

classmethod from_dataframe(df, head_column, relation_column, tail_column, entity_types=None, split=(0.7, 0.15, 0.15), seed=1234)[source]

Build a KGDataset from a pandas DataFrame of labeled (h,r,t) triples. IDs for entities and relations are automatically assigned based on labels in such a way that entities of the same type have contiguous IDs.

Parameters:
  • df (Union[DataFrame, Dict[str, DataFrame]]) – Pandas DataFrame of all triples in the knowledge graph dataset, or dictionary of DataFrames of triples for each part of the dataset split

  • head_column (Union[int, str]) – Name of the DataFrame column storing head entities

  • relation_column (Union[int, str]) – Name of the DataFrame column storing relations

  • tail_column (Union[int, str]) – Name of the DataFrame column storing tail entities

  • entity_types (Union[Series, Dict[str, str], None]) – If entities have types, dictionary or pandas Series of mappings entity label -> entity type (as strings).

  • split (Tuple[float, float, float]) – Tuple to set the train/validation/test split. Only used if no pre-defined dataset split is specified, i.e. if df is not a dictionary.

  • seed (int) – Random seed for the train/validation/test split. Only used if no pre-defined dataset split is specified, i.e. if df is not a dictionary.

Return type:

KGDataset

Returns:

Instance of the KGDataset class.

classmethod from_triples(data, split=(0.7, 0.15, 0.15), seed=1234, entity_dict=None, relation_dict=None, type_offsets=None)[source]

Build a dataset from an array of triples, where IDs for entities and relations have already been assigned. Note that, if entities have types, entities of the same type need to have contiguous IDs. Triples are randomly split in train/validation/test sets. The attribute KGDataset.original_triple_ids stores the IDs of the triples in each split wrt the original ordering in data.

If a pre-defined train/validation/test split is wanted, the KGDataset class should be instantiated manually.

Parameters:
  • data (ndarray[Any, dtype[int32]]) – Numpy array of triples [head_id, relation_id, tail_id]. Shape (num_triples, 3).

  • split (Tuple[float, float, float]) – Tuple to set the train/validation/test split.

  • seed (int) – Random seed for the train/validation/test split.

  • entity_dict (Optional[List[str]]) – Optional entity labels by ID.

  • relation_dict (Optional[List[str]]) – Optional relation labels by ID.

  • type_offsets (Optional[Dict[str, int]]) – Offset of entity types

Return type:

KGDataset

Returns:

Instance of the KGDataset class.

property ht_types: Dict[str, ndarray[Any, dtype[int32]]] | None

If entities have types, type IDs of triples’ heads/tails; {part: int32[n_triple, {h_type, t_type}]}

classmethod load(path)[source]

Load a KGDataset object saved with KGDataset.save().

Parameters:

path (Path) – Path to saved KGDataset object.

Return type:

KGDataset

Returns:

The saved KGDataset object.

n_entity: int

Number of entities (nodes) in the knowledge graph

n_relation_type: int

Number of relation types (edge labels) in the knowledge graph

neg_heads: Optional[Dict[str, ndarray[Any, dtype[int32]]]] = None

IDs of (possibly triple-specific) negative heads; {part: int32[n_triple or 1, n_neg_heads]}

neg_tails: Optional[Dict[str, ndarray[Any, dtype[int32]]]] = None

IDs of (possibly triple-specific) negative tails; {part: int32[n_triple or 1, n_neg_tails]}

original_triple_ids: Dict[str, ndarray[Any, dtype[int32]]]

IDs of the triples in KGDataset.triples wrt the ordering in the original array/dataframe from where the triples originate.

relation_dict: Optional[List[str]] = None

Relation type labels by ID; str[n_relation_type]

save(out_file)[source]

Save dataset to .pkl.

Parameters:

out_file (Path) – Path to output file.

Return type:

None

triples: Dict[str, ndarray[Any, dtype[int32]]]

List of (h_ID, r_ID, t_ID) triples, for each part of the dataset; {part: int32[n_triple, {h,r,t}]}

type_offsets: Optional[Dict[str, int]] = None

If entities have types, IDs are assumed to be clustered by type; {entity_type: int}