Graph Plotters
Here you can find the different graph loading, storing, and plotting functions in the package.
Graph Generator
- pyBiodatafuse.graph.generator.build_networkx_graph(combined_df: DataFrame, disease_compound=None, pathway_compound=None, homolog_df_list=None) MultiDiGraph[source]
Construct a NetWorkX graph from a Pandas DataFrame of genes and their multi-source annotations.
- Parameters:
combined_df – the input DataFrame to be converted into a graph.
disease_compound – the input DataFrame containing disease-compound relationships.
pathway_compound – the input DataFrame containing pathway-compound relationships from KEGG.
homolog_df_list – a list of DataFrame generated by querying homologs.
- Returns:
a NetworkX MultiDiGraph
- Raises:
ValueError – if the target type is not supported.
Graph Savers
- pyBiodatafuse.graph.saver.save_graph(combined_df: DataFrame, combined_metadata: Dict[Any, Any], disease_compound: DataFrame | None = None, graph_name: str = 'combined', graph_dir: str = 'examples/usecases/')[source]
Save the graph to a file.
- Parameters:
combined_df – the input df to be converted into a graph.
combined_metadata – the metadata of the graph.
disease_compound – the input df containing disease-compound relationships.
graph_name – the name of the graph.
graph_dir – the directory to save the graph.
- Returns:
a NetworkX MultiDiGraph
- pyBiodatafuse.graph.saver.save_graph_to_graphml(g: MultiDiGraph, output_path: str)[source]
Convert a NetworkX graph to Neo4J graphml file.
- Parameters:
g – the NetworkX graph object.
output_path – the output path of the graphml file
- pyBiodatafuse.graph.saver.save_graph_to_edgelist(g: MultiDiGraph, output_path: str)[source]
Convert a NetworkX graph to edgelist file.
- Parameters:
g – the NetworkX graph object.
output_path – the output path of the edgelist file
Cytoscape
Neo4J
- pyBiodatafuse.graph.neo4j.exporter(network, uri, username, password, neo4j_import_folder, network_name: str = 'Network')[source]
Import the network to neo4j.
- Parameters:
network – NetworkX network
uri – URI for Neo4j database
username – user name for Neo4j database
password – password for Neo4j database
neo4j_import_folder – exact path to neo4j database import folder
network_name – network name given by users
Usage example:
network = nxGraph uri = "neo4j://localhost:7687" username = "neo4j" password = "biodatafuse" neo4j_import_folder = "../../neo4j-community-5.13.0/import/" network_name = "Network" exporter( network, uri, username, password, neo4j_import_folder, network_name )
RDF and GraphDB
- class pyBiodatafuse.graph.rdf.rdf.BDFGraph(base_uri: str, version_iri: str | None = None, title: str | None = None, description: str | None = None, author: str | None = None, orcid: str | None = None, creators: List[Dict[str, str]] | None = None)[source]
Main class for a BioDatafuse RDF Graph, superclass of rdflib.Graph.
Initialize a new instance of the class with the provided metadata and URIs.
- Parameters:
base_uri – The base URI for the RDF graph.
version_iri – The version IRI for the RDF graph (optional).
title – The title of the BDF graph (optional).
description – A description of the BDF graph (optional).
author – The author of the BDF graph (optional, use creators for multiple).
orcid – The ORCID identifier for the author (optional, use creators for multiple).
creators – A list of creator dictionaries, each with ‘name’ (required), ‘orcid’ (optional), and ‘url’ (optional) keys.
- generate_rdf(df: DataFrame, metadata: Dict[str, Any], open_only: bool = False) None[source]
Generate an RDF graph from the provided DataFrame and metadata.
- Parameters:
df – The DataFrame containing the data to be converted into RDF.
metadata – A dictionary containing metadata information for RDF generation.
open_only – A flag indicating whether to process only open data. Defaults to False.
metadata – Metadata information to be added to the RDF graph.
- record_datasource(datasource: str, node: URIRef | None = None, interaction_type: str | None = None) None[source]
Record that data from a specific data source was added to the graph.
This method is called by process_* methods when they successfully add data from a data source to the RDF graph.
- Parameters:
datasource – Name of the data source (e.g., “StringDB”, “Bgee”).
node – Optional URIRef of a node created from this data source.
interaction_type – Optional type of interaction being added.
- process_row(row: Series, i: int, datasources: DataFrame) None[source]
Process a single row of the DataFrame and update the RDF graph.
- Parameters:
row – A dictionary-like object representing a single row of the DataFrame.
i – An integer representing the index of the row.
datasources – The BDF datasource table.
- collect_disease_data(row: Series) List[Dict[str, Any]][source]
Collect disease data from the row.
- Parameters:
row – A dictionary representing a row of data.
- Returns:
A list of collected disease data.
- valid_indices(source_idx: str | None, source_namespace: str | None, target_idx: str | None, target_namespace: str | None) bool[source]
Check if the row is valid.
This method verifies that none of the provided indices or namespaces are NaN (Not a Number).
- Parameters:
source_idx – The index of the source node.
source_namespace – The namespace of the source node.
target_idx – The index of the target node.
target_namespace – The namespace of the target node.
- Returns:
True if all indices and namespaces are valid (not NaN), False otherwise.
- add_gene_node(row: Series) URIRef | None[source]
Get gene node.
Dynamically creates gene nodes based on the target source. The target source must have a corresponding entry in NODE_URI_PREFIXES.
- Parameters:
row – A series containing the data for a single row. It must include the key “target.source”.
- Returns:
A URIRef for the gene, else None.
- add_compound_node(row: Series) URIRef | None[source]
Get compound node.
- Parameters:
row – A series containing the data for a single row. It must include the key “target.source”.
- Returns:
A URIRef for the compound, else None.
- process_disease_data(disease_data: List[Dict[str, Any]], id_number: str, source_idx: str, gene_node: URIRef) None[source]
Process disease data and add to the RDF graph.
- Parameters:
disease_data – List of disease data to be processed.
id_number – Identifier number for the gene.
source_idx – Source index for the data.
gene_node – RDF node representing the gene.
- process_expression_data(expression_data, id_number: str, source_idx: str, gene_node: URIRef) None[source]
Process gene expression data and add to the RDF graph.
- Parameters:
expression_data – The gene expression and experimental process data.
id_number – The identifier number for the gene.
source_idx – The source index for the data.
gene_node – The RDF node representing the gene.
- process_processes_data(processes_data: List[Dict[str, Any]] | None, gene_node: URIRef) None[source]
Process Gene Ontology (GO) terms and add to the RDF graph.
- Parameters:
processes_data – A list of GO terms related to a gene.
gene_node – The RDF node representing the gene.
- process_compound_data(compound_data: List[Dict[str, Any]] | None, gene_node: URIRef) None[source]
Process compound data and add to the RDF graph.
- Parameters:
compound_data – List of compounds to be processed.
gene_node – URIRef of gene node.
- process_pubchem_assay_data(assay_data: List[Dict[str, Any]], gene_node: URIRef) None[source]
Process PubChem assay data and add to the RDF graph.
- Parameters:
assay_data – List of PubChem assay entries to be processed.
gene_node – URIRef of gene node.
- process_compoundwiki_data(target_node: URIRef, row: Series) None[source]
Process CompoundWiki annotation data and add to the RDF graph.
- Parameters:
target_node – URIRef of the target node (gene or protein).
row – Data row containing CompoundWiki annotations.
- process_literature_data(literature_based_data: Dict[str, Any] | List[Dict[str, Any]] | None, gene_node: URIRef, id_number: str, source_idx: str, new_uris: Dict[str, str], i: int) None[source]
Process literature-based data and add to the RDF graph.
- Parameters:
literature_based_data – Data derived from literature sources. Can be a single entry or a list of entries.
gene_node – The gene node to which the literature-based data will be added.
id_number – Unique identifier for the expression data.
source_idx – Identifier for the source of the expression data.
new_uris – Node URIs for the graph.
i – Row index.
- process_transporter_inhibitor_data(gene_node, transporter_inhibitor_data: List[Dict[str, Any]] | None) None[source]
Process transporter inhibitor data and add to the RDF graph.
- Parameters:
gene_node – An RDF node representing the gene.
transporter_inhibitor_data – A list of transporter inhibitor data entries to be processed.
- process_inhibitor_transporter_data(compound_node, inhibitor_transporter_data: List[Dict[str, Any]] | None) None[source]
Process inhibitor transporter data and add to the RDF graph.
- Parameters:
compound_node – An RDF node representing the compound.
inhibitor_transporter_data – A list of inhibitor transporter data entries to be processed.
- process_pathways(row: Series, identifier_node: URIRef, protein_nodes: List[URIRef]) None[source]
Process pathway data and add to the RDF graph.
This method processes pathway data from various sources and adds the relevant information to the RDF graph. It creates pathway nodes and establishes relationships between gene nodes, protein nodes, and pathway nodes.
- Parameters:
row – A dictionary containing pathway data from different sources.
identifier_node – An RDF node representing the identifier.
protein_nodes – A list of RDF nodes representing proteins associated with the identifier.
- process_molecular_pathway(molecular_data, identifier, id_number) None[source]
Process molecular pathway data and add to the RDF graph.
- Parameters:
molecular_data – A list of dicts containing pathway data.
identifier – An RDF node representing the gene or compound in the row.
id_number – The identifier number for the row.
- process_ppi_data(stringdb_data: List[Dict[str, Any]] | None, gene_node: URIRef) None[source]
Process Protein-Protein Interaction (ppi) data and add to the RDF graph.
- Parameters:
stringdb_data – List of dictionaries containing ppi data from STRING database.
gene_node – The gene URIRef.
- process_aop_data(aop_data: List[Dict[str, Any]] | None = None, gene_node: URIRef | None = None, compound_node: URIRef | None = None) None[source]
Process AOP-Wiki data and add to the RDF graph.
- Parameters:
aop_data – List of dictionaries containing AOP data. Defaults to None.
gene_node – The gene URIRef. Defaults to None.
compound_node – The compound URIRef. Defaults to None.
- shex(path: str | None = None, threshold: float = 0.001, uml_figure_path: str | None = None, print_string_output: bool = True, additional_namespaces: Dict[str, str] | None = None) Any[source]
Get ShEx shapes with optional parameters.
- Parameters:
path – Path to save the ShEx results.
threshold – Validation threshold.
uml_figure_path – Path to save UML diagram for shapes.
print_string_output – Whether to print the output string.
additional_namespaces – Additional namespaces for shapes.
- Returns:
ShEx graph result.
- shacl(path: str | None = None, threshold: float = 0.001, uml_figure_path: str | None = None, print_string_output: bool = True, additional_namespaces: Dict[str, str] | None = None) Any[source]
Get SHACL shapes with optional parameters.
- Parameters:
path – Path to save the SHACL results.
threshold – Validation threshold.
uml_figure_path – Path to save UML diagram for shapes.
print_string_output – Whether to print the output string.
additional_namespaces – Additional namespaces for shapes.
- Returns:
SHACL graph result.
- shacl_prefixes(path: str | None = None, namespaces: Dict[str, str] | None = None, print_string_output: bool = False) Any[source]
Get a SHACL prefixes graph, optionally add more namespaces to bind to it.
- Parameters:
path – Path to save the SHACL prefixes.
namespaces – Namespaces for the prefixes.
print_string_output – bool, print or not the generated TTL as a string.
- Returns:
SHACL prefixes.
- class pyBiodatafuse.graph.rdf.graphdb.GraphDBManager[source]
A class to manage GraphDB repositories via REST API.
- static create_repository(base_url: str, repository_name: str = 'default', username: str | None = None, password: str | None = None)[source]
Create a new repository in GraphDB.
- Parameters:
base_url – The base URL of the GraphDB instance.
repository_name – The name of the repository to create.
username – The username for authentication.
password – The password for authentication.
- static list_repositories(base_url: str, username: str | None = None, password: str | None = None)[source]
List all repositories in the GraphDB instance.
- Parameters:
base_url – The base URL of the GraphDB instance.
username – Optional username for authentication.
password – Optional password for authentication.
- Returns:
List of repositories as JSON.
- static get_repository_info(base_url: str, repository_id: str, username: str | None = None, password: str | None = None)[source]
Retrieve detailed information about a specific repository.
- Parameters:
base_url – The base URL of the GraphDB instance.
repository_id – The ID of the repository.
username – Optional username for authentication.
password – Optional password for authentication.
- Returns:
Repository information as JSON.
- Raises:
HTTPError – If the request to retrieve repository info fails.
- static count_triples(base_url: str, repository_id: str, username: str | None = None, password: str | None = None)[source]
Count the number of triples in a repository.
- Parameters:
base_url – The base URL of the GraphDB instance.
repository_id – The ID of the repository.
username – Optional username for authentication.
password – Optional password for authentication.
- Returns:
Number of triples in the repository.
- static restart_repository(base_url: str, repository_id: str, username: str | None = None, password: str | None = None)[source]
Restart a specific repository.
- Parameters:
base_url – The base URL of the GraphDB instance.
repository_id – The ID of the repository.
username – Optional username for authentication.
password – Optional password for authentication.
- static delete_repository(base_url: str, repository_id: str, username: str | None = None, password: str | None = None)[source]
Delete a specific repository.
- Parameters:
base_url – The base URL of the GraphDB instance.
repository_id – The ID of the repository.
username – Optional username for authentication.
password – Optional password for authentication.
- static upload_to_graphdb(base_url: str, repository_id: str, username: str, password: str, bdf_graph, file_format: str = 'turtle')[source]
Upload an RDF graph to a GraphDB repository.
- Parameters:
base_url – The base URL of the GraphDB instance.
repository_id – The ID of the repository to upload the graph to.
username – The username for authentication.
password – The password for authentication.
bdf_graph – The RDF graph to upload.
file_format – The format of the RDF graph (default is “turtle”).
- Raises:
HTTPError – If the request to execute the query fails.
- static query_graphdb(base_url: str, repository_name: str, username: str, password: str, query: str, response_format: str = 'json')[source]
Execute a SPARQL query on a GraphDB repository.
- Parameters:
base_url – The base URL of the GraphDB instance.
repository_name – The name of the repository to query.
username – The username for authentication.
password – The password for authentication.
query – The SPARQL query to execute.
response_format – The format of the query response (default is “json”).
- Returns:
Query results as a dictionary or pandas DataFrame.
- Raises:
HTTPError – If the request to execute the query fails.
- pyBiodatafuse.graph.rdf.metadata.add_creator_node(g: Graph, graph_resource: URIRef, name: str, orcid: str | None = None, url: str | None = None) URIRef[source]
Create and add a properly modeled creator node to the graph.
Models the creator as both foaf:Person and schema:Person with foaf:name. If an ORCID is provided, uses it as the node URI. Otherwise uses a provided URL or generates a blank node.
- Parameters:
g – The RDF graph to which the creator node is added.
graph_resource – The graph resource URI to link the creator to.
name – The name of the creator.
orcid – The ORCID URL of the creator (e.g., ‘https://orcid.org/0000-0001-2345-6789’).
url – An alternative URL to identify the creator (if no ORCID provided).
- Returns:
The URIRef of the created person node.
- pyBiodatafuse.graph.rdf.metadata.add_metadata(g: Graph, graph_uri: str, metadata: dict, version_iri: str | None = None, title: str | None = None, description: str | None = None, author: str | None = None, orcid: str | None = None, creators: List[Dict[str, str]] | None = None)[source]
Add metadata to the RDF graph, including creation date, version, title, and creators.
- Parameters:
g – The RDF graph to which metadata is added.
graph_uri – URI identifying the RDF graph.
metadata – Combined metadata for a BioDatafuse query.
version_iri – Version IRI to add (optional).
title – Title of the graph (optional).
description – Description of the graph (optional).
author – Author’s name (optional, deprecated - use creators instead).
orcid – Author’s ORCID (optional, deprecated - use creators instead).
creators – List of creator dictionaries with ‘name’ (required), ‘orcid’ (optional), and ‘url’ (optional) keys.
RDF Utilities
- pyBiodatafuse.graph.rdf.utils.replace_na_none(item)[source]
Replace occurrences of NA values (such as ‘na’, ‘nan’, ‘none’) with None.
- Parameters:
item – Item to process. Can be a string, float, list, dict, or numpy array.
- Returns:
Processed item with NA values replaced by None.
- pyBiodatafuse.graph.rdf.utils.extract_curie(prefix, identifier)[source]
Generate a CURIE by normalizing a prefix and identifier.
- Parameters:
prefix – Prefix string, such as a registry identifier.
identifier – Identifier to be appended to the prefix.
- Returns:
Normalized CURIE or None if normalization fails.
- pyBiodatafuse.graph.rdf.utils.construct_uri(base_uri, identifier)[source]
Construct a URIRef from a base URI and an identifier.
- Parameters:
base_uri – Base URI string for the RDF resource.
identifier – Identifier to append to the base URI.
- Returns:
A URIRef representing the constructed URI.
- pyBiodatafuse.graph.rdf.utils.add_data_source_node(g: Graph, source: str) URIRef[source]
Create and add a data source node to the RDF graph.
Uses DCAT Dataset and VoID Dataset types, aligned with dataset_provenance.py.
- Parameters:
g – RDF graph to which the data source node will be added.
source – String containing the name of the source of the data.
- Returns:
URIRef for the created data source node.
- pyBiodatafuse.graph.rdf.utils.get_shapes(g, base_uri, path, threshold=1e-09, graph_type='shex', uml_figure_path=None, print_string_output=True, additional_namespaces=None)[source]
Use shexer (https://github.com/DaniFdezAlvarez/shexer) on the BDF graph to generate Shex or SHACL.
- Parameters:
g – RDF graph to generate shapes from.
base_uri – The graph iri to be added to shaper namespaces.
path – relative path in which the graph TTL will be saved, if provided.
threshold – float between [0,1] used to accept shapes based on frequency.
graph_type – “shex” or “shacl”, to specify which graph type to generate.
uml_figure_path – str path where the generated UML is stored.
print_string_output – bool, print or not the generated TTL as a string.
additional_namespaces – dictionary containing {namespace: prefix} pairs.
- Raises:
ValueError – If the graph type is not a valid string or not in [‘shex’, ‘shacl’].
- Returns:
shaper shex or shacl graph
- pyBiodatafuse.graph.rdf.utils.get_shacl_prefixes(namespaces, path, new_uris, print_string_output)[source]
Generate SHACL prefix declarations and save them in Turtle format.
- Parameters:
namespaces – Optional dictionary of prefix to namespace URI mappings to include in the SHACL declarations.
path – Optional path to a file where the Turtle data will be written. If not provided, the data is not written to disk.
new_uris – Dictionary of prefix to namespace URI mappings to include in the SHACL declarations.
print_string_output – bool, print or not the generated TTL as a string.
- Returns:
A RDFLib Graph containing the SHACL prefix declarations.
- pyBiodatafuse.graph.rdf.utils.get_node_label(g, node)[source]
Retrieve the label of a given node from an RDF graph.
- Parameters:
g – The RDF graph containing the data.
node – The node whose label is to be retrieved.
- Returns:
The label of the node if it exists, otherwise None.
- pyBiodatafuse.graph.rdf.utils.discover_prefixes_from_graph(g: Graph) dict[source]
Discover namespace prefixes from all URIs in a graph using bioregistry.
This function collects all URIs from subjects, predicates, and objects in the graph, then uses bioregistry to identify prefixes and their corresponding namespace URIs.
- Parameters:
g – The RDF graph to analyze.
- Returns:
Dictionary mapping prefix names to namespace URIs.