Data sources

Here you can find the different data resources that you can successfully query with the pyBiodatafuse package.

Bgee

Bgee <https://www.bgee.org/> is a database for retrieval and comparison of gene expression patterns across multiple animal species.

pyBiodatafuse.annotators.bgee.get_gene_expression(bridgedb_df: DataFrame)[source]

Query gene-tissue expression information from Bgee with SPARQL.

Parameters:: bridgedb_df – BridgeDb output for creating the list of gene ids to query
Returns:: a DataFrame containing the Bgee output and dictionary of the Bgee metadata.

DisGeNET

DisGeNET <https://www.disgenet.org/home/> is a discovery platform containing one of the largest publicly available collections of genes and variants associated to human diseases.

pyBiodatafuse.annotators.disgenet.get_gene_disease(api_key: str, bridgedb_df: DataFrame) → Tuple[DataFrame, dict][source]

Query gene-disease associations from DisGeNET.

Parameters:

api_key – DisGeNET API key (more details can be found at https://disgenet.com/plans)
bridgedb_df – BridgeDb output for creating the list of gene ids to query.

Returns:

a DataFrame containing the DisGeNET output and dictionary of the DisGeNET metadata.

MolMeDB

MolMeDB <https://molmedb.upol.cz/detail/intro> is an open chemistry database about interactions of molecules with membranes.

pyBiodatafuse.annotators.molmedb.get_gene_compound_inhibitor(bridgedb_df: DataFrame) → Tuple[DataFrame, dict][source]

Query MolMeDB for inhibitors of transporters encoded by genes in input.

Parameters:: bridgedb_df – BridgeDb output for creating the list of gene ids to query
Returns:: a DataFrame containing the MolMeDB output and dictionary of the MolMeDB metadata.

pyBiodatafuse.annotators.molmedb.get_compound_gene_inhibitor(bridgedb_df: DataFrame) → Tuple[DataFrame, dict][source]

Query MolMeDB for transporters inhibited by molecule.

Parameters:: bridgedb_df – BridgeDb output for creating the list of gene ids to query.
Returns:: a DataFrame containing the MolMeDB output and dictionary of the MolMeDB metadata.

OpenTargets

OpenTargets database uses human genetics and genomics data for systematic drug target identification and prioritisation.

pyBiodatafuse.annotators.opentargets.get_gene_go_process(bridgedb_df: DataFrame) → Tuple[DataFrame, dict][source]

Get information about GO pathways associated with a genes of interest.

Parameters:: bridgedb_df – BridgeDb output for creating the list of gene ids to query
Returns:: a DataFrame containing the OpenTargets output and dictionary of the query metadata.

pyBiodatafuse.annotators.opentargets.get_gene_reactome_pathways(bridgedb_df: DataFrame) → Tuple[DataFrame, dict][source]

Get information about Reactome pathways associated with a gene.

Parameters:: bridgedb_df – BridgeDb output for creating the list of gene ids to query
Returns:: a DataFrame containing the OpenTargets output and dictionary of the query metadata.

pyBiodatafuse.annotators.opentargets.get_gene_compound_interactions(bridgedb_df: DataFrame, cache_pubchem_cid: bool = True) → Tuple[DataFrame, dict][source]

Get information about drugs associated with a genes of interest.

Parameters:

bridgedb_df – BridgeDb output for creating the list of gene ids to query
cache_pubchem_cid – whether to cache the PubChem CID for the ChEMBL ID

Returns:

a DataFrame containing the OpenTargets output and dictionary of the query metadata.

pyBiodatafuse.annotators.opentargets.get_disease_compound_interactions(bridgedb_df: DataFrame, cache_pubchem_cid: bool = False) → Tuple[DataFrame, dict][source]

Get information about drugs associated with diseases of interest.

Parameters:

bridgedb_df – BridgeDb output for creating the list of gene ids to query.
cache_pubchem_cid – If True, the PubChem CID will be cached for future use.

Returns:

a DataFrame containing the OpenTargets output and dictionary of the query metadata.

StringDB

StringDB <https://string-db.org/> aims to collect, score and integrate all publicly available sources of protein-protein interaction information, and to complement these with computational predictions.

pyBiodatafuse.annotators.stringdb.get_ppi(bridgedb_df: DataFrame, species: str = 'human') → Tuple[DataFrame, Dict[str, Any]][source]

Annotate genes with protein-protein interactions from STRING-DB.

Parameters:

bridgedb_df – BridgeDb output for creating the list of gene ids to query
species – The species to query. (Try ‘Homo sapiens’ if ‘human’ is not working.)

Returns:

a tuple (DataFrame containing the StringDB output, metadata dictionary)

Wikidata

Wikidata <https://www.wikidata.org/> acts as central storage for the structured data of its Wikimedia sister projects including Wikipedia, Wikivoyage, Wiktionary, Wikisource, and others.

pyBiodatafuse.annotators.wikidata.get_gene_cellular_component(bridgedb_df: DataFrame)[source]

Get cellcular component information and Wikidata identifiers for a gene’s encoded protein.

Parameters:: bridgedb_df – BridgeDb output for creating the list of gene ids to query
Returns:: a DataFrame containing the Wikidata output and dictionary of the query metadata.

WikiPathways

Wikipathways <https://www.wikipathways.org/> is an open science platform for biological pathways contributed, updated, and used by the research community.

pyBiodatafuse.annotators.wikipathways.get_gene_wikipathways(bridgedb_df: DataFrame, query_interactions: bool = False, organism: str = 'Homo sapiens') → DataFrame[source]

Query WikiPathways for pathways associated with genes.

Parameters:

bridgedb_df – BridgeDb output for creating the list of gene ids to query
query_interactions – Set whether to retrieve gene part_of pathways relationships (False) or all molecular interactions (True).
organism – The organism to query. Default is “Homo sapiens”.

Returns:

a DataFrame containing the WikiPathways output and dictionary of the WikiPathways metadata.

MINERVA

MINERVA is a standalone webserver for visual exploration, analysis and management of molecular networks encoded in following systems biology formats.

pyBiodatafuse.annotators.minerva.get_minerva_components(map_name: str, get_elements: bool | None = True, get_reactions: bool | None = True) → Tuple[str, dict][source]

Get information about MINERVA componenets from a specific project.

Parameters:

map_name – MINERVA map name. The extensive list can be found at https://minerva-net.lcsb.uni.lu/table.html.
get_elements – boolean to get elements of the chosen diagram
get_reactions – boolean to get reactions of the chosen diagram

Returns:

a tuple of map endpoint and dictionary containing: - ‘map_elements’ contains a list for each of the pathways in the model. Those lists provide information about Compartment, Complex, Drug, Gene, Ion, Phenotype, Protein, RNA and Simple molecules involved in that pathway - ‘map_reactions’ contains a list for each of the pathways in the model. Those lists provide information about the reactions involed in that pathway. - ‘models’ is a list containing pathway-specific information for each of the pathways in the model.

Raises:

ValueError – if the provided map_name is not valid.

pyBiodatafuse.annotators.minerva.get_gene_pathways(bridgedb_df: DataFrame, map_name: str, get_elements: bool | None = True, get_reactions: bool | None = True) → Tuple[DataFrame, dict][source]

Get information about MINERVA pathways associated with a gene.

Parameters:

bridgedb_df – BridgeDb output for creating the list of gene ids to query
map_name – name of the map you want to retrieve the information from. The extensive list can be found at https://minerva-net.lcsb.uni.lu/table.html.
get_elements – boolean to get elements of the chosen diagram.
get_reactions – if get_reactions = boolean to get reactions of the chosen diagram.

Returns:

a tuple containing MINERVA outputs and dictionary of the MINERVA metadata.

AOP-Wiki

AOP-Wiki is a database for Adverse Outcome Pathways (AOPs), which describe mechanistic information about the linkage between a molecular initiating event and an adverse outcome.

pyBiodatafuse.annotators.aopwiki.get_aops_gene(bridgedb_df: DataFrame, pathway: bool = False) → Tuple[DataFrame, dict][source]

Query for AOPs associated with genes from AOP Wiki RDF.

Parameters:

bridgedb_df – BridgeDb output for creating the list of gene ids to query
pathway – if True, retrieve full pathway information including upstream/downstream key events. If False (default), retrieve simplified AOP information.

Returns:

a DataFrame containing the AOP Wiki RDF output and dictionary of the AOP Wiki RDF metadata

pyBiodatafuse.annotators.aopwiki.get_aops_compound(bridgedb_df: DataFrame, pathway: bool = False) → Tuple[DataFrame, dict][source]

Query for AOPs associated with compounds from AOP Wiki RDF.

Parameters:

bridgedb_df – BridgeDb output for creating the list of compound ids to query
pathway – if True, retrieve full pathway information including upstream/downstream key events. If False (default), retrieve simplified AOP information.

Returns:

a DataFrame containing the AOP Wiki RDF output and dictionary of the AOP Wiki RDF metadata

pyBiodatafuse.annotators.aopwiki.get_aops(bridgedb_df: DataFrame, pathway: bool = False) → Tuple[DataFrame, dict][source]

Query for AOPs associated with genes or compounds.

Parameters:

bridgedb_df – BridgeDb output for creating the list of gene/compound ids to query
pathway – if True, retrieve full pathway information including upstream/downstream key events. If False (default), retrieve simplified AOP information.

Raises:

ValueError – if the input identifiers are not recognized or if they are not admitted gene or compound identifiers

Returns:

a DataFrame containing the AOP Wiki RDF output and dictionary of the AOP Wiki RDF metadata

IntAct

IntAct provides a freely available, open source database system and analysis tools for molecular interaction data.

pyBiodatafuse.annotators.intact.get_gene_interactions(bridgedb_df: DataFrame, interaction_type: str = 'both')[source]

Annotate genes with interaction data from IntAct.

Parameters:

bridgedb_df – BridgeDb output for creating the list of gene ids to query.
interaction_type – Either ‘gene_gene’, ‘gene_compound’ or ‘both’. If the input is ‘both’, ‘gene_gene’ and ‘gene_compound’ will be queried.

Raises:

ValueError – If an invalid interaction_type is provided.

Returns:

a tuple (DataFrame containing the IntAct output, metadata dictionary)

pyBiodatafuse.annotators.intact.get_compound_interactions(bridgedb_df: DataFrame, interaction_type: str = 'both')[source]

Annotate compounds with interaction data from IntAct.

Parameters:

bridgedb_df – BridgeDb output for creating the list of compound ids to query.
interaction_type – Either ‘compound_compound’, ‘compound_gene’ or ‘both’. If the input is ‘both’, ‘compound_compound’ and ‘compound_gene’ will be queried.

Raises:

ValueError – If an invalid interaction_type is provided.

Returns:

a tuple (DataFrame containing the IntAct output, metadata dictionary)

KEGG

KEGG is a database resource for understanding high-level functions and utilities of biological systems from genomic and molecular-level information.

pyBiodatafuse.annotators.kegg.get_pathways(bridgedb_df: DataFrame)[source]

Annotate genes with KEGG pathway information.

Parameters:: bridgedb_df – input dataframe.
Returns:: dataframe including the kegg pathways as well as the metadata.

pyBiodatafuse.annotators.kegg.get_compounds(kegg_df: DataFrame)[source]

Get compound names for KEGG compounds in the dataframe.

Parameters:: kegg_df – Bridgedb dataframe.
Returns:: Updated DataFrame with KEGG compounds and their names.

PubChem

PubChem is a database of chemical molecules and their activities against biological assays.

pyBiodatafuse.annotators.pubchem.get_protein_compound_screened(bridgedb_df: DataFrame) → Tuple[DataFrame, dict][source]

Query PubChem for molecules screened on proteins as targets.

Parameters:: bridgedb_df – BridgeDb output for creating the list of gene ids to query.
Returns:: a DataFrame containing the PubChem output and dictionary of the PubChem metadata.

CompoundWiki

CompoundWiki is a crowd-sourced database that provides comprehensive information about chemical compounds.

pyBiodatafuse.annotators.compoundwiki.get_compound_annotations(combined_df: DataFrame, kegg_compound_df: DataFrame | None = None) → Tuple[DataFrame, dict][source]

Annotate compounds in the input DataFrame using CompoundWiki data.

Parameters:

combined_df – Main DataFrame containing compound identifiers and interaction columns
kegg_compound_df – Optional DataFrame for KEGG annotations (if available)

Returns:

Tuple of (annotated DataFrame, metadata dictionary for provenance)

gProfiler

gProfiler is a web server for functional enrichment analysis and conversions of gene lists.

pyBiodatafuse.annotators.gprofiler.get_gene_enrichment(bridgedb_df: DataFrame, species: str = 'hsapiens', padj_colname: str = 'padj', padj_filter: float = 0.05) → tuple[DataFrame, dict][source]

Enrichment analysis using g:Profiler and retrieve version info.

Parameters:

bridgedb_df – DataFrame containing gene data with columns “identifier” and a significance column.
species – species for both version retrieval and g:Profiler query (default is “hsapiens”).
padj_colname – Name of the column used to filter significant genes (default is “padj”).
padj_filter – Significance threshold to filter genes (default is 0.05).

Returns:

A tuple containing: - Processed DataFrame from g:Profiler analysis. - Dictionary with g:Profiler version information.

TFLink

TFLink is a comprehensive database of transcription factor-target interactions.

pyBiodatafuse.annotators.tflink.get_tf_target(tf_file: str, bridgedb_df: DataFrame, filter_deg: bool, padj_filter: float | None = 0.01, padj_colname: str | None = None) → Tuple[DataFrame, dict][source]

Add tfs and targets from tflink.

Parameters:

tf_file – Path of the TF-Target dataset. See the links in the docstring for downloading.
bridgedb_df – BridgeDb output for creating the list of gene ids to query.
filter_deg – Filter the data based on DEA output, if true, makes sure the column to filter and threshold should be checked.
padj_filter – The adjusted p-value threshold for filtering DEGs (default is 0.01).
padj_colname – The name of the column containing adjusted p-values (default is None).

Returns:

A TFLink DataFrame and dictionary of the TFLink metadata.

Raises:

ValueError – If the specified column for filtering DEGs is not found in the DataFrame.

MitoCarta

MitoCarta is an inventory of mammalian mitochondrial genes.

pyBiodatafuse.annotators.mitocarta.get_gene_mito_pathways(bridgedb_df: DataFrame, mitocarta_file: str, filename: str, species: str = 'hsapiens', sheet_name: str = 'A Human MitoCarta3.0') → Tuple[DataFrame, dict][source]

Get gene and mitochondia pathways from MitoCarta.

Parameters:

bridgedb_df – BridgeDb output for creating the list of gene ids to query.
mitocarta_file – Name of the remote MitoCarta file to download.
filename – The local file path to save the downloaded dataset.
species – Species for which to process the data; defaults to “hsapiens”.
sheet_name – Excel sheet name to read from the file; defaults to “A Human MitoCarta3.0”.

Returns:

A tuple containing the processed DataFrame and a metadata dictionary.