Data sources
Here you can find the different data resources that you can successfully query with the pyBiodatafuse package.
Bgee
Bgee <https://www.bgee.org/> is a database for retrieval and comparison of gene expression patterns across multiple animal species.
- pyBiodatafuse.annotators.bgee.get_gene_expression(bridgedb_df: DataFrame)[source]
Query gene-tissue expression information from Bgee with SPARQL.
- Parameters:
bridgedb_df – BridgeDb output for creating the list of gene ids to query
- Returns:
a DataFrame containing the Bgee output and dictionary of the Bgee metadata.
DisGeNET
DisGeNET <https://www.disgenet.org/home/> is a discovery platform containing one of the largest publicly available collections of genes and variants associated to human diseases.
- pyBiodatafuse.annotators.disgenet.get_gene_disease(api_key: str, bridgedb_df: DataFrame) Tuple[DataFrame, dict][source]
Query gene-disease associations from DisGeNET.
- Parameters:
api_key – DisGeNET API key (more details can be found at https://disgenet.com/plans)
bridgedb_df – BridgeDb output for creating the list of gene ids to query.
- Returns:
a DataFrame containing the DisGeNET output and dictionary of the DisGeNET metadata.
MolMeDB
MolMeDB <https://molmedb.upol.cz/detail/intro> is an open chemistry database about interactions of molecules with membranes.
- pyBiodatafuse.annotators.molmedb.get_gene_compound_inhibitor(bridgedb_df: DataFrame) Tuple[DataFrame, dict][source]
Query MolMeDB for inhibitors of transporters encoded by genes in input.
- Parameters:
bridgedb_df – BridgeDb output for creating the list of gene ids to query
- Returns:
a DataFrame containing the MolMeDB output and dictionary of the MolMeDB metadata.
- pyBiodatafuse.annotators.molmedb.get_compound_gene_inhibitor(bridgedb_df: DataFrame) Tuple[DataFrame, dict][source]
Query MolMeDB for transporters inhibited by molecule.
- Parameters:
bridgedb_df – BridgeDb output for creating the list of gene ids to query.
- Returns:
a DataFrame containing the MolMeDB output and dictionary of the MolMeDB metadata.
OpenTargets
OpenTargets database uses human genetics and genomics data for systematic drug target identification and prioritisation.
- pyBiodatafuse.annotators.opentargets.get_gene_go_process(bridgedb_df: DataFrame) Tuple[DataFrame, dict][source]
Get information about GO pathways associated with a genes of interest.
- Parameters:
bridgedb_df – BridgeDb output for creating the list of gene ids to query
- Returns:
a DataFrame containing the OpenTargets output and dictionary of the query metadata.
- pyBiodatafuse.annotators.opentargets.get_gene_reactome_pathways(bridgedb_df: DataFrame) Tuple[DataFrame, dict][source]
Get information about Reactome pathways associated with a gene.
- Parameters:
bridgedb_df – BridgeDb output for creating the list of gene ids to query
- Returns:
a DataFrame containing the OpenTargets output and dictionary of the query metadata.
- pyBiodatafuse.annotators.opentargets.get_gene_compound_interactions(bridgedb_df: DataFrame, cache_pubchem_cid: bool = True) Tuple[DataFrame, dict][source]
Get information about drugs associated with a genes of interest.
- Parameters:
bridgedb_df – BridgeDb output for creating the list of gene ids to query
cache_pubchem_cid – whether to cache the PubChem CID for the ChEMBL ID
- Returns:
a DataFrame containing the OpenTargets output and dictionary of the query metadata.
- pyBiodatafuse.annotators.opentargets.get_disease_compound_interactions(bridgedb_df: DataFrame, cache_pubchem_cid: bool = False) Tuple[DataFrame, dict][source]
Get information about drugs associated with diseases of interest.
- Parameters:
bridgedb_df – BridgeDb output for creating the list of gene ids to query.
cache_pubchem_cid – If True, the PubChem CID will be cached for future use.
- Returns:
a DataFrame containing the OpenTargets output and dictionary of the query metadata.
StringDB
StringDB <https://string-db.org/> aims to collect, score and integrate all publicly available sources of protein-protein interaction information, and to complement these with computational predictions.
- pyBiodatafuse.annotators.stringdb.get_ppi(bridgedb_df: DataFrame, species: str = 'human') Tuple[DataFrame, Dict[str, Any]][source]
Annotate genes with protein-protein interactions from STRING-DB.
- Parameters:
bridgedb_df – BridgeDb output for creating the list of gene ids to query
species – The species to query. (Try ‘Homo sapiens’ if ‘human’ is not working.)
- Returns:
a tuple (DataFrame containing the StringDB output, metadata dictionary)
Wikidata
Wikidata <https://www.wikidata.org/> acts as central storage for the structured data of its Wikimedia sister projects including Wikipedia, Wikivoyage, Wiktionary, Wikisource, and others.
- pyBiodatafuse.annotators.wikidata.get_gene_cellular_component(bridgedb_df: DataFrame)[source]
Get cellcular component information and Wikidata identifiers for a gene’s encoded protein.
- Parameters:
bridgedb_df – BridgeDb output for creating the list of gene ids to query
- Returns:
a DataFrame containing the Wikidata output and dictionary of the query metadata.
WikiPathways
Wikipathways <https://www.wikipathways.org/> is an open science platform for biological pathways contributed, updated, and used by the research community.
- pyBiodatafuse.annotators.wikipathways.get_gene_wikipathways(bridgedb_df: DataFrame, query_interactions: bool = False, organism: str = 'Homo sapiens') DataFrame[source]
Query WikiPathways for pathways associated with genes.
- Parameters:
bridgedb_df – BridgeDb output for creating the list of gene ids to query
query_interactions – Set whether to retrieve gene part_of pathways relationships (False) or all molecular interactions (True).
organism – The organism to query. Default is “Homo sapiens”.
- Returns:
a DataFrame containing the WikiPathways output and dictionary of the WikiPathways metadata.
MINERVA
MINERVA is a standalone webserver for visual exploration, analysis and management of molecular networks encoded in following systems biology formats.
- pyBiodatafuse.annotators.minerva.get_minerva_components(map_name: str, get_elements: bool | None = True, get_reactions: bool | None = True) Tuple[str, dict][source]
Get information about MINERVA componenets from a specific project.
- Parameters:
map_name – MINERVA map name. The extensive list can be found at https://minerva-net.lcsb.uni.lu/table.html.
get_elements – boolean to get elements of the chosen diagram
get_reactions – boolean to get reactions of the chosen diagram
- Returns:
a tuple of map endpoint and dictionary containing: - ‘map_elements’ contains a list for each of the pathways in the model. Those lists provide information about Compartment, Complex, Drug, Gene, Ion, Phenotype, Protein, RNA and Simple molecules involved in that pathway - ‘map_reactions’ contains a list for each of the pathways in the model. Those lists provide information about the reactions involed in that pathway. - ‘models’ is a list containing pathway-specific information for each of the pathways in the model.
- Raises:
ValueError – if the provided map_name is not valid.
- pyBiodatafuse.annotators.minerva.get_gene_pathways(bridgedb_df: DataFrame, map_name: str, get_elements: bool | None = True, get_reactions: bool | None = True) Tuple[DataFrame, dict][source]
Get information about MINERVA pathways associated with a gene.
- Parameters:
bridgedb_df – BridgeDb output for creating the list of gene ids to query
map_name – name of the map you want to retrieve the information from. The extensive list can be found at https://minerva-net.lcsb.uni.lu/table.html.
get_elements – boolean to get elements of the chosen diagram.
get_reactions – if get_reactions = boolean to get reactions of the chosen diagram.
- Returns:
a tuple containing MINERVA outputs and dictionary of the MINERVA metadata.
AOP-Wiki
AOP-Wiki is a database for Adverse Outcome Pathways (AOPs), which describe mechanistic information about the linkage between a molecular initiating event and an adverse outcome.
- pyBiodatafuse.annotators.aopwiki.get_aops_gene(bridgedb_df: DataFrame, pathway: bool = False) Tuple[DataFrame, dict][source]
Query for AOPs associated with genes from AOP Wiki RDF.
- Parameters:
bridgedb_df – BridgeDb output for creating the list of gene ids to query
pathway – if True, retrieve full pathway information including upstream/downstream key events. If False (default), retrieve simplified AOP information.
- Returns:
a DataFrame containing the AOP Wiki RDF output and dictionary of the AOP Wiki RDF metadata
- pyBiodatafuse.annotators.aopwiki.get_aops_compound(bridgedb_df: DataFrame, pathway: bool = False) Tuple[DataFrame, dict][source]
Query for AOPs associated with compounds from AOP Wiki RDF.
- Parameters:
bridgedb_df – BridgeDb output for creating the list of compound ids to query
pathway – if True, retrieve full pathway information including upstream/downstream key events. If False (default), retrieve simplified AOP information.
- Returns:
a DataFrame containing the AOP Wiki RDF output and dictionary of the AOP Wiki RDF metadata
- pyBiodatafuse.annotators.aopwiki.get_aops(bridgedb_df: DataFrame, pathway: bool = False) Tuple[DataFrame, dict][source]
Query for AOPs associated with genes or compounds.
- Parameters:
bridgedb_df – BridgeDb output for creating the list of gene/compound ids to query
pathway – if True, retrieve full pathway information including upstream/downstream key events. If False (default), retrieve simplified AOP information.
- Raises:
ValueError – if the input identifiers are not recognized or if they are not admitted gene or compound identifiers
- Returns:
a DataFrame containing the AOP Wiki RDF output and dictionary of the AOP Wiki RDF metadata
IntAct
IntAct provides a freely available, open source database system and analysis tools for molecular interaction data.
- pyBiodatafuse.annotators.intact.get_gene_interactions(bridgedb_df: DataFrame, interaction_type: str = 'both')[source]
Annotate genes with interaction data from IntAct.
- Parameters:
bridgedb_df – BridgeDb output for creating the list of gene ids to query.
interaction_type – Either ‘gene_gene’, ‘gene_compound’ or ‘both’. If the input is ‘both’, ‘gene_gene’ and ‘gene_compound’ will be queried.
- Raises:
ValueError – If an invalid interaction_type is provided.
- Returns:
a tuple (DataFrame containing the IntAct output, metadata dictionary)
- pyBiodatafuse.annotators.intact.get_compound_interactions(bridgedb_df: DataFrame, interaction_type: str = 'both')[source]
Annotate compounds with interaction data from IntAct.
- Parameters:
bridgedb_df – BridgeDb output for creating the list of compound ids to query.
interaction_type – Either ‘compound_compound’, ‘compound_gene’ or ‘both’. If the input is ‘both’, ‘compound_compound’ and ‘compound_gene’ will be queried.
- Raises:
ValueError – If an invalid interaction_type is provided.
- Returns:
a tuple (DataFrame containing the IntAct output, metadata dictionary)
KEGG
KEGG is a database resource for understanding high-level functions and utilities of biological systems from genomic and molecular-level information.
PubChem
PubChem is a database of chemical molecules and their activities against biological assays.
- pyBiodatafuse.annotators.pubchem.get_protein_compound_screened(bridgedb_df: DataFrame) Tuple[DataFrame, dict][source]
Query PubChem for molecules screened on proteins as targets.
- Parameters:
bridgedb_df – BridgeDb output for creating the list of gene ids to query.
- Returns:
a DataFrame containing the PubChem output and dictionary of the PubChem metadata.
CompoundWiki
CompoundWiki is a crowd-sourced database that provides comprehensive information about chemical compounds.
- pyBiodatafuse.annotators.compoundwiki.get_compound_annotations(combined_df: DataFrame, kegg_compound_df: DataFrame | None = None) Tuple[DataFrame, dict][source]
Annotate compounds in the input DataFrame using CompoundWiki data.
- Parameters:
combined_df – Main DataFrame containing compound identifiers and interaction columns
kegg_compound_df – Optional DataFrame for KEGG annotations (if available)
- Returns:
Tuple of (annotated DataFrame, metadata dictionary for provenance)
gProfiler
gProfiler is a web server for functional enrichment analysis and conversions of gene lists.
- pyBiodatafuse.annotators.gprofiler.get_gene_enrichment(bridgedb_df: DataFrame, species: str = 'hsapiens', padj_colname: str = 'padj', padj_filter: float = 0.05) tuple[DataFrame, dict][source]
Enrichment analysis using g:Profiler and retrieve version info.
- Parameters:
bridgedb_df – DataFrame containing gene data with columns “identifier” and a significance column.
species – species for both version retrieval and g:Profiler query (default is “hsapiens”).
padj_colname – Name of the column used to filter significant genes (default is “padj”).
padj_filter – Significance threshold to filter genes (default is 0.05).
- Returns:
A tuple containing: - Processed DataFrame from g:Profiler analysis. - Dictionary with g:Profiler version information.
TFLink
TFLink is a comprehensive database of transcription factor-target interactions.
- pyBiodatafuse.annotators.tflink.get_tf_target(tf_file: str, bridgedb_df: DataFrame, filter_deg: bool, padj_filter: float | None = 0.01, padj_colname: str | None = None) Tuple[DataFrame, dict][source]
Add tfs and targets from tflink.
- Parameters:
tf_file – Path of the TF-Target dataset. See the links in the docstring for downloading.
bridgedb_df – BridgeDb output for creating the list of gene ids to query.
filter_deg – Filter the data based on DEA output, if true, makes sure the column to filter and threshold should be checked.
padj_filter – The adjusted p-value threshold for filtering DEGs (default is 0.01).
padj_colname – The name of the column containing adjusted p-values (default is None).
- Returns:
A TFLink DataFrame and dictionary of the TFLink metadata.
- Raises:
ValueError – If the specified column for filtering DEGs is not found in the DataFrame.
MitoCarta
MitoCarta is an inventory of mammalian mitochondrial genes.
- pyBiodatafuse.annotators.mitocarta.get_gene_mito_pathways(bridgedb_df: DataFrame, mitocarta_file: str, filename: str, species: str = 'hsapiens', sheet_name: str = 'A Human MitoCarta3.0') Tuple[DataFrame, dict][source]
Get gene and mitochondia pathways from MitoCarta.
- Parameters:
bridgedb_df – BridgeDb output for creating the list of gene ids to query.
mitocarta_file – Name of the remote MitoCarta file to download.
filename – The local file path to save the downloaded dataset.
species – Species for which to process the data; defaults to “hsapiens”.
sheet_name – Excel sheet name to read from the file; defaults to “A Human MitoCarta3.0”.
- Returns:
A tuple containing the processed DataFrame and a metadata dictionary.