Core Utilities

Here you can find the core utility functions for data loading, ID mapping, and annotation orchestration.

Data Loaders

pyBiodatafuse.data_loader.create_df_from_file(file_path: str) → DataFrame[source]

Create a DataFrame from a file containing a list of identifiers.

Parameters:: file_path – path to the file containing the list of identifiers
Returns:: a DataFrame containing the list of identifiers

pyBiodatafuse.data_loader.create_df_from_text(text_input: str) → DataFrame[source]

Create a DataFrame from a text containing a list of identifiers.

Parameters:: text_input – text containing the list of identifiers with each identifier on a new line.
Returns:: a DataFrame containing the list of identifiers

pyBiodatafuse.data_loader.create_df_from_dea(file_path: str) → DataFrame[source]

Read a dataframe containing the result of the differential expression analysis (DEA).

Parameters:: file_path – path to the file containing the result of DEA
Returns:: the DEA dataframe with proper column name
Raises:: ValueError – if the file is not value

pyBiodatafuse.data_loader.filter_dea(data: DataFrame, column_name: str, min_value: float | None = None, max_value: float | None = None, abs_value: float | None = None) → DataFrame[source]

Filter the differential expression analysis (DEA) table.

Parameters:

data – DEA dataframe
column_name – the column to filter
min_value – the minimum value
max_value – the maximum value
abs_value – the absolute value (when filtering for LogFoldChange)

Returns:

the filtered DEA dataframe

Raises:

ValueError – if the paramaters are invalid

ID Mapping

pyBiodatafuse.id_mapper.read_datasource_file() → DataFrame[source]

Read the datasource file.

Returns:: a DataFrame containing the data from the datasource file

pyBiodatafuse.id_mapper.match_input_datasource(identifiers) → str[source]

Check if the input identifiers match the datasource.

This function attempts to match the provided identifiers against known patterns in the datasource file and returns the corresponding data source.

Parameters:: identifiers – a pandas DataFrame containing the identifiers to be matched
Returns:: data source
Raises:: ValueError – if the identifiers series is empty, no match is found, or multiple matches are found

pyBiodatafuse.id_mapper.get_version_webservice_bridgedb() → dict[source]

Get version of BridgeDb web service.

Returns:: a dictionary containing the version information
Raises:: ValueError – if failed to retrieve data

pyBiodatafuse.id_mapper.get_version_datasource_bridgedb(input_species: str | None = None) → List[str][source]

Get version of BridgeDb datasource.

Parameters:: input_species – specify the species, for now only human would be supported
Returns:: a list containing the version information
Raises:: ValueError – if failed to retrieve data

pyBiodatafuse.id_mapper.bridgedb_xref(identifiers: DataFrame, input_species: str | None = None, output_datasource: list | None = None, input_datasource: Literal['Ensembl', 'NCBI Gene', 'HGNC', 'HGNC Accession Number', 'MGI', 'miRBase mature sequence', 'miRBase Sequence', 'OMIM', 'RefSeq', 'Rfam', 'RGD', 'SGD', 'UCSC Genome Browser', 'NCBI Protein', 'PDB', 'Pfam', 'Uniprot-TrEMBL', 'Uniprot-SwissProt', 'Affy', 'Agilent', 'Illumina', 'Gene Ontology', 'CAS', 'ChEBI', 'ChemSpider', 'ChEMBL compound', 'DrugBank', 'HMDB', 'Guide to Pharmacology Ligand ID', 'InChIKey', 'KEGG Compound', 'KEGG Drug', 'KEGG Glycan', 'LIPID MAPS', 'LipidBank', 'PharmGKB Drug', 'PubChem Compound', 'PubChem Substance', 'SwissLipids', 'TTD Drug', 'Wikidata', 'Wikipedia'] = 'HGNC') → Tuple[DataFrame, dict][source]

Map input identifiers using BridgeDb.

Parameters:

identifiers – A pandas DataFrame with one column named ‘identifier’.
input_species – Optional species name. Only ‘Homo sapiens’ is currently supported.
input_datasource – The type of identifier in the input DataFrame. Expected formats by datasource: - “HGNC”: e.g. “TP53” - “HGNC Accession Number”: e.g. “HGNC:11998” - “Ensembl”: e.g. “ENSG00000141510” - “NCBI Gene”: e.g. “7157” - “MGI”: e.g. “MGI:104874” - “miRBase mature sequence”: e.g. “hsa-miR-21-5p” - “miRBase Sequence”: e.g. “MI0000077” - “OMIM”: e.g. “191170” - “RefSeq”: e.g. “NM_000546” - “Rfam”: e.g. “RF00001” - “RGD”: e.g. “RGD:620474” - “SGD”: e.g. “YAL001C” - “UCSC Genome Browser”: e.g. “uc001aaa.3” - “NCBI Protein”: e.g. “NP_000537” - “PDB”: e.g. “1TUP” - “Pfam”: e.g. “PF00069” - “Uniprot-SwissProt”: e.g. “P04637” - “Uniprot-TrEMBL”: e.g. “Q9H0H5” - “Affy”: e.g. “202763_at” - “Agilent”: e.g. “A_23_P61180” - “Illumina”: e.g. “ILMN_1803030” - “Gene Ontology”: e.g. “GO:0006915” - “CAS”: e.g. “50-00-0” - “ChEBI”: e.g. “CHEBI:15377” - “ChemSpider”: e.g. “5798” - “ChEMBL compound”: e.g. “CHEMBL25” - “DrugBank”: e.g. “DB01050” - “HMDB”: e.g. “HMDB0000122” - “Guide to Pharmacology Ligand ID”: e.g. “1234” - “InChIKey”: e.g. “BSYNRYMUTXBXSQ-UHFFFAOYSA-N” - “KEGG Compound”: e.g. “C00031” - “KEGG Drug”: e.g. “D00001” - “KEGG Glycan”: e.g. “G00001” - “LIPID MAPS”: e.g. “LMFA01010001” - “LipidBank”: e.g. “LBID0001” - “PharmGKB Drug”: e.g. “PA449053” - “PubChem Compound”: e.g. “2244” - “PubChem Substance”: e.g. “12345678” - “SwissLipids”: e.g. “SLM:000000001” - “TTD Drug”: e.g. “D000001” - “Wikidata”: e.g. “Q18216” - “Wikipedia”: e.g. “Aspirin”
output_datasource – Optional list of identifier types to map to.

Returns:

Tuple of: - DataFrame with mapped identifiers. - Dictionary of data resource metadata.

Raises:

ValueError – If required inputs are missing or the mapping fails.

pyBiodatafuse.id_mapper.check_smiles(smile: str | None) → str | None[source]

Canonicalize the smiles of a compound.

Parameters:: smile – smiles string
Returns:: canonicalized smiles string

pyBiodatafuse.id_mapper.get_cid_from_data(idx: str | None, idx_type: str) → str | None[source]

Get PubChem ID from any query using PubChempy.

Parameters:

idx – identifier to query
idx_type – type of identifier to query. Potential curies include : smiles, inchikey, inchi, name

Returns:

PubChem ID

pyBiodatafuse.id_mapper.get_cid_from_pugrest(idx: str | None, idx_type: str) → str | None[source]

Get PubChem ID from any query throung Pubchem PUGREST.

Parameters:

idx – identifier to query
idx_type – type of identifier to query. Potential curies include : smiles, inchikey, inchi, name

Returns:

PubChem ID

pyBiodatafuse.id_mapper.pubchem_xref(identifiers: list, identifier_type: str = 'name', cache_res: bool = False) → Tuple[DataFrame, dict][source]

Map chemical names or smiles or inchikeys to PubChem identifier.

Parameters:

identifiers – a list of identifiers to query
identifier_type – type of identifier to query. Potential curies include : smiles, inchikey, inchi, name
cache_res – whether to cache the results

Raises:

ValueError – if the input_datasource is not provided or if the request fails

Returns:

a DataFrame containing the mapped identifiers and dictionary of the data resource metadata.

pyBiodatafuse.id_mapper.cid2chembl(cids: list) → dict[source]

Map Pubchem CIDs to ChEMBL identifier.

Parameters:: cids – a list of CIDs identifiers to query
Raises:: ValueError – if the input_datasource is not provided or if the request fails
Returns:: a dictonary of ChEMBL mapped to CID identifiers and dictionary of the data resource metadata.

Homologs

pyBiodatafuse.human_homologs.check_endpoint_ensembl() → bool[source]

Check if the endpoint of the Ensembl API is available.

Returns:: A True statement if the endpoint is available, else return False

pyBiodatafuse.human_homologs.check_version_ensembl() → str[source]

Check the current version of the REST API.

Returns:: A True statement if the endpoint is available, else return False

pyBiodatafuse.human_homologs.get_human_homologs(row)[source]

Retrieve human homologs for mouse genes using Ensembl API.

Parameters:: row – row from input dataframe.
Returns:: dictionary mapping mouse genes to human homologs.

pyBiodatafuse.human_homologs.get_homologs(bridgedb_df)[source]

Retrieve homologs for input DataFrame.

Parameters:: bridgedb_df – input dataframe.
Returns:: dataframe including the human homologs as well as the metadata.

Annotation Orchestrator

pyBiodatafuse.id_annotator.run_gene_selected_sources(bridgedb_df: DataFrame, selected_sources_list: list, api_key: str | None = None, map_name: str | None = None) → Tuple[DataFrame, DefaultDict[str, dict]][source]

Query the selected databases and convert the output to a dataframe.

Parameters:

bridgedb_df – BridgeDb output for creating the list of gene ids to query.
selected_sources_list – list of selected databases.
api_key – DisGeNET API key (more details can be found at https://disgenet.com/plans).
map_name – name of the map you want to retrieve the information from. The extensive list can be found at https://minerva-net.lcsb.uni.lu/table.html.

Returns:

a DataFrame containing the combined output and dictionary of the metadata.

Raises:

ValueError – If ‘disgenet’ is in the selected_sources_list and api_key is not provided. Or if ‘minerva’ is in the selected sources and if map name is not provided.

Utility Functions

pyBiodatafuse.utils.get_identifier_of_interest(bridgedb_df: DataFrame, db_source: str, keep: List | None = None) → DataFrame[source]

Get identifier of interest from BridgeDb output file.

Parameters:

bridgedb_df – DataFrame containing the output from BridgeDb
db_source – identifier of interest from BridgeDB (e.g. “NCBI Gene”)
keep – list of additional identifier sources to keep in the output

Returns:

a DataFrame containing the identifiers of interest

pyBiodatafuse.utils.collapse_data_sources(data_df: DataFrame, source_namespace: str, target_df: DataFrame, common_cols: list, target_specific_cols: list, col_name: str) → DataFrame[source]

Collapse data sources into a single column.

Parameters:

data_df – BridegDb dataFrame containing idenfitiers from all sources
source_namespace – identifier of interest from BridgeDB (e.g. “NCBI Gene”)
target_df – DataFrame containing data from a external source
common_cols – list of columns that are common to both dataframes and can be used to merge
target_specific_cols – list of columns that are specific to the external source
col_name – name of the new column to be created

Returns:

a DataFrame containing the new data columns for a new resource

pyBiodatafuse.utils.create_or_append_to_metadata(data: dict, prev_entry: List[dict]) → List[dict][source]

Create and/or append data to a metadata file.

Parameters:

data – dictionary of data to be saved to the metadata file.
prev_entry –
list of dictionaries containing the previous data The metatdata file has the following schema: {

”datasource”: name_of_datasource, “metadata”: {

”source_version”: {source_version_info}, “data_version”: {data_version_info} (Optional)

}, “query”: { “size”: number_of_results_queried, “time”: time_taken_to_run_the_query, (using datetime.datetime.now()) “date”: date_of_query, “url”: url_of_query, “request_string”: post_request_string (Optional)

}

}

Returns:

a metadata dictionary

pyBiodatafuse.utils.combine_sources(bridgedb_df: DataFrame, df_list: List[DataFrame]) → DataFrame[source]

Combine multiple dataframes into a single dataframe.

Parameters:

bridgedb_df – BridgeDb output.
df_list – list of dataframes to be combined.

Returns:

a single dataframe containing from a list of dataframes

pyBiodatafuse.utils.combine_with_homologs(df: DataFrame, homolog_dfs: list) → DataFrame[source]

Merge a DataFrame with a list of homolog dataframes.

Parameters:

df – An already combined df containing output of non-homolog annotators.
homolog_dfs – List of homolog dataframes to be combined.

Returns:

Merged DataFrame with homolog-derived data added, clean of temp columns.

pyBiodatafuse.utils.check_columns_against_constants(data_df: DataFrame, output_dict: dict, check_values_in: list)[source]

Check if columns in the data source output DataFrame match expected types and values from a dictionary of constants.

Parameters:

data_df – DataFrame to check.
output_dict – Dictionary containing expected types for columns.
check_values_in – List of column names to check values against constants.

pyBiodatafuse.utils.create_harmonized_input_file(annotated_df: DataFrame, target_col: str, target_source: str, identifier_source: str | None = None) → DataFrame[source]

Create a harmonized input DataFrame by extracting specific identifiers from a complex nested structure within a target column.

Parameters:

annotated_df – DataFrame containing the initial data with nested dictionaries.
target_col – Name of the column containing the nested dictionaries.
target_source – The specific identifier source to extract (e.g., ‘EFO’, ‘OMIM’).
identifier_source – The main identifier in the output.

Returns:

A DataFrame with original identifiers and the extracted target identifiers.

pyBiodatafuse.utils.give_annotator_warning(annotator_name: str) → None[source]: Get the warning message for an annotator.