Core Utilities
Here you can find the core utility functions for data loading, ID mapping, and annotation orchestration.
Data Loaders
- pyBiodatafuse.data_loader.create_df_from_file(file_path: str) DataFrame[source]
Create a DataFrame from a file containing a list of identifiers.
- Parameters:
file_path – path to the file containing the list of identifiers
- Returns:
a DataFrame containing the list of identifiers
- pyBiodatafuse.data_loader.create_df_from_text(text_input: str) DataFrame[source]
Create a DataFrame from a text containing a list of identifiers.
- Parameters:
text_input – text containing the list of identifiers with each identifier on a new line.
- Returns:
a DataFrame containing the list of identifiers
- pyBiodatafuse.data_loader.create_df_from_dea(file_path: str) DataFrame[source]
Read a dataframe containing the result of the differential expression analysis (DEA).
- Parameters:
file_path – path to the file containing the result of DEA
- Returns:
the DEA dataframe with proper column name
- Raises:
ValueError – if the file is not value
- pyBiodatafuse.data_loader.filter_dea(data: DataFrame, column_name: str, min_value: float | None = None, max_value: float | None = None, abs_value: float | None = None) DataFrame[source]
Filter the differential expression analysis (DEA) table.
- Parameters:
data – DEA dataframe
column_name – the column to filter
min_value – the minimum value
max_value – the maximum value
abs_value – the absolute value (when filtering for LogFoldChange)
- Returns:
the filtered DEA dataframe
- Raises:
ValueError – if the paramaters are invalid
ID Mapping
- pyBiodatafuse.id_mapper.read_datasource_file() DataFrame[source]
Read the datasource file.
- Returns:
a DataFrame containing the data from the datasource file
- pyBiodatafuse.id_mapper.match_input_datasource(identifiers) str[source]
Check if the input identifiers match the datasource.
This function attempts to match the provided identifiers against known patterns in the datasource file and returns the corresponding data source.
- Parameters:
identifiers – a pandas DataFrame containing the identifiers to be matched
- Returns:
data source
- Raises:
ValueError – if the identifiers series is empty, no match is found, or multiple matches are found
- pyBiodatafuse.id_mapper.get_version_webservice_bridgedb() dict[source]
Get version of BridgeDb web service.
- Returns:
a dictionary containing the version information
- Raises:
ValueError – if failed to retrieve data
- pyBiodatafuse.id_mapper.get_version_datasource_bridgedb(input_species: str | None = None) List[str][source]
Get version of BridgeDb datasource.
- Parameters:
input_species – specify the species, for now only human would be supported
- Returns:
a list containing the version information
- Raises:
ValueError – if failed to retrieve data
- pyBiodatafuse.id_mapper.bridgedb_xref(identifiers: DataFrame, input_species: str | None = None, output_datasource: list | None = None, input_datasource: Literal['Ensembl', 'NCBI Gene', 'HGNC', 'HGNC Accession Number', 'MGI', 'miRBase mature sequence', 'miRBase Sequence', 'OMIM', 'RefSeq', 'Rfam', 'RGD', 'SGD', 'UCSC Genome Browser', 'NCBI Protein', 'PDB', 'Pfam', 'Uniprot-TrEMBL', 'Uniprot-SwissProt', 'Affy', 'Agilent', 'Illumina', 'Gene Ontology', 'CAS', 'ChEBI', 'ChemSpider', 'ChEMBL compound', 'DrugBank', 'HMDB', 'Guide to Pharmacology Ligand ID', 'InChIKey', 'KEGG Compound', 'KEGG Drug', 'KEGG Glycan', 'LIPID MAPS', 'LipidBank', 'PharmGKB Drug', 'PubChem Compound', 'PubChem Substance', 'SwissLipids', 'TTD Drug', 'Wikidata', 'Wikipedia'] = 'HGNC') Tuple[DataFrame, dict][source]
Map input identifiers using BridgeDb.
- Parameters:
identifiers – A pandas DataFrame with one column named ‘identifier’.
input_species – Optional species name. Only ‘Homo sapiens’ is currently supported.
input_datasource – The type of identifier in the input DataFrame. Expected formats by datasource: - “HGNC”: e.g. “TP53” - “HGNC Accession Number”: e.g. “HGNC:11998” - “Ensembl”: e.g. “ENSG00000141510” - “NCBI Gene”: e.g. “7157” - “MGI”: e.g. “MGI:104874” - “miRBase mature sequence”: e.g. “hsa-miR-21-5p” - “miRBase Sequence”: e.g. “MI0000077” - “OMIM”: e.g. “191170” - “RefSeq”: e.g. “NM_000546” - “Rfam”: e.g. “RF00001” - “RGD”: e.g. “RGD:620474” - “SGD”: e.g. “YAL001C” - “UCSC Genome Browser”: e.g. “uc001aaa.3” - “NCBI Protein”: e.g. “NP_000537” - “PDB”: e.g. “1TUP” - “Pfam”: e.g. “PF00069” - “Uniprot-SwissProt”: e.g. “P04637” - “Uniprot-TrEMBL”: e.g. “Q9H0H5” - “Affy”: e.g. “202763_at” - “Agilent”: e.g. “A_23_P61180” - “Illumina”: e.g. “ILMN_1803030” - “Gene Ontology”: e.g. “GO:0006915” - “CAS”: e.g. “50-00-0” - “ChEBI”: e.g. “CHEBI:15377” - “ChemSpider”: e.g. “5798” - “ChEMBL compound”: e.g. “CHEMBL25” - “DrugBank”: e.g. “DB01050” - “HMDB”: e.g. “HMDB0000122” - “Guide to Pharmacology Ligand ID”: e.g. “1234” - “InChIKey”: e.g. “BSYNRYMUTXBXSQ-UHFFFAOYSA-N” - “KEGG Compound”: e.g. “C00031” - “KEGG Drug”: e.g. “D00001” - “KEGG Glycan”: e.g. “G00001” - “LIPID MAPS”: e.g. “LMFA01010001” - “LipidBank”: e.g. “LBID0001” - “PharmGKB Drug”: e.g. “PA449053” - “PubChem Compound”: e.g. “2244” - “PubChem Substance”: e.g. “12345678” - “SwissLipids”: e.g. “SLM:000000001” - “TTD Drug”: e.g. “D000001” - “Wikidata”: e.g. “Q18216” - “Wikipedia”: e.g. “Aspirin”
output_datasource – Optional list of identifier types to map to.
- Returns:
Tuple of: - DataFrame with mapped identifiers. - Dictionary of data resource metadata.
- Raises:
ValueError – If required inputs are missing or the mapping fails.
- pyBiodatafuse.id_mapper.check_smiles(smile: str | None) str | None[source]
Canonicalize the smiles of a compound.
- Parameters:
smile – smiles string
- Returns:
canonicalized smiles string
- pyBiodatafuse.id_mapper.get_cid_from_data(idx: str | None, idx_type: str) str | None[source]
Get PubChem ID from any query using PubChempy.
- Parameters:
idx – identifier to query
idx_type – type of identifier to query. Potential curies include : smiles, inchikey, inchi, name
- Returns:
PubChem ID
- pyBiodatafuse.id_mapper.get_cid_from_pugrest(idx: str | None, idx_type: str) str | None[source]
Get PubChem ID from any query throung Pubchem PUGREST.
- Parameters:
idx – identifier to query
idx_type – type of identifier to query. Potential curies include : smiles, inchikey, inchi, name
- Returns:
PubChem ID
- pyBiodatafuse.id_mapper.pubchem_xref(identifiers: list, identifier_type: str = 'name', cache_res: bool = False) Tuple[DataFrame, dict][source]
Map chemical names or smiles or inchikeys to PubChem identifier.
- Parameters:
identifiers – a list of identifiers to query
identifier_type – type of identifier to query. Potential curies include : smiles, inchikey, inchi, name
cache_res – whether to cache the results
- Raises:
ValueError – if the input_datasource is not provided or if the request fails
- Returns:
a DataFrame containing the mapped identifiers and dictionary of the data resource metadata.
- pyBiodatafuse.id_mapper.cid2chembl(cids: list) dict[source]
Map Pubchem CIDs to ChEMBL identifier.
- Parameters:
cids – a list of CIDs identifiers to query
- Raises:
ValueError – if the input_datasource is not provided or if the request fails
- Returns:
a dictonary of ChEMBL mapped to CID identifiers and dictionary of the data resource metadata.
Homologs
- pyBiodatafuse.human_homologs.check_endpoint_ensembl() bool[source]
Check if the endpoint of the Ensembl API is available.
- Returns:
A True statement if the endpoint is available, else return False
- pyBiodatafuse.human_homologs.check_version_ensembl() str[source]
Check the current version of the REST API.
- Returns:
A True statement if the endpoint is available, else return False
Annotation Orchestrator
- pyBiodatafuse.id_annotator.run_gene_selected_sources(bridgedb_df: DataFrame, selected_sources_list: list, api_key: str | None = None, map_name: str | None = None) Tuple[DataFrame, DefaultDict[str, dict]][source]
Query the selected databases and convert the output to a dataframe.
- Parameters:
bridgedb_df – BridgeDb output for creating the list of gene ids to query.
selected_sources_list – list of selected databases.
api_key – DisGeNET API key (more details can be found at https://disgenet.com/plans).
map_name – name of the map you want to retrieve the information from. The extensive list can be found at https://minerva-net.lcsb.uni.lu/table.html.
- Returns:
a DataFrame containing the combined output and dictionary of the metadata.
- Raises:
ValueError – If ‘disgenet’ is in the selected_sources_list and api_key is not provided. Or if ‘minerva’ is in the selected sources and if map name is not provided.
Utility Functions
- pyBiodatafuse.utils.get_identifier_of_interest(bridgedb_df: DataFrame, db_source: str, keep: List | None = None) DataFrame[source]
Get identifier of interest from BridgeDb output file.
- Parameters:
bridgedb_df – DataFrame containing the output from BridgeDb
db_source – identifier of interest from BridgeDB (e.g. “NCBI Gene”)
keep – list of additional identifier sources to keep in the output
- Returns:
a DataFrame containing the identifiers of interest
- pyBiodatafuse.utils.collapse_data_sources(data_df: DataFrame, source_namespace: str, target_df: DataFrame, common_cols: list, target_specific_cols: list, col_name: str) DataFrame[source]
Collapse data sources into a single column.
- Parameters:
data_df – BridegDb dataFrame containing idenfitiers from all sources
source_namespace – identifier of interest from BridgeDB (e.g. “NCBI Gene”)
target_df – DataFrame containing data from a external source
common_cols – list of columns that are common to both dataframes and can be used to merge
target_specific_cols – list of columns that are specific to the external source
col_name – name of the new column to be created
- Returns:
a DataFrame containing the new data columns for a new resource
- pyBiodatafuse.utils.create_or_append_to_metadata(data: dict, prev_entry: List[dict]) List[dict][source]
Create and/or append data to a metadata file.
- Parameters:
data – dictionary of data to be saved to the metadata file.
prev_entry –
list of dictionaries containing the previous data The metatdata file has the following schema: {
”datasource”: name_of_datasource, “metadata”: {
”source_version”: {source_version_info}, “data_version”: {data_version_info} (Optional)
}, “query”: { “size”: number_of_results_queried, “time”: time_taken_to_run_the_query, (using datetime.datetime.now()) “date”: date_of_query, “url”: url_of_query, “request_string”: post_request_string (Optional)
}
}
- Returns:
a metadata dictionary
- pyBiodatafuse.utils.combine_sources(bridgedb_df: DataFrame, df_list: List[DataFrame]) DataFrame[source]
Combine multiple dataframes into a single dataframe.
- Parameters:
bridgedb_df – BridgeDb output.
df_list – list of dataframes to be combined.
- Returns:
a single dataframe containing from a list of dataframes
- pyBiodatafuse.utils.combine_with_homologs(df: DataFrame, homolog_dfs: list) DataFrame[source]
Merge a DataFrame with a list of homolog dataframes.
- Parameters:
df – An already combined df containing output of non-homolog annotators.
homolog_dfs – List of homolog dataframes to be combined.
- Returns:
Merged DataFrame with homolog-derived data added, clean of temp columns.
- pyBiodatafuse.utils.check_columns_against_constants(data_df: DataFrame, output_dict: dict, check_values_in: list)[source]
Check if columns in the data source output DataFrame match expected types and values from a dictionary of constants.
- Parameters:
data_df – DataFrame to check.
output_dict – Dictionary containing expected types for columns.
check_values_in – List of column names to check values against constants.
- pyBiodatafuse.utils.create_harmonized_input_file(annotated_df: DataFrame, target_col: str, target_source: str, identifier_source: str | None = None) DataFrame[source]
Create a harmonized input DataFrame by extracting specific identifiers from a complex nested structure within a target column.
- Parameters:
annotated_df – DataFrame containing the initial data with nested dictionaries.
target_col – Name of the column containing the nested dictionaries.
target_source – The specific identifier source to extract (e.g., ‘EFO’, ‘OMIM’).
identifier_source – The main identifier in the output.
- Returns:
A DataFrame with original identifiers and the extracted target identifiers.