Helper functions
For a lack of a better name, hpo3
comes with a helper
submodule that contains
some methods that fully utilize Rust’s multithreading for batchwise large operations.
This is especially useful for large set data analysis.
Methods
- batch_similarity(comparisons, kind, method)
Calculate similarity between
HPOTerm
in batchesThis method runs parallelized on all avaible CPU
- Parameters
comparisons (list[tuple[
pyhpo.HPOTerm
,pyhpo.HPOTerm
]]) – A list ofHPOTerm
tuples. The twoHPOTerm
within one tuple will be compared to each other.kind (str, default:
omim
) –Which kind of information content to use for similarity calculation
Available options:
omim
gene
method (str, default
graphic
) –The method to use to calculate the similarity.
Available options:
resnik - Resnik P, Proceedings of the 14th IJCAI, (1995)
lin - Lin D, Proceedings of the 15th ICML, (1998)
jc - Jiang J, Conrath D, ROCLING X, (1997) This is different to PyHPO
jc2 - Jiang J, Conrath D, ROCLING X, (1997) Same as jc, but kept for backwards compatibility
rel - Relevance measure - Schlicker A, et.al., BMC Bioinformatics, (2006)
ic - Information coefficient - Li B, et. al., arXiv, (2010)
graphic - Graph based Information coefficient - Deng Y, et. al., PLoS One, (2015)
dist - Distance between terms
- Returns
The similarity scores of each comparison
- Return type
list[float]
- Raises
KeyError – Invalid
kind
providedRuntimeError – Invalid
method
Examples
import itertools from pyhpo import Ontology, HPOSet, helper Ontology() terms = [t for t in Ontology] term_combinations = [(a[0], a[1]) for a in itertools.combinations(terms,2)] similarities = helper.batch_similarity(term_combinations[0:10000], kind="omim", method="graphic")
- batch_set_similarity(comparisons, kind, method, combine)
Calculate similarity between
HPOSet
in batchesThis method runs parallelized on all avaible CPU
- Parameters
comparisons (list[tuple[
pyhpo.HPOSet
,pyhpo.HPOSet
]]) – A list ofHPOSet
tuples. The twoHPOSet
within one tuple will be compared to each other.kind (str, default:
omim
) –Which kind of information content to use for similarity calculation
Available options:
omim
gene
method (str, default
graphic
) –The method to use to calculate the similarity.
Available options:
resnik - Resnik P, Proceedings of the 14th IJCAI, (1995)
lin - Lin D, Proceedings of the 15th ICML, (1998)
jc - Jiang J, Conrath D, ROCLING X, (1997) This is different to PyHPO
jc2 - Jiang J, Conrath D, ROCLING X, (1997) Same as jc, but kept for backwards compatibility
rel - Relevance measure - Schlicker A, et.al., BMC Bioinformatics, (2006)
ic - Information coefficient - Li B, et. al., arXiv, (2010)
graphic - Graph based Information coefficient - Deng Y, et. al., PLoS One, (2015)
dist - Distance between terms
- Returns
The similarity scores of each comparison
- Return type
list[float]
- Raises
NameError – Ontology not yet constructed
KeyError – Invalid
kind
providedRuntimeError – Invalid
method
orcombine
Examples
import itertools from pyhpo import Ontology, HPOSet, helper Ontology() gene_sets = [g.hpo_set() for g in Ontology.genes] gene_set_combinations = [(a[0], a[1]) for a in itertools.combinations(gene_sets,2)] similarities = helper.batch_set_similarity(gene_set_combinations[0:100], kind="omim", method="graphic", combine = "funSimAvg")
- batch_disease_enrichment(hposets)
Calculate enriched diseases in a list of
HPOSet
This method runs parallelized on all avaible CPU
Calculate the hypergeometric enrichment of diseases associated to the terms of each set. Each set is calculated individually, the returning list has the same order as the input data.
- Parameters
hposets (list[
pyhpo.HPOSet
]) – A list of HPOSets. The enrichment of all diseases is calculated separately for each HPOset in the list- Returns
The enrichment result for every disease. See
pyhpo.stats.EnrichmentModel.enrichment()
for details- Return type
list[dict]
- Raises
NameError – Ontology not yet constructed
Examples
from pyhpo import Ontology, helper Ontology() genes = [g for g in Ontology.genes[0:100]] gene_sets = [g.hpo_set() for g in genes] enrichments = helper.batch_disease_enrichment(gene_sets) for (gene, enriched_diseases) in zip(genes, enrichments): print( "The top enriched diseases for {} are: {}".format( gene.name, ", ".join([f"{disease['item'].name}, ({disease['enrichment']})" for disease in enriched_diseases[0:5]]) ) ) # >>> The top enriched diseases for C7 are: C7 deficiency, (3.6762699175625894e-42), C6 deficiency, (3.782313673973149e-37), C5 deficiency, (2.6614254464758174e-33), Complement factor B deficiency, (4.189056541495023e-32), Complement component 8 deficiency, type II, (8.87368759499919e-32) # >>> The top enriched diseases for WNT5A are: Robinow syndrome, autosomal recessive, (0.0), Robinow syndrome, autosomal dominant 1, (0.0), Pallister-Killian syndrome, (1.2993558687813034e-238), Robinow syndrome, autosomal dominant 3, (1.2014167106834296e-223), Peters-plus syndrome, (2.5163107554882648e-216) # >>> The top enriched diseases for TYMS are: Dyskeratosis congenita, X-linked, (5.008058437787544e-192), Dyskeratosis congenita, digenic, (2.703378203105612e-184), Dyskeratosis congenita, autosomal dominant 2, (1.3109083102058795e-150), Bloom syndrome, (3.965926308699221e-141), Dyskeratosis congenita, autosomal dominant 3, (1.123439117889186e-131)
- batch_gene_enrichment(hposets)
Calculate enriched genes in a list of
HPOSet
This method runs parallelized on all avaible CPU
Calculate hypergeometric enrichment of genes associated to the terms of each set. Each set is calculated individually, the returning list has the same order as the input data.
- Parameters
hposets (list[
pyhpo.HPOSet
]) – A list of HPOSets. The enrichment of all genes is calculated separately for each HPOset in the list- Returns
The enrichment result for every gene. See
pyhpo.stats.EnrichmentModel.enrichment()
for details- Return type
list[dict]
- Raises
NameError – Ontology not yet constructed
Examples
from pyhpo import Ontology, helper Ontology() diseases = [d for d in Ontology.omim_diseases[0:100]] disease_sets = [d.hpo_set() for d in diseases] enrichments = helper.batch_gene_enrichment(disease_sets) for (disease, enriched_genes) in zip(diseases, enrichments): print( "The top enriched genes for {} are: {}".format( disease.name, ", ".join([f"{gene['item'].name}, ({gene['enrichment']})" for gene in enriched_genes[0:5]]) ) ) # >>> The top enriched genes for Immunodeficiency 85 and autoimmunity are: TOM1, (7.207370728788139e-45), PIK3CD, (1.9560156243742087e-17), IL2RG, (1.0000718026169596e-16), BACH2, (3.373013104581288e-15), IL6ST, (3.760565282680126e-15) # >>> The top enriched genes for CODAS syndrome are: LONP1, (4.209128613268585e-80), EXTL3, (5.378742851736401e-23), SMC1A, (5.338807361962185e-22), FLNA, (1.0968887647112733e-21), COL2A1, (1.1029731783630839e-21) # >>> The top enriched genes for Rhizomelic chondrodysplasia punctata, type 1 are: PEX7, (9.556919089648523e-54), PEX5, (7.030392607093173e-22), PEX1, (3.7973830291601626e-19), PEX11B, (4.318791413029623e-19), HSPG2, (7.108950838424571e-19) # >>> The top enriched genes for Oculopharyngodistal myopathy 4 are: RILPL1, (1.4351489331895004e-49), LRP12, (2.168165858699749e-30), GIPC1, (3.180801819975307e-27), NOTCH2NLC, (1.0700847991253517e-23), VCP, (2.8742020666947536e-20)