Classes for performing gene set enrichment analysis (genometools.enrichment)

GeneSetEnrichmentAnalysis Test a set of genes or a ranked list of genes for gene set enrichment.
StaticGSEResult Result of a hypergeometric test for gene set enrichment.
RankBasedGSEResult Result of an XL-mHG-based test for gene set enrichment.
class genometools.enrichment.GeneSetEnrichmentAnalysis(genome, gene_set_coll)[source]

Test a set of genes or a ranked list of genes for gene set enrichment.

Parameters:
_genome

ExpGenome object – The universe of genes.

_gene_set_coll

GeneSetCollection object – The list of gene sets to be tested.

Notes

The class is initialized with a set of valid gene names (an ExpGenome object), as well as a set of gene sets (a GeneSetCollection object). During initialization, a binary “gene-by-gene set” matrix is constructed, which stores information about which gene is contained in each gene set. This matrix is quite sparse, and requires a significant amount of memory. As an example, for a set of p = 10,000 genes and n = 10,000 gene sets, this matrix is of size 100 MB in the memory (i.e., p x n bytes).

Once the class has been initialized, the function get_static_enrichment can be used to test a set of genes for gene set enrichment, and the function get_rank_based_enrichment can be used to test a ranked list of genes for gene set enrichment.

Note also that two conventions get mixed here: In the GeneSet class, a gene simply corresponds to a string containing the gene name, whereas in the ExpGenome class, a gene is an ExpGene object. Here, we represent genes as simple strings, since the user can always obtain the corresponding ExpGene object from the ExpGenome genome.

get_rank_based_enrichment(ranked_genes, pval_thresh=0.05, X_frac=0.25, X_min=5, L=None, adjust_pval_thresh=True, escore_pval_thresh=None, exact_pval=u'always', gene_set_ids=None, table=None)[source]

Test for gene set enrichment at the top of a ranked list of genes.

This function uses the XL-mHG test to identify enriched gene sets.

This function also calculates XL-mHG E-scores for the enriched gene sets, using escore_pval_thresh as the p-value threshold “psi”.

Parameters:
  • ranked_genes (list of str) – The ranked list of genes.
  • pval_thresh (float, optional) – The p-value threshold used to determine significance. See also adjust_pval_thresh. [0.05]
  • X_frac (float, optional) – The min. fraction of genes from a gene set required for enrichment. [0.25]
  • X_min (int, optional) – The min. no. of genes from a gene set required for enrichment. [5]
  • L (int, optional) – The lowest cutoff to test for enrichment. If None, int(0.25*(no. of genes)) will be used. [None]
  • adjust_pval_thresh (bool, optional) – Whether to adjust the p-value thershold for multiple testing, using the Bonferroni method. [True]
  • escore_pval_thresh (float or None, optional) – The “psi” p-value threshold used in calculating E-scores. [None]
  • exact_pval (str) – Choices are: “always”, “if_significant”, “if_necessary”. Parameter will be passed to xlmhg.get_xlmhg_test_result. [“always”]
  • gene_set_ids (list of str or None, optional) – A list of gene set IDs to specify which gene sets should be tested for enrichment. If None, all gene sets will be tested. [None]
  • table (2-dim numpy.ndarray of type numpy.longdouble or None, optional) – The dynamic programming table used by the algorithm for calculating XL-mHG p-values. Passing this avoids memory re-allocation when calling this function repetitively. [None]
Returns:

A list of all significantly enriched gene sets.

Return type:

list of RankBasedGSEResult

get_static_enrichment(genes, pval_thresh, adjust_pval_thresh=True, K_min=3, gene_set_ids=None)[source]

Find enriched gene sets in a set of genes.

Parameters:
  • genes (set of str) – The set of genes to test for gene set enrichment.
  • pval_thresh (float) – The significance level (p-value threshold) to use in the analysis.
  • adjust_pval_thresh (bool, optional) – Whether to adjust the p-value threshold using a Bonferroni correction. (Warning: This is a very conservative correction!) [True]
  • K_min (int, optional) – The minimum number of gene set genes present in the analysis. [3]
  • gene_set_ids (Iterable or None) – A list of gene set IDs to test. If None, all gene sets are tested that meet the K_min criterion.
Returns:

A list of all significantly enriched gene sets.

Return type:

list of StaticGSEResult

class genometools.enrichment.StaticGSEResult(gene_set, N, n, selected_genes, pval)[source]

Result of a hypergeometric test for gene set enrichment.

Parameters:
gene_set

genometools.basic.GeneSet – The gene set.

N

int – The total number of genes in the analysis.

n

int – The number of genes selected.

selected_genes

set of ExpGene – The genes from the gene set found present.

pval

float – The hypergeometric p-value.

fold_enrichment

Returns the fold enrichment of the gene set.

Fold enrichment is defined as ratio between the observed and the expected number of gene set genes present.

get_pretty_format(max_name_length=0)[source]

Returns a nicely formatted string describing the result.

Parameters:max_name_length (int [0]) – The maximum length of the gene set name (in characters). If the gene set name is longer than this number, it will be truncated and ”...” will be appended to it, so that the final string exactly meets the length requirement. If 0 (default), no truncation is performed. If not 0, must be at least 3.
Returns:The formatted string.
Return type:str
Raises:ValueError – If an invalid length value is specified.
class genometools.enrichment.RankBasedGSEResult(gene_set, N, indices, ind_genes, X, L, stat, cutoff, pval, pval_thresh=None, escore_pval_thresh=None, escore_tol=None)[source]

Result of an XL-mHG-based test for gene set enrichment.

This class inherits from xlmhg.mHGResult.

Parameters:
  • gene_set (genometools.basic.GeneSet) – See gene_set attribute.
  • N (int) – The total number of genes in the ranked list. See also xlmhg.mHGResult.N.
  • indices (numpy.ndarray of integers) – The indices of the gene set genes in the ranked list.
  • ind_genes (list of str) – See ind_genes attribute.
  • X (int) – The XL-mHG X parameter.
  • L (int) – The XL-mHG L parameter.
  • stat (float) – The XL-mHG test statistic.
  • cutoff (int) – The cutoff at which the XL-mHG test statistic was attained.
  • pval (float) – The XL-mHG p-value.
  • pval_thresh (float, optional) – The p-value threshold used in the analysis. [None]
  • escore_pval_thresh (float, optional) – The hypergeometric p-value threshold used for calculating the E-score. If not specified, the XL-mHG p-value will be used, resulting in a conservative E-score. [None]
  • escore_tol (float, optional) – The tolerance used for calculating the E-score. [None]
gene_set

genometools.basic.GeneSet – The gene set.

ind_genes

list of str – The names of the genes corresponding to the indices.