Classes for performing gene set enrichment analysis (genometools.enrichment
)¶
GeneSetEnrichmentAnalysis |
Test a set of genes or a ranked list of genes for gene set enrichment. |
StaticGSEResult |
Result of a hypergeometric test for gene set enrichment. |
RankBasedGSEResult |
Result of an XL-mHG-based test for gene set enrichment. |
-
class
genometools.enrichment.
GeneSetEnrichmentAnalysis
(genome, gene_set_coll)[source]¶ Test a set of genes or a ranked list of genes for gene set enrichment.
Parameters: - genome (
ExpGenome
object) – See_genome
attribute. - gene_set_coll (
GeneSetCollection
object) – See_gene_set_coll
attribute.
-
_gene_set_coll
¶ GeneSetCollection
object – The list of gene sets to be tested.
Notes
The class is initialized with a set of valid gene names (an
ExpGenome
object), as well as a set of gene sets (aGeneSetCollection
object). During initialization, a binary “gene-by-gene set” matrix is constructed, which stores information about which gene is contained in each gene set. This matrix is quite sparse, and requires a significant amount of memory. As an example, for a set of p = 10,000 genes and n = 10,000 gene sets, this matrix is of size 100 MB in the memory (i.e., p x n bytes).Once the class has been initialized, the function
get_static_enrichment
can be used to test a set of genes for gene set enrichment, and the functionget_rank_based_enrichment
can be used to test a ranked list of genes for gene set enrichment.Note also that two conventions get mixed here: In the
GeneSet
class, a gene simply corresponds to a string containing the gene name, whereas in theExpGenome
class, a gene is anExpGene
object. Here, we represent genes as simple strings, since the user can always obtain the correspondingExpGene
object from theExpGenome
genome.-
get_rank_based_enrichment
(ranked_genes, pval_thresh=0.05, X_frac=0.25, X_min=5, L=None, adjust_pval_thresh=True, escore_pval_thresh=None, exact_pval=u'always', gene_set_ids=None, table=None)[source]¶ Test for gene set enrichment at the top of a ranked list of genes.
This function uses the XL-mHG test to identify enriched gene sets.
This function also calculates XL-mHG E-scores for the enriched gene sets, using
escore_pval_thresh
as the p-value threshold “psi”.Parameters: - ranked_genes (list of str) – The ranked list of genes.
- pval_thresh (float, optional) – The p-value threshold used to determine significance.
See also
adjust_pval_thresh
. [0.05] - X_frac (float, optional) – The min. fraction of genes from a gene set required for enrichment. [0.25]
- X_min (int, optional) – The min. no. of genes from a gene set required for enrichment. [5]
- L (int, optional) – The lowest cutoff to test for enrichment. If
None
, int(0.25*(no. of genes)) will be used. [None] - adjust_pval_thresh (bool, optional) – Whether to adjust the p-value thershold for multiple testing, using the Bonferroni method. [True]
- escore_pval_thresh (float or None, optional) – The “psi” p-value threshold used in calculating E-scores. [None]
- exact_pval (str) – Choices are: “always”, “if_significant”, “if_necessary”. Parameter
will be passed to
xlmhg.get_xlmhg_test_result
. [“always”] - gene_set_ids (list of str or None, optional) – A list of gene set IDs to specify which gene sets should be tested for enrichment. If
None
, all gene sets will be tested. [None] - table (2-dim numpy.ndarray of type numpy.longdouble or None, optional) – The dynamic programming table used by the algorithm for calculating XL-mHG p-values. Passing this avoids memory re-allocation when calling this function repetitively. [None]
Returns: A list of all significantly enriched gene sets.
Return type: list of
RankBasedGSEResult
-
get_static_enrichment
(genes, pval_thresh, adjust_pval_thresh=True, K_min=3, gene_set_ids=None)[source]¶ Find enriched gene sets in a set of genes.
Parameters: - genes (set of str) – The set of genes to test for gene set enrichment.
- pval_thresh (float) – The significance level (p-value threshold) to use in the analysis.
- adjust_pval_thresh (bool, optional) – Whether to adjust the p-value threshold using a Bonferroni correction. (Warning: This is a very conservative correction!) [True]
- K_min (int, optional) – The minimum number of gene set genes present in the analysis. [3]
- gene_set_ids (Iterable or None) – A list of gene set IDs to test. If
None
, all gene sets are tested that meet theK_min
criterion.
Returns: A list of all significantly enriched gene sets.
Return type: list of
StaticGSEResult
- genome (
-
class
genometools.enrichment.
StaticGSEResult
(gene_set, N, n, selected_genes, pval)[source]¶ Result of a hypergeometric test for gene set enrichment.
Parameters: - gene_set (
genometools.basic.GeneSet
) – Seegene_set
. - N (int) – See
N
. - n (int) – See
n
. - selected_genes (iterable of
ExpGene
) – Seeselected_genes
. - pval (float) – See
pval
.
-
gene_set
¶ genometools.basic.GeneSet
– The gene set.
-
N
¶ int – The total number of genes in the analysis.
-
n
¶ int – The number of genes selected.
-
pval
¶ float – The hypergeometric p-value.
-
fold_enrichment
¶ Returns the fold enrichment of the gene set.
Fold enrichment is defined as ratio between the observed and the expected number of gene set genes present.
-
get_pretty_format
(max_name_length=0)[source]¶ Returns a nicely formatted string describing the result.
Parameters: max_name_length (int [0]) – The maximum length of the gene set name (in characters). If the gene set name is longer than this number, it will be truncated and ”...” will be appended to it, so that the final string exactly meets the length requirement. If 0 (default), no truncation is performed. If not 0, must be at least 3. Returns: The formatted string. Return type: str Raises: ValueError
– If an invalid length value is specified.
- gene_set (
-
class
genometools.enrichment.
RankBasedGSEResult
(gene_set, N, indices, ind_genes, X, L, stat, cutoff, pval, pval_thresh=None, escore_pval_thresh=None, escore_tol=None)[source]¶ Result of an XL-mHG-based test for gene set enrichment.
This class inherits from
xlmhg.mHGResult
.Parameters: - gene_set (
genometools.basic.GeneSet
) – Seegene_set
attribute. - N (int) – The total number of genes in the ranked list.
See also
xlmhg.mHGResult.N
. - indices (
numpy.ndarray
of integers) – The indices of the gene set genes in the ranked list. - ind_genes (list of str) – See
ind_genes
attribute. - X (int) – The XL-mHG X parameter.
- L (int) – The XL-mHG L parameter.
- stat (float) – The XL-mHG test statistic.
- cutoff (int) – The cutoff at which the XL-mHG test statistic was attained.
- pval (float) – The XL-mHG p-value.
- pval_thresh (float, optional) – The p-value threshold used in the analysis. [None]
- escore_pval_thresh (float, optional) – The hypergeometric p-value threshold used for calculating the E-score. If not specified, the XL-mHG p-value will be used, resulting in a conservative E-score. [None]
- escore_tol (float, optional) – The tolerance used for calculating the E-score. [None]
-
gene_set
¶ genometools.basic.GeneSet
– The gene set.
-
ind_genes
¶ list of str – The names of the genes corresponding to the indices.
- gene_set (