Index:
DNA microarray experiments typically yield a long list of differentially expressed genes. It is up to the biologist to extract meaningful patterns from these lists. In recent years, several methods have been proposed to analyze microarray data in terms of gene sets rather than single genes (Goeman et al. 2007). A gene set is any collection of genes linked by a known relationship, e.g. participation in the same functional pathway or chromosomal proximity. Given an a-priori defined (i.e. known from previous experiments) list of gene sets, Gene Set Analysis (GSA) methods compute a "global" statistic to test each set as a whole for differential expression. A correction for multiple hypothesis testing is then performed, typically by computing a False Discovery Rate (FDR) q-value for each test.
Our group is developing a Python framework for GSA named pygsa, which aims to provide Python implementations for the most popular analysis methods. As of version 1.1, the only supported method is SAM-GS (Dinu et al. 2007). SAM-GS is also available in the original R implementation by Irina Dinu at http://www.ualberta.ca/~yyasui/homepage.html.
What's new: in version 1.1 pygsa has been rewritten to be more efficient: it uses less memory and it is faster with medium sized datasets.
NOTE: version 1.0 does not require installation, just
unpack the archive and run python sam_gs.py --help.
Download the pygsa tarball or zip archive, extract its contents and run
python setup.py install. You can then run sam_gs.py
--help to get usage info. Note for windows users: if
you cannnot run sam_gs.py from the command prompt, check that the "Scripts"
subdirectory of your Python home directory
(e.g. C:\Python25\Scripts) is listed in your PATH environment
variable.
GSA Analysis methods require three input files:
Currently supported file formats are a subset of the ones supported by GSEA: GCT for dataset files, CLS for class labeling and GMT for gene set lists. See the GSEA data formats page for a detailed description.
To run SAM-GS, call:
python sam_gs.py [OPTIONS] GCT_FILE CLS_FILE GMT_FILE
where GCT_FILE, CLS_FILE, GMT_FILE are the three GSA required files described in the Input Files section. The most important options are:
-p INT: number of random permutations of the class labels to
perform for calculating the empirical p-values. If the total number of
distinct permutations of the class labels vector is smaller, sam_gs
will automatically use the full set of permutations. The default value
is 1000. A larger number of permutations will yield more accurate
p-values but it will also increase the running time.
-m INT, --ngmin=INT: minimum number of genes allowed
for a gene set. Gene set size is computed after intersection with
dataset genes.
-M INT, --ngmax=INT: maximum number of genes allowed
for a gene set. Gene set size is computed after intersection with
dataset genes.
-l FLOAT: lambda threshold for q-values. If this option is not set, lambda will be automatically estimated.
Run sam_gs.py --help to get the full option list. The
output file is a tab-separated csv with four columns:
Example 1: run 10000 permutations
python sam_gs.py mydata.gct mydata.cls mydata.gmt -p 10000
Example 2: set minimum and maximum set size
python sam_gs.py mydata.gct mydata.cls mydata.gmt -m 15 -M 500
pygsa is developed by Simone Leo at the Distributed Computing Group, CRS4. For any questions, comments or suggestions, contact the author at: simone.leo
crs4.it