CRS4 - Distributed Computing

pygsa: Python Gene Set Analysis

Index:

Introduction

DNA microarray experiments typically yield a long list of differentially expressed genes. It is up to the biologist to extract meaningful patterns from these lists. In recent years, several methods have been proposed to analyze microarray data in terms of gene sets rather than single genes (Goeman et al. 2007). A gene set is any collection of genes linked by a known relationship, e.g. participation in the same functional pathway or chromosomal proximity. Given an a-priori defined (i.e. known from previous experiments) list of gene sets, Gene Set Analysis (GSA) methods compute a "global" statistic to test each set as a whole for differential expression. A correction for multiple hypothesis testing is then performed, typically by computing a False Discovery Rate (FDR) q-value for each test.

pygsa

Our group is developing a Python framework for GSA named pygsa, which aims to provide Python implementations for the most popular analysis methods. As of version 1.1, the only supported method is SAM-GS (Dinu et al. 2007). SAM-GS is also available in the original R implementation by Irina Dinu at http://www.ualberta.ca/~yyasui/homepage.html.

Minimum Requirements

Download

What's new: in version 1.1 pygsa has been rewritten to be more efficient: it uses less memory and it is faster with medium sized datasets.

Previous Versions

NOTE: version 1.0 does not require installation, just unpack the archive and run python sam_gs.py --help.

Installation

Download the pygsa tarball or zip archive, extract its contents and run python setup.py install. You can then run sam_gs.py --help to get usage info. Note for windows users: if you cannnot run sam_gs.py from the command prompt, check that the "Scripts" subdirectory of your Python home directory (e.g. C:\Python25\Scripts) is listed in your PATH environment variable.

Input Files

GSA Analysis methods require three input files:

Currently supported file formats are a subset of the ones supported by GSEA: GCT for dataset files, CLS for class labeling and GMT for gene set lists. See the GSEA data formats page for a detailed description.

SAM-GS User Manual

To run SAM-GS, call:

python sam_gs.py [OPTIONS] GCT_FILE CLS_FILE GMT_FILE

where GCT_FILE, CLS_FILE, GMT_FILE are the three GSA required files described in the Input Files section. The most important options are:

Run sam_gs.py --help to get the full option list. The output file is a tab-separated csv with four columns:

  1. gene set name;
  2. SAM-GS statistic value: this is the sum of the squares of SAM's d statistics for all genes in the gene set;
  3. p-value of the test for differential expression. This is computed as the fraction of permutations of the class labels vector which yield a SAM-GS statistic greater than the one computed for the original dataset;
  4. Storey FDR q-value of the test. If the lambda parameter is not given as input, it will be estimated with the bootstrap technique described in (Storey 2002).

Example 1: run 10000 permutations

python sam_gs.py mydata.gct mydata.cls mydata.gmt -p 10000

Example 2: set minimum and maximum set size

python sam_gs.py mydata.gct mydata.cls mydata.gmt -m 15 -M 500

Contact

pygsa is developed by Simone Leo at the Distributed Computing Group, CRS4. For any questions, comments or suggestions, contact the author at: simone.leocrs4.it