RNASimulASE Project Home Page

RNASimulASE - Simulator of Allele Specific RNA-seq data

Copyright © 2013 Daniel Edsgärd, Olof Emanuelsson
RNASimulASE is available free to use, under the GNU GPL version 3 license.
This product includes the software hapgen2, developed by Zhan Su et al, which is freely available for academic use only.

About
Download
Installation

Prerequisites
Installing a binary distribution
Building from source

Running RNASimulASE

Quick start
Introduction and output directories
Options
Annotation
Parallelization

Citing RNASimulASE
Contact information

About

RNASimulASE is a tool to simulate allele-specific expression (ASE) data generated by RNA-sequencing. Recently, there has been an increased interest in leveraging the diploid nature of genomes in genetic research (Tewhey, et al., NatRevGen, 2011). In RNA-seq this is realized by so called allele-specific expression (ASE), where the transcriptional level for each allele from a pair of homologous chromosomes can be distinguished at heterozygote variants.

To evaluate experimental designs, as well as the robustness and validity of emerging ASE analysis approaches, simulation of allele-specific RNA-seq data is crucial. Given a reference transcriptome, genetic variants, recombination rates, and empirical transcript expression levels, RNASimulASE generates diploid personal transcriptomes in FASTA format and RNA-seq output in FASTQ format. Additional features include base quality sampling and sequencing error simulation from empirical data of an actual run of a sequencing-machine. All input parameters have default values, facilitating an easy-to-use program. See the paper for further information.

Download

RNASimulASE is available via Sourceforge: Download SimulASE.

Installation

Prerequisites

Apart from the provided C++ binaries, RNASimulASE also makes use of Ruby and R scripts. Working installations of these languages are therefore needed. Ruby is shipped as part of most OS distributions. R can be downloaded from here.

Installing a binary distribution

Binaries are provided for Linux (x86_64) and OS X (Intel).

Download the binaries of RNASimulASE: rnasimulase-X.X.platform.tar.gz.
If simulating human data, annotation is also available for download: annot.tar.gz
Extract the binaries and annotation:
"tar -xvzf rnasimulase-X.X.tar.gz"
"tar -xvzf annot.tar.gz"
Add the binary directory to your shell PATH, by adding the following to your ~/.bashrc (Linux) or ~/.profile (OS X) shell startup file. Similarly, set an environment variable "ANNOT" to the annotation directory, within the same shell startup file:
"export PATH=/path/to/rnasimulase/bin:$PATH"
"export ANNOT=/path/to/rnasimulase/annot"
Note: Users of shells other than bash need to amend the PATH setting command accordingly in their corresponding shell startup file.

Building from source

The source code is distributed along with the binaries and the only additional steps as compared to installing the binary distribution (see 1-3 above) is to:

Go to the '/path/to/rnasimulase/src' directory and type:
"make"
Note: The makefile requires a gcc compiler version which includes the ISO 2011 C++ standard library (binaries for rnasimulase-1.0 was compiled with gcc 4.7).
To compile hapgen2 on a different platform the developer asks you to contact them. Once compiled, put the hapgen binary into the rnasimulase binary directory ('/path/to/rnasimulase/bin').

Running RNASimulASE

Quick start

To run with default parameters, type: simulase
For help, type: simulase -h.

Introduction and output directories

Executing the RNASimulASE pipeline by typing rnasimulase will execute the suite of subprograms in the following order: simhaplotranscriptome, simexpr, simfastqtrain and simfastq.

By default, scripts generated and executed by rnasimulase are put in a directory called "cmds" and data is written to a directory called "data". Within "data", three sub-directories are created:

"diploref", containing personal diploid transcriptomes in FASTA format for each simulated individual
"simexpr", containing simulated expression levels for each haplo-isoform (each of the two haplotypes of a diploid transcript)
"simfastq", in which the simulated reads are found in FASTQ format

Options

Options are described in detail if applying the -h option to rnasimulase or any of its subprograms. Furthermore, examples of the format for the input files are provided in the annotation (annot.tar.gz). To get detailed help for any of the options, type:

rnasimulase -h
simdiplotranscriptome -h
writediploref -h
simexpr -h
simfastqtrain -h
simfastq -h

Note: To get further help on hapgen2 options, visit their site.

Parallelization

If the user wishes to simulate a very high number of individuals, rnasimulase can be executed as separate tasks on separate processors. To facilitate this, the option "-r" is provided, which generates a random directory to which the output is written.

Annotation

We provide annotation files for analysis of human data. This include files needed to run Simdiplotranscriptome (hapgen and writing diploid transcriptomes), Simexpr and Simfastqtrain:

To run hapgen we provide files downloaded from Impute2, but subsetted by us with respect to the transcriptome (*.leg, *.hap, *.map), as well as a file with positions for a single variant within each chromosome, which is a dummy needed to make hapgen2 run (chr2onepos.list).
To write diploid transcriptomes based on the hapgen output we provide a reference transcriptome from Ensembl (*.fa) and transcript-relative positions of all variants input to hapgen (*.relpos). Since the set of all transcript ids and chromosomes are the same for a given transcriptome, we have also pregenerated that data for speed purposes (all.enst and chr.list).
To simulate expression values with Simexpr we provide a file with empirical expression values from a real sample: enst.counts.tab
To estimate base-quality distributions and read-length with Simfastqtrain we provide a file with empirical base-qualities for 2 million reads from a real run: ill.hiseq2000.phred64.readlen100.fastq

Citing RNASimulASE

Edsgärd D. and Emanuelsson O., RNASimulASE - Simulation of Allele Specific RNA-seq Data, 2013 (Submitted)

Contact information

Please use the Discussion Forum for issues related to running the program. Other comments or questions may be e-mailed to:
Technical: Daniel Edsgärd
PI: Olof Emanuelsson