RNASimulASE - Simulator of Allele Specific RNA-seq data
Copyright © 2013 Daniel Edsgärd, Olof Emanuelsson
RNASimulASE is available free to use, under the GNU GPL version 3 license.
This product includes the software hapgen2, developed by Zhan Su et al, which is freely available for academic use only.
- About
- Download
- Installation
- Prerequisites
- Installing a binary distribution
- Building from source
- Running RNASimulASE
- Quick start
- Introduction and output directories
- Options
- Annotation
- Parallelization
- Citing RNASimulASE
- Contact information
RNASimulASE is a tool to simulate allele-specific expression (ASE)
data generated by RNA-sequencing. Recently, there has been an
increased interest in leveraging the diploid nature of genomes
in genetic research (Tewhey, et al., NatRevGen, 2011). In RNA-seq this is
realized by so called allele-specific expression (ASE), where
the transcriptional level for each allele from a pair of
homologous chromosomes can be distinguished at heterozygote
variants.
To evaluate experimental designs, as well as the
robustness and validity of emerging ASE analysis approaches,
simulation of allele-specific RNA-seq data is crucial. Given a reference transcriptome, genetic variants, recombination rates, and empirical transcript expression levels, RNASimulASE generates diploid personal transcriptomes in FASTA format and RNA-seq output in FASTQ format. Additional features include base quality sampling and sequencing error simulation from empirical data of an actual run of a sequencing-machine. All input parameters have default values, facilitating an easy-to-use program. See the paper for further information.
RNASimulASE is available via Sourceforge:
Download SimulASE.
Apart from the provided C++ binaries, RNASimulASE also makes use of
Ruby and R scripts. Working installations of these languages are
therefore needed. Ruby is shipped as part of most OS distributions. R can be downloaded from here.
Binaries are provided for Linux (x86_64) and OS X (Intel).
- Download the binaries of RNASimulASE: rnasimulase-X.X.platform.tar.gz.
If simulating human data, annotation is also available for download: annot.tar.gz
- Extract the binaries and annotation:
"tar -xvzf rnasimulase-X.X.tar.gz"
"tar -xvzf annot.tar.gz"
- Add the binary directory to your shell PATH, by adding the
following to your ~/.bashrc (Linux) or ~/.profile (OS X)
shell startup file. Similarly, set an environment variable "ANNOT" to the
annotation directory, within the same shell startup file:
"export PATH=/path/to/rnasimulase/bin:$PATH"
"export ANNOT=/path/to/rnasimulase/annot"
Note: Users of shells other than bash need to amend the PATH setting
command accordingly in their corresponding shell startup file.
The source code is distributed along with the binaries and the only
additional steps as compared to installing the binary
distribution (see 1-3 above) is to:
-
Go to the '/path/to/rnasimulase/src' directory and type:
"make"
Note: The makefile requires a gcc compiler version which includes the
ISO 2011 C++ standard library (binaries for rnasimulase-1.0 was
compiled with gcc 4.7).
- To compile hapgen2 on a different platform the developer asks
you to contact
them. Once compiled, put the hapgen binary into the rnasimulase
binary directory ('/path/to/rnasimulase/bin').
To run with default parameters, type: simulase
For help, type: simulase -h.
Executing the RNASimulASE pipeline by typing rnasimulase will execute the suite of subprograms in the following order: simhaplotranscriptome, simexpr, simfastqtrain and simfastq.
By default, scripts generated and executed by rnasimulase are put in a directory called "cmds" and data is written to a directory called "data". Within "data", three sub-directories are created:
- "diploref", containing personal diploid transcriptomes in FASTA format for each simulated individual
- "simexpr", containing simulated expression levels for each haplo-isoform (each of the two haplotypes of a diploid transcript)
- "simfastq", in which the simulated reads are found in FASTQ format
Options are described in detail if applying the -h option to rnasimulase or any of its subprograms. Furthermore, examples of the format for the input files are provided in the annotation (annot.tar.gz). To get detailed help for any of the options, type:
- rnasimulase -h
- simdiplotranscriptome -h
- writediploref -h
- simexpr -h
- simfastqtrain -h
- simfastq -h
Note: To get further help on hapgen2 options, visit their site.
If the user wishes to simulate a very high number of individuals, rnasimulase can be executed as separate tasks on separate processors. To facilitate this, the option "-r" is provided, which generates a random directory to which the output is written.
We provide annotation files for analysis of human data. This include files needed to run Simdiplotranscriptome (hapgen and writing diploid transcriptomes), Simexpr and Simfastqtrain:
- To run hapgen we provide files downloaded from Impute2, but subsetted by us with respect to the transcriptome (*.leg, *.hap, *.map), as well as a file with positions for a single variant within each chromosome, which is a dummy needed to make hapgen2 run (chr2onepos.list).
- To write diploid transcriptomes based on the hapgen output we provide a reference transcriptome from Ensembl (*.fa) and transcript-relative positions of all variants input to hapgen (*.relpos). Since the set of all transcript ids and chromosomes are the same for a given transcriptome, we have also pregenerated that data for speed purposes (all.enst and chr.list).
- To simulate expression values with Simexpr we provide a file with empirical expression values from a real sample: enst.counts.tab
- To estimate base-quality distributions and read-length with Simfastqtrain we provide a file with empirical base-qualities for 2 million reads from a real run: ill.hiseq2000.phred64.readlen100.fastq
Edsgärd D. and Emanuelsson O., RNASimulASE - Simulation of Allele Specific RNA-seq Data, 2013 (Submitted)
Please use the Discussion Forum for issues related to running the program. Other comments or questions may be e-mailed to:
Technical: Daniel Edsgärd
PI: Olof Emanuelsson