ABSTRACT
While numerous RNA-seq data analysis pipelines are available, research has shown that the choice of pipeline influences the results of differentially expressed gene detection and gene expression estimation. Gene expression estimation is a key step in RNA-seq data analysis, since the accuracy of gene expression estimates profoundly affects the subsequent analysis. Generally, gene expression estimation involves sequence alignment and quantification, and accurate gene expression estimation requires accurate alignment. However, the impact of aligners on gene expression estimation remains unclear. We address this need by constructing nine pipelines consisting of nine spliced aligners and one quantifier. We then use simulated data to investigate the impact of aligners on gene expression estimation. To evaluate alignment, we introduce three alignment performance metrics, (1) the percentage of reads aligned, (2) the percentage of reads aligned with zero mismatch (ZeroMismatchPercentage), and (3) the percentage of reads aligned with at most one mismatch (ZeroOneMismatchPercentage). We then evaluate the impact of alignment performance on gene expression estimation using three metrics, (1) gene detection accuracy, (2) the number of genes falsely quantified (FalseExpNum), and (3) the number of genes with falsely estimated fold changes (FalseFcNum). We found that among various pipelines, FalseExpNum and FalseFcNum are correlated. Moreover, FalseExpNum is linearly correlated with the percentage of reads aligned and ZeroMismatchPercentage, and FalseFcNum is linearly correlated with ZeroMismatchPercentage. Because of this correlation, the percentage of reads aligned and ZeroMismatchPercentage may be used to assess the performance of gene expression estimation for all RNA-seq datasets.
- Z. Wang, M. Gerstein, and M. Snyder, "RNA-Seq: a revolutionary tool for transcriptomics," Nature Reviews Genetics, vol. 10, pp. 57--63, 2009.Google ScholarCross Ref
- C. Trapnell, A. Roberts, L. Goff, G. Pertea, D. Kim, D. R. Kelley, et al., "Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks," Nature protocols, vol. 7, pp. 562--578, 2012.Google ScholarCross Ref
- J. C. Marioni, C. E. Mason, S. M. Mane, M. Stephens, and Y. Gilad, "RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays," Genome research, vol. 18, pp. 1509--1517, 2008.Google ScholarCross Ref
- Z. Peng, Y. Cheng, B. C.-M. Tan, L. Kang, Z. Tian, Y. Zhu, et al., "Comprehensive analysis of RNA-Seq data reveals extensive RNA editing in a human transcriptome," Nature biotechnology, vol. 30, pp. 253--260, 2012.Google ScholarCross Ref
- A. Oshlack, M. D. Robinson, and M. D. Young, "From RNA-seq reads to differential expression results," Genome biol, vol. 11, p. 220, 2010.Google ScholarCross Ref
- O. D. Iancu, S. Kawane, D. Bottomly, R. Searles, R. Hitzemann, and S. McWeeney, "Utilizing RNA-Seq data for de novo coexpression network inference," Bioinformatics, vol. 28, pp. 1592--1597, 2012. Google ScholarDigital Library
- S. M.-I. Consortium, "A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium," Nat Biotech, vol. advance online publication, 08/24/online 2014.Google Scholar
- N. A. Fonseca, J. Marioni, and A. Brazma, "RNA-seq gene profiling-a systematic empirical comparison," PloS one, vol. 9, p. e107026, 2014.Google ScholarCross Ref
- C. Soneson and M. Delorenzi, "A comparison of methods for differential expression analysis of RNA-seq data," BMC bioinformatics, vol. 14, p. 91, 2013.Google ScholarCross Ref
- H. Li and N. Homer, "A survey of sequence alignment algorithms for next-generation sequencing," Briefings in bioinformatics, vol. 11, pp. 473--483, 2010.Google ScholarCross Ref
- R. Chandramohan, P.-Y. Wu, J. H. Phan, and M. D. Wang, "Systematic Assessment of RNA-Seq Quantification Tools Using Simulated Sequence Data," in Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics, 2013, p. 623. Google ScholarDigital Library
- J. A. Robles, S. E. Qureshi, S. J. Stephen, S. R. Wilson, C. J. Burden, and J. M. Taylor, "Efficient experimental design and analysis strategies for the detection of differential expression using RNA-Sequencing," BMC genomics, vol. 13, p. 484, 2012.Google ScholarCross Ref
- G. R. Grant, M. H. Farkas, A. D. Pizarro, N. F. Lahens, J. Schug, B. P. Brunk, et al., "Comparative analysis of RNA-Seq alignment algorithms and the RNA-Seq unified mapper (RUM)," Bioinformatics, vol. 27, pp. 2518--28, Sep 15 2011. Google ScholarDigital Library
- P. G. Engström, T. Steijger, B. Sipos, G. R. Grant, A. Kahles, G. Rätsch, et al., "Systematic evaluation of spliced alignment programs for RNA-seq data," Nature methods, vol. 10, pp. 1185--1191, 2013.Google ScholarCross Ref
- S. Anders, P. T. Pyl, and W. Huber, "HTSeq--A Python framework to work with high-throughput sequencing data," Bioinformatics, p. btu638, 2014.Google Scholar
- B. Sipos, G. Slodkowicz, T. Massingham, and N. Goldman, "Realistic simulations reveal extensive sample-specificity of RNA-seq biases," arXiv preprint arXiv:1308.3172, 2013.Google Scholar
- T. Massingham, "simNGS -- software for simulating next-generation sequencing data, http://www.ebi.ac.uk/goldman-srv/simNGS/," 2012.Google Scholar
- X. Zheng and E. N. Moriyama, "Comparative studies of differential gene calling using RNA-Seq data," BMC bioinformatics, vol. 14, p. S7, 2013.Google ScholarCross Ref
- S. C. Munger, N. Raghupathy, K. Choi, A. K. Simons, D. M. Gatti, D. A. Hinerfeld, et al., "Rna-seq alignment to individualized genomes improves transcript abundance estimates in multiparent populations," Genetics, vol. 198, pp. 59--73, 2014.Google ScholarCross Ref
- D. Kim, G. Pertea, C. Trapnell, H. Pimentel, R. Kelley, and S. L. Salzberg, "TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions," Genome Biol, vol. 14, p. R36, 2013.Google ScholarCross Ref
- A. Dobin, C. A. Davis, F. Schlesinger, J. Drenkow, C. Zaleski, S. Jha, et al., "STAR: ultrafast universal RNA-seq aligner," Bioinformatics, vol. 29, pp. 15--21, 2013. Google ScholarDigital Library
- K. Wang, D. Singh, Z. Zeng, S. J. Coleman, Y. Huang, G. L. Savich, et al., "MapSplice: accurate mapping of RNA-seq reads for splice junction discovery," Nucleic acids research, p. gkq622, 2010.Google Scholar
- T. D. Wu and S. Nacu, "Fast and SNP-tolerant detection of complex variants and splicing in short reads," Bioinformatics, vol. 26, pp. 873--881, 2010. Google ScholarDigital Library
- Y. Zhang, E.-W. Lameijer, P. AC't Hoen, Z. Ning, P. E. Slagboom, and K. Ye, "PASSion: a pattern growth algorithm-based pipeline for splice junction detection in paired-end RNA-Seq data," Bioinformatics, vol. 28, pp. 479--486, 2012. Google ScholarDigital Library
- J. Wu, O. Anczuków, A. R. Krainer, M. Q. Zhang, and C. Zhang, "OLego: fast and sensitive mapping of spliced mRNA-Seq reads using small seeds," Nucleic acids research, vol. 41, pp. 5149--5163, 2013.Google ScholarCross Ref
- Y. Liao, G. K. Smyth, and W. Shi, "The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote," Nucleic acids research, vol. 41, pp. e108--e108, 2013.Google ScholarCross Ref
- S. Huang, J. Zhang, R. Li, W. Zhang, Z. He, T.-W. Lam, et al., "SOAPsplice: genome-wide ab initio detection of splice junctions from RNA-Seq data," Frontiers in genetics, vol. 2, 2011.Google Scholar
- S. Marco-Sola, M. Sammeth, R. Guigó, and P. Ribeca, "The GEM mapper: fast, accurate and versatile alignment by filtration," Nature methods, vol. 9, pp. 1185--1188, 2012.Google ScholarCross Ref
Index Terms
- The impact of RNA-seq aligners on gene expression estimation
Recommendations
An integrative analysis of ATAC-seq and RNA-seq data in activated, CD4+CD45RO+CD196+ human T cells treated with IL-1B and IL-23 with or without PGE2
BCB '16: Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health InformaticsThe advances in various "omics" technologies enable quantification of various biological molecules in a high-throughput manner, and thus allow us to integrate multiple layers of information for comprehensive understanding of biological processes or ...
Microarray vs. RNA-Seq: a comparison for active subnetwork discovery
BCB '12: Proceedings of the ACM Conference on Bioinformatics, Computational Biology and BiomedicineWhile microarrays have been successfully used by the researchers to analyze gene expression levels, cutting edge high throughput sequencing technologies now made it possible to go one step further. Recent studies show that absolute expression levels are ...
A probabilistic approach for automated discovery of perturbed genes using expression data from microarray or RNA-Seq
BackgroundIn complex diseases, alterations of multiple molecular and cellular components in response to perturbations are indicative of disease physiology. While expression level of genes from high-throughput analysis can vary among patients, the common ...
Comments