research-article

The impact of RNA-seq aligners on gene expression estimation

Authors:
Cheng Yang

Emory University, and Peking University, Atlanta, GA

Emory University, and Peking University, Atlanta, GA
View Profile

,
Po-Yen Wu

Georgia Institute of Technology, Atlanta, GA

Georgia Institute of Technology, Atlanta, GA
View Profile

,
Li Tong

Georgia Institute of Technology and Emory University, Atlanta, GA

Georgia Institute of Technology and Emory University, Atlanta, GA
View Profile

,
John Phan

Georgia Institute of Technology and Emory University, Atlanta, GA

Georgia Institute of Technology and Emory University, Atlanta, GA
View Profile

,
May Wang

Georgia Institute of Technology and Emory University, Atlanta, GA

Georgia Institute of Technology and Emory University, Atlanta, GA
View Profile

BCB '15: Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health InformaticsSeptember 2015Pages 462–471https://doi.org/10.1145/2808719.2808767

Published:09 September 2015Publication History

BCB '15: Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics

Pages 462–471

ABSTRACT

While numerous RNA-seq data analysis pipelines are available, research has shown that the choice of pipeline influences the results of differentially expressed gene detection and gene expression estimation. Gene expression estimation is a key step in RNA-seq data analysis, since the accuracy of gene expression estimates profoundly affects the subsequent analysis. Generally, gene expression estimation involves sequence alignment and quantification, and accurate gene expression estimation requires accurate alignment. However, the impact of aligners on gene expression estimation remains unclear. We address this need by constructing nine pipelines consisting of nine spliced aligners and one quantifier. We then use simulated data to investigate the impact of aligners on gene expression estimation. To evaluate alignment, we introduce three alignment performance metrics, (1) the percentage of reads aligned, (2) the percentage of reads aligned with zero mismatch (ZeroMismatchPercentage), and (3) the percentage of reads aligned with at most one mismatch (ZeroOneMismatchPercentage). We then evaluate the impact of alignment performance on gene expression estimation using three metrics, (1) gene detection accuracy, (2) the number of genes falsely quantified (FalseExpNum), and (3) the number of genes with falsely estimated fold changes (FalseFcNum). We found that among various pipelines, FalseExpNum and FalseFcNum are correlated. Moreover, FalseExpNum is linearly correlated with the percentage of reads aligned and ZeroMismatchPercentage, and FalseFcNum is linearly correlated with ZeroMismatchPercentage. Because of this correlation, the percentage of reads aligned and ZeroMismatchPercentage may be used to assess the performance of gene expression estimation for all RNA-seq datasets.

References

Z. Wang, M. Gerstein, and M. Snyder, "RNA-Seq: a revolutionary tool for transcriptomics," Nature Reviews Genetics, vol. 10, pp. 57--63, 2009.Google ScholarCross Ref
C. Trapnell, A. Roberts, L. Goff, G. Pertea, D. Kim, D. R. Kelley, et al., "Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks," Nature protocols, vol. 7, pp. 562--578, 2012.Google ScholarCross Ref
J. C. Marioni, C. E. Mason, S. M. Mane, M. Stephens, and Y. Gilad, "RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays," Genome research, vol. 18, pp. 1509--1517, 2008.Google ScholarCross Ref
Z. Peng, Y. Cheng, B. C.-M. Tan, L. Kang, Z. Tian, Y. Zhu, et al., "Comprehensive analysis of RNA-Seq data reveals extensive RNA editing in a human transcriptome," Nature biotechnology, vol. 30, pp. 253--260, 2012.Google ScholarCross Ref
A. Oshlack, M. D. Robinson, and M. D. Young, "From RNA-seq reads to differential expression results," Genome biol, vol. 11, p. 220, 2010.Google ScholarCross Ref
O. D. Iancu, S. Kawane, D. Bottomly, R. Searles, R. Hitzemann, and S. McWeeney, "Utilizing RNA-Seq data for de novo coexpression network inference," Bioinformatics, vol. 28, pp. 1592--1597, 2012. Google ScholarDigital Library
S. M.-I. Consortium, "A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium," Nat Biotech, vol. advance online publication, 08/24/online 2014.Google Scholar
N. A. Fonseca, J. Marioni, and A. Brazma, "RNA-seq gene profiling-a systematic empirical comparison," PloS one, vol. 9, p. e107026, 2014.Google ScholarCross Ref
C. Soneson and M. Delorenzi, "A comparison of methods for differential expression analysis of RNA-seq data," BMC bioinformatics, vol. 14, p. 91, 2013.Google ScholarCross Ref
H. Li and N. Homer, "A survey of sequence alignment algorithms for next-generation sequencing," Briefings in bioinformatics, vol. 11, pp. 473--483, 2010.Google ScholarCross Ref
R. Chandramohan, P.-Y. Wu, J. H. Phan, and M. D. Wang, "Systematic Assessment of RNA-Seq Quantification Tools Using Simulated Sequence Data," in Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics, 2013, p. 623. Google ScholarDigital Library
J. A. Robles, S. E. Qureshi, S. J. Stephen, S. R. Wilson, C. J. Burden, and J. M. Taylor, "Efficient experimental design and analysis strategies for the detection of differential expression using RNA-Sequencing," BMC genomics, vol. 13, p. 484, 2012.Google ScholarCross Ref
G. R. Grant, M. H. Farkas, A. D. Pizarro, N. F. Lahens, J. Schug, B. P. Brunk, et al., "Comparative analysis of RNA-Seq alignment algorithms and the RNA-Seq unified mapper (RUM)," Bioinformatics, vol. 27, pp. 2518--28, Sep 15 2011. Google ScholarDigital Library
P. G. Engström, T. Steijger, B. Sipos, G. R. Grant, A. Kahles, G. Rätsch, et al., "Systematic evaluation of spliced alignment programs for RNA-seq data," Nature methods, vol. 10, pp. 1185--1191, 2013.Google ScholarCross Ref
S. Anders, P. T. Pyl, and W. Huber, "HTSeq--A Python framework to work with high-throughput sequencing data," Bioinformatics, p. btu638, 2014.Google Scholar
B. Sipos, G. Slodkowicz, T. Massingham, and N. Goldman, "Realistic simulations reveal extensive sample-specificity of RNA-seq biases," arXiv preprint arXiv:1308.3172, 2013.Google Scholar
T. Massingham, "simNGS -- software for simulating next-generation sequencing data, http://www.ebi.ac.uk/goldman-srv/simNGS/," 2012.Google Scholar
X. Zheng and E. N. Moriyama, "Comparative studies of differential gene calling using RNA-Seq data," BMC bioinformatics, vol. 14, p. S7, 2013.Google ScholarCross Ref
S. C. Munger, N. Raghupathy, K. Choi, A. K. Simons, D. M. Gatti, D. A. Hinerfeld, et al., "Rna-seq alignment to individualized genomes improves transcript abundance estimates in multiparent populations," Genetics, vol. 198, pp. 59--73, 2014.Google ScholarCross Ref
D. Kim, G. Pertea, C. Trapnell, H. Pimentel, R. Kelley, and S. L. Salzberg, "TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions," Genome Biol, vol. 14, p. R36, 2013.Google ScholarCross Ref
A. Dobin, C. A. Davis, F. Schlesinger, J. Drenkow, C. Zaleski, S. Jha, et al., "STAR: ultrafast universal RNA-seq aligner," Bioinformatics, vol. 29, pp. 15--21, 2013. Google ScholarDigital Library
K. Wang, D. Singh, Z. Zeng, S. J. Coleman, Y. Huang, G. L. Savich, et al., "MapSplice: accurate mapping of RNA-seq reads for splice junction discovery," Nucleic acids research, p. gkq622, 2010.Google Scholar
T. D. Wu and S. Nacu, "Fast and SNP-tolerant detection of complex variants and splicing in short reads," Bioinformatics, vol. 26, pp. 873--881, 2010. Google ScholarDigital Library
Y. Zhang, E.-W. Lameijer, P. AC't Hoen, Z. Ning, P. E. Slagboom, and K. Ye, "PASSion: a pattern growth algorithm-based pipeline for splice junction detection in paired-end RNA-Seq data," Bioinformatics, vol. 28, pp. 479--486, 2012. Google ScholarDigital Library
J. Wu, O. Anczuków, A. R. Krainer, M. Q. Zhang, and C. Zhang, "OLego: fast and sensitive mapping of spliced mRNA-Seq reads using small seeds," Nucleic acids research, vol. 41, pp. 5149--5163, 2013.Google ScholarCross Ref
Y. Liao, G. K. Smyth, and W. Shi, "The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote," Nucleic acids research, vol. 41, pp. e108--e108, 2013.Google ScholarCross Ref
S. Huang, J. Zhang, R. Li, W. Zhang, Z. He, T.-W. Lam, et al., "SOAPsplice: genome-wide ab initio detection of splice junctions from RNA-Seq data," Frontiers in genetics, vol. 2, 2011.Google Scholar
S. Marco-Sola, M. Sammeth, R. Guigó, and P. Ribeca, "The GEM mapper: fast, accurate and versatile alignment by filtration," Nature methods, vol. 9, pp. 1185--1188, 2012.Google ScholarCross Ref

Index Terms

The impact of RNA-seq aligners on gene expression estimation
1. Applied computing
  1. Life and medical sciences

Recommendations

An integrative analysis of ATAC-seq and RNA-seq data in activated, CD4+CD45RO+CD196+ human T cells treated with IL-1B and IL-23 with or without PGE2
BCB '16: Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics

The advances in various "omics" technologies enable quantification of various biological molecules in a high-throughput manner, and thus allow us to integrate multiple layers of information for comprehensive understanding of biological processes or ...
Read More
Microarray vs. RNA-Seq: a comparison for active subnetwork discovery
BCB '12: Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine

While microarrays have been successfully used by the researchers to analyze gene expression levels, cutting edge high throughput sequencing technologies now made it possible to go one step further. Recent studies show that absolute expression levels are ...
Read More
A probabilistic approach for automated discovery of perturbed genes using expression data from microarray or RNA-Seq

BackgroundIn complex diseases, alterations of multiple molecular and cellular components in response to perturbations are indicative of disease physiology. While expression level of genes from high-throughput analysis can vary among patients, the common ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

BCB '15: Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics
September 2015
683 pages
ISBN:9781450338530
DOI:10.1145/2808719

Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 September 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
RNA-seq
alignment
gene expression estimation
quantification
Qualifiers
- research-article
Conference

Acceptance Rates
BCB '15 Paper Acceptance Rate48of141submissions,34%Overall Acceptance Rate254of885submissions,29%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 221
  Total Downloads
- Downloads (Last 12 months)9
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

The impact of RNA-seq aligners on gene expression estimation

BCB '15: Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics

ABSTRACT

References

Cited By

Index Terms

Recommendations

An integrative analysis of ATAC-seq and RNA-seq data in activated, CD4+CD45RO+CD196+ human T cells treated with IL-1B and IL-23 with or without PGE2

Microarray vs. RNA-Seq: a comparison for active subnetwork discovery

A probabilistic approach for automated discovery of perturbed genes using expression data from microarray or RNA-Seq

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

The impact of RNA-seq aligners on gene expression estimation

BCB '15: Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics

ABSTRACT

References

Cited By

Index Terms

Recommendations

An integrative analysis of ATAC-seq and RNA-seq data in activated, CD4+CD45RO+CD196+ human T cells treated with IL-1B and IL-23 with or without PGE2

Microarray vs. RNA-Seq: a comparison for active subnetwork discovery

A probabilistic approach for automated discovery of perturbed genes using expression data from microarray or RNA-Seq

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media