research-article

CNVScan: detecting borderline copy number variations in NGS data via scan statistics

Authors:
E. Bergamini

Karlsruhe Institute of Technology, Karlsruhe, Germany

Karlsruhe Institute of Technology, Karlsruhe, Germany
View Profile

,
R. D'Aurizio

IIT-CNR and IFC-CNR LISM - Laboratory for Integrative Systems Medicine, Pisa, Italy

IIT-CNR and IFC-CNR LISM - Laboratory for Integrative Systems Medicine, Pisa, Italy
View Profile

,
M. Leoncini

Università di Modena e Reggio Emilia, Modena, Italy

Università di Modena e Reggio Emilia, Modena, Italy
View Profile

,
M. Pellegrini

IIT-CNR and IFC-CNR LISM - Laboratory for Integrative Systems Medicine, Pisa, Italy

IIT-CNR and IFC-CNR LISM - Laboratory for Integrative Systems Medicine, Pisa, Italy
View Profile

BCB '15: Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health InformaticsSeptember 2015Pages 335–344https://doi.org/10.1145/2808719.2808754

Published:09 September 2015Publication History

BCB '15: Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics

Pages 335–344

ABSTRACT

Background. Next Generation Sequencing (NGS) data has been extensively exploited in the last decade to analyse genome variations and to understand the role of genome variations in complex diseases. Copy number variations (CNVs) are genomic structural variants estimated to account for about 1.2% of the total variation in humans. CNVs in coding or regulatory regions may have an impact on the gene expression, often also at a functional level, and contribute to cause different diseases like cancer, autism and cardiovascular diseases. Computational methods developed for detection of CNVs from NGS data and based on the depth of coverage are limited to the identification of medium/large events and heavily influenced by the level of coverage.

Result. In this paper we propose, CNVScan a CNV detection method based on scan statistics that overcomes limitations of previous read count (RC) based approaches mainly by being a window-less approach. The scans statistics have been used before mainly in epidemiology and ecology studies, but never before was applied to the CNV detection problem to the best of our knowledge. Since we avoid windowing we do not have to choose an optimal window-size which is a key step in many previous approaches. Extensive simulated experiments with single read data in extreme situations (low coverage, short reads, homo/heterozygoticity) show that this approach is very effective for a range of small CNV (200-500 bp) for which previous state-of-the-art methods are not suitable.

Conclusion. The scan statistics technique is applied and adapted in this paper for the first time to the CNV detection problem. Comparison with state-of-the art methods shows the approach is quite effective in discovering short CNV in rather extreme situations in which previous methods fail or have degraded performance. CNVScan thus extends the range of CNV sizes and types that can be detected via read count with single read data.

References

A. Abyzov, A. E. Urban, M. Snyder, and M. Gerstein. Cnvnator: an approach to discover, genotype, and characterize typical and atypical cnvs from family and population genome sequencing. Genome research, 21(6):974--984, 2011.Google ScholarCross Ref
D. Agarwal, A. McGregor, J. M. Phillips, S. Venkatasubramanian, and Z. Zhu. Spatial scan statistics: approximations and performance study. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 24--33. ACM, 2006. Google ScholarDigital Library
D. Agarwal, J. M. Phillips, and S. Venkatasubramanian. The hunting of the bump: on maximizing statistical discrepancy. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pages 1137--1146. Society for Industrial and Applied Mathematics, 2006. Google ScholarDigital Library
C. Alkan, B. P. Coe, and E. E. Eichler. Genome structural variation discovery and genotyping. Nature Reviews Genetics, 12(5):363--376, 2011.Google ScholarCross Ref
A. Alkodsi, R. Louhimo, and S. Hautaniemi. Comparative analysis of methods for identifying somatic copy number alterations from deep sequencing data. Briefings in bioinformatics, page bbu004, 2014.Google Scholar
C. Bartenhagen and M. Dugas. Rsvsim: an r/bioconductor package for the simulation of structural variations. Bioinformatics, page btt198, 2013.Google Scholar
Y. Benjamini and T. P. Speed. Summarizing and correcting the gc content bias in high-throughput sequencing. Nucleic acids research, page gks001, 2012.Google Scholar
P. J. Campbell, P. J. Stephens, E. D. Pleasance, S. O'Meara, H. Li, T. Santarius, L. A. Stebbings, C. Leroy, S. Edkins, C. Hardy, et al. Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing. Nature genetics, 40(6):722--729, 2008.Google ScholarCross Ref
D. F. Conrad, D. Pinto, R. Redon, L. Feuk, O. Gokcumen, Y. Zhang, J. Aerts, T. D. Andrews, C. Barnes, P. Campbell, et al. Origins and functional impact of copy number variation in the human genome. Nature, 464(7289):704--712, 2010.Google ScholarCross Ref
K. Das, J. Schneider, and D. B. Neill. Anomaly pattern detection for biosurveillance. Advances in Disease Surveillance, 5:19, 2008.Google Scholar
T. Derrien, J. Estellé, S. M. Sola, D. G. Knowles, E. Raineri, R. Guigó, and P. Ribeca. Fast computation and applications of genome mappability. PloS one, 7(1):e30377, 2012.Google ScholarCross Ref
J. Duan, J.-G. Zhang, H.-W. Deng, and Y.-P. Wang. Comparative studies of copy number variation detection methods for next-generation sequencing technologies. PloS one, 8(3):e59128, 2013.Google ScholarCross Ref
J. Glaz, J. I. Naus, S. Wallenstein, S. Wallenstein, and J. I. Naus. Scan statistics. Springer, 2001.Google ScholarCross Ref
A. Gusnanto, C. C. Taylor, I. Nafisah, H. M. Wood, P. Rabbitts, and S. Berri. Estimating optimal window size for analysis of low-coverage next-generation sequence data. Bioinformatics, 30(13):1823--1829, 2014.Google ScholarCross Ref
W. Huang, L. Li, J. R. Myers, and G. T. Marth. Art: a next-generation sequencing read simulator. Bioinformatics, 28(4):593--594, 2012. Google ScholarDigital Library
M. Kulldorff. A spatial scan statistic. Communications in Statistics-Theory and Methods, 26(6):1481--1496, 1997.Google ScholarCross Ref
M. Kulldorff. Spatial scan statistics: models, calculations, and applications. In Scan Statistics and Applications, pages 303--322. Birkhauser, Boston, 1999.Google ScholarCross Ref
H. Li and R. Durbin. Fast and accurate short read alignment with burrows--wheeler transform. Bioinformatics, 25(14):1754--1760, 2009. Google ScholarDigital Library
H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, R. Durbin, and G. P. D. P. Subgroup. The sequence alignment/map format and samtools. Bioinformatics, 25(16):2078--2079, 2009. Google ScholarDigital Library
A. Magi, M. Benelli, S. Yoon, F. Roviello, and F. Torricelli. Detecting common copy number variants in high-throughput sequencing data by using JointSLM algorithm. Nucleic acids research, page gkr068, 2011.Google Scholar
A. Magi, L. Tattini, T. Pippucci, F. Torricelli, and M. Benelli. Read count approach for dna copy number variants detection. Bioinformatics, 28(4):470--478, 2012. Google ScholarDigital Library
S. A. McCarroll, F. G. Kuruvilla, J. M. Korn, S. Cawley, J. Nemesh, A. Wysoker, M. H. Shapero, P. I. de Bakker, J. B. Maller, A. Kirby, et al. Integrated detection and population-genetic analysis of snps and copy number variation. Nature genetics, 40(10):1166--1174, 2008.Google ScholarCross Ref
P. Medvedev, M. Stanciu, and M. Brudno. Computational methods for discovering structural variation with next-generation sequencing. Nature methods, 6:S13--S20, 2009.Google ScholarCross Ref
R. E. Mills, K. Walter, C. Stewart, R. E. Handsaker, K. Chen, C. Alkan, A. Abyzov, S. C. Yoon, K. Ye, R. K. Cheetham, et al. Mapping copy number variation by population-scale genome sequencing. Nature, 470(7332):59--65, 2011.Google ScholarCross Ref
J. I. Naus. The distribution of the size of the maximum cluster of points on a line. Journal of the American Statistical Association, 60(310):532--538, 1965.Google ScholarCross Ref
J. I. Naus. Approximations for distributions of scan statistics. Journal of the American Statistical Association, 77(377):177--183, 1982.Google ScholarCross Ref
D. B. Neill and A. W. Moore. Anomalous spatial cluster detection. Proceedings of the KDD 2005 Workshop on Data Mining Methods for Anomaly Detection, 2005.Google Scholar
D. B. Neill, A. W. Moore, F. Pereira, and T. M. Mitchell. Detecting significant multidimensional spatial clusters. In Advances in Neural Information Processing Systems, pages 969--976, 2004.Google Scholar
S. Pabinger, A. Dander, M. Fischer, R. Snajder, M. Sperk, M. Efremova, B. Krabichler, M. R. Speicher, J. Zschocke, and Z. Trajanoski. A survey of tools for variant analysis of next-generation genome sequencing data. Briefings in bioinformatics, 15(2):256--278, 2014.Google ScholarCross Ref
P. A. Pevzner and S.-H. Sze. Combinatorial approaches to finding subtle signals in DNA sequences. In Proc. 18th Int. Conf. on Intelligent Systems for Mol. Biol., pages 269--278, 2000. Google ScholarDigital Library
N. Rieber, M. Zapatka, B. Lasitschka, D. Jones, P. Northcott, B. Hutter, N. Jäger, M. Kool, M. Taylor, P. Lichter, et al. Coverage bias and sensitivity of variant calling for four whole-genome sequencing technologies. PLoS One, 8(6):e66621, 2013.Google ScholarCross Ref
S. M. Teo, Y. Pawitan, C. S. Ku, K. S. Chia, and A. Salim. Statistical challenges associated with detecting copy number variations with next-generation sequencing. Bioinformatics, 28(21):2711--2718, 2012. Google ScholarDigital Library
T. J. Treangen and S. L. Salzberg. Repetitive dna and next-generation sequencing: computational challenges and solutions. Nature Reviews Genetics, 13(1):36--46, 2012.Google ScholarCross Ref
R. Xi, A. G. Hadjipanayis, L. J. Luquette, T.-M. Kim, E. Lee, J. Zhang, M. D. Johnson, D. M. Muzny, D. A. Wheeler, R. A. Gibbs, et al. Copy number variation detection in whole-genome sequencing data using the bayesian information criterion. Proceedings of the National Academy of Sciences, 108(46):E1128--E1136, 2011.Google ScholarCross Ref
R. Xi, T.-M. Kim, and P. J. Park. Detecting structural variations in the human genome using next generation sequencing. Briefings in functional genomics, page elq025, 2011.Google Scholar
C. Xie and M. T. Tammi. Cnv-seq, a new method to detect copy number variation using high-throughput sequencing. BMC bioinformatics, 10(1):80, 2009.Google ScholarCross Ref
S. Yoon, Z. Xuan, V. Makarov, K. Ye, and J. Sebat. Sensitive and accurate detection of copy number variants using read depth of coverage. Genome research, 19(9):1586--1592, 2009.Google ScholarCross Ref

Index Terms

CNVScan: detecting borderline copy number variations in NGS data via scan statistics
1. Applied computing
  1. Life and medical sciences

Recommendations

BagGMM: Calling copy number variation by bagging multiple Gaussian mixture models from tumor and matched normal next-generation sequencing data
Abstract
Copy number variations (CNVs) contribute significantly to human genomic variability, some of which lead to diseases. However, effective detection of CNVs from whole genome next generation sequencing data (NGS) remains challenging. Here,...
Read More
SnpFilt

Display Omitted Reference-free assembly-based discovery of single nucleotide polymorphisms (SNP) from next generation sequencing data of bacterial genomes.A bioinformatics pipeline that constructs an assembly using SPAdes and then removes unreliable ...
Read More
MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads
BCB '11: Proceedings of the 2nd ACM Conference on Bioinformatics, Computational Biology and Biomedicine

Motivation:

An important step of "metagenomics" analysis is the assembly of multiple genomes from mixed sequence reads of multiple species in a microbial community. Most conventional pipelines employ a single-genome assembler with carefully optimized ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

BCB '15: Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics
September 2015
683 pages
ISBN:9781450338530
DOI:10.1145/2808719

Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 September 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
computational biology
copy number variation
next generation sequencing
Qualifiers
- research-article
Conference

Acceptance Rates
BCB '15 Paper Acceptance Rate48of141submissions,34%Overall Acceptance Rate254of885submissions,29%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 95
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

CNVScan: detecting borderline copy number variations in NGS data via scan statistics

BCB '15: Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics

ABSTRACT

References

Cited By

Index Terms

Recommendations

BagGMM: Calling copy number variation by bagging multiple Gaussian mixture models from tumor and matched normal next-generation sequencing data

SnpFilt

MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

CNVScan: detecting borderline copy number variations in NGS data via scan statistics

BCB '15: Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics

ABSTRACT

References

Cited By

Index Terms

Recommendations

BagGMM: Calling copy number variation by bagging multiple Gaussian mixture models from tumor and matched normal next-generation sequencing data

SnpFilt

MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media