research-article

Machine Learning Models that Remember Too Much

Authors:
Congzheng Song

Cornell University, Ithaca, NY, USA

Cornell University, Ithaca, NY, USA
View Profile

,
Thomas Ristenpart

Cornell Tech, New York, NY, USA

Cornell Tech, New York, NY, USA
View Profile

,
Vitaly Shmatikov

Cornell Tech, New York, NY, USA

Cornell Tech, New York, NY, USA
View Profile

CCS '17: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications SecurityOctober 2017Pages 587–601https://doi.org/10.1145/3133956.3134077

Published:30 October 2017Publication History

CCS '17: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security

Pages 587–601

ABSTRACT

Machine learning (ML) is becoming a commodity. Numerous ML frameworks and services are available to data holders who are not ML experts but want to train predictive models on their data. It is important that ML models trained on sensitive inputs (e.g., personal images or documents) not leak too much information about the training data.

We consider a malicious ML provider who supplies model-training code to the data holder, does \emph{not} observe the training, but then obtains white- or black-box access to the resulting model. In this setting, we design and implement practical algorithms, some of them very similar to standard ML techniques such as regularization and data augmentation, that "memorize" information about the training dataset in the model\textemdash yet the model is as accurate and predictive as a conventionally trained model. We then explain how the adversary can extract memorized information from the model. We evaluate our techniques on standard ML tasks for image classification (CIFAR10), face recognition (LFW and FaceScrub), and text analysis (20 Newsgroups and IMDB). In all cases, we show how our algorithms create models that have high predictive power yet allow accurate extraction of subsets of their training data.

Supplemental Material

References

M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang. Deep learning with differential privacy. In CCS, 2016. Google ScholarDigital Library
Algorithmia. https://algorithmia.com, 2017.Google Scholar
Amazon Machine Learning. https://aws.amazon.com/machine-learning, 2017.Google Scholar
G. Ateniese, L. V. Mancini, A. Spognardi, A. Villani, D. Vitali, and G. Felici. Hacking smart machines with smarter ones: How to extract meaningful data from machine learning classifiers. IJSN, 10(3):137--150, 2015. Google ScholarDigital Library
M. Backes, P. Berrang, M. Humbert, and P. Manoharan. Membership privacy in MicroRNA-based studies. In CCS, 2016.Google ScholarDigital Library
M. Balduzzi, J. Zaddach, D. Balzarotti, E. Kirda, and S. Loureiro. A security analysis of Amazon's Elastic Compute cloud service. In SAC, 2012. Google ScholarDigital Library
A. Baumann, M. Peinado, and G. Hunt. Shielding applications from an untrusted cloud with haven. TOCS, 33(3):8, 2015. Google ScholarDigital Library
A. L. Berger, V. J. D. Pietra, and S. A. D. Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39--71, 1996.Google ScholarDigital Library
B. Biggio, B. Nelson, and P. Laskov. Poisoning attacks against support vector machines. In ICML, 2012.Google ScholarDigital Library
BigML. https://bigml.com, 2017.Google Scholar
D. Bogdanov, M. Niitsoo, T. Toft, and J. Willemson. High-performance secure multi-party computation for data mining applications. IJIS, 11(6):403--418, 2012. Google ScholarDigital Library
R. Bost, R. A. Popa, S. Tu, and S. Goldwasser. Machine learning classification over encrypted data. In NDSS, 2015. Google ScholarCross Ref
C. Bucilua, R. Caruana, and A. Niculescu-Mizil. Model compression. In KDD, 2006. Google ScholarDigital Library
S. Bugiel, S. Nürnberger, T. Pöppelmann, A.-R. Sadeghi, and T. Schneider. AmazonIA: When elasticity snaps back. In CCS, 2011.Google Scholar
W. Chen, J. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen. Compressing convolutional neural networks in the frequency domain. In KDD, 2016. Google ScholarDigital Library
C. Clifton, M. Kantarcioglu, J. Vaidya, X. Lin, and M. Y. Zhu. Tools for privacy preserving distributed data mining. ACM SIGKDD Explorations Newsletter, 4(2):28--34, 2002. Google ScholarDigital Library
C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):273--297, 1995. Google ScholarDigital Library
N. Dalvi, P. Domingos, Mausam, S. Sanghai, and D. Verma. Adversarial classification. In KDD, 2004.Google ScholarDigital Library
DeepDetect. https://www.deepdetect.com, 2015--2017.Google Scholar
S. Dieleman, J. Schlüter, C. Raffel, E. Olson, S. K. Sønderby, D. Nouri, et al. Lasagne: First release. http://dx.doi.org/10.5281/zenodo.27878, 2015.Google Scholar
T. T. A. Dinh, P. Saxena, E.-C. Chang, B. C. Ooi, and C. Zhang. M2R: Enabling stronger privacy in MapReduce computation. In USENIX Security, 2015.Google ScholarDigital Library
W. Du, Y. S. Han, and S. Chen. Privacy-preserving multivariate statistical analysis: Linear regression and classification. In ICDM, 2004.Google ScholarCross Ref
J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. JMLR, 12(Jul):2121--2159, 2011.Google Scholar
C. Dwork, A. Smith, T. Steinke, J. Ullman, and S. Vadhan. Robust traceability from trace amounts. In FOCS, 2015. Google ScholarDigital Library
M. Fredrikson, S. Jha, and T. Ristenpart. Model inversion attacks that exploit confidence information and basic countermeasures. In CCS, 2015. Google ScholarDigital Library
M. Fredrikson, E. Lantz, S. Jha, S. Lin, D. Page, and T. Ristenpart. Privacy in pharmacogenetics: An end-to-end case study of personalized Warfarin dosing. In USENIX Security, 2014.Google ScholarDigital Library
Google Cloud Prediction API, 2017.Google Scholar
J. Graham-Cumming. How to beat an adaptive spam filter. In MIT Spam Conference, 2004.Google Scholar
S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In ICLR, 2016.Google Scholar
Haven OnDemand. https://www.havenondemand.com, 2017.Google Scholar
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016. Google ScholarCross Ref
N. Homer, S. Szelinger, M. Redman, D. Duggan, W. Tembe, J. Muehling, J. V. Pearson, D. A. Stephan, S. F. Nelson, and D. W. Craig. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLOS Genetics, 2008. Google ScholarCross Ref
G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Report 07--49, University of Massachusetts, Amherst, October 2007.Google Scholar
indico. https://indico.io, 2016.Google Scholar
T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In ECML, 1998.Google ScholarDigital Library
Keras. https://keras.io, 2015.Google Scholar
Kernel.org Linux repository rooted in hack attack. https://www.theregister.co.uk/2011/08/31/linux_kernel_security_breach/, 2011.Google Scholar
M. Kloft and P. Laskov. Online anomaly detection under adversarial impact. In AISTATS, 2010.Google Scholar
H. Krawczyk, R. Canetti, and M. Bellare. HMAC: Keyed-hashing for message authentication. https://tools.ietf.org/html/rfc2104, 1997.Google ScholarDigital Library
A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.Google Scholar
A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, 2012.Google ScholarDigital Library
N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar. Attribute and simile classifiers for face verification. In ICCV, 2009. Google ScholarCross Ref
S. Lahiri. Complexity of word collocation networks: A preliminary structural analysis. In Proc. Student Research Workshop at the 14th Conference of the European Chapter of the Association for Computational Linguistics, 2014. Google ScholarCross Ref
K. Lang. NewsWeeder: Learning to filter netnews. In ICML, 1995.Google ScholarDigital Library
G. B. H. E. Learned-Miller. Labeled faces in the wild: Updates and new reporting procedures. Technical Report UM-CS-2014-003, University of Massachusetts, Amherst, May 2014.Google Scholar
Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436--444, 2015. Google ScholarCross Ref
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proc. IEEE, 86(11):2278--2324, 1998. Google ScholarCross Ref
Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio. Neural networks with few multiplications. In ICLR, 2016.Google Scholar
Y. Lindell and B. Pinkas. Privacy preserving data mining. Journal of Cryptology, 15(3), 2002. Google ScholarDigital Library
D. Lowd. Good word attacks on statistical spam filters. In CEAS, 2005.Google Scholar
D. Lowd and C. Meek. Adversarial learning. In KDD, 2005. Google ScholarDigital Library
A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts. Learning word vectors for sentiment analysis. In Proc. 49th Annual Meeting of the ACL: Human Language Technologies, 2011.Google Scholar
L. v. d. Maaten and G. Hinton. Visualizing data using t-SNE. JMLR, 9(Nov):2579--2605, 2008.Google Scholar
Microsoft Azure Machine Learning. https://azure.microsoft.com/en-us/services/machine-learning, 2017.Google Scholar
MLJAR. https://mljar.com, 2016--2017.Google Scholar
MXNET. http://mxnet.io, 2015--2017.Google Scholar
J. Newsome, B. Karp, and D. Song. Paragraph: Thwarting signature learning by training maliciously. In RAID, 2006.Google ScholarDigital Library
Nexosis. http://www.nexosis.com, 2017.Google Scholar
H.-W. Ng and S. Winkler. A data-driven approach to cleaning large face datasets. In ICIP, 2014. Google ScholarCross Ref
J. Nocedal and S. J. Wright. Numerical Optimization. Springer, New York, 2nd edition, 2006.Google Scholar
O. Ohrimenko, F. Schuster, C. Fournet, A. Mehta, S. Nowozin, K. Vaswani, and M. Costa. Oblivious multi-party machine learning on trusted processors. In USENIX Security, 2016.Google Scholar
B. Pang and L. Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proc. ACL, 2005. Google ScholarDigital Library
N. Papernot, P. McDaniel, A. Sinha, and M. Wellman. Towards the science of security and privacy in machine learning. https://arxiv.org/abs/1611.03814, 2016.Google Scholar
M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. XNOR-Net: ImageNet classification using binary convolutional neural networks. In ECCV, 2016.Google ScholarCross Ref
B. I. Rubinstein, B. Nelson, L. Huang, A. D. Joseph, S.-h. Lau, S. Rao, N. Taft, and J. Tygar. Antidote: Understanding and defending against poisoning of anomaly detectors. In IMC, 2009.Google ScholarDigital Library
F. Schuster, M. Costa, C. Fournet, C. Gkantsidis, M. Peinado, G. Mainar-Ruiz, and M. Russinovich. VC3: Trustworthy data analytics in the cloud using SGX. In S&P, 2015.Google ScholarDigital Library
R. Shokri and V. Shmatikov. Privacy-preserving deep learning. In CCS, 2015.Google ScholarDigital Library
R. Shokri, M. Stronati, C. Song, and V. Shmatikov. Membership inference attacks against machine learning models. In S&P, 2017. Google ScholarCross Ref
P. Y. Simard, D. Steinkraus, and J. C. Platt. Best practices for convolutional neural networks applied to visual document analysis. In ICDAR, 2003. Google ScholarCross Ref
Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. https://arxiv.org/abs/1605.02688, 2016.Google Scholar
S. Torres-Arias, A. K. Ammula, R. Curtmola, and J. Cappos. On omitting commits and committing omissions: Preventing git metadata tampering that (re)-introduces software vulnerabilities. In USENIX Security, 2016.Google Scholar
V. Vapnik. The Nature of Statistical Learning Theory. Springer Science & Business Media, 2013.Google Scholar
J. Wei, X. Zhang, G. Ammons, V. Bala, and P. Ning. Managing security of virtual machine images in a cloud environment. In CCSW, 2009. Google ScholarDigital Library
Y. Zhai, L. Yin, J. Chase, T. Ristenpart, and M. Swift. CQSTR: Securing cross-tenant applications with cloud containers. In SoCC, 2016.Google ScholarDigital Library
C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. In ICLR, 2017.Google Scholar

Index Terms

Machine Learning Models that Remember Too Much
1. Security and privacy
  1. Software and application security

Recommendations

Semi-supervised learning combining transductive support vector machine with active learning

In typical data mining applications, labeling the large amounts of data is difficult, expensive, and time consuming, if annotated manually. To avoid manual labeling, semi-supervised learning uses unlabeled data along with the labeled data in the ...
Read More
Weakly supervised machine learning
Abstract
Supervised learning aims to build a function or model that seeks as many mappings as possible between the training data and outputs, where each training data will predict as a label to match its corresponding ground‐truth value. Although ...
Read More
Learning Fast Matching Models from Weak Annotations
WWW '19: The World Wide Web Conference

We propose a novel training scheme for fast matching models in Search Ads, motivated by practical challenges. The first challenge stems from the pursuit of high throughput, which prohibits the deployment of inseparable architectures, and hence greatly ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CCS '17: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security
October 2017
2682 pages
ISBN:9781450349468
DOI:10.1145/3133956
General Chair:
Bhavani Thuraisingham
The University of Texas at Dallas, USA
,
Program Chairs:
David Evans
University of Virginia
,
Tal Malkin
Columbia University
,
Dongyan Xu
Purdue University
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 30 October 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
CCS '17 Paper Acceptance Rate151of836submissions,18%Overall Acceptance Rate1,261of6,999submissions,18%
More
Upcoming Conference
CCS '24

Sponsor:

sigsac

ACM SIGSAC Conference on Computer and Communications Security

October 14 - 18, 2024

Salt Lake City , UT , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 234
  Total Citations
  View Citations
- 2,986
  Total Downloads
- Downloads (Last 12 months)497
- Downloads (Last 6 weeks)84
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Machine Learning Models that Remember Too Much

CCS '17: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Semi-supervised learning combining transductive support vector machine with active learning

Weakly supervised machine learning

Learning Fast Matching Models from Weak Annotations

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Machine Learning Models that Remember Too Much

CCS '17: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Semi-supervised learning combining transductive support vector machine with active learning

Weakly supervised machine learning

Learning Fast Matching Models from Weak Annotations

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media