ABSTRACT
With the spread of social networks and their unfortunate use for hate speech, automatic detection of the latter has become a pressing problem. In this paper, we reproduce seven state-of-the-art hate speech detection models from prior work, and show that they perform well only when tested on the same type of data they were trained on. Based on these results, we argue that for successful hate speech detection, model architecture is less important than the type of data and labeling criteria. We further show that all proposed detection techniques are brittle against adversaries who can (automatically) insert typos, change word boundaries or add innocuous words to the original hate speech. A combination of these methods is also effective against Google Perspective - a cutting-edge solution from industry. Our experiments demonstrate that adversarial training does not completely mitigate the attacks, and using character-level features makes the models systematically more attack-resistant than using word-level features.
- Badjatiya, P., Gupta, S., Gupta, M., and Varma, V. Deep Learning for Hate Speech Detection in Tweets. In Proceedings of the 26th International Conference on World Wide Web Companion (2017), pp. 759--760. Google ScholarDigital Library
- Brennan, M., Afroz, S., and Greenstadt, R. Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity. ACM Transactions on Information and System Security 15, 3 (2011), 12:1--12:22. Google ScholarDigital Library
- Brennan, M., and Greenstadt, R. Practical Attacks Against Authorship Recognition Techniques. In Proceedings of the Twenty-First Conference on Innovative Applications of Artificial Intelligence (2009), K. Haigh and N. Rychtyckyj, Eds., pp. 60--65.Google Scholar
- Brown, A. What is hate speech? Part1: The myth of hate. Law and Philosophy 36, 4 (2017), 419--468.Google Scholar
- Burnap, P., and Williams, M. L. Cyber hate speech on twitter: An application of machine classification and statistical modeling for policy and decision making. Policy & Internet 7, 2 (2015), 223--242.Google ScholarCross Ref
- Chen, Y., Zhou, Y., Zhu, S., and Xu, H. Detecting offensive language in social media to protect adolescent online safety. In Proceedings of the 2012 International Conference on Privacy, Security, Risk and Trust and of the 2012 International Conference on Social Computing, PAS-SAT/SocialCom '12 (Amsterdam, 2012), pp. 71--80. Google ScholarDigital Library
- Davidson, T., Warmslay, D., Macy, M., and Weber, I. Automated Hate Speech Detection and the Problem of Offensive Language. In Proceedings of the 11th Conference on Web and Social Media (2017), pp. 512--515.Google Scholar
- Dinakar, K., Jones, B., Havasi, C., Lieberman, H., and Picard, R. Common sense reasoning for detection, prevention, and mitigation of cyberbullying. ACM Transactions on Interactive Intelligen Systems 2, 3 (2012), 18:1--18:30. Google ScholarDigital Library
- Gitari, N. D., Zuping, Z., Damien, H., and Long, J. A lexicon-based approach for hate speech detection. International Journal of Multimedia and Ubiquitous Engineering 10, 4 (2015), 215--230.Google ScholarCross Ref
- Goldberg, Y. A primer on neural network models for natural language processing. Journal of Artificial Intelligence Research 57, 1 (2016), 345--420. Google ScholarDigital Library
- Hosseini, H., Kannan, S., Zhang, B., and Poovendran, R. Deceiving Google's Perspective API Built for Detecting Toxic Comments. CoRR abs/1702.08138 (2017).Google Scholar
- Howard, J., and Ruder, S. Fine-tuned language models for text classification. CoRR abs/1801.06146 (2018).Google Scholar
- Kumar, S., and Shah, N. False information on web and social media: A survey. CoRR abs/1804.08559 (2018).Google Scholar
- Lowd, D., and Meek, C. Good word attacks on statistical spam filters. In CEAS (2005).Google Scholar
- Marpaung, J., Sain, M., and Lee, H.-J. Survey on malware evasion techniques: State of the art and challenges. In 14th International Conference on Advanced Communication Technology (2012), pp. 744--749.Google Scholar
- Mehdad, Y., and Tetreault, J. Do characters abuse more than words? In 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue (2016), pp. 299--303.Google ScholarCross Ref
- Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer Sentinel Mixture Models. In Proceedings of the International Conference on Learning Representations (2017).Google Scholar
- Michel, J.-B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., Team, T. G. B., Pickett, J. P., Hoiberg, D., Clancy, D., Norvig, P., Orwant, J., Pinker, S., Nowak, M. A., and Aiden, E. L. Quantitative Analysis of Culture Using Millions of Digitized Books. Science 6014, 331 (2011), 176--182.Google Scholar
- Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient Estimation of Word Representations in Vector Space. ArXiv e-prints (2013).Google Scholar
- Nobata, C., Tetreault, J., Thomas, A., Mehdad, Y., and Chang, Y. Abusive language detection in online user content. In Proceedings of the 25th International Conference on World Wide Web. Google ScholarDigital Library
- Pennington, J., Socher, R., and Manning, C. D. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP) (2014), pp. 1532--1543.Google ScholarCross Ref
- Perea, M., nabeitia, J. A. D., and Carreiras, M. R34D1NG W0RD5 W1TH NUMB3R5. Journal of Experimental Psychology: Human Perception and Performance 34 (2008), 237--241.Google ScholarCross Ref
- Rayner, K., White, S., Johnson, R., and Liversedge, S. Raeding wrods with jubmled lettres: there is a cost. Psychological Science 17, 3 (2006), 192--193.Google ScholarCross Ref
- Schmidt, A., and Wiegand, M. A Survey on Hate Speech Detection using Natural Language Processing. In Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media (2017), pp. 1--10.Google ScholarCross Ref
- Stern, H., Mason, J., and Shepherd, M. A linguistics-based attack on personalised statistical e-mail classifiers. Tech. rep., Dalhousie University, 2004.Google Scholar
- Warner, W., and Hirschberg, J. Detecting hate speech on the world wide web. In Proceedings of the Second Workshop on Language in Social Media, LSM '12 (2012), pp. 19--26. Google ScholarDigital Library
- Waseem, Z., and Hovy, D. Hateful symbols or hateful people? predictive features for hate speech detection on twitter. In Proceedings of the NAACL Student Research Workshop (2016), pp. 88--93.Google ScholarCross Ref
- Wulczyn, E., Thain, N., and Dixon, L. Ex Machina: Personal Attacks Seen at Scale. In Proceedings of the 26th International Conference on World Wide Web (2017), pp. 1391--1399. Google ScholarDigital Library
- Zhang, Z., Robinson, D., and Tepper, J. Detecting Hate Speech on Twitter Using a Convolution-GRU Based Deep Neural Network. In Proceedings of ESWC (2018), pp. 745--760.Google ScholarCross Ref
- Zhou, Y., Jorgensen, Z., and Inge, W. M. Combating good word attacks on statistical spam filters with multiple instance learning. 19th IEEE International Conference on Tools with Artificial Intelligence(ICTAI 2007) 2 (2007), 298--305. Google ScholarDigital Library
Index Terms
- All You Need is "Love": Evading Hate Speech Detection
Recommendations
On Improving the Effectiveness of Adversarial Training
IWSPA '19: Proceedings of the ACM International Workshop on Security and Privacy AnalyticsMachine learning models, including neural networks, are vulnerable to adversarial examples, which are adversarial inputs generated from legitimate examples by applying small perturbations to fool machine learning models to misclassify. Algorithms that ...
Adversarial Machine Learning Attacks and Defense Methods in the Cyber Security Domain
In recent years, machine learning algorithms, and more specifically deep learning algorithms, have been widely used in many fields, including cyber security. However, machine learning systems are vulnerable to adversarial attacks, and this limits the ...
Wild Patterns: Ten Years After the Rise of Adversarial Machine Learning
CCS '18: Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications SecurityDeep neural networks and machine-learning algorithms are pervasively used in several applications, ranging from computer vision to computer security. In most of these applications, the learning algorithm has to face intelligent and adaptive attackers ...
Comments