ABSTRACT
The automatic extraction of information from Cyber Threat Intelligence (CTI) reports is crucial in risk management. The increased frequency of the publications of these reports has led researchers to develop new systems for automatically recovering different types of entities and relations from textual data. Most state-of-the-art models leverage Natural Language Processing (NLP) techniques, which perform greatly in extracting a few types of entities at a time but cannot detect heterogeneous data or their relations. Furthermore, several paradigms, such as STIX, have become de facto standards in the CTI community and dictate a formal categorization of different entities and relations to enable organizations to share data consistently.
This paper presents STIXnet, the first solution for the automated extraction of all STIX entities and relationships in CTI reports. Through the use of NLP techniques and an interactive Knowledge Base (KB) of entities, our approach obtains F1 scores comparable to state-of-the-art models for entity extraction (0.916) and relation extraction (0.724) while considering significantly more types of entities and relations. Moreover, STIXnet constitutes a modular and extensible framework that manages and coordinates different modules to merge their contributions uniquely and exhaustively. With our approach, researchers and organizations can extend their Information Extraction (IE) capabilities by integrating the efforts of several techniques without needing to develop new tools from scratch.
- Alfred V. Aho and Margaret J. Corasick. 1975. Efficient String Matching: An Aid to Bibliographic Search. Commun. ACM 18, 6 (jun 1975), 333–340.Google ScholarDigital Library
- Sean Barnum. 2012. Standardizing cyber threat intelligence information with the structured threat information expression (stix). Mitre Corporation 11 (2012), 1–22.Google Scholar
- David Bianco. 2013. The pyramid of pain. Enterprise Detection & Response (2013).Google Scholar
- Long Chen, Yu Gu, Xin Ji, Chao Lou, Zhiyong Sun, Haodan Li, Yuan Gao, and Yang Huang. 2019. Clinical trial cohort selection based on multi-level rule-based natural language processing system. Journal of the American Medical Informatics Association 26, 11 (07 2019), 1218–1226.Google ScholarCross Ref
- Ping Chen, Lieven Desmet, and Christophe Huygens. 2014. A Study on Advanced Persistent Threats. In Communications and Multimedia Security, Bart De Decker and André Zúquete (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 63–72.Google Scholar
- K. R. Chowdhary. 2020. Natural Language Processing. Springer India, 603–649.Google Scholar
- Julie Connolly, Mark Davidson, and Charles Schmidt. 2014. The trusted automated exchange of indicator information (taxii). The MITRE Corporation (2014), 1–20.Google Scholar
- Christiane Fellbaum. 2010. WordNet. Springer Netherlands, 231–243.Google Scholar
- Houssem Gasmi, Jannik Laval, and Abdelaziz Bouras. 2019. Information Extraction of Cybersecurity Concepts: An LSTM Approach. Applied Sciences 9, 19 (2019).Google Scholar
- Balázs Godény. 2012. Rule Based Product Name Recognition and Disambiguation. In 2012 IEEE 12th International Conference on Data Mining Workshops. 858–860.Google Scholar
- Lei Hua and Chanqin Quan. 2016. A shortest dependency path based convolutional neural network for protein-protein relation extraction. BioMed research international 2016 (2016).Google Scholar
- Natalia Konstantinova. 2014. Review of Relation Extraction Methods: What Is New Out There?. In International Conference on Analysis of Images, Social Networks and Texts. Springer International Publishing, 15–28.Google ScholarCross Ref
- Valentine Legoy, Marco Caselli, Christin Seifert, and Andreas Peter. 2020. Automated Retrieval of ATT&CK Tactics and Techniques for Cyber Threat Reports.Google Scholar
- Tao Li, Yuanbo Guo, and Ankang Ju. 2019. A Self-Attention-Based Approach for Named Entity Recognition in Cybersecurity. In 2019 15th International Conference on Computational Intelligence and Security (CIS). 147–150.Google ScholarCross Ref
- Sepideh Mesbah, Christoph Lofi, Manuel Valle Torre, Alessandro Bozzon, and Geert-Jan Houben. 2018. TSE-NER: An Iterative Approach for Long-Tail Entity Extraction in Scientific Publications. In International Semantic Web Conference. Springer International Publishing, Cham, 127–143.Google Scholar
- Abhishek Nadgeri, Anson Bastos, Kuldeep Singh, Isaiah Onando Mulang’, Johannes Hoffart, Saeedeh Shekarpour, and Vijay Saraswat. 2021. KGPool: Dynamic Knowledge Graph Context Selection for Relation Extraction.Google Scholar
- Luke Noel. 2021. RedAI: A machine learning approach to cyber threat intelligence. (2021).Google Scholar
- Jakub Piskorski and Roman Yangarber. 2013. Information extraction: Past, present and future. Springer, 23–49.Google Scholar
- Z Porkorny. 2018. What Are the Phases of The Threat Intelligence Lifecycle. The Threat Intelligence Handbook (2018).Google Scholar
- Alexandra Pomares Quimbaya, Alejandro Sierra Múnera, Rafael Andrés González Rivera, Julián Camilo Daza Rodríguez, Oscar Mauricio Muñoz Velandia, Angel Alberto Garcia Peña, and Cyril Labbé. 2016. Named Entity Recognition Over Electronic Health Records Through a Combined Dictionary-based Approach. Procedia Computer Science 100 (2016), 55–61.Google ScholarCross Ref
- Priyanka Ranade, Aritran Piplai, Anupam Joshi, and Tim Finin. 2021. CyBERT: Contextualized Embeddings for the Cybersecurity Domain. In 2021 IEEE International Conference on Big Data (Big Data). 3334–3342.Google Scholar
- Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.Google ScholarCross Ref
- Johan Sigholm and Martin Bang. 2013. Towards Offensive Cyber Counterintelligence: Adopting a Target-Centric View on Advanced Persistent Threats. In 2013 European Intelligence and Security Informatics Conference. 166–171.Google ScholarDigital Library
- Blake E Strom, Andy Applebaum, Doug P Miller, Kathryn C Nickels, Adam G Pennington, and Cody B Thomas. 2018. Mitre att&ck: Design and philosophy. In Technical report. The MITRE Corporation.Google Scholar
- Peng Sun, Xuezhen Yang, Xiaobing Zhao, and Zhijuan Wang. 2018. An Overview of Named Entity Recognition. In 2018 International Conference on Asian Language Processing (IALP). 273–278.Google Scholar
- Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. BERT Rediscovers the Classical NLP Pipeline. (2019).Google Scholar
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need.Google Scholar
- Thomas D. Wagner, Khaled Mahbub, Esther Palomar, and Ali E. Abdallah. 2019. Cyber threat intelligence sharing: Survey and research directions. Computers & Security 87 (2019), 101589.Google ScholarDigital Library
- Xuren Wang, Runshi Liu, Jie Yang, Rong Chen, Zhiting Ling, Peian Yang, and Kai Zhang. 2022. Cyber Threat Intelligence Entity Extraction Based on Deep Learning and Field Knowledge Engineering. In 2022 IEEE 25th International Conference on Computer Supported Cooperative Work in Design (CSCWD). IEEE, 406–413.Google Scholar
- Rebecka Weegar. 2021. Applying natural language processing to electronic medical records for estimating healthy life expectancy. The Lancet Regional Health – Western Pacific 9 (01 Apr 2021).Google Scholar
- Sachini Weerawardhana, Subhojeet Mukherjee, Indrajit Ray, and Adele Howe. 2015. Automated Extraction of Vulnerability Information for Home Computer Security. In Foundations and Practice of Security, Frédéric Cuppens, Joaquin Garcia-Alfaro, Nur Zincir Heywood, and Philip W. L. Fong (Eds.). Springer International Publishing, 356–366.Google Scholar
- Zhibiao Wu and Martha Palmer. 1994. Verb Semantics and Lexical Selection. (1994).Google Scholar
- Yan Xu, Lili Mou, Ge Li, Yunchuan Chen, Hao Peng, and Zhi Jin. 2015. Classifying Relations via Long Short Term Memory Networks along Shortest Dependency Paths. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1785–1794.Google ScholarCross Ref
- Zhihao Yan and Jingju Liu. 2020. A Review on Application of Knowledge Graph in Cybersecurity. In 2020 International Signal Processing, Communications and Engineering Management Conference (ISPCEM). 240–243.Google Scholar
- Yizhe You, Jun Jiang, Zhengwei Jiang, Peian Yang, Baoxu Liu, Huamin Feng, Xuren Wang, and Ning Li. 2022. TIM: threat context-enhanced TTP intelligence mining on unstructured threat data. Cybersecurity 5, 1 (01 Feb 2022), 3.Google Scholar
- Yinghai Zhou, Yitong Ren, Ming Yi, Yanjun Xiao, Zhiyuan Tan, Nour Moustafa, and Zhihong Tian. 2023. CDTier: A Chinese Dataset of Threat Intelligence Entity Relationships. IEEE Transactions on Sustainable Computing (2023).Google Scholar
- Yinghai Zhou, Yi Tang, Ming Yi, Chuanyu Xi, and Hai Lu. 2022. CTI View: APT Threat Intelligence Analysis System. Security and Communication Networks 2022 (03 Jan 2022), 9875199.Google Scholar
Index Terms
- STIXnet: A Novel and Modular Solution for Extracting All STIX Objects in CTI Reports
Recommendations
CyNER: Information Extraction from Unstructured Text of CTI Sources with Noncontextual IOCs
Advances in Information and Computer SecurityAbstractCybersecurity threats have been increasing and growing more sophisticated year by year. In such circumstances, gathering Cyber Threat Intelligence (CTI) and following up with up-to-date threat information is crucial. Structured CTI such as ...
Data-driven analytics for cyber-threat intelligence and information sharing
Efficient analysis of shared Cyber Threat Intelligence (CTI) information is crucial for network risk assessment and security hardening. There is a growing interest in implementing a proactive line of defense through threat profiling. However, ...
Useful Cyber Threat Intelligence Relation Retrieval Using Transfer Learning
EICC '23: Proceedings of the 2023 European Interdisciplinary Cybersecurity ConferenceThe emergence of hacker groups extends the complexity and frequency of cyberattacks. To adapt to the rapidly evolving cyberattacks, acquiring valuable information from security incident reports is critical for businesses to gain visibility into the fast-...
Comments