DexBERT: Effective, Task-Agnostic and Fine-grained Representation Learning of Android Bytecode
release_luwgyrgdynfnvor4kucnyr55a4
by
Tiezhu Sun
2023
Abstract
The automation of a large number of software engineering tasks is becoming
possible thanks to Machine Learning (ML). Central to applying ML to software
artifacts (like source or executable code) is converting them into forms
suitable for learning. Traditionally, researchers have relied on manually
selected features, based on expert knowledge which is sometimes imprecise and
generally incomplete. Representation learning has allowed ML to automatically
choose suitable representations and relevant features. Yet, for Android-related
tasks, existing models like apk2vec focus on whole-app levels, or target
specific tasks like smali2vec, which limits their applicability. Our work is
part of a new line of research that investigates effective, task-agnostic, and
fine-grained universal representations of bytecode to mitigate both of these
two limitations. Such representations aim to capture information relevant to
various low-level downstream tasks (e.g., at the class-level). We are inspired
by the field of Natural Language Processing, where the problem of universal
representation was addressed by building Universal Language Models, such as
BERT, whose goal is to capture abstract semantic information about sentences,
in a way that is reusable for a variety of tasks. We propose DexBERT, a
BERT-like Language Model dedicated to representing chunks of DEX bytecode, the
main binary format used in Android applications. We empirically assess whether
DexBERT is able to model the DEX language and evaluate the suitability of our
model in three distinct class-level software engineering tasks: Malicious Code
Localization, Defect Prediction, and Component Type Classification. We also
experiment with strategies to deal with the problem of catering to apps having
vastly different sizes, and we demonstrate one example of using our technique
to investigate what information is relevant to a given task.
In text/plain
format
Archived Files and Locations
application/pdf 941.2 kB
file_7xoe7vlzk5d5rihza25t7ievta
|
arxiv.org (repository) web.archive.org (webarchive) |
2212.05976v2
access all versions, variants, and formats of this works (eg, pre-prints)