论文标题
A3IDENT:一种基于两步的方法来识别Android应用的主要作者
A3Ident: A Two-phased Approach to Identify the Leading Authors of Android Apps
论文作者
论文摘要
作者身份识别是通过给定代码识别和分类作者的过程。作者身份识别可用于广泛的软件域,例如代码作者纠纷,窃检测,攻击者身份的暴露。除了遗留软件开发的固有挑战外,Android中的框架编程和众包模式大大增加了作者身份识别的困难。更具体地说,广泛的第三方库和继承的组件(例如类,方法和变量)稀释了整个Android应用程序中的主要代码,并模糊了不同作者编写的代码边界。但是,先前的研究还没有很好地解决这些挑战。 为此,我们设计了一种两步方法,将Android应用程序的主要代码归因于特定的开发人员。在第一阶段,我们提出了三种类型的策略,以确定应用程序中的Java软件包之间的关系,这些策略包括上下文,语义和结构关系。开发了一个软件包汇总算法,以聚集所有由同一作者编写的概率很高的软件包。在第二阶段,我们开发了三种类型的功能,以捕获作者的编码习惯和代码样式测定法。基于此,我们从其发达的Android应用程序中为作者生成了指纹,并采用了几种机器学习算法进行作者资格分类。我们在三个包含来自257个不同开发人员的应用程序15,666个应用程序的数据集中评估了我们的方法,并平均达到92.5%的精度。此外,我们对2,900个混淆应用程序进行测试,我们的方法可以以80.4%的准确率对应用程序进行分类。
Authorship identification is the process of identifying and classifying authors through given codes. Authorship identification can be used in a wide range of software domains, e.g., code authorship disputes, plagiarism detection, exposure of attackers' identity. Besides the inherent challenges from legacy software development, framework programming and crowdsourcing mode in Android raise the difficulties of authorship identification significantly. More specifically, widespread third party libraries and inherited components (e.g., classes, methods, and variables) dilute the primary code within the entire Android app and blur the boundaries of code written by different authors. However, prior research has not well addressed these challenges. To this end, we design a two-phased approach to attribute the primary code of an Android app to the specific developer. In the first phase, we put forward three types of strategies to identify the relationships between Java packages in an app, which consist of context, semantic and structural relationships. A package aggregation algorithm is developed to cluster all packages that are of high probability written by the same authors. In the second phase, we develop three types of features to capture authors' coding habits and code stylometry. Based on that, we generate fingerprints for an author from its developed Android apps and employ several machine learning algorithms for authorship classification. We evaluate our approach in three datasets that contain 15,666 apps from 257 distinct developers and achieve a 92.5% accuracy rate on average. Additionally, we test it on 2,900 obfuscated apps and our approach can classify apps with an accuracy rate of 80.4%.