Comments
Transcript
Detecting Android Malware Variants using Opcodes
Univeristá degli studi del Sannio DIPARTIMENTO DI INGEGNERIA Corso di Laurea Magistrale in Ingegneria Informatica Master Thesis in Sicurezza delle reti e dei sistemi software Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis Supervisor: Author : Prof. Corrado Aaron Visaggio Antonio Pirozzi mat. 399/37 Co-Supervisor: Ing. Francesco Mercaldo Academic Year 2013/2014 Dedicated to my Family. Dedicated to my parents Angelo and Orsola, for all that i received from them, for their daily sacrifices, for the entire life. Dedicated to my aunt Loredana, always present in my life, a guardian angel for the whole family, thanks for your support. Dedicated to my uncle Vincenzo, my real older brother, undisputed Master of Jazz. Dedicated to Luciana, a special aunt. Dedicated to Joshua, the joy of the family. Dedicated to my Grandparents, Antonio, Maddalena, Italo, Maria unfortunately they are not there more, they would be very proud of me. Dedicated to my Grandmother Maria, my other mom, I will take her forever in my heart, i know that you are always beside me. Dedicated to my love Alessia, the best girl that I could never find, thanks for making my life better. Dedicated to Steve Jobs and Nikola Tesla, the greatest revolutionaries, they made this world a better place. i Acknowledgements First and foremost a Special Thanks to Prof. Corrado Aaron Visaggio for his help and for the trust he has always placed in me. Thanks also to Ing. Francesco Mercaldo for its continued motivation and collaboration. A special thanks to a special person, Tito. A special thanks to my best friend Marco, he is always there. A special thanks to my friend Angelo, we started this adventure together, all those days passed in “Labis”, believing in our ideas . . . A special thanks to Nicola, he gave me a place in his company, he has always believed in me from day one, taught me day by day the qualities that must have a good engineer. A special thanks also to my colleagues Sonia Laudanna, Giovanni Izzo, Fabio Giovine, Angelo Ciampa, Piero Uniti, special people with whom i shared this long University journey. A special thanks to all my University Professors, each of which has been able to give me a great teaching for life. ii “Here’s to the crazy ones. The misfits. The rebels. The troublemakers. The round pegs in the square holes. The ones who see things differently. They’re not fond of rules. And they have no respect for the status quo. You can praise them, disagree with them, quote them, disbelieve them, glorify or vilify them. About the only thing you can’t do is ignore them. Because they change things. They invent. They imagine. They heal. They explore. They create. They inspire. They push the human race forward. Maybe they have to be crazy. . . While some see them as the crazy ones, we see genius. Because the people who are crazy enough to think they can change the world, are the ones who do.” Steve Jobs Abstract Android platform starts became the universal front-end in the IoE and IoT, mobile attacks will continue to grow rapidly as new technologies expand the attack surface. Vendors, manufacturer, providers should extend vulnerability shielding and exploitprevention technologies and Anti-Malware vendors have to enhance their actual solutions, because new type of malware have a fileless payload that only runs in memory and to circumvent detection as well it adopt more complex obfuscation techniques. The idea behind this work, arises from the awareness that a more effective and holistic anti malware approach have to first outline the phylogenesis, understand its evolution and sophistication, their belonging semantics. This methodology move toward this direction, implementing a clone-detection heuristic for outline common payload components in order to identify malware variants. Our Heuristic, is a contribution in the Malware Analysis phase, not in the Detection phase, to well-understand Android Malware and their evolution, to trace back a possible Malware descent. To achieve these goals, we start from the analysis of the Opcodes Frequency Distribution, obtaining by similarities, the 10 nearest vectors from the Data-set (build from the Android Drebin Project), then, an n-grams heuristic on the Adjacency Lists, detect isomorphism features in the Call Graphs to identify payloads components as common sub-graphs. Then we a re able to outline a possible genome for each malware family and are able to define a possible descent for each malware variant, also multiple-descents, proving the effectiveness of this methodology. This work aims to lay the foundation of a new types of methodologies based on the study of the payload philogenesy. Contents Acknowledgements ii Abstract iv Table of Contents v List of Figures viii List of Tables xi List of Listings xiii Abbreviations xv 1 Introduction 1.1 Motivation and Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Android: A Security Overview 2.1 The Android Security Model . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Android Security Program . . . . . . . . . . . . . . . . . . . . 2.1.2 Android Platform Security Architecture . . . . . . . . . . . . 2.1.2.1 Kernel-level Security . . . . . . . . . . . . . . . . . . 2.1.2.2 The Application Sandbox . . . . . . . . . . . . . . . 2.1.2.3 The System Partitions . . . . . . . . . . . . . . . . . 2.1.2.4 The Secure BootLoader and QFuses . . . . . . . . . 2.1.2.5 Cryptography . . . . . . . . . . . . . . . . . . . . . 2.1.2.6 Security-Enhanced Android and Samsung Knox . . 2.1.2.7 Secure IPC : INTENTS and BINDER . . . . . . . . 2.1.3 Android Application Security . . . . . . . . . . . . . . . . . . 2.1.3.1 The Android Permission Model and Protected APIs 2.1.3.2 Application Signing Permission Enforcement . . . . 2.2 Malicious Android apps . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 An Overview: Botnet, Data collectors and Madware . . . . . 2.2.2 Android Malware Characterization . . . . . . . . . . . . . . . v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 1 2 3 3 4 5 6 6 8 9 13 13 18 34 34 36 47 47 53 Contents 2.2.3 vi 2.2.2.1 Malware Installation 2.2.2.2 Activation . . . . . 2.2.2.3 Malicious Payload . 2.2.2.4 Permission Used . . Evolutions and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Malware Detection Methodologies 3.1 Malware Detection Techniques . . . . . . . . . 3.1.1 Signature-based Detection Techniques . 3.1.2 Anomaly-based Detection Techniques . 3.1.3 Application Permission Analysis . . . . 3.1.4 Cloud-based Detection Analysis . . . . . 3.2 Malware trasformation . . . . . . . . . . . . . . 3.3 Evaluating Android anti-malware Products . . 3.4 Malware Analysis using Call Graphs . . . . . . 3.4.1 Similarity Detection using Call Graphs . 3.4.2 Software Clones Taxonomy . . . . . . . 3.4.3 The Isomorphism Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 58 58 60 65 . . . . . . . . . . . 67 68 69 72 73 78 80 85 91 92 94 101 4 Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 102 4.1 The Data-Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.1.1 The Malicious Data-Set, Malware Variants . . . . . . . . . . . . . 104 4.1.1.1 Malware families . . . . . . . . . . . . . . . . . . . . . . . 104 4.1.2 The Trusted Data-set . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.2 The multi-staged Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.3 The Relative Opcodes Frequency Distribution . . . . . . . . . . . . . . . . 112 4.4 Isomorphism Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 4.4.1 The Control Flow Graph . . . . . . . . . . . . . . . . . . . . . . . 142 4.4.2 The Function Call Graph . . . . . . . . . . . . . . . . . . . . . . . 143 4.4.3 Vector-Space Mapping . . . . . . . . . . . . . . . . . . . . . . . . . 144 4.4.4 N-grams Analysis stage . . . . . . . . . . . . . . . . . . . . . . . . 156 4.4.4.1 N-grams Analysis for each of the 10 returned vectors . . 158 4.4.5 The Method-level similarity analysis stage. Type I and II clones detection at method-level . . . . . . . . . . . . . . . . . . . . . . . 164 4.5 The Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 4.6 Real System implementation . . . . . . . . . . . . . . . . . . . . . . . . . 167 4.6.0.1 DescentDroid.pm . . . . . . . . . . . . . . . . . . . . . . . 170 5 Experimental Phase 5.1 Focus of the Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Focus I: Performance evaluation : A MultiClass Classification Problem . 5.2.1 Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Evaluations Detail . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2.1 Classification Accuracy, Precision&Recall for each Malware Class. . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3 Conclusions and Considerations . . . . . . . . . . . . . . . . . . . 5.3 Focus II: Detection Evaluation of Android Malware . . . . . . . . . . . . 173 . 173 . 173 . 174 . 175 . 178 . 187 . 188 Contents 5.3.1 5.3.2 5.3.3 vii Characterization . . . . . . . . . . . . . Evaluations Detail . . . . . . . . . . . . 5.3.2.1 I classifier: 0.99800 CosSim . 5.3.2.2 II classifier: 0.99500 CosSim . 5.3.2.3 III classifier: 0.99000 CosSim Conclusions and Considerations . . . . . 6 Applications and future works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 189 189 189 190 191 195 A Partial code of Descent tool. Opcodes Frequency Distribution Computation 197 Bibliography 214 List of Figures 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 2.15 2.16 2.17 2.18 2.19 2.20 2.21 2.22 2.23 2.24 2.25 2.26 2.27 2.28 2.29 2.30 2.31 2.32 2.33 Android Software Stack [1] . . . . . . . . . . . . . . . . . . . . . . . . . Android Sandbox Mechanism [2] [3] . . . . . . . . . . . . . . . . . . . . Applicazions sharing the same UID [2] [3] . . . . . . . . . . . . . . . . . Motorola OMAP Secure Boot Chain [4] . . . . . . . . . . . . . . . . . . Samsung Knox efuse warranty bit . . . . . . . . . . . . . . . . . . . . . . HTC S-OFF NAND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SEAndroid partialc coverage [5] . . . . . . . . . . . . . . . . . . . . . . . Samsung KNOX System Security Overview [6] . . . . . . . . . . . . . . Samsung KNOX Secure Boot [7] . . . . . . . . . . . . . . . . . . . . . . Samsung KNOX Application Container [6] . . . . . . . . . . . . . . . . . Samsung KNOX Support CAC [6] . . . . . . . . . . . . . . . . . . . . . Samsung KNOX Client Certificate Management (CCM) [7] . . . . . . . Android simple form of IPC [8] . . . . . . . . . . . . . . . . . . . . . . . Android Process IPC [9] . . . . . . . . . . . . . . . . . . . . . . . . . . . Binder Communication [10] . . . . . . . . . . . . . . . . . . . . . . . . . Binder Framework [10] . . . . . . . . . . . . . . . . . . . . . . . . . . . . Binder Permission Enforcement [10] . . . . . . . . . . . . . . . . . . . . Binder Token Object Reference [10] . . . . . . . . . . . . . . . . . . . . Display of permissions for applications [11] . . . . . . . . . . . . . . . . Securing Activities with Custom permission [12] . . . . . . . . . . . . . Securing Services with Custom permission [12] . . . . . . . . . . . . . . Securing BroadcastReceiver with Custom permission [12] . . . . . . . . Securing ContentProvider with Custom permission [12] . . . . . . . . . . GMail AndroidManifest prior to 2.3.5 without signature-level enforcing . GMail AndroidManifest after to 2.3.5 with signature-level enforcing . . . Securing with URI permissions [12] . . . . . . . . . . . . . . . . . . . . . Apps collect your Information [13] . . . . . . . . . . . . . . . . . . . . . Apps collect your Information [13] . . . . . . . . . . . . . . . . . . . . . Which apps are abusing [13] . . . . . . . . . . . . . . . . . . . . . . . . . AdLibraries’ privacy scores [13] . . . . . . . . . . . . . . . . . . . . . . . AdLibraries brought to you by malware [13] . . . . . . . . . . . . . . . . An Update Attack from BaseBridge [14] . . . . . . . . . . . . . . . . . . Comparison of the top 20 permission requested by malicious and bening Android apps [14] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.34 An Overview of existing Android Malware (PART I: INSTALLATION AND ACTIVATION) 1 of 2 [14] . . . . . . . . . . . . . . . . . . . . . . 2.35 An Overview of existing Android Malware (PART I: INSTALLATION AND ACTIVATION) 2 of 2 [14] . . . . . . . . . . . . . . . . . . . . . . viii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 7 7 10 12 12 14 15 15 16 17 18 19 27 28 29 31 32 38 40 41 42 43 45 45 46 47 48 49 50 51 56 . 60 . 61 . 62 List of Figures ix 2.36 An Overview of existing Android Malware (PART II: MALICIOUS PAYLOADS) 1 of 2 [14] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 2.37 An Overview of existing Android Malware (PART II: MALICIOUS PAYLOADS) 2 of 2 [14] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 3.15 3.16 3.17 3.18 3.19 3.20 3.21 3.22 3.23 3.24 3.25 4.1 4.2 4.3 4.4 Malware detection Technologies Taxonomy [15] . . . . . . . . . . . . . . . Dynamic Signature Extraction [15] . . . . . . . . . . . . . . . . . . . . . . Top 20 requested permissions which has the most different requested rate in different dataset. The ordinate is the difference between the requested rate in malware dataset and the requested rate in benign dataset [16] . . . Difference in the frequencies of 18 selected permission in malware and benign .apk files [17] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . COMPARATIVE ANALYSIS OF BI-NORMAL SEPARATION AND MUTUAL INFORMATION FEATURE SELECTION METHOD [17] . . . . TOP 5 PERMISSION COMBINATIONSWHEN K = 5 [18] . . . . . . . . TOP 5 PERMISSION COMBINATIONSWHEN K = 6 [18] . . . . . . . . Cloud-based malware protection techniques: (a) Paranoid Android and (b) Crowdroid [19] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Before ProGuard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . After ProGuard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Before ProGuard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . String Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Before ProGuard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Code Reordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . An example of a junk code fragment [20] . . . . . . . . . . . . . . . . . . . Av-test carried out in the 2014: Detection rates in the endurance test [21] ANTI-MALWARE PRODUCTS EVALUATED [20] . . . . . . . . . . . . MALWARE SAMPLES USED FOR TESTING ANTI-MALWARE TOOLS [20] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Trasformation Keys [20] . . . . . . . . . . . . . . . . . . . . . . . . . . . . DROIDDREAM TRANSFORMATIONS AND ANTI-MALWARE FAILURE [20] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . FAKEPLAYER TRANSFORMATIONS AND ANTI-MALWARE FAILURE [20] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . EVALUATION SUMMARY [20] . . . . . . . . . . . . . . . . . . . . . . . “Android.Trojan.FakeInst.AS” from the FakeInstaller Malware Family [22] Complete function call graph of “Android:RuFraud-C” from the malware family FakeInstaller. Dark shading of nodes indicate malicious structures identified by the SVM [22] . . . . . . . . . . . . . . . . . . . . . . . . . . . Two corresponding methods in two app clones are from different markets. The first method has one more function call to initialize several ads [23] . 69 71 74 75 75 76 77 79 81 81 82 82 82 82 83 86 87 87 87 88 88 89 93 93 94 Android Drebin logo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 The Whole System Multi-staged Architecture . . . . . . . . . . . . . . . . 111 66d4fb0ba082a53eaedf8909f65f4f9d60f0b038e6d5695dbe6d5798853904aa sample, Opcode frequency Distribution . . . . . . . . . . . . . . . . . . . . . . 116 Opcodes Distribution extraction and Computation . . . . . . . . . . . . . 117 List of Figures 4.5 4.23 4.24 4.25 4.26 4.27 4.28 4.29 4.30 4.31 Candidate apk Relative Opcodes Frequency Distribution Vector (blue) vs Relative Opcodes Frequency Distribution of the 1st Vector (red) . . . . Candidate apk Relative Opcodes Frequency Distribution Vector (blue) vs Relative Opcodes Frequency Distribution of the 2nd Vector (red) . . . . Candidate apk Relative Opcodes Frequency Distribution Vector (blue) vs Relative Opcodes Frequency Distribution of the 3rd Vector (red) . . . . Candidate apk Relative Opcodes Frequency Distribution Vector (blue) vs Relative Opcodes Frequency Distribution of the 4th Vector (red) . . . . Candidate apk Relative Opcodes Frequency Distribution Vector (blue) vs Relative Opcodes Frequency Distribution of the 5th Vector (red) . . . . Candidate apk Relative Opcodes Frequency Distribution Vector (blue) vs Relative Opcodes Frequency Distribution of the 6th Vector (red) . . . . Candidate apk Relative Opcodes Frequency Distribution Vector (blue) vs Relative Opcodes Frequency Distribution of the 7th Vector (red) . . . . Candidate apk Relative Opcodes Frequency Distribution Vector (blue) vs Relative Opcodes Frequency Distribution of the 8th Vector (red) . . . . Candidate apk Relative Opcodes Frequency Distribution Vector (blue) vs Relative Opcodes Frequency Distribution of the 9th Vector (red) . . . . Candidate apk Relative Opcodes Frequency Distribution Vector (blue) vs Relative Opcodes Frequency Distribution of the 10th Vector (red) . . . Gappusin family, Opcodes classes Frequency Distribution . . . . . . . . Android Dragon Quest game CFG in Gephi . . . . . . . . . . . . . . . . Android Dragon Quest game FCG in Gephi . . . . . . . . . . . . . . . . Structural data mapped in the Vector-Space . . . . . . . . . . . . . . . . Adjacency Lists preparation and similarities search . . . . . . . . . . . . Architecture. N-gram analysis . . . . . . . . . . . . . . . . . . . . . . . . Common Subgraphs with the vector n. 9 Detected by n-grams analysis stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Common Subgraphs with the vector n. 10 Detected by n-grams analysis stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Architecture. Method-level similarity analysis stage . . . . . . . . . . . . method.pl execution phase . . . . . . . . . . . . . . . . . . . . . . . . . . AdjList.pl -freq execution phase . . . . . . . . . . . . . . . . . . . . . . . AdjList.pl -adjFCG execution phase . . . . . . . . . . . . . . . . . . . . AdjList.pl -adjFCG execution phase . . . . . . . . . . . . . . . . . . . . Execution phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . DescentDroid Execution phase . . . . . . . . . . . . . . . . . . . . . . . DescentDroid Execution phase . . . . . . . . . . . . . . . . . . . . . . . DescentDroid Execution phase, with debug option . . . . . . . . . . . . 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 979 malware variants proper Classified from 1000 Malware samples . . . Multiclass Classification, Precision in function to the size of the dataset Multiclass Classification, Precision vs the size of the dataset . . . . . . . Multiclass Classification Recall in function to the size of the dataset . . Multiclass Classification, Recall vs the size of the dataset . . . . . . . . 3 classifiers on the ROC space . . . . . . . . . . . . . . . . . . . . . . . . 3 classifiers - Performance Comparison . . . . . . . . . . . . . . . . . . . 3 classifiers - FP/FN Comparison . . . . . . . . . . . . . . . . . . . . . . 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16 4.17 4.18 4.19 4.20 4.21 4.22 x . 120 . 122 . 124 . 126 . 129 . 131 . 133 . 135 . 137 . . . . . . . 139 141 142 144 145 147 156 . 162 . . . . . . . . . . 163 164 167 168 169 169 170 171 171 172 . . . . . . . . 175 183 184 185 186 193 193 194 List of Tables 2.1 2.2 Cryptocurrencies Mining Difficult Rate . . . . . . . . . . . . . . . . . . . . 52 Android Malware Families [14] . . . . . . . . . . . . . . . . . . . . . . . . 54 4.1 4.2 4.3 Malware Drebin Project Families . . . . . . . . . . . . . . . . . . . . . . . 105 Dalvik Bytecode set Categories . . . . . . . . . . . . . . . . . . . . . . . . 113 66d4fb0ba082a53eaedf8909f65f4f9d60f0b038e6d5695dbe6d5798853904aa Relative Opcodes Frequency Distribution Vector . . . . . . . . . . . . . . . . 116 10 most similar vector by the Cosine Similarity . . . . . . . . . . . . . . . 118 Candidate apk Relative Opcodes Frequency Distribution Vector (left) vs Relative Opcodes Frequency Distribution of the 1st Vector (right) . . . . 119 Candidate apk Relative Opcodes Frequency Distribution Vector (left) vs Relative Opcodes Frequency Distribution of the 2nd Vector (right) . . . . 121 Candidate apk Relative Opcodes Frequency Distribution Vector (left) vs Relative Opcodes Frequency Distribution of the 3rd Vector (right) . . . . 123 Candidate apk Relative Opcodes Frequency Distribution Vector (left) vs Relative Opcodes Frequency Distribution of the 4th Vector (right) . . . . 125 Candidate apk Relative Opcodes Frequency Distribution Vector (left) vs Relative Opcodes Frequency Distribution of the 5th Vector (right) . . . . 128 Candidate apk Relative Opcodes Frequency Distribution Vector (left) vs Relative Opcodes Frequency Distribution of the 6th Vector (right) . . . . 130 Candidate apk Relative Opcodes Frequency Distribution Vector (left) vs Relative Opcodes Frequency Distribution of the 7th Vector (right) . . . . 132 Candidate apk Relative Opcodes Frequency Distribution Vector (left) vs Relative Opcodes Frequency Distribution of the 8th Vector (right) . . . . 134 Candidate apk Relative Opcodes Frequency Distribution Vector (left) vs Relative Opcodes Frequency Distribution of the 9th Vector (right) . . . . 136 Candidate apk Relative Opcodes Frequency Distribution Vector (left) vs Relative Opcodes Frequency Distribution of the 10th Vector (right) . . . . 138 Candidate apk CFG Adjacency List extracted and sorted . . . . . . . . . 148 Candidate apk FCG Adjacency List extracted . . . . . . . . . . . . . . . 151 Candidate apk FCG Adjacency List extracted . . . . . . . . . . . . . . . 152 n-grams on the Adjacency Lists of the 1st returned vector . . . . . . . . . 158 n-grams on the Adjacency Lists of the 2nd returned vector . . . . . . . . 158 n-grams on the Adjacency Lists of the 3rd returned vector . . . . . . . . . 158 n-grams on the Adjacency Lists of the 4th returned vector . . . . . . . . . 158 n-grams on the Adjacency Lists of the 5th returned vector . . . . . . . . . 159 n-grams on the Adjacency Lists of the 6th returned vector . . . . . . . . . 159 n-grams on the Adjacency Lists of the 7th returned vector . . . . . . . . . 159 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16 4.17 4.18 4.19 4.20 4.21 4.22 4.23 4.24 xi List of Tables xii 4.25 4.26 4.27 4.28 4.29 4.30 n-grams on the Adjacency Lists of the 8th returned vector . n-grams on the Adjacency Lists of the 9th returned vector . n-grams on the Adjacency Lists of the 10th returned vector Common Subgraph with the n. 9 vector . . . . . . . . . . . Common Subgraph with the n. 10 vector . . . . . . . . . . The Classifier Analysis stage 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 160 160 162 163 165 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 5.13 5.14 5.15 Results detail on 1000 malware samples, Synoptic table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Precision and Recall of the MultiClass Classification Process . . Different Classifiers of Opcodes Frequency Distribution Similarity Confusion Matrix for 0.99800 . . . . . . . . . . . . . . . . . . . . Performance measures for 0.99800 CosSim . . . . . . . . . . . . Confusion Matrix for 0.99500 . . . . . . . . . . . . . . . . . . . . Performance measures for 0.99800 CosSim . . . . . . . . . . . . Confusion Matrix for 0.99000 . . . . . . . . . . . . . . . . . . . . Performance measures for 0.99800 . . . . . . . . . . . . . . . . . Accuracy of the three Classifiers . . . . . . . . . . . . . . . . . . Recall of the three Classifiers . . . . . . . . . . . . . . . . . . . . Precision of the three Classifiers . . . . . . . . . . . . . . . . . . . F-Score of the three Classifiers . . . . . . . . . . . . . . . . . . . FP/FN ratio of the three Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 177 180 188 189 189 189 190 190 190 191 191 191 192 192 Listings 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 2.15 2.16 2.17 2.18 3.1 3.2 3.3 3.4 3.5 4.1 4.2 Intent-filter declaration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Explicit Intent declaration . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Implicit Intent declaration [24] . . . . . . . . . . . . . . . . . . . . . . . . 23 BroadcastReceiver registration for incoming SMS . . . . . . . . . . . . . . 24 Intent Filter with priority . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Exported CellValidateService component [25] . . . . . . . . . . . . . . . . 25 CellValidateService Class [25] . . . . . . . . . . . . . . . . . . . . . . . . . 25 Exploit [25] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 IBinder.transact() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 IBinder.onTransact() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Client application acquiring wavelock on PowerManager service [26] . . . 32 PowerManager source code [26] . . . . . . . . . . . . . . . . . . . . . . . . 33 WhatsApp.SF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 RECEIVE SMS permission . . . . . . . . . . . . . . . . . . . . . . . . . . 37 ACCESS FINE LOCATION permission . . . . . . . . . . . . . . . . . . . 37 ACCOUNT MANAGER permission . . . . . . . . . . . . . . . . . . . . . 38 Securing Android components . . . . . . . . . . . . . . . . . . . . . . . . . 44 URI permissions example . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Kirin policy example [27] . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Type I Clone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 Type II Clone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Type III Clone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 Type IV Clone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Node properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 37d2af6588343d813bac06d8db964491e04cd1fb933dfcdd7e940c0c47975bde sample Adjacency List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 A.1 Partial code of Descent tool. Opcodes Frequency Distribution Computation198 A.2 Partial code of Descent tool. Opcodes Frequency Distribution Computation199 A.3 Partial code of Descent tool. Opcodes Frequency Distribution Computation200 A.4 Partial code of Descent tool. Opcodes Frequency Distribution Computation201 A.5 Partial code of Descent tool. Opcodes Frequency Distribution Computation202 A.6 Partial code of Descent tool. Opcodes Frequency Distribution Computation203 A.7 Partial code of Descent tool. Opcodes Frequency Distribution Computation204 A.8 Partial code of Descent tool. Opcodes Frequency Distribution Computation205 A.9 Partial code of Descent tool. Opcodes Frequency Distribution Computation206 A.10 Partial code of Descent tool. Opcodes Frequency Distribution Computation207 A.11 Partial code of Descent tool. Opcodes Frequency Distribution Computation208 xiii List of Tables A.12 Partial code of Descent tool. A.13 Partial code of Descent tool. A.14 Partial code of Descent tool. A.15 Partial code of Descent tool. A.16 Partial code of Descent tool. xiv Opcodes Frequency Distribution Computation209 Opcodes Frequency Distribution Computation210 Opcodes Frequency Distribution Computation211 Opcodes Frequency Distribution Computation212 Opcodes Frequency Distribution Computation213 Abbreviations AIDL Android Interface Definition Language BYOD Bring Your Own Device CA Certification Authority DoD Department of Defense FIPS Federal Information Processing Standard FOTA Firmware Over TheAir FSA Finite State Automaton IoT Internet of Things IoE Internet of Everything MAC Mandatory Access Control MDM Mobile Device Management NSA National Security Agency ODE On-device Data Encryption OTA Over TheAir PDA Push Down Automaton TAN Transaction Authentication Number xv Chapter 1 Introduction 1.1 Motivation and Problem Every year, malware previsions, are ever more adverse. Mobile attacks will continue to grow rapidly as new technologies expand the attack surface. New vulnerabilities did not only reside on devices but also on platforms and apps. Installing malicious apps and visiting malicious websites will no longer be the sole mobile infection vectors. Vendors, manufacturer, providers should extend vulnerability shielding and exploit-prevention technologies and Anti-Malware vendors have to enhance their actual solution because new type of malware have a fileless payload that only runs in memory and to circumvent detection as well it adopt more complex obfuscation techniques. Exploit kits are growing and will attack android platform, which starts became the universal front-end in the IoE and IoT. SmartTV, Boxee TV devices, biometric systems, Home Automation devices, Android Car (Open Automotive Alliance) and the digital payments, ever more pervasive inside our life. 1.2 Goals This Master Thesis work doesn’ t want to be another methodology to Detecting Malware on Android devices, wants to represent an inversion in the strategy, posing itself before 1 Chapter 1. Introduction 2 any existing Anti-Malware solution, to support them, to enhance them, to build signs for them. The idea behind, arises from the awareness that a more effective and holistic anti malware approach have to first outline the phylogenesis, understand its evolution and sophistication, their belonging semantics. This methodology move toward this direction, implementing a clone-detection heuristic for outline common payload components in order to identify malware variants. Our Heuristic, is a contribution in the Malware Analysis phase, not in the Detection phase, to well-understand Android Malware and their evolution, to trace back a possible Malware descent. To achieve these goals, our methodology start from the analysis of the Opcodes Frequency Distribution, obtaining by similarities, the 10 nearest vectors from the Data-set (build from the Android Drebin Project), then, an n-grams heuristic on the Adjacency Lists, detect isomorphism features in the Call Graphs to identify payloads or malicious behaviours. In this way we a re able to outline a possible genome for each malware families. 1.3 Thesis Overview This Master Thesis is organized as follow. CHAPTER 2 (Android: A Security Overview) start with a Security Overview of the Android Security model, describing its internals, focusing on the Security mechanism implemented by Google and by vendors, in the second part of the chapter, will be described the Android Malware Characterization, malware evolutions and challanges. CHAPTER 3 (Malware Detection Methodologies) focusing on the state of the art of Malware detection Methodologies and its related unresolved problems, we also evaluate commercial mobile anti-malware products for Android and test how resistant they are against various common obfuscation techniques. In the second part of the Chapter we exploring detection methodologies based on Call-graph isomorphism analysis for malware variants detection. CHAPTER 4 (Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis) provide a detailed analysis of our implemented methodology, based on Opcodes Distribution analysis and an n-grams heuristic to outline isomorphism features in the Call Graphs. CHAPTER 5 (Experimental Phase) focus on the experimentation phase trying to confirm the value of this methodology. Finally CHAPTER 6 (Conclusions, Applications and future works) describes possible applications of this methodology and purposes for the future. Chapter 2 Android: A Security Overview 2.1 The Android Security Model The project was founded in October 2003 at Palo Alto, CA by Andy Rubin working on software for mobile phones. In the 2005 Google acquired Android Inc. and in the 2007 unveiled it by the Open Handset Alliance, a consortium of device manufactures and operators lead by Google which intention is promoting open standards for mobile devices. Android is a mobile platform based on a linux Kernel, designed, from the very beginning, with openness and interoperability in mind, enable developers to create engaging and innovative applications. Android uses an open architecture which also allows third party manufacturers to install their own custom NAND Flash to provide customizations for their device. This open design can undermine the platform itself and it is necessary ensuring strong protection mechanism in respect to some foundamental assets: users, developers, devices, HW component, applications. As a result, since the early stages of its development, Android is subject to a rigorous Security Program. Following we analize how this program is composed, and will therefore refer to the official documentation “Android Security Overview” [11] by Google. 3 Chapter 2. Android: A Security Overview 2.1.1 4 Android Security Program The “Android Security Program” consist of 4 main elements: Design Review During the design phase of the Android Platform, each Major Feature is reviewed by specialized Security Engineer and possibly integrated through Security Controls in the Architecture. Penetration Testing and Code Review During the development of the Platform components, each component must pass rigorous Security checks conducted by Android Security Team, by Google’s Information Security Engineering team and by indipendent Security Expert. The goal of these assessments, is to identify weakness and vulnerabilities. Open Source and Community Review After the issuing of a brand new version of the Android Platform, it will be subject to further review by the OpenSource Android Community, both on specific Board Review such as the Kernel Community or the Google Play Forum. Incidente Response Security issues may also occours during the shipping phase, for this reason, Android has a Security Response process. A Security expert team constantly monitoring Android Security-specific communities, and after have identified potential securityissues, they trying to act immediately for mitigate vulnerabilities with OTA and FOTA updates. Following, referring to the “Android Security Overview” [11] documentation, will be analyzed the Android Platform Security Architecture. Chapter 2. Android: A Security Overview 2.1.2 5 Android Platform Security Architecture In order to protect their foundamental assets, Android provides the following Security features: • Kernel-level Security • Application Sandbox • The System Partitions and Secure Bootloaders. • Cripthography • Secure BootLoaders • Security-Enhanced Android and Samsung Knox • Secure IPC : INTENT e BINDER • Application Signing Enforcement. Following is presented the whole Android Platform Architecture, full of the key components for ensuring the Security and the Integrity: Figure 2.1: Android Software Stack [1] An Android application is written in java and for this are protected from standard BufferOverflow exploit due to its implicit buond checking but some applications can Chapter 2. Android: A Security Overview 6 also access to C/C++ libraries via JNI the Java Native Interface. Developers use JNI for performance reasons but most of C/C++ libraries are mapped to fixed memory address in memory. All the application codes, running on top of the kernel is confined and isolated by the Application Sandbox. 2.1.2.1 Kernel-level Security Android is based on Linux Kernel Security mechanism in addition it includes some enforced IPC (Intents, BroadcastReceiver, Binder). Linux is a world-wide Operating System used in Security-sensitive and governative context, the kernel provides to Android some Security Features including: • A user-based permission model. • Process Isolation and Sandboxing. • Secure IPC. • Network Security (iptables) 2.1.2.2 The Application Sandbox The Sandbox is a sealed container which ensure isolation among Android applications, running on top of the kernel, it is auditable. The kernel assigns to each applications, that is executed, an unique ID, the UID, in this way the kernel ensure that each application has its own data and no other application can read or modify it if the application itself is considered untrusted. This isolation mechanism is based on the Linux user privileges, normally processes will have different process space and different UID’s , it is a DAC Access Control model: Chapter 2. Android: A Security Overview 7 Figure 2.2: Android Sandbox Mechanism [2] [3] By default Android applications run in the Sandbox and doesn’t have any grant to access resources, they can access to device resources by defining specific permissions in their AndroidManifest file. All application components runs in the same Process Space it is, however, possible to specify a process in which each compnent should run, the android:process attribute of the <application> element, in the AndroidManifest file serves to this. It is also possible to define android:process attribute so that components in different applications run in the same process but also, in this case the application must share their UID and must be signed with the same key. To share the same UID among applications, first of all, it is necessary to define the UID in the AndroidManifest file, specifying the android:sharedUserId [3] attribute then sign them with the same developer’s private key, the signature on any resource is easily traced back to it’s author. Figure 2.3: Applicazions sharing the same UID [2] [3] Chapter 2. Android: A Security Overview 8 Android Application Sandbox protect against Memory Corruption exploits which could compromise the entire System. 2.1.2.3 The System Partitions The Android Platform’s internal memory is generally a solid-state (flash) memory. Usually the bootloader has its own partition /boot, Recovery is another partition /recovery and the system partition is under /system. Following a short description of Android standard partitions: - /misc miscellaneous. - /boot bootloader, filesystem root. - /recovery holds the recovery program. - /system OS, Kernel, OS libraries, application runtime, binary applications. - /cache cache data. - /data user application data/preferences. To protect files use mode param : • MODE PRIVATE • MODE WORLD READABLE : • MODE WORLD WRITEABLE : It is important to note that application files, stored on internal storage, are secured by linux UID mechanisms but application files stored on external storage like SD card are not secured, it is not necessary to specify a permission for a third-part application to read these files. For this reason, application files stored on external storage must be encrypted. Chapter 2. Android: A Security Overview 2.1.2.4 9 The Secure BootLoader and QFuses The bootloader is a little piece of code that runs before the OS start up. This low-level code tells the device how to start-up and find the kernel image, the authentication process must guarantee the origin and the integrity of the software, it is stored on a device non-volatile memory and vary from one device to another. Manufactures sign and securing their bootloaders in order to ensure that their OS image is securely programmed into flash memory at the factory, to permit verification of the image and avoid tampering. Usually the implementation is OEM-dependent. The Secure Bootloader verifies signatures of Kernel, Android Recovery and if the signature check fails drop into Recovery mode and then the device was unable to boot. Following some Secure Boot technologies: Motorola OMAP Boot Chain Motorola in their devices, but not only it (Motorola Droid Milestone, Nokia N900, Motorola Defy but also Archos 43, Archos 70, Archos 101, LG Optimus Black, LG Optimus Bright, etc.), used the OMAP Crypto Processor by Texas Instruments. The JTAG port is locked down to prevent emulation and debug attacks. • SHA1 hash of root public key stored in eFUSE. • Boot ROM verification using either SHA-1 or SHA-256, and AES-128. • NIST-800-22 certified random number generator. • Signature verification on Kernel or Recovery. • Boot modules encrypted. Chapter 2. Android: A Security Overview Figure 2.4: Motorola OMAP Secure Boot Chain [4] 10 Chapter 2. Android: A Security Overview 11 The Boot ROM verifies the keys stored in the mbmloader and the mbmloader signature, mbmloader verifies signature of the mbm, mbm verifies the Linux BootLoader and the Linux BootLoader verifies the Kernel image or the Recovery. Qualcomm and QFuse Most of Qualcomm devices such as HTC family (HTC One, HTC Desire, HTC M8, HTC ONE X, etc.) using Qfuse driver. The Qfuse driver allow the one-time immutable configuration of the device settings and for examples to storing cryptographic keys. The value of each Qfuse is sensed at powerup and stored in a register, blowing Qfuses is done by placing a value to a register and applying current to the fuse. This process is irreversible. For instance, if the FORCE TRUSTED BOOT QFuse is blown, each stage of the Boot Chain is verified and this means that at each Boot stage, only verified bootloaders can run. The PBL “Primary Bootloader” verifies the signature of the SBL “Second BootLoader”, the SBL verifies signature on REX/AMSS (baseband) and on HBOOT (application bootloader) and then HBOOT verify Kernel image and finally boots the OS. Stage2 bootloader set ups the network, memory, and everything else that is required to run the kernel. When the kernel is finished loading, it starts the init process The Qfuse technology is also implemented in Samsung Knox devices as a warranty flag (Knox Warranty bit), in fact, once the device is flashed, the Knox warranty bit remain set in a register by blowing an efuse and remain always blown. This technology provide some Security Controls on the device by the OEM that, for example, may decide not to release more updates for their devices as happens for most of Samsung Galaxy devices. Chapter 2. Android: A Security Overview 12 Figure 2.5: Samsung Knox efuse warranty bit HTC S-ON/S-OFF In the HTC devices, the NAND memory is secured (S-ON Security On) for writeprotection, passing from S-ON to S-OFF means disable the NAND’s Security checks. Switching to S-OFF is possible to NAND repartition or to flashing the boot.img or the Kernel. Figure 2.6: HTC S-OFF NAND Chapter 2. Android: A Security Overview 2.1.2.5 13 Cryptography Android provides developers, a set of cryptography API for the implementation of the Cryptographics Security Standards such as AES, DES, SSL, RSA SHA. These APIs are provided by the javax.crypto package. By the Android documentation [11], from Android 4.0 it provide the KeyChain API for the management of Security Certs. 2.1.2.6 Security-Enhanced Android and Samsung Knox SELinux is a technology developed by NSA in the 2000 and it is used in much government and corporate environments for Data Security and for implementing the Bell-LaPadula model. Samsung worked with NSA on Android SELinux porting by referring as SEAndroid (Security-Enhanced Android). SEAndroid implements a MAC policy over the entire system, also on system processes (root, UID 0) enhancing the basic Linux DAC model. From the Android 4.3 Aka KitKat the Sandbox is enforced introducing SEAndroid but is still in permissive-mode by default for compatibility reason. This mode only log all policy violations but without intervening in the behaviour. From Android 4.4 [28] SEAndroid works in enforcing-mode for the installd, netd, vold and zygote root processes but for all others still works in permissive-mode [29]. In enforcing-mode, each policy violations is logged and then blocked. As reported by the document Validating Security-Enhanced Linux in Android [29]. Policies are defined in the external\sepolicy dir and it is possible to implementing its own policy defining a policy file in the device\manufacturer\device-name\sepolicy dir and after updating the BoardConfig.mk Makefile. At this point after the Android rebuild, SEAndroid will be aware about new user-defined policies, as OEM do with their devices, Google advice to conduct a policy rules validation process [29]. The “Policy Rules” are defined in the form: allow domains types:classes permissions as white List model, where: • Domain A label for a process or a process group. Chapter 2. Android: A Security Overview 14 • Type A label for a object or a set of objects. • Class The Object’s type. • Permission The action to perform (as Read and Write). Those Security Enhancements make the exploiting much more difficult but, we have to take some considerations. For instance, not the whole platform is Secured: Figure 2.7: SEAndroid partialc coverage [5] For example, the Policy Manager does not protect SurfaceFlinger which renders image to the screen, hackers can capture raw content send to SurfaceFlinger. Another effective way for hackers is to disable the Policy Management exploiting Linx Kernel Flaws. A depth study is done on a proprietary technology: Samsung Knox. This technology allows the applications’ sealing and isolation implementing a MDM and BYOD Enterprise-level technology. Samsung Knox is composed by the following components [6]: Chapter 2. Android: A Security Overview 15 Figure 2.8: Samsung KNOX System Security Overview [6] The Secure Boot technology prevents NAND flashing , Custom ROMs through a digital cert verification process. The firmware images are loaded only if they went cryptographically verified by a key stored in the device at factory time. The firmware images that are allowed are that cryptographically signed by trusted keys then verifyed by the HW. That is each stage of the boot process verifies the digital signature of the next before executing it: Figure 2.9: Samsung KNOX Secure Boot [7] Samsung Knox use SE for Android Security Enhancements for Android) to implementing a MAC model. To implementing MAC model, the OS integrity must be ensured, to address this, Samsung Knox use a layer called TIMA (TrustZone-based Integrity Measurement Architecture) which uses the ARM TrustZone technology. The TrustZone technology is a tamper-resistant sector of an ARM processor. During the boot process, TrustZone save fingerprint called “measurements” from all bootloaders and Chapter 2. Android: A Security Overview 16 OS kernel, at system runtime ARM TrustZone apps on KNOX Workspace constantly compares all measurements, in this way Trusted Boot is ensured. Critical security decisions are made based on the compared results [7]. CPU and Memory are partitioned into “containers” of secure and normal type. TIMA runs in secure mode and it can’t be disabled while SEAndroid runs in normal mode. Based on Samsung Knox documentation [6], if TIMA were encounter an integrity violation, it will notify that violation to the enterprise IT stuff via MDM. Samsung Knox besides implements an Application Security mechanism which provide different “execution environnements” to the device, each with its own launcher, applications, configurations ensuring isolation among these containers forbids the standards IPC mechanisms and allowing the “inter-containers” communications only through configuration policies. Figure 2.10: Samsung KNOX Application Container [6] Besides the Cryptoprocessor enable the ODE by means of which the IT administrators can encrypt all data on the device. The ODE use FIPS 140-2 certified Advanced Encryption Standard (AES) NIST standard with an 256 bit AES Key, the encryption key derives from an user supplied password through a Password-Based Key Derivation Function 2 (PBKDF2) secure function. There is also a secure VPN support, KNOX VPN, also FIPS 140-2 certified by which is possible to configuring per applications policies. Samsung Knox provide a full support to a governative utilization of the device through: Chapter 2. Android: A Security Overview 17 • Boot Attestation Samsung sign the Certificates of the Government’s Agencies by creating a 2 level CA, Knox Government root certificate, at this point, also government’s agencies can build their own firmware image to load on the device: • Smartcard - CAC Support By DoD regulamentation compliance, Samsung Knox permit to store the digital certs on the device. The browser, email and VPN clients can use credentials on the Common Access Cards (CACs) to log in, if the enterprise IT admin has configured this policy. CAC can be used for two-factor authentication on the device lock screen. Figure 2.11: Samsung KNOX Support CAC [6] • Client Certificate Management (CCM) As reporte by the Knox documentation [7], IMA Client Certificate Management (CCM) enables storage and retrieval of digital certificates, as well as other operations using those certificates such as encryption, decryption, signing, verification, and so on. The certificates and associated keys are encrypted with a device-unique hardware key that can only be decrypted from code running within ARM TrustZone. A default certificate is provided for apps that do not require their own certificate. TIMA Real-time Kernel Protection (RKP) intercepts critical kernel events to stop attempt that could undermine kernel integrity. TIMA also authtenticate kernel modules. Chapter 2. Android: A Security Overview 18 Figure 2.12: Samsung KNOX Client Certificate Management (CCM) [7] • Certification&Validations Samsung Knox is compliant with the NIST FIPS 140-2 level 1 specifications for DAR Data at Rest and DIT Data in Transit. On May 2, 2013 the Defense Information Systems Agency DISA approved the STIG for Samsung KNOX Workspace drafted for the Mobile Operating System SRG. 2.1.2.7 Secure IPC : INTENTS and BINDER Android use classical IPC mechanisms of the Unix-like System (but no SySV semaphores, shared memory segments, message queues etc.), also implementing their own Secure and flexibly IPC mechanisms within an exporting components mechanism. Android support a form of IPC through Intent and ContentProvider [8]: - Intents allow Asynchronous communication style among components. - Point-to-point style, publish-subscribe style. - Some components act as senders, other asreceivers - All low-level communication are managed by Binder. Chapter 2. Android: A Security Overview 19 Figure 2.13: Android simple form of IPC [8] An ovewrview of the main components of the Android Platform to better understand communication mechanisms: • Activities : Is an appication component by which the user can interact with the application. Usually, a complete application is composed by more Activities logically connected. • BroadcastReceiver: Is an Android application component that allows to registrating for system or application’s events. For instance, an application can registrating for the system event ACTION BOOT COMPLETED which will thrown when the OS will be launched. A BroadcastReceiver can be registered statically in the AndroidManifest file or in a dynamic way through the Context.registerReceiver() method. The class implementing the receiver must extends Context.registerReceiver() class, then it is possible to trigger when a particular event occourred. For this to work, the developer must define the relative permission in the AndroidManifest. For instance, if we want to create a Broadcast Receiver for the android.intent.action.BOOT COMP event will be necessary to define the android.permission.RECEIVE BOOT COMPLETED permission, in this way, the user, will be aware, through the Manifest file, what events application will listen. Chapter 2. Android: A Security Overview 20 • Services A service is a component that runs in the background to perform longrunning operations without needing to interact with the user. • Intents An Intent is a message Object generated by the Intent class of android.content package, that represents the “intention” to do something. It is a flexible and reliable asynchronous IPC. It is composed by the following fields [30] : – ComponentName: It is the component name that should receive the Intent with the relative data (Data field ). It is used to define Explicit Intents. The ComponentName class has different constructors : ComponentName(String package name,String class name), ComponentName( Context package,String class name ) e ComponentName( Context package, Class class) each of which permit the identification of a particular component. – Action: It is the action to take, the complete list can be found on the ufficial Intent class documentation and it is described by this syntax: package name.intent.action.ACTION. An activity declare in it own Manifest, the action that it would be able to manage through action elements. For instance ACTION MAIN to launch the main activity and ACTION EDIT for the modification of data. – Data: Represent the data and the type of these data. Activities will states in the Manifest, the data type that they could manage, through the attribute android:mimeType es: android:mimeType=image* augmenting the action description, in this case, an action ACTION EDIT on an image will be oriented on a image manipulation. – Category : Intent class permit to define categories, they will be first defined in the AndroidManifest with category also specifying the android:category attribute. It is used to provide more informations on actions to take. – Extras: This is an optional filed key:value based that it is useful to use when we want to provide additional informations with the putExtras(Bundle extras) methos, the receiver activity will extract the value with getExtras() method. Chapter 2. Android: A Security Overview 21 – Flags: Flags permit some extra caractherization, for example when starting an activity, it is also possible to use FLAG ACTIVITY SINGLE TOP for the verification of the singleton of the activity’s instance. In the official Intent class documentation [30] there is the complete list. With Intents is possible to activate application’s inner components or components exported by other applications. A broadcastIntent method can be used to directly send to a listening BroadcastReceiver [30] or with startActivity method to launch ana activity, it is possible to launch an activity, starting a service or thrown a broadcast message. It is possible to use 2 different form for intents: Explicit Intents or Implicit Intents. Explicit Intents directly calls the component they needs finding it through a name, while Implicit Intents ask System to send the message for them. Implicit Intents acts through a mechanism called Intent Resolution at run-time, when creating an Implicit Intent Android search for an appropriate component looking through Intent Filters declared insiede AndroidManifest of all installed applications. Once finding the right Intent fIlter, the Intent object will be delivered to it, if instead more Intent-filters are compatible, at that point, a dialog-box will be displayed to the user to choose the appropriate application to use for that action. An Intent-filter is used to tell Android, of which Intents listening. Google specifies, as good practice the utilizazion of Explicit Intents and recommend to not declare Intent filters for services [30]. Chapter 2. Android: A Security Overview 22 Listing 2.1: Intent-filter declaration <activity android:name="ShareActivity"> <intent-filter> <action android:name="android.intent.action.SEND"/> <category android:name="android.intent.category. DEFAULT"/> <data android:mimeType="text/plain"/> </intent-filter> </activity> An Activity without declared Intent-filter will only receive Explicit Intents. This Intent-filter receive only ACTION SEND intents. By the official documentation [30], from a Security perspective, would not be sufficient to avoid Intent-filter utilization, because avoiding Intent-filters utilization, the application can’ t reply to Implicit Intents but any other application can still use its component through Explicit Intents knowing the exact component namespace. It is necessary, about this consideration, to set to false the exported attribute: Listing 2.2: Explicit Intent declaration Intent i = new Intent(this, ActivityTwo.class); #Explicit Intent i.putExtra("Value1", "This value one for ActivityTwo "); i.putExtra("Value2", "This value two ActivityTwo"); Chapter 2. Android: A Security Overview 23 In this example, we see that, to invoke a component is necessary to just only know its exact name. Listing 2.3: Implicit Intent declaration [24] Intent helpIntent=new Intent(); helpIntent.setAction(android.intent.action.ACTION_EDIT); helpIntent.putExtras(message,userText); helpIntent.setType(text/plain); startActivity(helpIntent); In this example, an Implicit Intent starts, the recipient component should have text Editing capabilities. The Intent-resolution mechanism analyze the Action, Data and Category fields of the Intent object [24] . Intents Security BroadcastIntent ed Exported Components It is possible to use Broadcast Intent also through sendBroadcast(), sendOrderedBroadcast() methods, which Intents after will be “consumed” by receiver, they could not arrive to all applications. Sending a Brodcast request is dangerous, above all, if they contains sensible data, any evil application can intercept those data, for avoiding this, it is convenient to specifying, in the Request Intent, nonnull permissions. In this way, only the right target application could receive that Intent, for this reason, if Intent contains sensible data, is convenient to using Explicit Intents which calls directly the target recipient. Broadcast Intents could be easily hijacked by third-part applications. Let’s see, how it is possible. If we would create a stealth SMS hijacking application, first define a BroadcastReceiver for intercept incoming SMS: Chapter 2. Android: A Security Overview 24 Listing 2.4: BroadcastReceiver registration for incoming SMS <receiver android:name=".SMSReceiver"> <intent-filter> <action android:name="android.provider. Telephony.SMS_RECEIVED" /> </intent-filter> </receiver> When an SMS will be received, for avoid the default SMS application interception, we can set android:priority attribute in the Intent-filter for android.provider. Telephony.SMS RECEIVED actions and give, as value, maximum priority, that is 100. Listing 2.5: Intent Filter with priority <receiver android:name=".SMSReceiver"> <intent-filter android:priority="100"> <action android:name="android.provider. Telephony.SMS_RECEIVED" /> </intent-filter> </receiver> At this point, application first intercept the BroadcastIntent but after they’ ll continue to propagate and so that can be aborted in this way: this.abortBroadcast(). This evil behaviour is typical of many malware, just think that in that way it is possible to intercept Bank or TAN SMS about banking operations. But the not carefully use of BroadcastIntent, itsn’t the only way to introduce weakness in the system, other weakness can be introduced by default exported components. In fact, Services, by default were exported, unless the android:exported attribute were not specified to “true”. It will be sufficient to read the AndroidManifest and see if vulnerable services are exported, at that point if is also defined an Intent Filter inside service element, we could create an ad-hoc application which exploits that exported component without specifying particular permissions inside Chapter 2. Android: A Security Overview 25 AndroidManifest. A great example of how to exploit this weakness i reported on [25], the application in question is the famous GoSMS PRO. The application has the CellValidateService component exported by default an with an Intent-filter following defined: Listing 2.6: Exported CellValidateService component [25] <service android:name="com.jb.gosms.im. CellValidateService"> <intent-filter> <action android:name="com.jb..gosms.goim. ACTION_CELL_VALIDATE"/> </intent-filter> </service> Inside the CellValidateService code there is the following logic: Listing 2.7: CellValidateService Class [25] public void onStart(Intent paramIntent, int paramInt) {..... if("com.jb..gosms.goim.ACTION_CELL_VALIDATE".equals(paramIntent. getAction())) {String str1 = paramIntent.getStringExtra("phone") ....} Code(str1,str3) ..... public void Code(String paramString1, String paramString2){ SmsManager.getDefault().sendTextMessage(paramString1,null, paramString2,null,null) ........} } Chapter 2. Android: A Security Overview 26 At that point it is possible to exploit the weakness, forging an application without authorization. That application will exploit the default exported CellValidateService component, forging an the com.jb.gosms. goim.ACTION CELL VALIDATE Intent and after setting an extra field containing the telephone number as: Listing 2.8: Exploit [25] Intent intent= new Intent(); intent.setAction(com.xxx.ACTION_CELL_VALIDATE); intent.putExtra("phone","3424325") startService(intent); • BINDER In android each single applications run inside a process space isolated, for Security and Stability reasons, but anyway it is necessary that these components, communicating between them. For instance, an application needs for access to a Service, at that point through an IPC interface, the service is requested , through a native library interface which interfaces with the driver in kernel-space. This interaction, among services, applications and the kernel, occours in Android through BINDER: Chapter 2. Android: A Security Overview 27 Figure 2.14: Android Process IPC [9] The framework perform an ioctl() call to the /dev/binder fd, trasferring data to the kernel (1), the Driver will find the Server Service, will copy data in th server process space and will wake up a thread in the server process to evading request (2). At this point, the server, after the Parcels data unmarshalling, will verify that the Client must have the proper permission to execute the request and, if it is necessary, it will send to the kernel, the authorization to use a resource (3), waiting for a response (4). Subsequently, a libbinder library copy, loaded inside the server process space, will do the response data marshalling and then will send to Binder driver (5) which will send to the Client process (6) [9]. The android libc “bionic” does not support the SystemV IPC: - no semaphores - no message queues - no shared memory segments Chapter 2. Android: A Security Overview 28 Binder is a Secure and Reliable IPC mechanism in Android. It derives from an Open Source project called “OpenBinder” by Palm Inc. born under Linux and it extends the IPC classical concept. Substantially, it is a mechanism which permit to bind data and functions from an execution environments to another. Binder born from the idea to make the IPC communication more performing, the communication does not done by memory copy but through a kenrnel memory space and therefore shared among all processes. Binder is a driver and it’s task is to map each memory address referenced by a process, in a kernel-memory address. The communication model it used is client-server based, the Client begin the communication awaiting for a proxy from the Server. The Server will have a thread pool, to server incoming Request, this client-server communication is possible through Binder driver (/dev/binder ) in Kernel-space with ioctl() calls : Figure 2.15: Binder Communication [10] Processes cannot directly invoke read/write operations on other processes, only the kernel does, for this reason, processes use Binder as intermediary. Binder offers a C++ framework to create Remote Services. Developers can build its own remote service using AIDL language, specifying the interface, which once compiled provides Java classes for Proxy and Stub. Remote services are functionalities that applications can export and share with other processes: Chapter 2. Android: A Security Overview 29 Figure 2.16: Binder Framework [10] Binder is defined, from Application point of view, in the Binder.java class, implementation of IBinder interface. Each time that an Android process sends some data to another, a “transaction” is defined, with the IBinder.transact() IBinder method : Listing 2.9: IBinder.transact() public final boolean transact (int code, Parcel data, Parcel reply, int flags) This function allow the Client to transmit an action to do (code field) and some data (data field) to work on, these data must be first “parcelized”, must be of Parcel type (must implements Parcel interface) to permit the Marshalling and Unmarshalling. The recipient process receive the call on the IBinder.onTransact() method : Chapter 2. Android: A Security Overview 30 Listing 2.10: IBinder.onTransact() protected boolean onTransact (int code, Parcel data, Parcel reply, int flags) Binder Security Permission Enforcement In the moment, in which a process build a Binder transaction, the calling process UID and PID is registered into the kernel, which is possible to accessing to some informations: - android.os.Binder.getCallingPid() process PID . - android.os.Binder.getCallingUid() process UID . - PackageManager.getPackagesForUid(int uid) Application’s package name. - PackageManager.getPackageInfo(String packageName, int flags), Context.checkCallingOrSelfPermission(String permission) which can return PackageManager. PERMISSION GRANTED o PackageManager. PERMISSION DENIED flags implementing Permission Enforcement. Binder Reference Security If 2 processes have to comminicate, the Client process must send data to kernel (through an ioctl call) and a reference, to the Server process’ binder object. When Server process, read the Reference from Parcel, with the Parcel.writeStrongBinder() Chapter 2. Android: A Security Overview 31 Figure 2.17: Binder Permission Enforcement [10] method is sure that the Client Process has a particular Binder and the grants to do it. This mechanism prevents fake client Processes to send ad-hoc Reference to deceive Server Process, representing Binder References which not possess. Binder References, called Binder Tokens are uniques and assigned by the driver, avoiding, in this way, false Client processes’s identities. With this mechanism, getting a Binder Reference is not for sure. Binder Tokens: Each Binder objects maintains an unique identity across process boundaries, by the Binder Kernel Driver which assigns an unique 32-bit Integer value to each object in each transaction it sees, maintaining in this way a cross-process identity. A Binder object reference can be either: - A Virtual Memory address to a Binder Object in the same process. - A 32 bit handle to a Binder object in another process. On each transaction it sees, the Binder driver maps local addresses to remote Binder handles and viceversa, the driver maintains this mapping between process. The Binder unique-identity across process boundary rules is used as Security Access Token. Chapter 2. Android: A Security Overview 32 Figure 2.18: Binder Token Object Reference [10] Following we consider an application that want to acquire the wavelock on the PowerManager Service [26]: Listing 2.11: Client application acquiring wavelock on PowerManager service [26] public class MyActivity extends Activity { private PowerManager.WakeLock wakeLock; protected void onCreate(Bundle savedInstanceState) { super.onCreate(savedInstanceState); PowerManager pm = (PowerManager) getSystemService( Context.POWER_SERVICE); wakeLock = pm.newWakeLock(PowerManager.PARTIAL_WAKE_LOCK , "My Tag"); wakeLock.acquire(); } @Override protected void onDestroy() { super.onDestroy(); wakeLock.release();} } Following the Analysis of the PowerManager service source code to better understand the Binder Token mechanism [26]: Chapter 2. Android: A Security Overview Listing 2.12: PowerManager source code [26] public final class PowerManager { // Our handle on the global power manager service. private final IPowerManager mService; public WakeLock newWakeLock(int levelAndFlags, String tag) { return new WakeLock(levelAndFlags, tag); } public final class WakeLock { private final IBinder mToken; private final int mFlags; private final String mTag; WakeLock(int flags, String tag) { // Create a token that uniquely identifies this wake lock. mToken = new Binder(); mFlags = flags; mTag = tag; } public void acquire() { // Send the power manager service a request to acquire a wake // lock for the application. Include the token as part of the // request so that the power manager service can validate the // application’s identity when it requests to release the wake // lock later on. mService.acquireWakeLock(mToken, mFlags, mTag); } public void release() { // Send the power manager service a request to release the // wake lock associated with ’mToken’. mService.releaseWakeLock(mToken); } } 33 Chapter 2. Android: A Security Overview 34 1 The Client request an instance of PowerManager class in onCreate() method, responsible for managing the device power state. 2 The Client application create and acquire a waveLock, then the PowerManager send the unique Binder Token of the wavelock in the acquire() request. 3 The Client Application releases the waveLock in onDestroy(). The PowerManager send the Binder Token of the waveLock as a part of the request. When the PowerManager Service receive the request, it compares the tokens against of all others waveLock it stored and release the waveLock only if it finds a match ensuring that malicious applications cannot trick the PowerManager Service to release the waveLock of others applications. A special kind of Binder Token is the Windows Token, the WindowsManager use it to uniquely identifying Windows in the System. The WindowsManager require that each applications pass their Windows Token as a part of the Request, in this way, a malicious application cannot draw something on top of another application. If a token don’t match, the WindowManager throws a BadTokenException. BinderTokens are used extensively in the Android Platform for security reasons to achieve a secure component communication. 2.1.3 2.1.3.1 Android Application Security The Android Permission Model and Protected APIs Android applications are installed from a single apk (Android Application Package) file, it is an archive file format, variant of the jar file format, more in general, it contains the following resources: • META-INF/ – MANIFEST.MF : The Manifest file. – CERT.RSA: The application certificate. – CERT.SF : The list of resources and the SHA-1 digest of the following Manifest resources: Chapter 2. Android: A Security Overview 35 Listing 2.13: WhatsApp.SF Name: res/drawable-xhdpi-v4/mic_background_incoming_normal. png sha1-Digest: 8wEzOIIOCx4L0R/E4/3Dy6IjXDs= Name: res/drawable-hdpi-v4/ common_signin_btn_icon_pressed_light.9.png sha1-Digest: X4WAf4UWRpI4GENKGJqxId5qqTI= Name: lib/armeabi-v7a/libcurve25519.so sha1-Digest: cT2rnF/ZorIjUdANe5vZoJE0SfQ= Name: res/drawable-hdpi-v4/ic_call_decline_normal.png sha1-Digest: ZGoPYtfJXZH3VSpIBcqulYw6zO4= • lib: this folder contains the compiled code for a specific library used in the application (armeabi, armeabi-v7a, x86, mips) • Resource files. • AndroidManifest.xml : The AndroidManifest.xml tells the system the definitions of all high-level component (activities, services, broadcast receivers, and content providers ). This is very important for the Application Security of the system and how users understand the application, because it specifies which permissions the application requires. • classes.dex : All the classes compiled for the Dalvik Virtual Machine in the dex format. Referring to [11], all Application in Android runs in an Application Sandbox in a privilege-separated way and for this reason, ti can only access a limited set of System Resources, since the system manage the accesses to the resources. The System provide for this, a security fine-graned mechanism that can restrict operations that application can perform. Applications must explicitly share their resources and data statically and explicitely declare permissions they need in the AndroidManifest file. Android has no mechanism to grant permissions at run-time. Chapter 2. Android: A Security Overview 2.1.3.2 36 Application Signing Permission Enforcement All apk’s must be signed with a certificate whose private key is owned by the developer. All certificates are not signed by a CA but are self-signed with the developer’s private key. This process is called “Application Signing” which permit the grant or deny when an Application request the same identity of another application. When an application is installed, the system give an unique UID valid for all the application’s life cycle on that device, on another device the UID could be different for the same application. The System enforce Security at process level, in fact, 2 distinct apps, signed with 2 different private keys cannot run in the same process space with the same UID. To permit this it is first necessary to define, inside the AndroidManifest.xml, the shareUserId attribute which specifies the name of a user ID that will be shared to other apps. Setting this value, each application will be given the same UID but for this to work the applications must be signed with the same private key. Normally each application data will be assigned the same application’s UID, in this way the data cannot be accessible (in read/write) by other applications (with different UIDs), if the application share the same UID with others, after a signing process with the same key, also share data and resources. Enforcing Permissions An Android Application has no permission by default, it cannot do anything that impact the data and protected resources of the device. To grant a permission, an application must include some permission declarations inside the AndroidManifest file with the ¡uses-permission¿ tag. For instance, following some example of specific permission declarations. Allow an application to receive and read SMS: Chapter 2. Android: A Security Overview 37 Listing 2.14: RECEIVE SMS permission <manifest xmlns:android="http://schemas.android.com/apk/res/ android" package="com.android.app.myapp" > <uses-permission android:name="android.permission. RECEIVE_SMS" /> ... </manifest> Allow an application to access fine graned location through the GPS, cell towers and Wifi: Listing 2.15: ACCESS FINE LOCATION permission <manifest xmlns:android="http://schemas.android.com/apk/res/ android" package="com.android.app.myapp" > <uses-permission android:name="android.permission. ACCESS_FINE_LOCATION" /> ... </manifest> Allows applications to call into AccountAuthenticators: Chapter 2. Android: A Security Overview 38 Listing 2.16: ACCOUNT MANAGER permission <manifest xmlns:android="http://schemas.android.com/apk/res/ android" package="com.android.app.myapp" > <uses-permission android:name="android.permission. ACCOUNT_MANAGER" /> ... </manifest> Permissions requested by an application are granted by the user explicitely by the package installer at install time and no checks are done while the application is running. If the application while running request a particular features, for which does not have permission, the result is a SecurityException means permissions failures, being thrown back to the application and being logged. Prior to installation of any application, the user is shown a clear message about the different permissions the application is requesting, it is also possible to show the application’s permission after installation, through “Settings” menu: Figure 2.19: Display of permissions for applications [11] Chapter 2. Android: A Security Overview 39 Android sandboxes all applications by default , for this reasons, application must explicitely share resource and data, they do it declaring permissions they need, in the AndroidManifest . When an app is installed, it is assigned a unique UID with permissions associated with it, Android permissions are typically implemented by mapping them to Linux groups, all components inside the app have the same permissions [31]. When the app includes some third-part components such as in-app advertisements libraries, plug-ins, social network APIs ect. also these components inherite those permissions. Protecting Security-sensitive application’s IPC with Signature-Protection level Android permissions fall into four levels. • Normal: These permissions cannot impart real harm to the user, (e.g. change the wallpaper) and, while apps need to request them, they are automatically granted. • Dangerous: These can impart real harm (e.g. call numbers, open Internet connections, etc) and apps need to request them with user confirmation. • Signature: These are automatically granted to a requesting app if that app is signed by the same certificate (so, developed by the same entity) as that which declared or created the permission. This level is designed to allow apps that are part of a suite, or otherwise related, to share data. • Signature System: Same as Signature, except that the system image gets the permissions automatically as well. This is designed for use by device manufacturers only. In Android is possible to use standard permission for the whole application or use permission only on the desidered component. It is also possible to create Custom Permissions used by developers to restrict access to various services/components and any application Chapter 2. Android: A Security Overview 40 that interacts with another application’s component would need to possess the required permission [12]. High-level permissions restricting access to entire components of the system or application can be applied through your AndroidManifest file. All that this requires is including an android:permission attribute on the desired component: • Activity permissions restrict who can start the associated activity: Figure 2.20: Securing Activities with Custom permission [12] Chapter 2. Android: A Security Overview 41 • Service permissions restrict who can start or bind the associated services. To protect whole service permissions are defined in the AndroidManifest, to protect individual methods (fine-graned approache) we can use Context::check*Permission() set of APIs : checkCallingPermission() or checkCallingOrSelfPermission(): Figure 2.21: Securing Services with Custom permission [12] Chapter 2. Android: A Security Overview 42 • BroadcastReceiver permissions restrict who can send broadcasts to the associated receiver, a permission failure, not thrown an Exception, will just not deliver the intent. Must considering who can receive my broadcasts: Figure 2.22: Securing BroadcastReceiver with Custom permission [12] Chapter 2. Android: A Security Overview 43 • ContentProvider permissions estrict who can access the data in a ContentProvider. IT also has an additional security facility available called URI permissions: Figure 2.23: Securing ContentProvider with Custom permission [12] Chapter 2. Android: A Security Overview 44 In this way is possible to secure fine-graned Android components: Listing 2.17: Securing Android components <application android:permission="com.permission.XXX" <activity android:exported="false" <service android:permission="com.permission.XXX" ... To protect security-sensitive IPC, in the case for example of ContentProviders, the developer could use <permission>. It is possible, for example, as cited by the official documentation, to consider using the Signature Protection level on permissions for IPC between applications by a single developer. Each permission has an associated “protection level” that indicates how the system proceeds when deciding whether to grant or deny the permission. Signature permissions only allow access by applications signed by the same developer, it is necessary first to define the android:protectionLevel attribute of <permission> element, with the “signature” value, the system grants only if the requesting application is signed with the same certificate as the application that declared the permission. Let’s we analyze the GMail app case. Gmail app by Google, had a weakness until the 2.3.5. version, it does not require signature level permission for reading mail, this means that another third-part (evil) application can read email from Gmail ContentProvider just defining appropriate permissions on its Manifest. The enforced permission is com.google.android.gm. permission.READ GMAIL which includes, since ver. 2.3.5 the signature-level enforcing: Chapter 2. Android: A Security Overview Figure 2.24: GMail AndroidManifest prior to 2.3.5 without signature-level enforcing 45 Figure 2.25: GMail AndroidManifest after to 2.3.5 with signature-level enforcing Protecting Security-sensitive application’s IPC with URI permissions To protect a ContentProvider, is possible to define a URI permission. Permissions are for a specific content URI and for the the whole Provider itself. When starting an activity or returning a result to an activity, the caller can set Intent.FLAG GRANT READ URI PERMISSION or Intent.FLAG GRANT WRITE URI PERMISSION for example, this grants the receiving activity permission to access specific data URI in the Intent. This fine-graned, per URI mechanism is possible through the android:grantUriPermissions attribute or ¡grant-uri-permissions¿ tag. First decide what URIs you want to dynamically grant access to : - All data: android:grantUriPermissions=“true” - Specific URIs: android:grantUriPermissions=“false” : Chapter 2. Android: A Security Overview Listing 2.18: URI permissions example <provider android.name=‘‘com.example.test.MailProvider" android.authorities=‘‘com.example.test.mailprovider" android.readPermission=‘‘com.example.test.permission.DB_READ " android.writePermission=‘‘com.example.test.permission. DB_WRITE"> <grant-uri-permission android:path=‘‘/attachments/" /> </provider> In this way it is possible for third-part application to use in R/W only the attachments of the ContentProvider of the mail app and not the whole emails data. Figure 2.26: Securing with URI permissions [12] 46 Chapter 2. Android: A Security Overview 2.2 47 Malicious Android apps “ There is a massive growth in the volume of malware families and samples...” from Symantec Mobile Adware and Malware Analysis report 2.2.1 An Overview: Botnet, Data collectors and Madware Android has become the leading smart phone Operating System in the world. Android continues to dominate the global smartphone market, with over 255 million units shipped and nearly 85% of the market share in the second quarter of 2014 as IDC reported and for this reason has become a favorite target for cybercriminals. Cybercriminals have started to use the internet marketing also to promote and sell their services on the black market. New schemes are based on make money capturing SMS messages used for online banking logins, or by sending premium-rate SMS messages without the users’ knowledge. For instance, users downloaded pirated copies of paid Android apps, without cautions, that which were infected with trojans or through social engineering techniques such as fake antivirus, which trick users into paying to get rid of non-existent malware. McAfee, on its “mobile consumer threats report 2014” finds that privacy-invading apps dominate the landscape, and many leveraging adlibraries that go hand in hand with malware. Figure 2.27: Apps collect your Information [13] Chapter 2. Android: A Security Overview 48 many apps use your unique device identifier (IMEI, MEID, ESN, or IMSI) to track you—and their bot clients—through your device. Watch for an app requesting the READ PHONE STATE permission [13]. From the McAfee Report emerging that new mobile malware analyzied collect aggressively personal data than regular app: Figure 2.28: Apps collect your Information [13] Chapter 2. Android: A Security Overview 49 Interesting from the report is also which apps are abusing (releting to the period May/December 2013): Figure 2.29: Which apps are abusing [13] Following will be shown how cybercriminals can make money from mobile malware. For instance, and android compromised device can partecipate in dangerous botnet activities such launching DDos attacks, click fraud or to send SMS to premium rate numbers. A compromised device can also theft data such as Acount details, call logs, steling IMEI etc.. can also steal your money tealing transaction authentication numbers (TANs) or through Extortion via ransomware or with Fake antivirus. But can also be used to monitor users’ activities through audio, camera, GPS etc. and behind all these activities there are billionaire costs for data breaches for companies and governments and billionaire earnings for cybercriminals. Chapter 2. Android: A Security Overview 50 Aggressives AdLibraries: MADWARE Based on the Symantec report [32], Madware refers to apps that use aggressive ad libraries. There are at least 65 known ad libraries and over 50 percent of them are classified as aggressive libraries. The percentage of madware on Google Play is steadily increasing, reaching over 23 percent in the first half of 2013. The data collected enables retailers to target you with ads and promotions based upon your location, the more precise the targeting, the more personal data takes. Based on the Mcafee report [13] some AdLibraries are more aggressive than the others. The AdLibrary Ledbolt it is demonstered to be to most aggressive ones to invading the privacy : Figure 2.30: AdLibraries’ privacy scores [13] Chapter 2. Android: A Security Overview 51 But based on the Mcafee report Analysis [13] we can notice that invasive data collection and malware don’t correlate perfectly, in fact some ad libraries were not used very often by malware authors and some other ones are only brought by malware : Figure 2.31: AdLibraries brought to you by malware [13] Cryptocurrency Mining According to the Sophos mobile threath report, there is another trend started in 2012, Chapter 2. Android: A Security Overview 52 the use of mobile botnets to “mining” Bitcoins. Minings is a way to earn Bitcoins and and because mining is incredibly resource-intensive, the process can severely run down a phone’s battery, eat through a data plan by periodically downloading what is known as a block chain. For this purpose, from May 2012 to February 2013, and after for other 3 weeks in April 2013, the mobile clients infected by ZeroAccess Botnet were enrolled with this purpose: earnings Bitcoins. But for an unknow reason, ZeroAccess Botnet stop to provide this functionality. But Lookout’s researchers calculated that if you’re mining for 24 solid hours on a Samsung Galaxy SIII, you’d only earn .00000007 Bitcoin or $0.00004473. In order to make just one Bitcoin in a day, Lookout says you’d need 14,285,714 phones working full-tilt simultaneously. Marc Rogers, Principal Security Researcher at Lookout, explains: “In order to control the rate at which new digital coins are minted, the software that runs the currency sets a difficulty rate which governs just how much processing power you need to expend in order to solve the blockchain and get new coins. The difficulty for Bitcoin is so tough right now that a recent mining experiment using 600 quadcore servers was only able to generate 0.4 bit coins”. Table 2.1: Cryptocurrencies Mining Difficult Rate Currency Bitcoin Litecoin Dogecoin Casinocoin Difficulty 4,250,217,919.87 5,162.40 1,151.55 1.62557361 For the difficulty of rate of mining, malware such as CoinKrypt is currently targeting the likes of Litecoin, Dogecoin, and Casinocoin, but not Bitcoin. It is almost one million times easier to mine Litecoin than Bitcoin, and over 3.5 million times easier to mine Dogecoin. Chapter 2. Android: A Security Overview 2.2.2 53 Android Malware Characterization A great study of Android Malware Characterization is “Dissecting Android Malware: Characterization and Evolution” [14]. Following , is provided a malware characterization based on this research work. In this paper, the authors focus on the Android platform and aim to systematize or characterize existing Android malware. The resesearch collect more than 1,200 malware samples covering the majority of malware families: Chapter 2. Android: A Security Overview 54 Table 2.2: Android Malware Families [14] Family FakePlayer GPSSMSSpy TapSnake SMSReplicator Geinimi ADRD Pjapps BgServ DroidDream Walkinwat zHash DroidDreamLight Endofday Zsone 12 BaseBridge DroidKungFu1 GGTracker jSMSHider Plankton YZHC Crusewin DroidKungFu2 GamblerSMS GoldDream HippoSMS Lovetrap Nickyspy SndApps Zitmo CoinPirate DogWars DroidKungFu3 GingerMaster NickyBot RogueSPPush AnserverBot Asroot DroidCoupon DroidDeluxe Gone60 Spitmo BeanBot DroidKungFu4 DroidKungFuSapp DroidKungFuUpdate FakeNetflix Jifake KMin RogueLemon Discovered 2010-08 2010-08 2010-08 2010-11 2010-12 2011-02 2011-02 2011-03 2011-03 2011-03 2011-03 2011-05 2011-05 2011-05 2011-06 2011-06 2011-06 2011-06 2011-06 2011-06 2011-07 2011-07 2011-07 2011-07 2011-07 2011-07 2011-07 2011-07 2011-07 2011-08 2011-08 2011-08 2011-08 2011-08 2011-08 2011-09 2011-09 2011-09 2011-09 2011-09 2011-09 2011-10 2011-10 2011-10 2011-10 2011-10 2011-10 2011-10 2011-10 Chapter 2. Android: A Security Overview 2.2.2.1 55 Malware Installation By analyzing malware samples in that collection, the authors categorize existing ways Android malware use to install onto user phones and generalize them into three main social engineering-based techniques, repackaging, update attack, and drive-by download. These techniques are not mutually exclusive. Repackaging Repackaging is one of the most common techniques malware authors use to piggyback malicious payloads into popular applications. Malware authors find and download popular apps, disassemble them, insert the malicious payload , repackage the app, signed it with a new Key and then submit the new app on the official or alternative Android Market. The payload should be initiated by an event such as the application starting or a phone call being received. In this research, among the 1260 malware samples, 1083 of them (or 86.0%) are repackaged. Also, possibly due to the attempt to hide piggybacked malicious payloads, malware authors tend to use the class-file names which look legitimate and benign. For instance, AnserverBot malware uses a package name com.sec.android.provider.drm for its payload, which looks like a module that provides legitimate DRM functionality. One malware Family uses a publicly available private key that is distributed in the Android Open Source Project (AOSP). The current Android security model allows the apps signed with the same platform key of the phone firmware to request the permissions and ,for example, installation of additional apps without user intervention [14]. Update Attack Instead of enclosing the Payload as a whole, like in the Repackaging Techniques, it only includes an update component which donwloads the malicious payload at runtime. Chapter 2. Android: A Security Overview 56 In the Research [14] dataset, there are four malware families : BaseBridge, DroidKungFuUpdate, AnserverBot, and Plankton, that adopt this attack. For an example, a BaseBridge-infected app it will check for updates and an update dialog box is displayed to the user and if the user accepts, an “updated” version with the malicious payload will then installed. The DroidKungFuUpdate malware notify the users through a third-party library that provides the notification functionality. This 2 examples works updating the whole app but AnserverBot and Plankton updates only some components without user approval. AnserverBot donwload its payload from the C&C while Plankton direct fetch the component from a remote server. Figure 2.32: An Update Attack from BaseBridge [14] Drive-by Download This kind of malware is installed enticing users to download “rich” or “interestings” apps. The GGTracker malware starts from its in-app advertisement, in particular if a Chapter 2. Android: A Security Overview 57 user click on this advertisement, will be redirected to a malicious web site for example claims to be analyzing the battery usage of user’s phone and will redirect the user to one fake Android Market to download an app claimed to improve battery efficiency [14]. Then the downloaded app is a malware sending premim-rate number SMS. Spitmo and ZitMo are porting of a PC malware Spyeye and Zeus, when a user is doing online banking with a compromised PC, the user will be redirected to download a particular smartphone app, which is claimed to better protect online banking activities. However, the downloaded app is actually a malware, which can collect and send mTANs or SMS messages to a remote server. These two malware families rely on the comprised desktop browsers to launch the attack. Thus attacker has ability to do some bank transaction Others The Research dataset [14] consist of 1083 repackaged apps which leaves alone 177 apps. They organize these remaining apps into spyware or including malicious functionalities. For instance GPSSMSSpy is an example that listens to SMS-based commands to record and upload the victim’s current location. FakeNetflix isn’t the real NetFlix app but disguises the user with the same UI of the real app stealing Netflix users credential. FakePlayer is another example that masquerades as a movie player but instead send SMS to Premium-rate Numbers or RogueSPPush an astrology that will automatically subscribe to premium-rate services by hiding confirmation SMS messages [14]. Chapter 2. Android: A Security Overview 2.2.2.2 58 Activation Android Malware relies on system-wide events to trigger its malicious payload. BOOT COMPLETED is the most interested one to existing Android malware, in that dataset, 29 (with 83.3% of the samples) malware families listen to this event. Geinimi listen for this event to bootstrap the background service com.geinimi.AdService, the SMS RECEIVED comes second with 21 malware families. For examples, zSone listens to this event and intercepts or removes all SMS message from particular originating numbers such as “10086” and “10010”. Some malware samples directly hijack the entry activity of the host apps which will be triggered when the user clicks the app icon on the home screen or an intent with action ACTION MAIN is received by the app [14]. 2.2.2.3 Malicious Payload Android Malware can be also characterized by their payload, we have four different categories : privilege escalation, remote control, financial charges, and personal information stealing. Privilege Escalation Android is a complex system built not only on Linux Kernel but also with more than 90 libraries such as WebKit, OpenSSL, SQLLite etc. This complexity introduces software vulnerabilities that can be potentially exploited to facilitate the execution of the payloads, for instance DroidKungFu does not embed the root exploit but first encrypt the exploit as an asset image and then at runtime decrypt this file and execute it, in fact where DroidKungFu was discovered no existing mobile antivirus was able to discover it. Chapter 2. Android: A Security Overview 59 Remote Control There are 1, 172 samples (93.0%) that turn phones into bot. They receive command from HTTP traffic from C&C servers, and most of them also encrypt C&C addresses. DroidKungFu3 employs the standard AES encryption scheme to hide its C&C servers, Geinimi similarly applies DES encryption scheme to encrypt its communication to the remote C&C server. Often C&C server are hosted under domains controlled by attackers but there are cases where servers are hosted on public clouds, for instance, C&C servers of Plankton malware are hosted on Amazon cloud. Financial Charge Another motivation behind malware infections is financial charge. One way is to subscribe to attacker-controlled Premium-rate services and silently sending SMS to them. For Instance, FakePlayer sends SMS message “798657” to multiple premium-rate numbers in Russia, GGTracker silently subscribe users to a premium-rate service in US and zSone malware sends premium-rate number SMS to a service located in China, all without user’s consent. The 4.4% of the dataset send SMS to premium-rate numbers. To succesfully subscribe and activate premium-rate services, users must confirm an SMS sent by the Provider, to avoid users being notified, for instance RogueSPPush will automatically reply “Y” to such incoming messages in the background, GGTracker will reply “YES” to one premium number, 99735, to active the subscribed service. Information Collection Some malware harvesting various information on the infected phones like sms, contacts, user accounts. In the dataset there are 13 malware families (138 samples) that collect SMS messages, 15 families (563 samples) gather phone numbers, and 3 families (43 samples) obtain and upload the information about user accounts [14]. For instance, SndApps malware collect users’ email addresses and then send them to a remote server, both Zitmo and Spitmo attempt to intercept SMS verification messages and then upload them to a remote server to perform some fraud actions. Chapter 2. Android: A Security Overview 2.2.2.4 60 Permission Used Malware whitout root privileges, relies on the Android Permission model. It is interesting to compare top permissions requested by malicious android apps with top permissions required by benign ones. For instance INTERNET, READ PHONE STATE, ACCESS NETWORK STATE, and WRITE EXTERNAL STORAGE permissions are widely requested in both malicious and benign apps. But malicious apps tend to request more frequently on the SMS-related permissions, such as READ SMS, WRITE SMS, RECEIVE SMS, and SEND SMS. Most malware request the RECEIVE BOOT COMPLETED permission. This number is five times of that in benign apps, due tot he fact that malware often starts background services without user’s intervention. Another interesting observation is that malicious apps tend to request more permissions than benign ones [14]. Figure 2.33: Comparison of the top 20 permission requested by malicious and bening Android apps [14] Figure 2.34: An Overview of existing Android Malware (PART I: INSTALLATION AND ACTIVATION) 1 of 2 [14] Chapter 2. Android: A Security Overview 61 Figure 2.35: An Overview of existing Android Malware (PART I: INSTALLATION AND ACTIVATION) 2 of 2 [14] Chapter 2. Android: A Security Overview 62 Figure 2.36: An Overview of existing Android Malware (PART II: MALICIOUS PAYLOADS) 1 of 2 [14] Chapter 2. Android: A Security Overview 63 Figure 2.37: An Overview of existing Android Malware (PART II: MALICIOUS PAYLOADS) 2 of 2 [14] Chapter 2. Android: A Security Overview 64 Chapter 2. Android: A Security Overview 2.2.3 65 Evolutions and Challenges Based on the study “Dissecting Android Malware: Characterization and Evolution” [14] we discovered that, in the rapid growth of Android Malware there are some variants that make use of some stealth and advanced techniques to avoid detection demonstrating Malware evolution. For Instance, let analyze the DroidKunFu case with all its variants. The first version of this malware was detected in the June 2011 and , at that time, was considered one of the most complex Android Malware, some months later, researchers found other variants of the DroidKungFu Malware, in October 2011 they discovered the 4 version DroidKungFu4 and some months later again they discovered the fifth variant DroidKungFuSapp. Also another variant was discovered called DroidKungFuUpdate. Between this six variant, four of them contains encripted root exploits, they are located under /assets dir and then executed at Runtime. Any different variants use a different key to encrypt the exploit to avoid detection. For instance, DroidKungFu1 the file containing the root exploits is named “ratc” while in DroidKungFu2 and in DroidKungFu3 the file is named “myicon”. All variants of DroidKungFu malware receives command from C&C servers, but changes the way to store the C&C addresses. In DroidKungFu1 the C&C server addresses are stored in plaintext in a Java Class, in DroidKungFu2 server addresses are stored in plaintex in a Native Class, in DroidKungFu3 the addresses are stored encrypted in a Java Class and in DroidKungFu4 they are encrypted in a Native Class. If the rooting is succesfull, a Shadow app will be installed then DroidKungFu can access arbitrary files in the phone and have the capability to install or remove any packages. One built-in payload of DroidKungFu is to install. In DroidKungFu1, the embedded app will show a fake Google Search icon but in DroidKungFu2 the embedded app is encrypted and not displaying any icon on the phone.// To make Detection more diffucult, DroidKungFu use encryption to hide C&C server addresses , constant strings, native payloads and shadows apps, also rapidly changes encryption keys, extensively use JNI interfaces to make more diffucult the static analysis. For instance, DroidKungFu2 and DroidKungFu4 uses JNI to communicate with remote servers and fetch commands [14]. There are some mechanism implemented by modern Malware which make the analysis very challenging. Another example is AnserverBot malware, in particular, it contains the name of 3 mobile AV Software com.qihoo360.mobilesafe, com.tencent.qqpimsecure and com.lbe.security, and attempts to match them with those installed apps on the phone and if one of the 3 AV is detected, AnserverBot stop itself Chapter 2. Android: A Security Overview 66 unexpectedly. New Android Malware thwarte Software Detection using obfuscation through Encryption or dynamic loading or modification of code , in this way an application’s binary code before and after an execution can be different. For instance, malware authors can use native code using JNI executed directly by the processor and not by the DVM making possible dynamic code manipulation. Android vulnerabilities are used by criminals to enhance the rights of malicious applications, which extends their capabilities and makes it more difficult to remove malicious programs. For example, to bypass Android Integrity check, malware uses the “Master Key Vulnerability” embedding unsigned executable files in Android installation packages. Chapter 3 Malware Detection Methodologies In this Chapter we discuss about a number of Detection Methodologies available in literature, as we can see, there are a substantial amount of still unsolved problems. One primary line of defense is given by the Security Architecture of the device, for example by the permissions system which restrict apps privileges. The user can grant or not the privileges to the app and sometimes without fully understanding what each permission means. This behaviour can lead to serious risks like data Exfiltration, data leakage such as one’s location or the list of contacts or the SMS. If a malware get it way into a device it is not clear how it is possible to detect its presence [33]. Following we also analyzing AV technologies and their limit to Detect Malware. Signature-based antimalware techniques have some limitations in fact, they can only Detect Malware where Signature is available and are less efficient to detect polymorphic and metamorphic viruses. Anomaly-based approaches are promising but are not efficintly possible to adapt to smart devices [15]. More recent technologies aren’t just signature based but uses an hybrid approach. We also evaluate commercial mobile anti-malware products for Android and test how resistant they are against various common obfuscation techniques. Some papers goes in the direction of exploring Call-Graph for Malware variant Analysis. The real challenge of using graphs is to show similarity in polynomial time. Given 2 Graph, they are isometric if they share the same structure and same functional properties, but the isomorphism problem take a NP time. To solve this, in literature there are some interesting researches that uses features of the Graph instead of the whole inter o intra 67 Chapter 3. Malware Detection Methodologies 68 procedural Graph that gratly simplify the isomorphism problem. Considering that malicious functionality of an Android application often concentrates on only a small number of its functions and second, considering the malware landscape of payload reuse. 3.1 Malware Detection Techniques Following we analyze limits and weakness of the techniques available in literature, we will refer to the “Detection of Mobile Malware in the Wild ” [19] and “Smartphone Malware Detection: From a Survey Towards Taxonomy” paper [15]. A Malware Detector is a system responsible to determine if a program has a malicious behaviour. In this paper [15] they classified Smartphone malware detection techniques according to well defined three rules: Reference Behaviour Rule Malware Detectors identify malicious behaviours using Reference Behaviour, according to that they classify detection techniques in 2 main group: Signature-based and Anomaly-based. Signature-based tecniques use known malware signature or pattern while Anomaly-based techniques use a “normal” behaviour defined in the “training” phase and use this normal model to identify malicious programs [15]. Analysis approache Rule According to this Rule, there are 2 different Analysis approaches: Static-Analysis and Dynamic Analyisis. Static Analysis studies program behaviour without executing it. This kind of Analysis is characterized by some stages: the app unpacking, disassembling and analysis with fuatures extraction. Dynamic Analysis studies program behaviour executing it in an isolated environment such as a VM or an emulator, collecting at runtime some information such as events, system calls and monitoring program behaviours Malware Behaviour Representation Rule Malware detection techniques use a variety of program behaviour representations, for example: hash values, hexadecimal bytes, sequence of run-time events and structural Chapter 3. Malware Detection Methodologies 69 properties. Malware Detection techniques can be classified into Static-signature and Behaviour-signature, behaviour-signature can also be classified into Static-behaviour and Dynamic-behaviour [15]. Figure 3.1: Malware detection Technologies Taxonomy [15] 3.1.1 Signature-based Detection Techniques These Detection techniques model malicious behaviours of the malware in form of Signature, are unique values that indicate the presence of malicious code. These Signatures are then stored in a database and are used in the Detection process. To address effectiveness, the Signature Database must be regularly updated and this could lead to some “gap” where malware is not recognized because the Signature Database it isn’t updated yet. The Signature could be Static or Behaviour according to the Malware Behaviour Representation Rule. Chapter 3. Malware Detection Methodologies 70 Static-Signature based technique The most common used Signature could be the series of bytes in the file or a cryptographic hash of the file or its sections. This technique is the one used by commercial AV software. For instance, a most common used Signature in this technique is the Byte Signature of the malware consisting in the sequence of Hexadecimal/Bytecode very common in that malware. Another Signature is the Hash Signature created by a hash function on particular instruction like MD5 or SHA-1 but the main drawback is that if a string or instruction change a bit, the Hash Signature changes.// Static-Signature based techniques [15] are very efficient and rielable in the Detection of new Malware and helps malware Detectors to run fuster consuming few resources. This techniques can be circumvented by using packers, obfuscation techniques, self-modifying code, means recycling malware with a sifferent signature. Behaviour-Signature based technique Behaviour Signature use semantic and Dynamic interpretation to Detect Malware and it is resilient to detect self-modifying code, encrypted or packed code. Analysis Approach Rule classify behaviour signature into static behaviour signature and dynamic behaviour signature. The static behaviour signature technique is based on static code analisys, extraacting features and information in a given executable file without executing them. The main advantages are for example, the ability to detects entire family of malware variants with one signature enumarating a priori all the execution paths, the limits are that the feature extraction could be a computational expansive process because the disassembling and after applying complex classification algorythm. While the dynamic behaviour signature technique is based on the dynamic features extraction and monitoring the “real” behaviour of the malware to model a Behaviour Signature to compare in the Detection Process: Chapter 3. Malware Detection Methodologies 71 Figure 3.2: Dynamic Signature Extraction [15] Dai et al [34] proposed a technique that use API interception to build a FSA and a PDA is used to describe the code samples candidates to be analyzed, this technique matches malicious signature by computing the interception between FSA and PDA and is able to detect malware also after packing. Bose et al [35] proposed an approach that model the application behaviours using temporal logic of causal knowledge (TLCK), The behaviour monitor agent monitors the application behaviour to construct behaviour signatures online from APIs calls. They use a machine learning classifier of type SVM to detect malware, the technique tested on a symbian phone has more than 96% accuracy. In another work, Kim et al [36] proposed a signature-based power detection approache, a power monitor agent monitors the mobile device and taking samples of power consumption which are used to build a power consumption models then the data analyser receives the power consumption history from the monitor agent and extract a unique pattern to drive a power signature. This power Signature is then compared with other signature contained in a database, it has high rate (up to 95%) of detection accuracy and high true positive rate (99%) in classifying mobile malwares. Also Dixon and Mishra [37] proposed a power-signature based malicious code detection technique, modeling individual power consumption profiles, focuses on abnormalities in power consumption introduced by malware. Chapter 3. Malware Detection Methodologies 3.1.2 72 Anomaly-based Detection Techniques Anomaly-based Detection Techniques are based upon 2 stages: The training stage and the classification stage. During the training stage, a profile of a system normal behavivour is constructed, during the classification stage, these “normal” profiles are compared within the profiles of other applications to Detect deviation from these profiles. This technique can also detect 0-Day attacks but on the other side, it is a very resource consuming technique in term of memory usage, time, CPU computation, and an inadeguate profile of the “normal” behaviour can leads to high false positives rate. Dynamic Anomaly-based technique In the Dynamic Anomaly-based technique, the “normal” behaviour are modeled from the informations and traces extracted during program execution. For instance, Buennemeyer et al [38] propose a techinque based on the battery constraints to detect malware activities. The sensing system constantly monitor device power consumption for detect anomalies when the values exceding the system’s dynamic threshold value. There are other research works in literature using the Dynamic Anomaly-based technique, for instance, Schmidt et al [39] proposed a framework to monitor smartphone by collecting dynamic features describing the system state such as CPU usage, the number of running processes and send all these informations to the Remote Anomaly Detection System (RADS), a system which collect and store inside a db the received features and running some machine learning algorythms to classify the “normal” or “malicious” behviour Static Anomaly-based technique Static Anomaly-based technique, the static program features such as execution paths, structural properties are extracted and examinated to find the “anomaly” behaviour. This technique detect malware before its execution. There are some interesting research works in literature , for instance, Schmidt et al [40] have extracted static information from Linux ELF file such as function calls, the function call are then compared with malware samples for the classification using Decision Tree learner (DTL), Nearest Neighbor (NN) algorithm, and Rule Inducer (RI) with 96% detection-accuracy and with 10% false positives. Chapter 3. Malware Detection Methodologies 3.1.3 73 Application Permission Analysis There are also some works focusing on application permissions. As we have seen, they play an important role in Android Security model. Will be cited an application certification framework for Android, Kirin [27], it extracts its security configurations and checks them against the security policy rule that it already has, If an application fails to pass all the security policy rules, Kirin can either delete it or alert the user. For instance, Listing 3.1: Kirin policy example [27] An application must not have PHONE_STATE, RECORD_AUDIO, and INTERNET permission labels This rule ensures that an application doesn’t record audio or access the Internet while the user is talking on the phone. All the security rules are encoded with the KIRIN SECURITY LANGUAGE (KSL). Testing Kirin with 311 popular applications revealed that the rules flagged 12 applications with eavesdropper-like characteristics. Another interesting work is those by Barrera [41], with self-organizing maps (SOMs) to visualize the relationship between the applications and the permissions requested but without focusing on their implications. Barrera et al. discovered that 93% of the free and 82% percent of the paid applications had at least one dangerous permission request. In [16] Liu et al. proposed a two-layered permission based detection scheme for detecting malicious Android applications based on a decision-tree. The feature they used are Permission Pairs and a J48 classifier to classify the applications, concluding that a permission-based malware detecting mechanism can be used as a filter in the first step to identify malicious applications with its high detection rate. Chapter 3. Malware Detection Methodologies 74 Figure 3.3: Top 20 requested permissions which has the most different requested rate in different dataset. The ordinate is the difference between the requested rate in malware dataset and the requested rate in benign dataset [16] In [17] they proposed a static analysis of android malware files by mining prominent permissions. Permissions are extracted from apk’s and then is divided into train and test set. During data preprocessing phase, from the training samples they determine all those permissions that are used by both benign as well as malware applications and filter out features that have minimum probability in identifying the target classes. From these synthesized permissions, they create four categories of permission sets [17] : - Malware and benign features (M ∪ B) - Common to malware and benign features (M ∩ B) - Discriminant malware features (M k B) - Discriminant benign features (B k M) The Feature selection is performed to extract a subset of k best features from a set of large feature space consisting of n features. They used Bi-Normal separation and Mutual Information (MI) as the feature selection methods, with a feature lenght of 15 features. Chapter 3. Malware Detection Methodologies 75 Figure 3.4: Difference in the frequencies of 18 selected permission in malware and benign .apk files [17] As evaluations reported, this method is useful also for initial malware classification. Figure 3.5: COMPARATIVE ANALYSIS OF BI-NORMAL SEPARATION AND MUTUAL INFORMATION FEATURE SELECTION METHOD [17] In [18], Liang and Du proposed a malware detection scheme based on permission combinations. They developed a tool called k-map which inspects the combinations of k (k≥1) permissions instead of a single permission, K-map iteratively reads application package files as input and calculates the permission request frequency in the form of k (k = 1, 2, 3, . . . ) permission combinations [18]. K-map requires two kinds of datasets to generate the permission maps, i.e., malware samples and benign application samples. The rule Chapter 3. Malware Detection Methodologies 76 sets selection process is to extract the permission combinations that are requested more often by malwares than by benign Apps: Figure 3.6: TOP 5 PERMISSION COMBINATIONSWHEN K = 5 [18] Chapter 3. Malware Detection Methodologies Figure 3.7: TOP 5 PERMISSION COMBINATIONSWHEN K = 6 [18] 77 Chapter 3. Malware Detection Methodologies 78 Another interesting works in literature is [42]. Bartel, Klein and Le Traon starts from the failures of the static analysis approaches presenting and advanced class-hierarchy and field-sensitive set of analyses to extract mapping between API methods of the framework and the permissions they require. For end-user applications, their evaluation revealed that 94/742 and 35/679 applications crawled from Android application stores indeed suffer from permission gaps (permissions declared but not used) [42]. 3.1.4 Cloud-based Detection Analysis Because of limited computational power and energy sources, smartphones don’t carry fully featured security mechanisms. There are some cloud-based approaches in literature, they move Security Analysis on the cloud hosting numerous replicas of Android Emulator to perform some behavioural-based analisys. For instance, Paranoid Android (PA), consist of a tracer, located in the smartphone, that records all the necessary information required to replay the mobile application’s execution. This execution information are transmitted to a cloud-based replayer which replay the application execution and performs some security checks such as dynamic analisys, memory scanners, system call anomaly detection and similar. PA consist also of a proxy to cache informations preventing in this way , the phone to ritransmitting the data back to replayer. PA incurs some significant overhead, increases the CPU load by 15 percent, and consumes 30 percent more energy during heavyweight tasks, for example, tracing a single system call such as read() takes 0.7 milliseconds in the user space, but less than 0.1 millisecond in the kernel [19]. Another example is Crowdroid, it consist of a moible lightweight application that monitor syscall and then send them to the cloud where is implemented a machine learning clustering technique which determinate the malicious or benign app nature. If the sample size is very small, the false positive rate is high. An important privacy implication of this techniques is that users have to send their applications behaviour to third part. Chapter 3. Malware Detection Methodologies Figure 3.8: Cloud-based malware protection techniques: (a) Paranoid Android and (b) Crowdroid [19] 79 Chapter 3. Malware Detection Methodologies 3.2 80 Malware trasformation In this paragraph it is addressed the trasformation that may be applied to malware sample to avoid detection, preserving their malicious behaviour. This is an important test by which is possible to evaluate the effectiveness of Android anti-malware tools. The transformations are classified on the basis of the work “Catch Me If You Can: Evaluating Android Anti-Malware Against Transformation Attacks” [20], in this work, authors classify malware transformation by 3 categories: - trivial which do not require code level changes. - DSA which result in variants that can still be detected by static analysis. - NSA which can render malware undetectable by static analysis. Let’s examine each transformation class [20]: trivial Repacking Android applications are signed jar files, they can be unzipped and then reassembled with the tools available in the Android SDK. Once repacked, android application must be signed but the original developer’s key is not available, and for this are signed with a custom key. Each detection methodology that match the developer’s key or match the checksum of the entire package are ineffective against this kind of transformation. Disassembling and Reassembling The compiled Dalvik Bytecode is presented under the dex format. A dex file, can be disassembled and then reassembled and the various items (classes, methods, strings, and so on) inside it can be re-arranged, re-ordered, for this reason, methods that check the whole dex file can be avoided by this transformation. Chapter 3. Malware Detection Methodologies 81 Changing Package Name Each application is identified by a unique package name defined in the android Manifest file. We can change this package name in everything we want. Methods based upon Package Name detection can be avoided by this simple transformation. Transformation Attacks Detectable by Static Analysis (DSA) Identifier Renaming Classes, methods and fields in bytecode can be renamed. For instance, the ProGuard tool provide identifier renaming. The ProGuard tool shrinks, optimizes, and obfuscates your code with the result of a smaller apk file harder to reverse engineer because by default, compiled bytecode still contains a lot of debugging information: source file names, line numbers, field names, method names, argument names, variable names, etc. Figure 3.9: Before ProGuard Figure 3.10: After ProGuard ProGuard in this situation removes the human-readable static member field in smali assembly code. Data Encoding The Dex file contains all the strings used in the code, these strings can be used as a “signature” against malware, for this reason, malware authors can encode strings, for instance: Chapter 3. Malware Detection Methodologies 82 Figure 3.11: Before ProGuard Figure 3.12: String Encoding Call Indirections Call indirections is an advanced trasformation in which authors manipulate the control flow of the application by inserting dummy methods which call the proper methods and this can be done for all the method calls. In this way malware authors can avoid detection methodologies based on Call Graph-based signatures. Code Reordering This method consist of change the order of the instruction by inserting goto statements preserving the program execution flow. Following an example of Code Reordering from the paper [20] : Figure 3.13: Before ProGuard Figure 3.14: Code Reordering Junk code Insertion This method consist of inserting code sequences that not affects the program semantics. This method can avoid detection methodologies based on analyzing particular bytecode sequences. The junk code can be just a sequence of nop instruction. Chapter 3. Malware Detection Methodologies 83 Figure 3.15: An example of a junk code fragment [20] Encrypting Payloads and Native Exploits In Android, native code is accessed, as native libraries, through the JNI bridge. Some malware, for example DroidDream pack its payload encryptedin a non standard location in the application package. This payload is decripted at runtime. These payload can also be stored encrypted. As the paper cite [20], payload and exploit encryption are categorized in DSA because a static detection is still possible based on the main application’s bytecode. Composite Transformation Obviously these transformation can be combined togheter to obtain more sophisticated transformations. Transformation Attacks Non-Detectable by Static Analysis (NSA) As cited by the paper [20], these transformation can break all kind of static analysis, it is still possible to perform some behavioral analysis or emulation-based. Reflection Reflection is commonly used by programs which require to examine or modify the runtime behavior of applications running. It can be used by malware authors, for example, to avoid method-based detection techniques by calling method using the name of the method. A subsequent encryption of the method name can make it impossible for any static analysis to recover the call. Chapter 3. Malware Detection Methodologies 84 Bytecode Encryption Bytecode Encryption is used mainly to avoid static analysis. An encrypted dex file is included in the application package and then, when an application component is created it first call a decryption routine that decrypt the dex file and loads it via user-defined class loader. For instance, in Android the DexclassLoader provides the functionality to load arbitrary dex files. Parts of the encryption keys may even be fetched remotely [20]. Chapter 3. Malware Detection Methodologies 3.3 85 Evaluating Android anti-malware Products Google’s chief security engineer for Android, Adrian Ludwig, told journalists that most users shouldn’t bother with anti-virus. Ludwig reportedly said: “ I don’t think 99% plus users even get a benefit from [antivirus]. There’s certainly no reason that they need to install something in addition to [the security we provide]. . . If I were to be in a line of work where I need that type of protection it would make sense for me to do that. [But] do I think the average user on Android needs to install [anti-virus]? Absolutely not. In this paragraph we evaluate some commercial Android anti-malware products to measure the effectiveness of that solutions. This evaluation is important to understand the actual limitation and take inspiration in the development of new solutions. There are various AV products for the Android platform, before going to evaluate some of these products, we have to understand their limiting factors. The Android platform, instead of others OS , where the software is considered as “trusted” a priori, however, a file system based sandbox ensures that each installed app may only access its own data and not that of the user or other apps, unless explicitly permitted by the user. For this, an Android AV cannot list for directories contents, file-system analysis is not possible and also hooking is not allowed by the sandbox. Thus, dynamically downloaded code will not be found by AV. Android AV applications cannot remove malware or place malware into quarantine. Users have to remove manually by try forcing uninstall by the application manager or often is required to the reinstall the device’s software image, this is necessary if malware gains elevated privileges and installs itself to the system partition, which is only readable by all other software. To evaluate Android AV products is important to refer to indipendent Test Lab such as AV-test at www.av-test.org . In one of their latest endurance test, AV-test [21], examined 36 Android Security apps, some of them are free to use but others require users Chapter 3. Malware Detection Methodologies 86 to pay an annual fee for some extra premium functions. A total of 8 apps achieved the maximum possible score of 13 points in the endurance test. With regards to the last year (2013), not even one of the apps tested was able to achieve 100 percent in the Protection category, one year on, the results are much better: 14 apps in the Protection category with a detection rate of 100 percent: Figure 3.16: Av-test carried out in the 2014: Detection rates in the endurance test [21] In the [20] of the 2014 there is an interesting study on Android Anti-Malware products evaluated on the basis of the different kind of trasformation discussed before. Following we reported their results. The product evaluated are: Chapter 3. Malware Detection Methodologies 87 Figure 3.17: ANTI-MALWARE PRODUCTS EVALUATED [20] The malware samples used for testing are: Figure 3.18: MALWARE SAMPLES USED FOR TESTING ANTI-MALWARE TOOLS [20] The Evaluation are performed after trasforming a malware with one (or combination) of the following: Figure 3.19: Trasformation Keys [20] Chapter 3. Malware Detection Methodologies 88 Following the Detection results about a DroidDream sample, the “X” represent a failure in the Detection: Figure 3.20: DROIDDREAM TRANSFORMATIONS AND ANTI-MALWARE FAILURE [20] Following the Detection results about a FakePlayer sample, the “X” represent a failure in the Detection: Figure 3.21: FAKEPLAYER TRANSFORMATIONS AND ANTI-MALWARE FAILURE [20] Following the Evaluation Summary of all solutions: Chapter 3. Malware Detection Methodologies 89 Figure 3.22: EVALUATION SUMMARY [20] The most important finding in that paper, show that all the anti-malware products evaluated are susceptible to common evasion techniques and may defeat to even trivial transformations not involving code-level-changes. More specifically, their findings include: • Finding 1: All the anti-malware product evaluated are vulnerable to common transformation. For example, Geinimi variants have encrypted strings, the DroidKungFu malware uses encrypted exploit code. Transformation like Identifier Renaming or Data Encryption are easily available with free or commercial tools. The most important consideration is that such signatures do not capture semantic properties of malware such as data and control flow. • Finding 2: At least 43% signatures are not based on codelevel artifacts. These signatures are based on file-name, dex checksum, information obtained by the PackageManager API or like AVG only by the content of the Manifest File • Finding 3: 90% of signatures do not require static analysis of bytecode. Only one of ten anti-malware tools was found to be using static analysis. Names of classes, methods, and fields, and all the strings and array data are stored in the classes.dex file as they are and hence can be obtained by content matching [20]. • Finding 4: Anti-malware tools have evolved towards content-based signatures over the past one year. Last year, 45% of the signatures were evaded by trivial transformations. their present results show a marked decrease in this fraction to 16%. New solutions moved to content-based matching such as matching Identifier and Strings. Chapter 3. Malware Detection Methodologies 90 Their results demonstrate that new signatures still lack resilience against polymorphic malware. Chapter 3. Malware Detection Methodologies 3.4 91 Malware Analysis using Call Graphs Christodorescu et al. [43] describe a technique for semantics based detection, to overcome the limitations of the Pattern Matching Approach that ignore the semantic of the instruction. The Pattern-matching is a pure syntactic and ignores the semantics of instruction. A Common technique used by malware to evade Detectors is Program Obfuscation, for example a virus can morph by encrypting its malicious payload and decrypting it at runtime. Polymorphic viruses can also obfuscate its decryption loop by code transposition, Metamorphic instead avoid Detection by obfuscating the entire malware and when they replicate, they change their code in different ways. Their algorithm [43] first unify nodes in a given program with nodes in a Signature Template, the signature template abstracts data flows and control flows, which are semantics properties of a program. It is a Data-Flow based technique and it is not vulnerable to any of the transformations, have a very low false positive rate, as authors cited but developing signature templates manually may be challenging. The use of code Encryption and reflection can still defeat also this kind of Detection because they hide the edges in the call graph. To address these limitations, the use of real-time Dynamic Analysis is essential. Malware Classification and Detection problems can be mean the detecting of novel instances of malware or the detecting of new malware variants. The detection of novel instances of malware can be addressed by statistical machine learning approache while, the detection of new malware variants is based upon the concept of ”similarity” or ”clone Detection” against a repository of malware samples. Polymorphic malware variants, use code modification, and obfuscation to hide themself, this means that, considering more instances of polymorphic malware, the content at byte-level or the content at instruction level may change. This drive to an important limitation for static-based approaches, but, it is observed that the Control Flow is more invariant in polymorphic malware [44]. Control Flow is the Execution Path a Program may take, the Call Graph represent the inter-procedural Control Flow (Function Call Graph) while the intra-procedural control flow (Control Flow Graph) is represented by a set of control flow graphs with one graph per procedure. A Call Graph for a program is a set of nodes and edges such that there is one node for Chapter 3. Malware Detection Methodologies 92 each procedure and if procedure c may call procedure p, then there is an edge from c to p (place where a procedure is invoked). A Control-Flow Graph is defined as a directed Graph G = (V,E) in which vertices u,v ∈ V represent basick block and an edge e ∈ E : u → v represent a possible Control Flow from u to v. A basic block consist of a series of instruction without jump instruction in the middle while directed edges between blocks represent jump in the Control Flow such as call, return instructions or unconditional jumps. The use of Reflection hides the edges in the Call Graph and if the methods name are also encrypted these edges are opaque to static analysis [44]. 3.4.1 Similarity Detection using Call Graphs Future attempts to define valid and reliable approaches for the Detection of Malware, may take into account the limitation of static-based analysis, some papers goes in the direction of exploring Call-Graph for Malware variant Analysis. The real challenge of using graphs to detect code clones is that the byte code streams contain no higher level semantic knowledge about the code, making this approach vulnerable to code modifications. Structural Malware detection starts from 2 consideration : first, malicious functionality of an Android applicaton often concentrates on only a small number of its functions and second, similar malicious code is often found throughout the malware landscape as attackers reuse existing code to infect different applications [22]. Research must investigate methods that make using graphs feasible for large scale malware detection [? ], considering that the ability to cluster similar samples together will make more generic detection techniques possible [45] and it made possible to implements a more generic malware Detection schemes. In the [45] they compute pair-wise graph similarity scores via graph matchings which approximately minimize the graph edit distance. Experiments show that it is indeed possible to accurately detect malware families via call graph clustering [45]. In general, graph matching involves discovering structural isomorphism between graphs through one of the following techniques [45]: Chapter 3. Malware Detection Methodologies 93 • Finding Graph Isomorphisms. • Detecting Maximum Common Subgraph. • Finding Minimum Graph Edit Dinstances. Figure 3.23: “Android.Trojan.FakeInst.AS” from the FakeInstaller Malware Family [22] Figure 3.24: Complete function call graph of “Android:RuFraud-C” from the malware family FakeInstaller. Dark shading of nodes indicate malicious structures identified by the SVM [22] Call Graph Isomorphism Analysis is also used for Clone Detection, Software Plagiarism and Software Vulnerabilities in general. In the paper [23] they implement a method for the Detection of so called “purpose-added” apps, to achieve both goals, they use a geometry characteristic, called centroid, of dependency graphs to measure the similarity between method(code fragments) in two apps. Then they synthesize the method-level similarities and draw a Y/N conclusion on app (core functionality) cloning. In the [23], they use the so called “centroid” instead of the whole CFG that can be viewed as the Chapter 3. Malware Detection Methodologies 94 “mass center” of the CFG. For achieving performance, they only check the Centroids, under the consideration that a small change in a method means a small change in its Centroid. They define the Centroid Difference Degree (CDD) and the Methods Difference Degree MDD metrics for methods comparison. They use the longest common subsequence LCS to compute the similarity between methods. For two methods m1 and m2, they view each method as a sequence of opcodes. Secondly, adding one node in small CFGs (with less than 4 nodes) may change the centroids a lot. But for big CFGs (with 4 nodes or above), centroid-based approach is effective [23]. Figure 3.25: Two corresponding methods in two app clones are from different markets. The first method has one more function call to initialize several ads [23] 3.4.2 Software Clones Taxonomy An important clone taxonomies can be found in [46] , as they stated, two code fragments can be similar based on the similarity of their program text or they can be similar in their functionalities without being textually similar. They consider clone types based on the kind of similarity two code fragments can have. • Textual Similarity: Based on the textual similarity we distinguish the following types of clones Chapter 3. Malware Detection Methodologies 95 - Type I: Identical code fragments except for variations in whitespace (may be also variations in layout) and comments. - Type II: Structurally/syntactically identical fragments except for variations in identifiers, literals, types, layout and comments. - Type III: Copied fragments with further modiØcations. Statements can be changed, added or removed in addition to variations in identiØers, literals, types, layout and comments • Functional Similarity: If the functionalities of the two code fragments are identical or similar i.e., they have similar pre and post conditions, we call them semantic clones and referred as Type IV clones. - Type IV: Two or more code fragments that perform the same computation but implemented through diÆerent syntactic variants. It is important to consider that each clone type represent a different level in the sophistication of the Detection Process, increasing from type I to type IV. The detection of Type IV clones is the hardest even after having a great deal of background knowledge about the program construction and software design Chapter 3. Malware Detection Methodologies 96 Type I Clones In Type I clones, a copied code fragment is the same as the original, there might be some variations in whitespace (blanks, new line(s), tabs etc.), comments and/or layouts. Type I is widely know as Exact clones. They are “textually similar”: Listing 3.2: Type I Clone if (a >= b) { c = d + b; // Comment1 d = d + 1;} else c = d - a; //Comment2 An exact copy clone of this original copy could be as follows: if (a>=b) { // Comment1’ c=d+b; d=d+1;} else // Comment2’ c=d-a; Chapter 3. Malware Detection Methodologies 97 Type II Clones Type II clone is a code fragment that is the same as the original except for some possible variations about the corresponding names of user-defined identifiers (name of variables, constants, class, methods and so on), types, layout and comments [46]. Listing 3.3: Type II Clone if (a >= b) { c = d + b; // Comment1 d = d + 1;} else c = d - a; //Comment2 A Type II clone for this fragment can be as follows: if (m >= n) { // Comment1’ y = x + n; x = x + 5; //Comment3 } else y = x - m; //Comment2’ Chapter 3. Malware Detection Methodologies 98 Type III Clones In Type III clones, the copied fragment is further modified with statement(s) changed, added and/or deleted. Listing 3.4: Type III Clone if (a >= b) { c = d + b; // Comment1 d = d + 1;} else c = d - a; //Comment2 If we now extend this code segment by adding a statement e = 1 then we can get, if (a >= b) { c = d + b; // Comment1 e = 1; // This statement is added d = d + 1; } else c = d - a; //Comment2 Chapter 3. Malware Detection Methodologies 99 Type IV Clones Type IV . Functional similarity reflects the degree to which the components act alike, i.e., captures similar functional properties and similarity assessment methods rely on matching of pre/post-conditions clones are the results of semantic similarity between two or more code fragment [46]. Listing 3.5: Type IV Clone Fragment 1: int i, j=1; for (i=1; i<=VALUE; i++) j=j*i; Now consider the following code fragment 2, which is actually a recursive function that calculates the factorial of its argument n. Fragment 2: int factorial(int n) { if (n == 0) return 1 ; else return n * factorial(n-1) ; } This Type IV semantic clones although one is a simple code fragment and another is a recursive function with no lexical/syntactic/structural similarities between the statements of the two fragments. Chapter 3. Malware Detection Methodologies 100 Centroid-based approach [23] although is extremely effective to detect Type 1, Type 2 and Type 3 clones, it may not be effective to detect Type 4 clones. Considering Type 1, Type 2 and Type 3 clones the Function Call Graph would remain the same, also changing the Identifiers name, the Control Flow remain the same. Also adding / deleting some instructions would remain the same some sequences of opcodes (Opcodes Common Sequences). About Type 4 clones, they are difficult to Detect by Static Analysis tools but, as we can see further, they may be irrelevant for Malware Analysis. Chapter 3. Malware Detection Methodologies 3.4.3 101 The Isomorphism Problem An exact graph isomorphism for two graphs, G and H [47] is a bijective function f(v) that maps the vertices V(G) to V(H) such that for all i,j ∈ V(G), i,j ∈ E(G) if and only if (f(i) ,f(j)) ∈ E(H). Graph Isomorphism captures the notion that some objects have “the same structure”, two isomorphic graphs enjoy the same graph theoretical properties. Unfortunately the graph isomorphism problem is one of few standard problems in computational complexity theory belonging to NP, but not known to belong to either of its well-known (and, if P 6≡ NP, disjoint) subsets: P and NP-complete. It is known to be NP-complete. For this reason detecting the largest common subgraph for a pair of graphs is closely related to graph isomorphism as it attempts to find the largest induced subgraph of G which is isomorphic to a subgraph in H [45] and a misure of Minimum Graph Edit Dinstance could be a similarity measure. For a definition of Graph Similarity we refer to that in [45]: (Graph similarity): The similarity σ(G, H) between two graphs G and H indicates the extent to which graph G resembles graph H and vice versa. The similarity σ(G, H) is a real value on the interval [0,1], where 0 indicates that graphs G and H are identical whereas a value 1 implies that there are no similarities. . It is important to considerate what “similarity” means for call Graph in order to bring out relevant strucural properties between malware samples. The real challenge of using graphs is to show similarity in polynomial time. Given 2 Graph, they are isometric if they share the same structure and same functional properties, but the isomorphism problem take a NP time. To solve this, in literature there are some interesting researches that uses features of the Graph instead of the whole inter o intra procedural Graph that gratly simplify the isomorphism problem. These features are the Centroid Difference Degree or Methods Difference Degree [23], the count of common identical substructures in two graphs [22] with the neighborhood hash graph kernel function, q-grams and k-subgraphs to create features vectors [48]. Chapter 4 Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis In this Chapter it will be described our contribute in the Android Malware Analysis. Our contribution is placed before actual Malware Detection Methodologies, in fact it will not be another method for Detecting Android Malware but it will be a new methodology to study Malware, Detecting Malware variants, identify similarities accross malware families and identify similarities at method-level in a method-reuse scenarios . This methodology is placed before any existing Malware detection Methodologies to enhance actual solutions or for example to build Malware “signs”, to track malware families evolution. It is not always possible to identify the malware payload, due to advanced obfuscation techniques applied on the apk in a code-reuse scenario or more in general variation in its structure. But however is possible to identify invariants characteristic components. And this is our direction. Our methodology is, however, guided by the Opcode Frequency Distribution Analysis, in fact all the analysis stage starts from the 10 more similar returned vectors. Will be described this methodology, which is realized across two directions. As first direction we 102 Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 103 compute the relative Opcodes frequency distribution as a measure to identify software clones. With this detection methodology we can implement a more resilient analysis to detect type I, Type II, Type III code clones. The realized methodology can perform the Opcodes Frequency Distribution Analysis at Method-level also, in order to conduct code-clone detection at more fine-graned level, in this way is it possible to identify Type I and type II code clones at method-level and this approach seems to be quite effective considering Code-Reuse Scenarios in the Malware Land. Second, after extracting the Inner Function Call Graph and the Inner Control Flow Graph our methodology tend to identifying isomorphism features between Android Malicious Applications, common subgraphs of both CFG and FCG to detecting code-clones. The call Graph are then mapped in a Vector Space with the use of the Adjacency Lists and after, these lists will be compared in an n-grams analysis stage to outline structural similarities in the Control Flow. Finally a “Similarity Score” will be computed by the Classifier, and assessed in the Experimental phase to confirm the value of this methodology. When we talk about Type X clones, we refer to code-clones at bytecode-level, because, we know , that a small change in an high-level code means a big change at low-level, both for optimization reason or for high-level API. Also for this reason we starting our methodology from the Opcodes Frequency Distribution, that it is more robust to detect changes at low-level. Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 104 4.1 4.1.1 The Data-Set The Malicious Data-Set, Malware Variants Figure 4.1: Android Drebin logo The system extracts structural properties from the candidate apk and then measures the distance of these properties against a Malware Database build above the Android Drebin (http://user.informatik.uni-goettingen.de/ darp/drebin/ ) [49] [50] data. The dataset contains 5,560 applications from 179 different malware families. The samples have been collected in the period of August 2010 to October 2012. Each sample has as unique Id its own sha256 sum , each sample has a different id, this means that each sample can be considered a malware variant inside that family. We have retrieved 4544 apk‘s of 178 different well-know family from the Drebin db and then extracting structural properties like Inners CFG/FCG, Opcodes Frequency Distribution and method-level Opcodes-frequency Distribution. All these data are stored in a local database , which is considered our starting dataset. Lets first examinate the data-set in terms of its composition and then we’ll take a look on our builded local db and how it was structured. 4.1.1.1 Malware families The Malware samples used to build our local dataset belong to 178 well-know different malware families: Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 105 Table 4.1: Malware Drebin Project Families Stealthcell Koomer GGtrack Adsms YcChar FaceNiff Nickspy Ksapp SMSZombie FakeDoc Loicdos Stealer FakeFlash RediAssi Anti Copycat DroidRooter Geinimi SeaWeth Acnetdoor Tesbo SpyBubble FarMap AccuTrack Nandrobox Kmin EICAR-Test-File UpdtKiller Flexispy Spy.GoneSixty Fakengry Glodream Moghava FinSpy SpyMob Proreso Exploit.RageCage Sakezon Generic Zsone TrojanSMS.Hippo Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 106 LifeMon Yzhc Saiva SMSSend SmForw SMSBomber Lypro RootSmart Ackposts Spitmo Dabom Fjcon Updtbot Rooter EWalls Maxit Pirates Coogos Lemon Fakelogo Dogowar Mobinauten Penetho Nisev DroidDream Ansca GPSpy JS Exploit-DynSrc TheftAware NickyRCP Netisend Placms Stiniter Antares Typstu Bosm Mobsquz FoCobers CgFinder JSmsHider Pirater RuFraud Loozfon BeanBot Gamex Sonus Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 107 Vdloader SpyPhone Nyleaker FakeInstaller Aks Adrd Whapsni Zitmo FakeNefix CrWind Qicsom FakeRun Iconosys DroidSheep Gonca Gappusin Xsider Cawitt Fidall Cosha Mania Jifake CellShark SMSreg ExploitLinuxLotoor SuBatt MobileTx Bgserv Fauxcopy Gasms Imlog Kidlogger Tapsnake Hispo MTracker QPlus Dougalek Fatakr RATC Plankton Steek DroidKungFu Hamob Mobilespy Booster Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 108 Spyset CellSpy PJApps Fsm Smspacem Gapev Arspam Dialer Spy.ImLog TrojanSMS.Boxer.AQ Fujacks Trackplus TigerBot Ceshark Anudow Raden SheriDroid Biige SendPay TrojanSMS.Stealer SmsWatcher BaseBridge MMarketPay Maistealer GinMaster Sdisp FakeTimer Replicator Kiser PdaSpy Ssmsp TrojanSMS.Denofow GlodEagl SafeKidZone Opfake SmsSpy SpyHasb FakePlayer Fakeview Foncy Boxer Vidro Gmuse Spyoo SerBG Luckycat Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 109 4.1.2 The Trusted Data-set However, regarding the “Trusted set”, we have selected about 1000 apk’s from the Google Play Store in a more or less randomly way. By this channel, we are ensured that each apk is “trusted” or at least it should be, this because google to keep malicious apps off the official Android app store, submit the application to different detection tools. One of these services is called Bouncer, it quietly and automatically scans apps (both new and previously uploaded ones) and developer accounts in Google Play with its reputation engine and cloud infrastructure. At Summercon 2012, Charlie Miller and Jon Oberheide gave a talk on how to be possible to fingerprint Android’s Bouncer, they submitted an Android app which had shell code included that allowed them to poke around Bouncer while the submitted app was being analyzed. This code also connected back and reported some interesting findings like Bouncer ip range, platform etc etc. After knowing that Bouncer can be easily fingerprinted, it is not difficult to figure a malware that can be avoid Bouncer Detection, for instance , this can be achieved through 2 different techniques: • Delayed attack: The application can include malicious payloads in the submitted app but behave benign when it is running in Bouncer. Once it gets onto a user’s device, then starts to run malicious code. • Update attack: No malicious code needs to be included in the initial installer. Once the application passes Bouncer’s check and gets installed on a real user’s device, then the application can either download additional malicious code to run or connect to its remote control and command (C&C) server to upload stolen data or receive further commands. Everyway we consider as “Trusted” application provided by the Google Play Store. This could be interesting for this study, for example to find malware variants between Google Play Store applications. Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 110 4.2 The multi-staged Architecture In this section i’ ll start describing the whole System Architecture with a top-down approach starting from the system general overview and after focusing the attention on the individual elements. It is a multi-staged Architecture which each stage acts as a separate Analysis stage, the System is realized in Software with some perl scripts. Each Analysis stage extract and store metrics in a database for both providing measures to the next stage or to any other further data-mining in a malware analysis environment. As we’ll see, the system is composed by 2 fundamental parts, the first is the extraction of the Opcodes Relative Frequency Distribution, by this, the system return the top 10 Vectors more similar than the candidate vector using the cosine similarity, after that, these 10 returned vectors are used for the second part. The second part consist in the n-gram analysis of the CFG and the FCG to find similar or cloned pattern, first the Call Graphs are extracted and then are mapped in the vector-space to better compute similarities. Finally, based on these measures, a classifier compute the final similarity taking into account these computed “distances”. Following is provided the System Architecture: Figure 4.2: The Whole System Multi-staged Architecture Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 111 Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 112 4.3 The Relative Opcodes Frequency Distribution Our Heuristic, as we will see, start from the Relative Opcodes Frequency Distribution, for this we have first extracted and stored the opcodes frequency distribution for each apk and then dividing it for the total number of Opcodes in the apk. For the Opcodes Distribution we referred to the Official Dalvik Bytecode Set Table available at source.android.com/devices/tech/dalvik/dalvik-bytecode.html. For each kind of operations, are assigned a category, for example, we have grouped all the operations on array in the category arrayop or for example, we have grouped all the binary operation in the category binop etc.. Following is the Bytecode set categories Table we have defined: Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 113 Table 4.2: Dalvik Bytecode set Categories Category nop op moveobject moveexception move Opcodes nop move-object, move-object/from16, move-object/16 move-exception return op move, move/from16, move/16, move-wide, move-wide/from16, move-wide/16, move-result, moveresult-wide, move-result-object. return returnobject return-object returnvoid returnwide return-void return-wide monitorenter monitor-enter monitor-exit monitor-exit goto goto goto16 goto/16 goto32 goto/32 invokevirtual invoke-virtual invokesuper invoke-super invokedirect invoke-direct invokestatic invoke-static invokeinterfaces invoke-interface Description Waste cycles. Move the contents of one objectbearing register to another. Save a just-caught exception into the given register. This must be the first instruction of any exception handler whose caught exception is not to be ignored, and this instruction must only ever occur as the first instruction of an exception handler. move category includes all move/*, all move-wide/* and all the moveresult/* Return from a single-width (32-bit) non-object value-returning method. Return from an object-returning method. Return from a void method. Return from a double-width (64-bit) value-returning method. Acquire the monitor for the indicated object. Release the monitor for the indicated object. Unconditionally jump to the indicated instruction. Unconditionally jump to the indicated instruction with 16 bit offset. Unconditionally jump to the indicated instructionwith 32 bit offset. invoke-virtual is used to invoke a normal virtual method (a method that is not private, static, or final, and is also not a constructor). invoke-super is used to invoke the closest superclass’s virtual method. invoke-direct is used to invoke a non-static direct method (that is, an instance method that is by its nature non-overridable, namely either a private instance method or a constructor). invoke-static is used to invoke a static method (which is always considered a direct method). invoke-interface is used to invoke an interface method, that is, on an object whose concrete class isn’t known, using a method id that refers to an interface. Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 114 packedswitch packed-switch sparseswitch sparse-switch arrayop aget, aget-wide, aget-object, agetboolean, aget-byte, aget-char, agetshort, aput, aput-wide, aput-object, aput-boolean, aput-byte, aput-char, aput-short if-eq, if-ne, if-lt, if-ge, if-gt, if-le iftest instanceop staticop iftestz cmpop unop binop throw iget, iget-wide, iget-object, igetboolean, iget-byte, iget-char, igetshort, iput, iput-wide, iput-object, iput-boolean, iput-byte, iput-char, iput-short sget, sget-wide, sget-object, sgetboolean, sget-byte, sget-char, sgetshort, sput, sput-wide, sput-object, sput-boolean, sput-byte, sput-char, sput-short if-eqz, if-nez, if-ltz, if-gez, if-gtz, iflez cmpl-float (lt bias), cmpg-float (gt bias), cmpl-double (lt bias), cmpgdouble (gt bias), cmp-long neg-int, not-int, neg-long, not-long, neg-float, neg-double, int-to-long, int-to-float, int-to-double, long-toint, long-to-float, long-to-double, float-to-int, float-to-long, float-todouble, double-to-int, double-tolong, double-to-float, int-to-byte, int-to-char, int-to-short add-int, sub-int, mul-int, div-int, rem-int, and-int, or-int, xor-int, shl-int, shr-int, ushr-int, add-long, sub-long, mul-long, div-long, remlong, and-long, or-long, xor-long, shl-long, shr-long, ushr-long, addfloat, sub-float, mul-float, div-float, rem-float, add-double, sub-double, mul-double, div-double, remdouble, binop/2addr, binop/lit16, binop/lit8 throw Jump to a new instruction based on the value in the given register, using a table of offsets corresponding to each value in a particular integral range, or fall through to the next instruction if there is no match. Jump to a new instruction based on the value in the given register, using an ordered table of value-offset pairs, or fall through to the next instruction if there is no match. Perform the identified array operation at the identified index of the given array, loading or storing into the value register. Branch to the given destination if the given two registers’ values compare as specified. Perform the identified object instance field operation with the identified field, loading or storing into the value register. Perform the identified object static field operation with the identified static field, loading or storing into the value register. Branch to the given destination if the given register’s value compares with 0 as specified. Perform the indicated floating point or long comparison. Perform the identified unary operation on the source register, storing the result in the destination register. Perform the identified binary operation on the two source registers, storing the result in the first source register. Throw the indicated exception. Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 115 After defined these categories, we have extracted, for each apk, all the Dalvik Opcodes and then creating a fixed-size vector with size 29 where each element is a defined category C=c1,c2,c3,c4.... and the corresponding value is the relative frequency of that category (a density measure). At the end, we have stored in a local db, for each apk, a characterizing vector, representing the Relative Opcodes Frequency Distribution. Following we provide an example of a characterizing vector for a malware sample (66d4fb0ba082a53eaedf8909f65f4f9d60f0b038e6d5695dbe6d5798853904aa) of the FakeDoc family: Figure 4.3: 66d4fb0ba082a53eaedf8909f65f4f9d60f0b038e6d5695dbe6d5798853904aa sample, Opcode frequency Distribution Table 4.3: 66d4fb0ba082a53eaedf8909f65f4f9d60f0b038e6d5695dbe6d5798853904aa Relative Opcodes Frequency Distribution Vector 116 nop op 0 move 0.01702 moveexception 0.00868 moveobject 0.05488 return op 0.02362 returnobject 0.05974 returnvoid 0.14345 returnwide 0.00764 monitorenter 0.00069 monitorexit 0 goto 0.14484 goto16 0.03161 goto32 0 invokevirtual 0.03890 invokesuper 0 invokedirect 0.00834 invokestatic 0.03334 invokeinterfaces 0.03091 packedswitch 0.01181 sparsewitch 0.00208 arrayop 0.00243 iftest 0.03091 instanceop 0.00660 staticop 0.01111 iftestz 0.27405 cmpop 0 unop 0 binop 0.00174 throw 0.02431 Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 117 The Relative Opcodes Frequency Distribution is a metric for characterize applications and it is our starting point in the Architecture. In fact, as first thing, the Opcodes distribution of an apk candidate is extracted and then compared with those inside our local db before extracted and a vector of 10 most similar samples will be returned measuring the Cosine Similarity in SW. Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them, given two vectors of attributes, A and B, the cosine similarity, cos(θ), is represented as: A·B cos(θ) = kAkkBk In the case of information retrieval, the cosine similarity of two documents or text in general will range from 0 to 1, the angle between two term frequency vectors cannot be greater than 90◦ . This vector contains the 10 more similar apk, respect to the candidate, in terms of Relative Opcodes frequency Distribution: Figure 4.4: Opcodes Distribution extraction and Computation We also computed the Relative Opcodes Frequency Distribution for each method included in the apk. In this way we are able to identify similarities at method-level in a more fine-graned way, identifying clones at method level. This metrics is also stored in a local db. Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 118 The method-level clone detection reported on Architecture, is another methods to find structural similarities at method-level but we can discuss about it in the next paragraph. Let’s show now a real case of the similarities detection methodology we implement. The real case starts from the candidate apk we want to analyze, and we select a malicious sample from the family Gappusin. Analyzing the sample, our methodology start the Opcodes Frequency Distribution analysis and returning the 10 most similar vector by the Cosine Similarity: Table 4.4: 10 most similar vector by the Cosine Similarity 1 37d2af6588343d813bac06d8db964491e04cd1fb933dfcdd7e940c0c47975bde Sim: 1 Family: Gappusin 2 85f5513112800588fcd3b3cb59a4eeb2b25f53b1b3070a9e3a7d447a5cd45281 Sim: 1 Family: Gappusin 3 4465602bd2c37dff6819798831828ac0cbb70614874afb179a8efbf26e538411 Sim: 1 Family: Gappusin 4 40ec5ac7d23316b2d3ae674efcb9a48073ac08082aeaa8f9d5f4945d2e1ae4d3 Sim: 1 Family: Gappusin 5 8f617e67ea6bdc4daf680875de10b3df79ec141780b1ff7604155a66021fee76 Sim: 1 Family: Gappusin 6 622cdedc202c2876e2e9b4c949100963b9b74b7d3977c31d15a0bb10ab2a7c97 Sim: 1 Family: Gappusin 7 12780c0ed735628dc21a5c0e4783e82c017a6ad999fd74e29420f5c1a598ca6c Sim: 1 Family: Gappusin 8 212aa193b9f5a688a650366970c02987c99d07f6b1e50f04c510bfb74f7062f1 Sim: 1 Family: Gappusin 9 da07316645455a02b2f9eda79662c585aee96e9ab06aeb139fb895b9ea89238a Sim: 0.99890 Family: Gappusin 10 2a1582f742edeb2ba7d0a90c0bf4358643612a38544d13a297c2ea5f6b7d86ce Sim: 0.99728 Family: Gappusin We can notice that all the 10 returned vectors returned belong to the Gappusin Family as expected, these vectors are more close to the candidate according to the Opcodes Frequency Distribution Similarity measured with the Cosine Similarity metric. This is a whole apk-level coarse grained analysis which take into account only affinity with the Opcodes Frequency Distribution. These first analysis show us that most probably, the candidate apk belong to the Gappusin Family. As we can see from the previous table, the Sim score is the Similarity Score for the Opcodes Frequency Distribution measured by the Cosine Similarity. Following we are going to compare the Opcodes Frequency Distribution between the candidate apk and each of the 10 returned vector, following are showed first the real frequency for each kind of opcode and the related istogram: Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 119 Table 4.5: Candidate apk Relative Opcodes Frequency Distribution Vector (left) vs Relative Opcodes Frequency Distribution of the 1st Vector (right) nop op 0 nop op 0 move 0.02791 move 0.02791 moveexception 0 moveexception 0 moveobject 0.00465 moveobject 0.00465 return op 0.06047 return op 0.06047 returnobject 0.05581 returnobject 0.05581 returnvoid 0.16279 returnvoid 0.16279 returnwide 0.00465 returnwide 0.00465 monitorenter 0 monitorenter 0 monitorexit 0 monitorexit 0 goto 0.15349 goto 0.15349 goto16 0.02326 goto16 0.02326 goto32 0 goto32 0 invokevirtual 0.00465 invokevirtual 0.00465 invokesuper 0 invokesuper 0 invokedirect 0.00465 invokedirect 0.00465 invokestatic 0.04651 invokestatic 0.04651 invokeinterfaces 0.00930 invokeinterfaces 0.00930 packedswitch 0 packedswitch 0 sparsewitch 0 sparsewitch 0 arrayop 0 arrayop 0 iftest 0.06512 iftest 0.06512 instanceop 0 instanceop 0 staticop 0 staticop 0 iftestz 0.28372 iftestz 0.28372 cmpop 0 cmpop 0 unop 0 unop 0 binop 0 binop 0 throw 0.00465 throw 0.00465 120 Figure 4.5: Candidate apk Relative Opcodes Frequency Distribution Vector (blue) vs Relative Opcodes Frequency Distribution of the 1st Vector (red) Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 121 Table 4.6: Candidate apk Relative Opcodes Frequency Distribution Vector (left) vs Relative Opcodes Frequency Distribution of the 2nd Vector (right) nop op 0 nop op 0 move 0.02791 move 0.02791 moveexception 0 moveexception 0 moveobject 0.00465 moveobject 0.00465 return op 0.06047 return op 0.06047 returnobject 0.05581 returnobject 0.05581 returnvoid 0.16279 returnvoid 0.16279 returnwide 0.00465 returnwide 0.00465 monitorenter 0 monitorenter 0 monitorexit 0 monitorexit 0 goto 0.15349 goto 0.15349 goto16 0.02326 goto16 0.02326 goto32 0 goto32 0 invokevirtual 0.00465 invokevirtual 0.00465 invokesuper 0 invokesuper 0 invokedirect 0.00465 invokedirect 0.00465 invokestatic 0.04651 invokestatic 0.04651 invokeinterfaces 0.00930 invokeinterfaces 0.00930 packedswitch 0 packedswitch 0 sparsewitch 0 sparsewitch 0 arrayop 0 arrayop 0 iftest 0.06512 iftest 0.06512 instanceop 0 instanceop 0 staticop 0 staticop 0 iftestz 0.28372 iftestz 0.28372 cmpop 0 cmpop 0 unop 0 unop 0 binop 0 binop 0 throw 0.00465 throw 0.00465 122 Figure 4.6: Candidate apk Relative Opcodes Frequency Distribution Vector (blue) vs Relative Opcodes Frequency Distribution of the 2nd Vector (red) Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 123 Table 4.7: Candidate apk Relative Opcodes Frequency Distribution Vector (left) vs Relative Opcodes Frequency Distribution of the 3rd Vector (right) nop op 0 nop op 0 move 0.02791 move 0.02791 moveexception 0 moveexception 0 moveobject 0.00465 moveobject 0.00465 return op 0.06047 return op 0.06047 returnobject 0.05581 returnobject 0.05581 returnvoid 0.16279 returnvoid 0.16279 returnwide 0.00465 returnwide 0.00465 monitorenter 0 monitorenter 0 monitorexit 0 monitorexit 0 goto 0.15349 goto 0.15349 goto16 0.02326 goto16 0.02326 goto32 0 goto32 0 invokevirtual 0.00465 invokevirtual 0.00465 invokesuper 0 invokesuper 0 invokedirect 0.00465 invokedirect 0.00465 invokestatic 0.04651 invokestatic 0.04651 invokeinterfaces 0.00930 invokeinterfaces 0.00930 packedswitch 0 packedswitch 0 sparsewitch 0 sparsewitch 0 arrayop 0 arrayop 0 iftest 0.06512 iftest 0.06512 instanceop 0 instanceop 0 staticop 0 staticop 0 iftestz 0.28372 iftestz 0.28372 cmpop 0 cmpop 0 unop 0 unop 0 binop 0 binop 0 throw 0.00465 throw 0.00465 124 Figure 4.7: Candidate apk Relative Opcodes Frequency Distribution Vector (blue) vs Relative Opcodes Frequency Distribution of the 3rd Vector (red) Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 125 Table 4.8: Candidate apk Relative Opcodes Frequency Distribution Vector (left) vs Relative Opcodes Frequency Distribution of the 4th Vector (right) nop op 0 nop op 0 move 0.02791 move 0.02791 moveexception 0 moveexception 0 moveobject 0.00465 moveobject 0.00465 return op 0.06047 return op 0.06047 returnobject 0.05581 returnobject 0.05581 returnvoid 0.16279 returnvoid 0.16279 returnwide 0.00465 returnwide 0.00465 monitorenter 0 monitorenter 0 monitorexit 0 monitorexit 0 goto 0.15349 goto 0.15349 goto16 0.02326 goto16 0.02326 goto32 0 goto32 0 invokevirtual 0.00465 invokevirtual 0.00465 invokesuper 0 invokesuper 0 invokedirect 0.00465 invokedirect 0.00465 invokestatic 0.04651 invokestatic 0.04651 invokeinterfaces 0.00930 invokeinterfaces 0.00930 packedswitch 0 packedswitch 0 sparsewitch 0 sparsewitch 0 arrayop 0 arrayop 0 iftest 0.06512 iftest 0.06512 instanceop 0 instanceop 0 staticop 0 staticop 0 iftestz 0.28372 iftestz 0.28372 cmpop 0 cmpop 0 unop 0 unop 0 binop 0 binop 0 throw 0.00465 throw 0.00465 126 Figure 4.8: Candidate apk Relative Opcodes Frequency Distribution Vector (blue) vs Relative Opcodes Frequency Distribution of the 4th Vector (red) Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 127 As we can notice, the first 4 returned vectors are the same, in fact, they have a Similarity Score of 1 (as we can see on the previous table) this means for the Cosine Similarity that they are the same. In the next page, we show the others remaining 6 vectors returned which has anyway an high similarity Score (but not 1), this means that the vectors are not the same but they are very similar to the Candidate in terms of Opcodes Frequency Distribution. The vectors from 1 to 4 are the same sample, while the vector num 5 is a little bit dissimilar but it belong to the same Zitmo Family, this means most probably that these num 5 vector represent a sample of that family a little bit dissimilar sharing the 99% of the Opcodes, as in a Code-Reuse scenario. Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 128 Table 4.9: Candidate apk Relative Opcodes Frequency Distribution Vector (left) vs Relative Opcodes Frequency Distribution of the 5th Vector (right) nop op 0 nop op 0 move 0.02791 move 0.02791 moveexception 0 moveexception 0 moveobject 0.00465 moveobject 0.00465 return op 0.06047 return op 0.06047 returnobject 0.05581 returnobject 0.05581 returnvoid 0.16279 returnvoid 0.16279 returnwide 0.00465 returnwide 0.00465 monitorenter 0 monitorenter 0 monitorexit 0 monitorexit 0 goto 0.15349 goto 0.15349 goto16 0.02326 goto16 0.02326 goto32 0 goto32 0 invokevirtual 0.00465 invokevirtual 0.00465 invokesuper 0 invokesuper 0 invokedirect 0.00465 invokedirect 0.00465 invokestatic 0.04651 invokestatic 0.04651 invokeinterfaces 0.00930 invokeinterfaces 0.00930 packedswitch 0 packedswitch 0 sparsewitch 0 sparsewitch 0 arrayop 0 arrayop 0 iftest 0.06512 iftest 0.06512 instanceop 0 instanceop 0 staticop 0 staticop 0 iftestz 0.28372 iftestz 0.28372 cmpop 0 cmpop 0 unop 0 unop 0 binop 0 binop 0 throw 0.00465 throw 0.00465 129 Figure 4.9: Candidate apk Relative Opcodes Frequency Distribution Vector (blue) vs Relative Opcodes Frequency Distribution of the 5th Vector (red) Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 130 Table 4.10: Candidate apk Relative Opcodes Frequency Distribution Vector (left) vs Relative Opcodes Frequency Distribution of the 6th Vector (right) nop op 0 nop op 0 move 0.02791 move 0.02791 moveexception 0 moveexception 0 moveobject 0.00465 moveobject 0.00465 return op 0.06047 return op 0.06047 returnobject 0.05581 returnobject 0.05581 returnvoid 0.16279 returnvoid 0.16279 returnwide 0.00465 returnwide 0.00465 monitorenter 0 monitorenter 0 monitorexit 0 monitorexit 0 goto 0.15349 goto 0.15349 goto16 0.02326 goto16 0.02326 goto32 0 goto32 0 invokevirtual 0.00465 invokevirtual 0.00465 invokesuper 0 invokesuper 0 invokedirect 0.00465 invokedirect 0.00465 invokestatic 0.04651 invokestatic 0.04651 invokeinterfaces 0.00930 invokeinterfaces 0.00930 packedswitch 0 packedswitch 0 sparsewitch 0 sparsewitch 0 arrayop 0 arrayop 0 iftest 0.06512 iftest 0.06512 instanceop 0 instanceop 0 staticop 0 staticop 0 iftestz 0.28372 iftestz 0.28372 cmpop 0 cmpop 0 unop 0 unop 0 binop 0 binop 0 throw 0.00465 throw 0.00465 131 Figure 4.10: Candidate apk Relative Opcodes Frequency Distribution Vector (blue) vs Relative Opcodes Frequency Distribution of the 6th Vector (red) Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 132 Table 4.11: Candidate apk Relative Opcodes Frequency Distribution Vector (left) vs Relative Opcodes Frequency Distribution of the 7th Vector (right) nop op 0 nop op 0 move 0.02791 move 0.02791 moveexception 0 moveexception 0 moveobject 0.00465 moveobject 0.00465 return op 0.06047 return op 0.06047 returnobject 0.05581 returnobject 0.05581 returnvoid 0.16279 returnvoid 0.16279 returnwide 0.00465 returnwide 0.00465 monitorenter 0 monitorenter 0 monitorexit 0 monitorexit 0 goto 0.15349 goto 0.15349 goto16 0.02326 goto16 0.02326 goto32 0 goto32 0 invokevirtual 0.00465 invokevirtual 0.00465 invokesuper 0 invokesuper 0 invokedirect 0.00465 invokedirect 0.00465 invokestatic 0.04651 invokestatic 0.04651 invokeinterfaces 0.00930 invokeinterfaces 0.00930 packedswitch 0 packedswitch 0 sparsewitch 0 sparsewitch 0 arrayop 0 arrayop 0 iftest 0.06512 iftest 0.06512 instanceop 0 instanceop 0 staticop 0 staticop 0 iftestz 0.28372 iftestz 0.28372 cmpop 0 cmpop 0 unop 0 unop 0 binop 0 binop 0 throw 0.00465 throw 0.00465 133 Figure 4.11: Candidate apk Relative Opcodes Frequency Distribution Vector (blue) vs Relative Opcodes Frequency Distribution of the 7th Vector (red) Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 134 Table 4.12: Candidate apk Relative Opcodes Frequency Distribution Vector (left) vs Relative Opcodes Frequency Distribution of the 8th Vector (right) nop op 0 nop op 0 move 0.02791 move 0.02791 moveexception 0 moveexception 0 moveobject 0.00465 moveobject 0.00465 return op 0.06047 return op 0.06047 returnobject 0.05581 returnobject 0.05581 returnvoid 0.16279 returnvoid 0.16279 returnwide 0.00465 returnwide 0.00465 monitorenter 0 monitorenter 0 monitorexit 0 monitorexit 0 goto 0.15349 goto 0.15349 goto16 0.02326 goto16 0.02326 goto32 0 goto32 0 invokevirtual 0.00465 invokevirtual 0.00465 invokesuper 0 invokesuper 0 invokedirect 0.00465 invokedirect 0.00465 invokestatic 0.04651 invokestatic 0.04651 invokeinterfaces 0.00930 invokeinterfaces 0.00930 packedswitch 0 packedswitch 0 sparsewitch 0 sparsewitch 0 arrayop 0 arrayop 0 iftest 0.06512 iftest 0.06512 instanceop 0 instanceop 0 staticop 0 staticop 0 iftestz 0.28372 iftestz 0.28372 cmpop 0 cmpop 0 unop 0 unop 0 binop 0 binop 0 throw 0.00465 throw 0.00465 135 Figure 4.12: Candidate apk Relative Opcodes Frequency Distribution Vector (blue) vs Relative Opcodes Frequency Distribution of the 8th Vector (red) Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 136 Table 4.13: Candidate apk Relative Opcodes Frequency Distribution Vector (left) vs Relative Opcodes Frequency Distribution of the 9th Vector (right) nop op 0 nop op 0 move 0.02791 move 0.03744 moveexception 0 moveexception 0.00399 moveobject 0.00465 moveobject 0.01348 return op 0.06047 return op 0.02546 returnobject 0.05581 returnobject 0.09636 returnvoid 0.16279 returnvoid 0.15876 returnwide 0.00465 returnwide 0.00100 monitorenter 0 monitorenter 0.00100 monitorexit 0 monitorexit 0.00150 goto 0.15349 goto 0.15277 goto16 0.02326 goto16 0.03495 goto32 0 goto32 0 invokevirtual 0.00465 invokevirtual 0.04693 invokesuper 0 invokesuper 0 invokedirect 0.00465 invokedirect 0.00899 invokestatic 0.04651 invokestatic 0.01398 invokeinterfaces 0.00930 invokeinterfaces 0.01248 packedswitch 0 packedswitch 0.00599 sparsewitch 0 sparsewitch 0 arrayop 0 arrayop 0.00300 iftest 0.06512 iftest 0.04943 instanceop 0 instanceop 0.02147 staticop 0 staticop 0.01947 iftestz 0.28372 iftestz 0.25961 cmpop 0 cmpop 0 unop 0 unop 0.00050 binop 0 binop 0.0050 throw 0.00465 throw 0.00399 137 Figure 4.13: Candidate apk Relative Opcodes Frequency Distribution Vector (blue) vs Relative Opcodes Frequency Distribution of the 9th Vector (red) Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 138 Table 4.14: Candidate apk Relative Opcodes Frequency Distribution Vector (left) vs Relative Opcodes Frequency Distribution of the 10th Vector (right) nop op 0 nop op 0 move 0.02791 move 0.03096 moveexception 0 moveexception 0.00593 moveobject 0.00465 moveobject 0.00791 return op 0.06047 return op 0.02141 returnobject 0.05581 returnobject 0.08531 returnvoid 0.16279 returnvoid 0.16634 returnwide 0.00465 returnwide 0.00132 monitorenter 0 monitorenter 0.00132 monitorexit 0 monitorexit 0.00132 goto 0.15349 goto 0.16271 goto16 0.02326 goto16 0.02931 goto32 0 goto32 0 invokevirtual 0.00465 invokevirtual 0.04447 invokesuper 0 invokesuper 0.00033 invokedirect 0.00465 invokedirect 0.01812 invokestatic 0.04651 invokestatic 0.01581 invokeinterfaces 0.00930 invokeinterfaces 0.01252 packedswitch 0 packedswitch 0.00527 sparsewitch 0 sparsewitch 0 arrayop 0 arrayop 0.00296 iftest 0.06512 iftest 0.04282 instanceop 0 instanceop 0.03195 staticop 0 staticop 0.02075 iftestz 0.28372 iftestz 0.26219 cmpop 0 cmpop 0 unop 0 unop 0.00066 binop 0 binop 0.00099 throw 0.00465 throw 0.00428 139 Figure 4.14: Candidate apk Relative Opcodes Frequency Distribution Vector (blue) vs Relative Opcodes Frequency Distribution of the 10th Vector (red) Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 140 In an attemept to place this methodology before the Detection process, a most interesting analysis is that of, choosing a malware family, by going to identify the Dalvik instructions more representative for that family. This is possible by first examing the distribution of each Opcode category. For this, we choose to examinate the Gappusin malware family. We can observe that Opcodes Classes like “iftestz”, “goto”, “return*” are instructions that are present in each method of eack known application, but certain types of instructions rarely occur. Are the rare opcodes classes that can be more representive of each malware family: 141 Figure 4.15: Gappusin family, Opcodes classes Frequency Distribution Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 142 4.4 Isomorphism Analysis As we also mentioned, we starting retrieving the malicious apk’s from the Android Drebin dataset, at this point we extracted some metric from each apk and store them in a local database, following are showed the relevant informations we decided to extract and store. For detecting code clones, we implemented some heuristics for the so called Isomorphism Analysis, to detect the isomorphism ratio between malware samples. 4.4.1 The Control Flow Graph As first step of this Heuristic, we extracted the Control Flow Graph of each application in our Data-set but without considering external imports, this choice was made with the intention of considering only internal structural properties because our study focuses on Similarities among Malware variants. Considering only internal structural properties of an application and not External or third-part libraries means less Entropy to Analyze with the assumption that external libraries are variant. For extracting CFG we used the AndroGuard suite exporting the CFG in gexf format. After, all the extracted CFG’s are stored in a local filesystem for further analysis. We used the Gephi Software to visualize and interact with the CFG: Figure 4.16: Android Dragon Quest game CFG in Gephi Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 143 For each node, we have some useful informations like the Class, The Method Name and the specific Opcode it implements: Listing 4.1: Node properties <node id="15" label="Ljp/waraerudouga/HttpPostTask;-access$3-BB@0x0-(Ljp/ waraerudouga/HttpPostTask; Ljava/lang/String;)V"> <att type="string" name="classname" value="Ljp/waraerudouga/HttpPostTask ;"/> <att type="string" name="name" value="access$3"/> <att type="string" name="descriptor" value="(Ljp/waraerudouga/HttpPostTask; Ljava/lang/String;)V"/> <att type="integer" name="offset" value="0"/> <att type="string" name="node.label" value="access$3\nreturn-void"/> After extracting the CFG and the FCG for the Dataset, we’ ll map these Graphs in the Vector-Space using Adjacency Lists to detect Android malware using machine learning techniques. After extracting the CFG’s we are able to identify “common patterns” in the Control Flow, for the identification we first mapping the CFG in the Vector-space and then applying some n-grams analysis. In this way we are able to identify Code Clones or everyway “common patterns” as a proof of Code Reuse. More n-grams are founded and more the Similarity Score is increased. 4.4.2 The Function Call Graph As we have already see for the CFG, same thing is for the Function Call Graph. We extracted the Function Call Graph of each application in the Data-set but without considering external imports, using the Androguard Framework. Each FCG is then storead in a local db for further analysis. It is also possible to manually analyze it with Gephi tool. Following we can observe a particular of the Call Graph of the Dragon quest game for Android: Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 144 Figure 4.17: Android Dragon Quest game FCG in Gephi Each FCG is extracted and then stored in a local db for further analysis. Our research also use the Call Graph structural properties to found similarities in malware variants, same thing of the CFG, in a Code-Reuse scenario, it is possible that some methods are cloned or some behaviors are cloned between malware families. We then apply an n-gram analysis to identify “Common Pattern” in the Call Graph. Then the similarity score is increased more n-gram pattern will be found. 4.4.3 Vector-Space Mapping To better and efficiently compute the “similarity” between Call Graphs, we introduce a Vector transformation. Also for the Opcodes Distribution we have introduces a trasformation in the Vector Space. We mapped all the CFG’s and FCG’s in the Vector Space with the use of the so called Adjacency Lists. In graph theory and computer science, an adjacency list representation of a graph is a collection of unordered lists, one for each vertex in the graph. Each list describes, for each vertex, the set of its neighbors. The use of an Adjacency List instead of an Adjacency Matrix is because for a sparse graph an adjacency list is significantly more space-efficient than an adjacency matrix. The space usage of the adjacency list is proportional to the number of edges and vertices in the graph, while for an adjacency matrix stored in this way the space is proportional to the square of the number of vertices. Also in regard to the Architecture, we can notice that, before each Similarity Computation, the structural data (CFG, FCG, Opcodes Distribution Frequency) are mapped in the Vector-space: Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 145 Figure 4.18: Structural data mapped in the Vector-Space In regards to the Architecture, firts of all, the system returns the 10 most similar Vectors respect to the candidate apk, of these, it map from the Flow Graphs the relative 10 Adjacency Lists to provide them to the next stage n-gram analysis. The Building of the Adjacency List is done in Central Memory using hash map data structure, only when it was complete, it was stored on the local storage. Following we show the construction of our Adjacency List for malware sample 37d2af6588343d813bac06d8db964491e04cd1fb933dfcdd7e940c0c47975 as you can notice, it is a little bit different respect to a “right” Adjacency List, in fact on the left we have the node id while, on the right we have the node value (Dalvik Opcode) followed by the other node connected to it (value). Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 146 Listing 4.2: 37d2af6588343d813bac06d8db964491e04cd1fb933dfcdd7e940c0c47975bde sample Adjacency List ’127’ => ’goto,return’, ’1801’ => ’goto/16,goto’, ’1648’ => ’if-eqz,invoke-virtual,goto’, ’882’ => ’goto,goto’, ’1050’ => ’if-eqz,goto,move’, ’1178’ => ’if-eqz,if-eqz,if-nez’, ’898’ => ’if-eqz,const/4,goto’, ’798’ => ’goto,return-object’, ’1387’ => ’if-lez,if-lez,return-object’, ’1031’ => ’invoke-virtual,return-void’, ’512’ => ’move,return’, ’463’ => ’if-nez,move-result-object,if-eqz’, ’517’ => ’goto,return’, ’458’ => ’goto,return-object’, ’451’ => ’move-object,return-object’, ’1785’ => ’goto,return-void’, ’454’ => ’goto,return-object’, ’1364’ => ’if-eqz,if-eqz,return-void’, ’1408’ => ’if-nez,iput-object,if-eqz’, ’1250’ => ’sput-boolean,if-eqz’, ’1363’ => ’iput-boolean,if-eqz’, ’722’ => ’if-eqz,if-eqz,goto’, ’634’ => ’iput-object,invoke-direct/range’, ’578’ => ’goto,return-void’, ’2067’ => ’goto,return-object’, ’1630’ => ’if-eqz,iput-object,if-eqz’, ’695’ => ’goto,return-object’, ’1244’ => ’if-eqz,if-eqz,goto’, ’1638’ => ’if-eqz,iput-object,return-void’, ’2144’ => ’invoke-virtual,return-void’, ’742’ => ’monitor-exit,return-void’, ’378’ => ’goto,if-eqz’, ’889’ => ’goto,if-lt’, ’350’ => ’invoke-interface,goto/16’, ’1969’ => ’if-le,invoke-virtual/range,goto/16’, ’1233’ => ’if-eqz,if-lez,goto/16’, ’540’ => ’if-eqz,if-ne,goto’, ’1615’ => ’if-eqz,if-nez,if-eqz’, ’1162’ => ’if-nez,sput-object,if-nez’, ... Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 147 We have to make a consideration, we don’t have the needs to reconstructing the whole graph, our study can only take into consideration “identical” Structural Connections, in fact, the Adjacency List of the Candidate apk is created and then , the Adjacency List of each of 10 similar vector returned by the system, is extracted. At this point the Adjacency Lists are sorted, and after that the system starts the Similarity Search to identify “common” shared structures. This analysis is conducted for both CFG and FCG. Figure 4.19: Adjacency Lists preparation and similarities search Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 148 Following we present the Candidate apk CFG Adjacency List extracted and sorted, ready for the n-gram Analisys stage: Table 4.15: Candidate apk CFG Adjacency List extracted and sorted return-void,goto check-cast,if-ne, const-string,if-eqz, const-string,if-nez, const-string,return-object, const/,check-cast, const/,goto, const/,if-eq, const/,if-lt, const/,if-nez, const/,move-object, const/,return, const/,return-object, goto,const/, goto,goto, goto,if-eq, goto,if-eqz, goto,if-gtz, goto,if-lt, goto,if-nez, goto,move, goto,return, goto,return-object, goto,return-void, goto/,if-lt, goto/,return, goto/,return-object, goto/,return-void, if-eq,goto,if-eq, if-eq,if-eqz,return-void, if-eq,if-lt,return, if-eqz,const/,goto, if-eqz,const/,return-void, if-eqz,goto,goto, if-eqz,goto,if-eqz, if-eqz,goto/,goto/, if-eqz,goto/,return-void, Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 149 if-eqz,if-eq,return-void, if-eqz,if-eqz,if-eqz, if-eqz,if-eqz,return-void, if-eqz,if-ge,goto, if-eqz,if-le,if-eqz, if-eqz,if-ne,if-eqz, if-eqz,if-ne,invoke-static, if-eqz,if-ne,return, if-eqz,if-nez,goto/, if-eqz,if-nez,if-eqz, if-eqz,if-nez,invoke-static, if-eqz,if-nez,move, if-eqz,if-nez,return, if-eqz,if-nez,return-object, if-eqz,if-nez,return-void, if-eqz,invoke-direct,if-eqz, if-eqz,invoke-static,goto, if-eqz,invoke-static,if-eqz, if-eqz,invoke-virtual,if-eqz, if-eqz,invoke-virtual/range,if-eqz, if-eqz,move,goto, if-eqz,return-void,if-nez, if-ge,goto,const/, if-gtz,return-object,goto, if-le,goto,if-eqz, if-lez,move-result-object,return-object, if-lt,goto,goto, if-lt,if-nez,goto/, if-lt,return,if-eq, if-ne,if-eqz,move, if-ne,if-nez,if-eqz, if-ne,invoke-static,goto, if-ne,invoke-static,if-eqz, if-ne,move,goto, if-ne,throw,return, if-nez,const-string,goto, if-nez,const-string,if-eqz, if-nez,const-string,if-nez, Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 150 if-nez,const-string,return-object, if-nez,const/,goto, if-nez,const/,if-eqz, if-nez,goto/,return-object, if-nez,if-eqz,goto, if-nez,if-eqz,if-eqz, if-nez,if-eqz,if-ne, if-nez,if-eqz,if-nez, if-nez,if-eqz,invoke-virtual/range, if-nez,if-eqz,return-void, if-nez,if-ne,return, if-nez,if-nez,goto, if-nez,invoke-interface,return, if-nez,invoke-interface,return-object, if-nez,invoke-static,goto, if-nez,invoke-static,move, invoke-direct,if-eqz, invoke-interface,return, invoke-interface,return-object, invoke-static,const/, invoke-static,if-eqz, invoke-static,move, invoke-static,return, invoke-static,return-void, invoke-virtual,return, invoke-virtual/range,if-eqz, move,goto, move,if-eqz, move,if-gtz, move,if-nez, move,return, move-object,const/, move-result-object,return-object, throw,goto, Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 151 Following we present the Candidate apk FCG Adjacency List ready for the n-grams Analisys stage: Table 4.16: Candidate apk FCG Adjacency List extracted sendNotify awardPointsHelper closeOfDialog showPushDialog onDestroy isValid FinishApplicationRunnable d Initialize drawGif showOffers forTab updateOnlineConfig handleConnectResponse bindNewApp g access$300 getPushAdDateFromServer getDisplayAdDataFromServer e onError access$100 access$500 getAnimationList2 ¡init¿ onSurfaceDestroyed drawFrame onPostExecute flush q b ¡init¿,a ¡init¿ submit initNotityData,¡init¿,¡init¿,getInstanceNoConnect, notify receiver access$0 getConfigParams, isEn, edit, get ¡init¿, a j, h, i, k, n, p, l, showUpdateDialog, ¡init¿, ¡init¿ getWapsInfo, ¡init¿, edit access$2, access$2 showOffers forTab a, f buildDocument, getNodeTrimValue, getNodeTrimValue, getNodeTrimValue, getNodeTrimValue getNewAppInfo n, h, a getDownload ¡init¿ ¡init¿ A, f, r, i, g, h, a, a, h, k, f, f, h, g a, ¡init¿ updateResultsInUi buildResponse getAnimation ¡init¿, ¡init¿, ¡init¿, ¡init¿, updateOnlineConfig, update, setUpdateOnlyWifi, getConfigParams, a, ¡init¿, onSharedPreferenceChanged, ¡init¿, ¡init¿, a, ¡init¿, ¡init¿, ¡init¿, ¡init¿, a, ¡init¿, a, ¡init¿, ¡init¿, ¡init¿, getWapsInfo, ¡init¿, ¡init¿, ¡init¿, get, edit, ¡init¿, ¡init¿, ¡init¿, getInstance, getInstance, getPoints, getParams, getParams, getParams, ¡init¿ access$0 access$0, drawGif, drawTouchPoint, drawGif a e ¡init¿, a a, c, a, a, a, h, p, k, n, a, b, d, b, c, c, b, h, ¡init¿, ¡init¿, b, a, d, f, d, a, a, a, a, a, ¡init¿, ¡init¿ Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 152 Table 4.17: Candidate apk FCG Adjacency List extracted DisplayAd onVisibilityChanged showFeedback pushInfo access$2100 access$800 getNetConfig showADS handleMessage getAnimation getParams getPushAd showNewApp updateResultsInUi getWAPS ID doInBackground startRotation onCreate getView access$2700 getAnimationList handleGetPointsResponse onReceivedError handleSpendPointsResponse onResume access$2600 access$2400 getPointsHelper ExitDialog onPause startAnimation getDownload update ¡init¿, DisplayAd, showADS access$0, drawFrame showFeedback forTab ¡init¿, get, getInstance, setPushAudio, getInstance, getPushAd handleGetPointsResponse UpdateDialog getConfigParams, getConfigParams, get, get, edit getInstance, getDisplayAd a, b, c, a ¡init¿ initMetaData, getInstance, getParams getPushAdDateFromServer, getPushAd, getInstance, getPushAd ¡init¿, ¡init¿, ¡init¿, bindNewApp, ¡init¿ ¡init¿, startAnimation, ¡init¿ getInstance, getWapsId a, a, a, a, a, a, a, a, a ¡init¿ a, a, b, a, a, a, a, ¡init¿, c, f, g, ¡init¿, ¡init¿, a, i, ¡init¿, initMetaData, initNotityData, getAlreadyInstalledPackages, showNewApp, Initialize, ¡init¿, ¡init¿, ¡init¿, getInstance, getInstanceNoConnect, notify receiver, getInstanceNoConnect, notify receiver, ¡init¿ access$1500, access$1500, getARGB, getARGB, getARGB, getARGB, getARGB, getARGB, getARGB, getARGB, getARGB, getARGB, getARGB, getARGB, getARGB, getARGB, getARGB, getARGB sendNotify getAnimation buildDocument, getNodeTrimValue, getNodeTrimValue, getNodeTrimValue access$502 buildDocument, getNodeTrimValue, getNodeTrimValue onResume, a, b, onResume, ¡init¿, onResume, getInstance, getPoints handleAwardPointsResponse handleSpendPointsResponse ¡init¿ ¡init¿, ¡init¿ onPause, ¡init¿, onPause getAnimationList2, getAnimationList ¡init¿, ¡init¿, ¡init¿, a, b a, p, a, a, b Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 153 access$400 access$3000 shouldOverrideUrlLoading getDisplayAdResponseFailed onPageFinished a initFeedback getValid drawCube access$2000 buildResponse getPoints toMD5 onPreferenceChange enterPage showMessage onPageStarted t handleConnectResponse buildResponse i, access$202, access$200, getInstanceNoConnect, package receiver w access$600 b, b, a, a, a, a, A, o, c, b, j, a, r, b, c, b, l, a, a, a, l, a, b, d, b, a, a, c, o, s, k, p, a, a, a, a, ¡init¿, a, b, a, e, a, a, a, a, a, a, a, a, ¡init¿, a, a, a, b, a, a, a, a, a, a, a, ¡init¿, ¡init¿, access$300, access$100, access$200, access$400, access$500, ¡init¿, ¡init¿, access$100, access$2500, access$300, access$2600, access$2200, access$300, access$100, access$2100, access$2200, access$1600, access$1000, access$1000, access$1000, access$300, access$1700, access$100, access$1800, access$1600, access$1900, access$2000, access$300, access$400, access$100, access$2300, access$300, access$2400, access$2200, access$100, access$200, access$200, access$300, access$400, access$500, access$600, access$600, access$700, access$100, access$800, access$900, access$1000, access$1000, access$1100, access$1202, access$1200, access$1400, access$1500, access$100, access$902, access$1500, access$100, access$902, access$2900, access$300, access$3000, getInstanceNoConnect, notify receiver, getInstanceNoConnect, notify receiver a, a get drawLine, drawLine, drawLine, drawLine, drawLine, drawLine, drawLine, drawLine, drawLine, drawLine, drawLine, drawLine toMD5 ¡init¿, a, ¡init¿, getNodeTrimValue, getNodeTrimValue, getNodeTrimValue, getNodeTrimValue, decodeBase64, getNodeTrimValue, getNodeTrimValue, getNodeTrimValue, getNodeTrimValue, getNodeTrimValue, getNodeTrimValue, getNodeTrimValue, getNodeTrimValue, getNodeTrimValue, getNodeTrimValue, getView, getNodeTrimValue, getNodeTrimValue, getNodeTrimValue, getNodeTrimValue, decodeBase64 getPointsHelper toHexString access$0, isValid onEvent ¡init¿ access$600 r Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 154 getInstance showNotifyList s c showMore onCreateEngine notifyReceiverHelper onDownloadStart getConfigParams onReceive openFeedbackActivity onSurfaceChanged getWapsInfo onItemClick submit access$2800 packageReceiverHelper onClick onFeedbackReturned showOffers spendPointsHelper notify receiver run getDisplayAd ¡init¿, a, ¡init¿, a, ¡init¿, a, ¡init¿, a, ¡init¿, ¡init¿, ¡init¿, ¡init¿, ¡init¿, ¡init¿, ¡init¿, ¡init¿ bindNewApp r b, b, a, a, d, e, c, ¡init¿, ¡init¿, ¡init¿ showMore ¡init¿ ¡init¿ ¡init¿, access$100, access$200, access$300 r a, getWapsInfo, pushInfo, getWapsInfo, pushInfo, getInstanceNoConnect, package receiver, getInstanceNoConnect, package receiver a, a drawFrame ¡init¿ access$1100 ¡init¿ getPushAd ¡init¿ a, b, c, a, a, a, d, ¡init¿, access$0, access$0, access$0, access$0, access$0, l, m, a, a, a, a, i, a, ¡init¿, c, ¡init¿, access$700, access$700, access$800, access$900, access$1000, access$1000, access$900, access$1000, access$1000, access$1202, access$1202, access$1300, access$1300, access$1400, access$1400, access$900, access$400, access$400, access$200, access$400, access$400, access$400, access$400, access$4, access$5, access$8, access$2, access$7, access$5, access$2, access$7, access$7, access$600, access$600, access$600, access$700, access$700, access$600, access$700, access$800, access$800, access$900, access$600, access$600, access$900, access$600, access$000, getInstance, spendPoints, getInstance, showOffers, access$1400, access$3100, access$1400, access$502, access$502, getInstanceNoConnect, package receiver a showOffers ¡init¿ notifyReceiverHelper ¡init¿, access$0, a, a, a, a, a, a, b, c, ¡init¿, a, drawFrame, d, e, b, access$002, access$000, showADS, access$100, access$0, ¡init¿, access$0, access$1, showMessage, access$2, showMessage, access$2, showMessage, access$2, showMessage, access$4, access$5, showMessage, access$3, showMessage, access$4, access$5, access$6, access$6, access$5, access$7, access$5, edit, access$1400, access$1400, access$2700, access$2800 getDisplayAdDataFromServer Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 155 h setOpenGLContext onEvent f getInstanceNoConnect i setDefaultReportPolicy showMore forTab handleAwardPointsResponse UpdateDialog ¡clinit¿ spendPoints awardPoints showUpdateDialog initMetaData onOffsetsChanged access$1100 package receiver finalize m, n, a, a, e a onEvent, onEvent, onEvent, a, ¡init¿ b, a, b, s, t, n, ¡init¿, ¡init¿ ¡init¿, a, ¡init¿, ¡init¿ b, a, a, j, a, b, p, q a showMore forTab buildDocument, getNodeTrimValue, getNodeTrimValue ¡init¿, ¡init¿ ¡init¿ spendPointsHelper awardPointsHelper g, b, d getWapsId, getWapsId, toMD5 access$1, drawFrame loadApps packageReceiverHelper ¡init¿ Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 156 4.4.4 N-grams Analysis stage After the preparation of the Adjacency Lists, starts the real Isomorphism Analysis on Call Graphs. Until this point we have extracted some structural properties of the apk, mapped them in the Vector-space to better and efficiently represent them, now we need to identify and quantify the common shared sub-structures in the CFG and in the FCG, after that phase a final similarities score is computed. Lets to analyze this phase. An n-gram is a contiguous sequence of n items from a given sequence of text or speech, with the sorted Adjacency Lists, it is time to detect the contiguous sequence of the CFG or the contiguous sequence of the Call graph in common. In this way we are able to Identify common subgraph, if the candidate apk has a valuable “portion” of its graph identical to one of the 10 most similar vector returned (10 most similar apk returned) we can consider this portion as malicious and probably can be considered as a Payload Portion. We can starting considering a 7-gram analysis on the CFG Adjacency Lists and 4gram analysis on the FCG Adjacency Lists. The choiche of these n value in this ngram analysis is a purely empirical choiche, we have noticed that a small gram sequence conduct to a much more number of false positive, at this point we decided to choiche a major n to only consider “relevant” portion of the graph. Following is represented the system architecture until the n-gram phase analysis, omit now a second methodology for the identification of similarities, the method-level clone detection: Figure 4.20: Architecture. N-gram analysis Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 157 Following are showed the results of the n-gram analysis phase both for the CFG and the FCG Adjacency Lists of each returned vector. This is the result of the n-gram analysis for the CFG Adjacency List: 158 4 3 2 1 Table 4.18: n-grams on the Adjacency Lists of the 1st returned vector N-grams Analysis for each of the 10 returned vectors 37d2af6588343d813bac06d8db964491e04cd1fb933dfcdd7e940c0c47975bde 40ec5ac7d23316b2d3ae674efcb9a48073ac08082aeaa8f9d5f4945d2e1ae4d3 Sim: 1 Family:Gappusin CFG 7-Grams Total: 63 FCG 4-Grams Total: 25 Table 4.21: n-grams on the Adjacency Lists of the 4th returned vector 37d2af6588343d813bac06d8db964491e04cd1fb933dfcdd7e940c0c47975bde 4465602bd2c37dff6819798831828ac0cbb70614874afb179a8efbf26e538411 Sim: 1 Family:Gappusin CFG 7-Grams Total: 63 FCG 4-Grams Total: 25 Table 4.20: n-grams on the Adjacency Lists of the 3rd returned vector 37d2af6588343d813bac06d8db964491e04cd1fb933dfcdd7e940c0c47975bde 85f5513112800588fcd3b3cb59a4eeb2b25f53b1b3070a9e3a7d447a5cd45281 Sim: 1 Family:Gappusin CFG 7-Grams Total: 64 FCG 4-Grams Total: 26 Table 4.19: n-grams on the Adjacency Lists of the 2nd returned vector 37d2af6588343d813bac06d8db964491e04cd1fb933dfcdd7e940c0c47975bde 37d2af6588343d813bac06d8db964491e04cd1fb933dfcdd7e940c0c47975bde Sim: 1 Family:Gappusin CFG 7-Grams Total: 64 FCG 4-Grams Total: 28 4.4.4.1 159 8 7 6 5 37d2af6588343d813bac06d8db964491e04cd1fb933dfcdd7e940c0c47975bde 212aa193b9f5a688a650366970c02987c99d07f6b1e50f04c510bfb74f7062f1 Sim: 1 Family:Gappusin CFG 7-Grams Total: 54 FCG 4-Grams Total: 22 Table 4.25: n-grams on the Adjacency Lists of the 8th returned vector 37d2af6588343d813bac06d8db964491e04cd1fb933dfcdd7e940c0c47975bde 12780c0ed735628dc21a5c0e4783e82c017a6ad999fd74e29420f5c1a598ca6c Sim: 1 Family:Gappusin CFG 7-Grams Total: 63 FCG 4-Grams Total: 23 Table 4.24: n-grams on the Adjacency Lists of the 7th returned vector 37d2af6588343d813bac06d8db964491e04cd1fb933dfcdd7e940c0c47975bde 85f5513112800588fcd3b3cb59a4eeb2b25f53b1b3070a9e3a7d447a5cd45281 Sim: 1 Family:Gappusin CFG 7-Grams Total: 63 FCG 4-Grams Total: 26 Table 4.23: n-grams on the Adjacency Lists of the 6th returned vector 37d2af6588343d813bac06d8db964491e04cd1fb933dfcdd7e940c0c47975bde 8f617e67ea6bdc4daf680875de10b3df79ec141780b1ff7604155a66021fee76 Sim: 1 Family:Gappusin CFG 7-Grams Total: 63 FCG 4-Grams Total: 25 Table 4.22: n-grams on the Adjacency Lists of the 5th returned vector 160 10 9 37d2af6588343d813bac06d8db964491e04cd1fb933dfcdd7e940c0c47975bde 2a1582f742edeb2ba7d0a90c0bf4358643612a38544d13a297c2ea5f6b7d86ce Sim: 1 Family:Gappusin CFG 7-Grams Total: 12 FCG 4-Grams Total: 5 Table 4.27: n-grams on the Adjacency Lists of the 10th returned vector 37d2af6588343d813bac06d8db964491e04cd1fb933dfcdd7e940c0c47975bde da07316645455a02b2f9eda79662c585aee96e9ab06aeb139fb895b9ea89238a Sim: 1 Family:Gappusin CFG 7-Grams Total: 18 FCG 4-Grams Total: 3 Table 4.26: n-grams on the Adjacency Lists of the 9th returned vector Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 161 The n-grams analysis has detected the “common (maximum) subgraph” of the Candidate with known instances of some malware families. This behavior means only one thing, that candidate adopted or in a proper manner cloned one or more malware behaviours. As stated by the previous tables, the first 8 returned vector have a Similarity Score of “1” about the Opcodes Frequency Distribution, this means that the first 8 vector are “almost the same”. The n-gram analysis return the whole adjacency list both for the CFG and FCG, the vector #1 is the same sample, with the name 37d2af6588343d813bac06d8db964491e04cd1fb933dfcdd7e940c0c47975bde. In fact all of these belong to the Gappusin Family, the vector #9 have a bit less similarity Score than the first 8 (0.99890) but still remains an high score. This result show that the #9 vector is enough similar to the candidate, in fact, the n-gram analysis stage return 18 entry on 64 for the CFG adjacency list and 3 entry on 28 for the FCG adjacency list, this means that vector #9 has some behaviour in common with the candidate, or in other words, there is a portion of the vector #9 cloned by the candidate. Same thing also for the vector #10. Candidate have almost the whole CFG/FCG in common with vectors from 1 to 8, about vector #9 it share a common subgraph for the FCG: Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 162 Table 4.28: Common Subgraph with the n. 9 vector access$2100: access$2400: access$2600: access$2700: access$2800: access$3000: access$300: access$400: handleAwardPointsResponse: handleConnectResponse: handleGetPointsResponse: handleMessage: handleGetPointsResponse handleSpendPointsResponse handleAwardPointsResponse sendNotify getPushAd buildResponse getDownload handleConnectResponse buildDocument, getNodeTrimValue, getNodeTrimValue buildDocument, getNodeTrimValue, getNodeTrimValue, getNodeTrimValue, getNodeTrimValue buildDocument, getNodeTrimValue, getNodeTrimValue, getNodeTrimValue a,b,c,a Following is shown “Common Subgraphs” detected by n-grams Analysis stage. Figure 4.21: Common Subgraphs with the vector n. 9 Detected by n-grams analysis stage Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 163 about vector #10 it share a common subgraph for the FCG: Table 4.29: Common Subgraph with the n. 10 vector access$2100: access$2400: access$2600: access$2700: access$2800: access$3000: access$300: access$400: getAnimationList2: getAnimationList: getConfigParams: getDisplayAd: onError: onEvent: onFeedbackReturned: onItemClick: startAnimation: startRotation: submit: t: handleGetPointsResponse handleSpendPointsResponse handleAwardPointsResponse sendNotify getPushAd buildResponse getDownload handleConnectResponse getAnimation getAnimation r getDisplayAdDataFromServer a,¡init¿ onEvent,onEvent,onEvent,a,¡init¿ a access$1100 getAnimationList2,getAnimationList ¡init¿ ¡init¿ r Figure 4.22: Common Subgraphs with the vector n. 10 Detected by n-grams analysis stage Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 164 4.4.5 The Method-level similarity analysis stage. Type I and II clones detection at method-level It was also implemented a more fine-graned analysis stage called The Method-level similarity analysis stage. The idea behind is to identify clones at method-level rather then at class level. With this analysis we can identify same reused behaviours from a well-know malware family. This analysis stage is implemented in parallel with the n-gram analysis stage as we can see from a slice of the architecture: Figure 4.23: Architecture. Method-level similarity analysis stage After the vectors are returned, 2 distinct fundamental analysis stage are implemented: the n-gram analysis stage, we appropriately analyzed previously, for the identification of common subgraphs, and the Method-level similarity analysis stage for the Identification of code clones at method-level. Both these 2 analysis stage provide a contribute for the next stage classifier for the computation of the final similarity scores. Let’s examinate this stage, as first thing, for each identified method in the candidate, an Opcode Frequency Distribution for that method, is computed and then compared with those of each returned vector to identify methods clone. The choice of compare an opcode distribution instead of the methods name is dictated by the fact that, in a code-reuse scenario, methods name or more in general “identifiers” can be changed or encrypted, while an identical opcodes distribution still remains indicative of a code clone. This analysis is resilient at least for type I, and type II Clones but not for type III and type IV, in fact, type I and II still have the same opcodes distribution, not the Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 165 same for type III and IV. Could make it even more resilient to type III and Type IV code clones with a little improvment, computing a similarity or distance score for the opcodes distribution, in fact, by the way, the opcodes distributions are strictly compared entirely and if (in case of Type III or IV code clone) a little variation is present, the result is “no match” for that distribution. 4.5 The Classifier At this point, each analysis stage products some score, a similarity score indicating the “isomorphism ratio” between samples, the ratio of the portion cloned by the candidate. Based on this portion we can attribute our candidate to one of the malware families present in the data-set, or more than one malware family. This task is under the responsability of the Classifier module. As we can see from the Architecture, the last stage is the Classifier, which is responsible to acquire metrics for each stage and provide a final Similarity Score which represent memberships to one or more than one malware families. The Detection of the malware variants is a threshold analysis based on isomorphism ratio between malware samples. Let’s analyze our implementation of the Classifier. At this phase, we will have 4 different metrics normalized, for each of the most similar vectors returned: Table 4.30: The Classifier Analysis stage 1 Relative Opcodes Frequency Distribution Sim. Score # of 7-grams CFG # of 4-grams FCG # of Methods Similarity Ratio Our Classifiers consist of formulas embedded in SW that take into accounts these 4 score, we can also consider that the classifier can be well calibrated basing on experimental phase, this flexyble architecture can permit us to also have other Classifiers. We make some considerations about. Our methodology is, however, guided by the Opcode Distribution Frequency Analysis, because each of the following analysis stage take into Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 166 consideration only the 10 returned vectors (most similar vectors), starting from these, the n-gram analysis is useful to detect code-clones and the Method-level similarity analysis stage show us the # of methods cloned. The Formula have to take into account also the # of n-grams detected in the Call Graphs, if there is no one n-gram token both for CFG or for the FCG, then the Similarity Score is 0, this means there is no similarities between the candidate and one of the malware samples present in the dataset. Our Classifier consist of 2 similarity scores following defined The I Similarity score is (the values are first normalized): I Sim. Score = max(# of 7-grams CFG , # of 4-grams FCG ) ∗ methodSimRatio In this formula, we take into account that at least n-grams for CFG or FCG exists, the max value of the n-grams is then multiplied with the methodSimRatio [0,1]. In this formula we doesn’t take into account the Relative Opcodes Frequency Distribution Sim. Score because, the 10 returned vectors are yet sorted by this score. Let we analyze the 2 Similarity Score. The II Similarity score is (the values are first normalized): II Sim. Score = OpCodeSimScore ∗ (# of 7-grams CFG) ∗ (# of 4-grams FCG ) ∗ methodSimRatio In this case we multiplied all the scores, this Similarity Score is close to 1 when all the scores are close to 1 (1 when the app is exactly a clone). The goodness of these scores will be evaluated in the experimental Chapter. Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 167 4.6 Real System implementation The System consist of a some perl scripts: • method.pl which is responsible to extract and store the Opcodes Frequency Distribution at method-level. • AdjList.pl which is responsible to extract and store the Adjacency Lists for both CFG and Call Graph and also for the extraction of the Opcodes Frequency Distribution. • DescentDroid.pm which is the core framework. It take the candidate apk in input and perform all the analysis based on Opcodes Frequency Distribution and Isomorphism between Call Graphs. Let we analyze the usage of these tools. method.pl It is necessary to execute the perl script as perl method.pl with no arguments. It take all the malicious samples in the dataset and compute the method-level Opcodes Frequency Distribution and store it in the method level sim table. This is a screenshot of the script execution: Figure 4.24: method.pl execution phase Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 168 As we can see from the screenshot, the yellow underlined is the sample id, the sample under analysis in that moment, the red underlined is the family membership and the blu underlined is the method on which the tool compute the Opcode Frequency Distribution. Below each entry, are the opcodes of that method. AdjList.pl The script takes an argument, we have to specify if we want to extract the Opcodes Frequency Distribution or the relative Adjacency Lists. perl AdjList.pl -freq for the frequency distribution computation or perl AdjList.pl -adjxxx for the adjacency lists extraction. Figure 4.25: AdjList.pl -freq execution phase with the flag -adjCFG or -adjFCG it is possible howeverto extract the related AdjacencyList: Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 169 Figure 4.26: AdjList.pl -adjFCG execution phase The yellow underlined is the sample id under analysis, and the red part is the relative extracted Adjacency List for the Call Graph. Figure 4.27: AdjList.pl -adjFCG execution phase Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 170 The yellow underlined is the sample id qhile, the red part is the relative CFG Adjacency List. On the left, the blu underlined, it’s the nodeid. 4.6.0.1 DescentDroid.pm These script are responsible for the extraction of the metrics from the dataset and for the storage of the same. In this way we are able to build a local database of signatures, a kind of Gold Standard for our further analysis. The DescentDroid.pm is the module where we implemented our methodology, in fact it takes a candidate apk as input and then classify it. Following are shown some screenshot captured during the execution of the system: Figure 4.28: Execution phase It is also possible to enable a debug option to show also portions of the adjacency Lists shared: Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 171 Figure 4.29: DescentDroid Execution phase Figure 4.30: DescentDroid Execution phase Chapter 4. Detecting Android Malware Variants using Opcodes Frequency Distribution and Call Graphs Isomorphism Analysis 172 Figure 4.31: DescentDroid Execution phase, with debug option Chapter 5 Experimental Phase 5.1 Focus of the Experiments Before defining a strategy in the experimental phase, we have to first take into account that the idea behind this methodology is to detect a probably malware variant in the candidate apk. We have to apply this methodology on a set of malware samples choosen randomly and before we trained our dataset. The main and first focus is, to evaluate, the Accuracy and the Precision& Recall on 1000 malware samples, also evaluating the Precision& Recall of the Classification as a MultiClass Classification Problem, and also evaluating the quality of the informations produced. These “informations” can be considered also as “payload presence signs”. A secondary focus, is that of evaluate the Effectiveness in Detecting Android Malware, in terms of Total Precision & Recall. We use the Weka tool. Weka is a collection of machine learning algorithms for data mining tasks, it contains tools for data preprocessing, classification, regression, clustering, association rules and visualization. 5.2 Focus I: Performance evaluation : A MultiClass Classification Problem Our main focus is to evaluate the Accuracy as the overall correctness of the Malware Classification process and the Precision& Recall of the Classifier for each class. We set up the problem as a MultiClass Classification Problem, where for a given input instance, 173 Chapter 5. Experimental Phase 174 there is exactly one correct output class which is selected out of several classes. Our evaluation will be also qualitative, in the sense that we evaluate the type and the quality of information produced in the perspective in which we can also use this methodology to build “signs” that other tools can use. In this evaluation we trying to response these Research Questions: • RQ 1 ) What is the Classification Accuracy of the proposed methodology ? • RQ 2 ) What is the Classification Effectiveness of the proposed methodology ? • RQ 3 ) Is there a relationship between the effectiveness of Classification and the size of the trained dataset ? 5.2.1 Characterization We consider 1000 malicious samples taken randomly from drebin project before the step of training. In this way we are sure that these 1000 samples used are malware variants of well-know malware families. Then we submitted, with a batch script, all these 1000 samples to the DescentDroid tool and after collect and store all the extracted metrics in a local db we are able to start Classifing malware variants. At this point we evaluate the Malware Classifier Accuracy and the Precision& Recall of the Classifier for each class. This is the characterization of the dataset used: Chapter 5. Experimental Phase 5.2.2 175 Evaluations Detail Starting from 1000 malware variants samples , DescentDroid is able to proper Classify 979 malware samples: Figure 5.1: 979 malware variants proper Classified from 1000 Malware samples Following is showed a portion of the table reporting these results, the table is composed by the these fields: • id : The id of the malicious being analyzed. • similar : The most similar apk in the dataset. • family : The family membership. • OpcodeSim : The Relative Opcodes Frequency Distribution Similarity. • CFG 7gramsC : The total counts of 7-grams identified on the CFG. • FCG 4gramsC : The total counts of 4-grams identified on the FCG. • methodLSimRatio :The Similarity Score at method-level (Indicative of the # of methods cloned). 176 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ID similar 8171962cbfe3dd358215be31644bcec8126995adbd17460f6280e080c3693af3 89653e3db8ed264d6cb8c0456410ff15a26e2f3452b1e0101294e845069bb331 552557183355e34353ab93e7de7fe281132cbc422eefdefcdadcb3af2235c631 b62d69ed54c99ed9a3ad5c9109ec384f3757273f9762afeb151dfcfb860dec24 9631d73c6d30ac906849b2ed9c0e42e1526f6a502690fa7f93888a70320c307d 756826a907cb774d9c71bdd63f8a3c3b30e38820f8811d6b38542d20c9d6ba70 1418642f008b0b623ae0cc627bf3c5214b82a376de21db9fd69cbe18cc211e5d 53bc8bb25e4ec65c6109ae763e0fc22812d56b4f5338f490405fb72e68515552 a88bdbd2170e2e9d1e20d1276fbe8c5cea75746c651e9f77170ed5f5480be5a7 efc6a641cb2a7da75daa60c8c8f2bf4fa4fef41ce640c294c837683374c6d14d fd08cb6e2c865af562e7ea171ede7a9d1165a189bb305f9f7102d58e7dd678d6 c6a25935de07fbfb3424d429b1c942e296b61849b839530f0eede3c52c066290 82aca2743ad212d945af1c56f52a01f1fd9e603baee3841f4d78202bd5732d4c cc25605a7362de48f70b288f851cb12f1f361f66ff5203f09ddc1cd8468c36d0 484804147b90b5ad68e86f35fffe156638a6aae2a783ff2f4c49f2137de9026a b387fee6de2400fb045b100437f82bcaf10dfa605eedebe02d854b59e51bc122 11e78b279cf47f46ec5290e7b79ed72b8992d73184a0fd10c4bdc39960613546 4269e9f03f43d82df992b417e961554caecb80d06ca3b0c1b847a09fd257901f 90e0253334786c6a99be569a7e62ca82fc50fce56045320e2473dd840d67096a 6f3908828424cbbe2e4fc8f1d1d67e538f8813a22942d9a7eb9ee5f74173c553 family FakeInstaller Nyleaker DroidKungFu Jifake Plankton Nickspy FakeInstaller Adrd Jifake Opfake Opfake FakeInstaller FakeInstaller ExploitLinuxLotoor Yzhc Gappusin Yzhc DroidKungFu FakeTimer Gonca Table 5.1: Results detail on 1000 malware samples, Synoptic table 0.99997 0.99582 1.00012 0.96858 0.99999 1.00002 1.00001 1.00012 0.99999 1.00505 1.00000 1.00001 0.99999 0.99820 1.00011 0.99995 1.00001 1.00013 1.00002 0.99921 OpcodeSim 9 5 45 1 3 24 8 66 4 12 23 18 18 14 88 43 2 26 3 12 CFG 7gramsC 1 4 13 0 82 3 2 15 0 0 4 12 5 2 33 18 36 29 0 1 FCG 4gramsC 0.10000 0.10000 1.00000 0.10000 0.54559 1.00000 1.00000 0.24419 1.00000 1.00000 1.00000 1.00000 1.00000 0.10000 1.00000 0.49315 1.00000 1.00000 1.00000 0.10000 methodLSimRatio 177 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 ID similar 59d7632c2b19adcd2bab55a37b13fe895ae3d98c335ea3d8f0c359a92c7e2ef4 d3c97661814502dd8d192a020e6bfbc6e9842eb6d39f0a9dc738053825d72274 aa49610e023b1026c924ea1a523ef5c63790c4e5c236437e14ff8a4f988f8d08 76003cc3dfb3de1635d1459e3bebfedd60bb05abaa410c76675bcfb187ee5015 39545b4d7e832ee5f11f9ea141b510f82eaa7fb777808b9be5b4e05fa459e9e2 8e662a20ad2a049b9b11c770d47c2db0c555ae2a863c7eef57a394dae1d40f6b ec06c6f995ae0702802a14336bf563c6c1c196f06e7d8488ef16d2bc94d42c8b a3615dd98e1301f9bf35304cb3253855c455982423a58732ad27beed99b2bda1 fd08cb6e2c865af562e7ea171ede7a9d1165a189bb305f9f7102d58e7dd678d6 7fda6a527f1e9a2dbb918d902a856fc8cb9bf66fb75cd07e51feebd87642db2c 053a5a6b5de01ac16784db09422c3c462bf208f7cbd4cde4b5244393f162e357 9943b22719f7b9543c7fa8f1e03e791d99257689c4d9dc44b01268aad883ed56 5208c091b315c25975719f7bc0a7cc01a62287843eb6299a454605f9bdfd3a91 ba48a05e03073cdbd8230ba9194eb1866d9f790ab0f10d949295e5479825b927 e4eb02b2d64d33e4c0536406bfc9a6d8fcc6b5237642d92333ee3e089bd82723 fd08cb6e2c865af562e7ea171ede7a9d1165a189bb305f9f7102d58e7dd678d6 fd08cb6e2c865af562e7ea171ede7a9d1165a189bb305f9f7102d58e7dd678d6 15e800e9abb449b50372b9b9807b585e546f8efd84c3182e670107bca8c74dde 563e5a2db3a05703c9a1006d726fa8ed2ccabed66a431d180409fb8b5f19406d 4fb6bcac98872b0ee94d3ca258069b6b02cafcb0980c5a44482a950b2ab62b45 Table 5.2 family FakeFlash Opfake Opfake Sakezon Opfake DroidKungFu Opfake Plankton Opfake Stealer Opfake FakeInstaller FakeInstaller FakeInstaller DroidKungFu Opfake Opfake Vidro Opfake ExploitLinuxLotoor 1.00001 1.00013 1.00003 1.00002 0.98758 0.99856 1.00015 0.99441 1.00000 1.00004 1.00063 0.99999 1.00000 0.99999 1.00016 0.99996 1.00000 0.99999 0.99997 1.00004 OpcodeSim 25 23 24 14 1 9 24 0 23 22 37 14 19 19 25 20 23 10 20 54 CFG 7gramsC 3 2 2 2 0 7 2 0 4 2 1 8 2 12 39 2 4 7 1 16 FCG 4gramsC 1.00000 1.00000 1.00000 1.00000 0.10000 0.10000 1.00000 0.10000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 0.10000 1.00000 1.00000 1.00000 1.00000 methodLSimRatio Chapter 5. Experimental Phase 178 From these results, we are able to state that each analyzed sample, is been traced back to a most similar malware in our dataset (builded from Drebin dataset) and therefore to its own family as the direct consequence of the Classification process. This Classification process is guided by some metrics that we can also extracted as the direct consequence of the presence of the payload. Following we evaluate the Accuracy, and the Precision&Recall of this methodology to proper classify malware. 5.2.2.1 Classification Accuracy, Precision&Recall for each Malware Class. In this section we provide an evaluation about the Accuracy and the Precision& Recall to proper identify the real malware family memberships. For this we first calculate the Accuracy of the Classification. Accuracy Is the overall correctness of the model and is calculated as the sum of Correct Classifications divided by the total number of classifications: Accuracy = tp+tn tp+f p+tn+f n and then: 979 Accuracy = 1000 = 97.9% Then we measure the Precision of the model for each class (for each malware family): Precision Precision is a measure of the accuracy provided that a specific class has been predicted. It is defined by: tp Precision = tp+f p Chapter 5. Experimental Phase 179 For instance, for the class A: P A Precision = T otalPTredictedasA Recall Recall is a measure of the ability of a prediction model to select instances of a certain class from a data set. Measures to which extent classes are scattered across clusters: It is defined by the formula: tp Recall = Sensitivity = tp+f n For instance, for the class A: TP A Recall = Sensitivity = T OT ALGOLDF ORA Now we try to address the RQ 3, relating the size of the family sample for each class, about our trained dataset, to their relative Precision&Recall. This evaluation try to underline a possible relationships between the size of the trained dataset and the precision or sensitivity in the Classification Process, if it is so, we have a good degree of freedom in enhancing this Process. Chapter 5. Experimental Phase 180 Table 5.3: Precision and Recall of the MultiClass Classification Process Class AccuTrack Adrd Adsms Aks Anti BaseBridge BeanBot Biige Boxer Ceshark Coogos Copycat Cosha Dabom Dialer Dougalek DroidDream DroidKungFu DroidRooter DroidSheep ExploitLinuxLotoor FaceNiff FakeDoc FakeFlash FakeInstaller FakePlayer FakeRun FakeTimer Fakelogo Fakengry FarMap Fatakr Fauxcopy FinSpy Fjcon Flexispy FoCobers Fsm GPSpy Gamex Gapev Gappusin Geinimi Generic GinMaster Glodream Gmuse Gonca Hamob Hispo TotSamples 2 3 1 2 1 3 2 2 3 2 2 3 3 1 1 3 13 180 2 4 16 1 3 1 191 3 18 3 4 2 1 3 1 1 2 1 3 1 1 1 1 8 25 1 23 6 1 2 3 1 Precision 1.00000 0.28571 1.00000 1.00000 1.00000 0.23077 1.00000 1.00000 0.50000 1.00000 0.00000 1.00000 1.00000 1.00000 1.00000 1.00000 0.90909 0.95954 0.95954 1.00000 0.78571 1.00000 0.60000 1.00000 0.98413 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 0.75000 1.00000 1.00000 1.00000 1.00000 0.75000 1.00000 1.00000 0.53333 0.83333 1.00000 1.00000 1.00000 1.00000 Recall 1.00000 0.66667 1.00000 1.00000 0.00000 1.00000 1.00000 1.00000 0.33333 1.00000 0.00000 0.66667 1.00000 1.00000 1.00000 1.00000 0.76923 0.92222 0.00000 1.00000 0.68750 1.00000 1.00000 1.00000 0.97382 0.33333 1.00000 1.00000 0.75000 0.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 0.00000 0.75000 0.72000 1.00000 0.69565 0.83333 1.00000 1.00000 1.00000 0.00000 Chapter 5. Experimental Phase Class Iconosys Imlog JSmsHider Jifake Kidlogger Kiser Kmin Koomer Ksapp Lemon LifeMon Luckycat Mania MobileTx Mobilespy Mobinauten Moghava Nandrobox Nickspy NickyRCP Nisev Nyleaker Opfake PdaSpy Penetho Placms Plankton Proreso QPlus Raden RediAssi RootSmart Rooter SMSZombie SMSreg Sakezon SeaWeth SendPay SerBG SheriDroid SmsWatcher Spitmo SpyBubble SpyHasb SpyMob 181 TotSamples 13 6 1 4 2 2 19 1 1 2 1 1 1 19 3 1 1 3 2 1 1 3 188 1 3 3 44 1 1 2 1 2 1 2 12 2 2 8 3 1 1 3 1 3 1 Precision 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 0.95000 0.00000 0.00000 0.00000 1.00000 1.00000 1.00000 0.95000 0.66667 0.66667 0.66667 0.75000 1.00000 1.00000 0.33333 1.00000 0.98947 1.00000 1.00000 1.00000 0.82353 1.00000 1.00000 1.00000 1.00000 0.66667 1.00000 1.00000 0.83333 0.66667 1.00000 0.88889 0.50000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 Recall 1.00000 1.00000 0.00000 1.00000 0.50000 1.00000 1.00000 0.00000 0.00000 0.00000 1.00000 1.00000 1.00000 1.00000 0.66667 0.00000 0.00000 1.00000 0.50000 1.00000 1.00000 0.33333 1.00000 1.00000 1.00000 1.00000 0.95455 1.00000 1.00000 0.50000 1.00000 1.00000 1.00000 1.00000 0.83333 1.00000 0.50000 1.00000 1.00000 1.00000 1.00000 1.00000 0.00000 1.00000 1.00000 Chapter 5. Experimental Phase Class SpyPhone Spyoo Spyset Stealer Stealthcell Steek Stiniter Tapsnake TigerBot Trackplus TrojanSMS.Denofow TrojanSMS.Hippo Typstu Vdloader Vidro Xsider Yzhc Zitmo Zsone 182 TotSamples 2 1 2 2 2 3 2 1 1 2 1 2 2 3 2 2 10 2 2 Precision 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 0.00000 1.00000 0.33333 1.00000 0.66667 1.00000 1.00000 0.50000 Recall 0.00000 1.00000 1.00000 1.00000 0.50000 1.00000 0.50000 1.00000 1.00000 0.50000 1.00000 0.00000 1.00000 0.33333 1.00000 1.00000 0.90000 1.00000 1.00000 183 Figure 5.2: Multiclass Classification, Precision in function to the size of the dataset 184 Figure 5.3: Multiclass Classification, Precision vs the size of the dataset 185 Figure 5.4: Multiclass Classification Recall in function to the size of the dataset 186 Figure 5.5: Multiclass Classification, Recall vs the size of the dataset Chapter 5. Experimental Phase 5.2.3 187 Conclusions and Considerations The I Focus experimentation, highlight us an important result, this methodology is effective and reliable to proper Classify Android Malware Variants, reliable because the final precision of the Classification process does not depend upon the number of samples in the training set. Now we can give an answer to each Research Questions: • RQ 1 )What is the Classification Accuracy of the proposed methodology ? As we have seen , the Total Accuracy in the Classification Process is about 97.9%. • RQ 2 ) What is the Classification Effectiveness of the proposed methodology ? We have before reported the results table with the Precision&Recall for each class and in most cases they tend to 1. For the Classification, this methodology seems to be very accurate and sensible. • RQ 3 ) Is there a relationships between the effectiveness of Classification and the size of the trained dataset ? There isn’t a direct relationships between the Precision or Recall and the size of samples in the trained dataset. We can state that, the Precision and the Recall are 1 in the case with a family with few samples. It is important for the future trying to address other kinds of possible relationships to improve this methodology. This is a synonimous of reliability because, this means that we can cluster a malware also if we have few samples of that family in our training dataset. Chapter 5. Experimental Phase 5.3 188 Focus II: Detection Evaluation of Android Malware The Focus II is focusing on the evaluation about the Detection of Malware in terms of Accuracy and Precision&Recall. First and foremost we conduct these experiments with 3 different classifiers for the Opcodes Frequency Distribution Similarity. Then we trying to responde these following Research Questions for each classifier: • RQ 1) What is the Detection Accuracy of the proposed methodology ? • RQ 2) What is the Detection Effectiveness of the proposed methodology ? • RQ 3) As can be satisfactory the extracted data in terms of quality (F-Score, FP/FN rate) ? 5.3.1 Characterization We consider now a 2-class problem defining the following values: Table 5.4: Different Classifiers of Opcodes Frequency Distribution Similarity I classifier 0.99800 CosSim II classifier 0.99500 CosSim III classifier 0.99000 CosSim For each values, after build their confusion matrix, it is possible to estimate: • TPR : True Positive Rate. Measures the proportion of actual positives which are correctly identified. Also defined as Recall • FPR : False Positive Rate. The probability of falsely rejecting the null hypothesis. • PPV : Positive Predictive Values . The probability that samples with a positive test truly is a malware. Also defined as Precision. • ACC : Accuracy. Is a measure of the sistematic error, the proximity of measurement results to the true value. Chapter 5. Experimental Phase 189 • F1-score : Is a measure of a test’s accuracy. It considers both the precision p and the recall r of the test to compute the score. • FP/FN : Is the ratio between False Positives and False Negatives, useful to establish the performance limits of the classifier. 5.3.2 5.3.2.1 Evaluations Detail I classifier: 0.99800 CosSim For the first classifier we obtained the following confusion matrix: Table 5.5: Confusion Matrix for 0.99800 Negative Cases Positive Cases Predicted Negative TN: 46.5% FN: 7.15% Predicted Positive FP: 3.5% TP: 42.85% Table 5.6: Performance measures for 0.99800 CosSim Measure TPR FPR PPV ACC F1-score FP/FN 5.3.2.2 value 85.7% 0.7% 92.4% 89.4% 0.889 0.489 II classifier: 0.99500 CosSim For the second classifier we obtained the following confusion matrix: Table 5.7: Confusion Matrix for 0.99500 Negative Cases Positive Cases Predicted Negative TN: 45% FN: 5.85% Predicted Positive FP: 5% TP: 44.15% Chapter 5. Experimental Phase 190 Table 5.8: Performance measures for 0.99800 CosSim Measure TPR FPR PPV ACC F1-score FP/FN 5.3.2.3 value 88.3% 10% 89.8% 89.1% 0.890 0.854 III classifier: 0.99000 CosSim For the third classifier we obtained the following confusion matrix: Table 5.9: Confusion Matrix for 0.99000 Negative Cases Positive Cases Predicted Negative TN: 41.8% FN: 5.4% Predicted Positive FP: 8.2% TP: 44.6% Table 5.10: Performance measures for 0.99800 Measure TPR FPR PPV ACC F1-score FP/FN value 89.2% 16.4% 84.4% 86.4% 0.867 1.515 Chapter 5. Experimental Phase 5.3.3 191 Conclusions and Considerations • RQ 1 ) What is the Detection Accuracy of the proposed methodology ? Table 5.11: Accuracy of the three Classifiers Classifier 0.99800 CosSim 0.99500 CosSim 0.99000 CosSim value 89% 89% 86% • RQ 2 ) What is the Detection Effectiveness of the proposed methodology ? Table 5.12: Recall of the three Classifiers Classifier 0.99800 CosSim 0.99500 CosSim 0.99000 CosSim value 85% 88% 89% It is possible to verify that defining a low threshold for the Opcodes Frequency Distribution, (i.e the 0.99000), more samples we are able to identify as malware. Table 5.13: Precision of the three Classifiers Classifier 0.99800 CosSim 0.99500 CosSim 0.99000 CosSim value 92% 90% 84% More the threshold is high (i.e. the 0.99800 classifier), more the data extracted corresponds to the exact data. • RQ 3) As can be satisfactory the extracted data in terms of quality (F-Score, FP/FN rate) ? Chapter 5. Experimental Phase 192 Table 5.14: F-Score of the three Classifiers Classifier 0.99800 CosSim 0.99500 CosSim 0.99000 CosSim value 88% 89% 87% We obtained values of the F1-Score between 87% (0.99000) and 89% (0.99500), we believe that our defined classifiers are still satisfactory. Table 5.15: FP/FN ratio of the three Classifiers Classifier 0.99800 CosSim 0.99500 CosSim 0.99000 CosSim value 49% 85% 151% About the FP/FN ratio, we can state that the II classifier is the best choiche because the ratio remains below the value 1. It is still a good trade-off. Chapter 5. Experimental Phase Then evaluate the 3 classifiers on the ROC space: Figure 5.6: 3 classifiers on the ROC space A Performance comparison of the 3 classifiers: Figure 5.7: 3 classifiers - Performance Comparison 193 Chapter 5. Experimental Phase 194 The FP/FN ratio comparison: Figure 5.8: 3 classifiers - FP/FN Comparison Only about the 0.99000 CosSim classifier we have a ratio greater than 1, this means that, in that case, the number of False Positives is greater than the number of False Negatives. This result means that we can opt for a good classifier both the 0.99800 and the 0.99500. Chapter 6 Applications and future works This Master Thesis work is placed in the context of a Malware Analysis Research Project. Our defined methodology, based on a code-clone heuristic is based upon a new approach for the Malware Analysis, representing an inversion in the strategy, placing it before any Anti-malware solution. Our approach, use an heuristic which is based on the extraction of the Opcodes Frequency Distribution for each sample and then a similarity function can measure the distance towards our trained dataset, then we extract the Common Subgraphs in both the CFG and FCG to measure the Isomorphism between malware. After a classifier is able tocategorize our candidate apk in the right class. This method is able to outline, for each candidate malware, a possible genome, and possible descents based on the inherited portions of payload. And above all it is resilient to Android Common Obfuscation Techniques. About the possible applications, It is possible to place this methodology before any existing Anti Malware Detection solution for example for build signs, to pre-filter malware, to correlate malware informations. It is also possible to put it after an Anti Malware Detection Solution to proper classify malware and study its philogenesys. In conclusion, we can assert of realizing a good methodology, with a good predictive 195 Chapter 6. Applications and future works 196 power in detecting Android Malware and with an high predictive power in classifying android malware. A big disvantage of this methodology is that isn’t scalable at the moment, this means that we need a big computation time for the analysis of apk bigger than 100MB. But in this moment we want to focus on the goodness of the method, focusing on the Accuracy rather than the computational time. In the optics to carry on this Research Project, we need ,for example : • Work on the scalability. • Updates and expands our trained dataset. Actually based on the Drebin Project. • Compute the Opcodes Frequency Distribution distance with other similarity scores than the Cosine Similarity. • Outline common payload portions between malware families. • Study the effectiveness against common Opcodes Trasformations. 197 Appendix A. Partial code of Descent tool. Opcodes Frequency Distribution Computation Appendix A Partial code of Descent tool. Opcodes Frequency Distribution Computation Listing A.1: Partial code of Descent tool. Opcodes Frequency Distribution Computation %relFreqOpcodes = ("nop_op", 0 , "move", 0 ,"move_exception", 0 ,"move_object", 0 , "return_op", 0 , "return_object", 0, " return_void", 0 , "return_wide", 0 , "monitor_enter", 0 , " monitor_exit", 0 , "goto_op", 0 , "goto16", 0 , "goto32", 0 , "invoke_virtual", 0 , "invoke_super", 0 , "invoke_direct", 0 , "invoke_static", 0 , "invoke_interface", 0 , " packed_switch", 0 , "sparse_switch", 0 , "arrayop", 0 , " iftest", 0 , "instanceop", 0 , "static_op", 0 , "iftestz", 0 , "cmpop", 0 , "unop", 0 , "binop", 0 , "throw", 0 ); %relFreqOpcodes2 = ("nop_op", 0 , "move", 0 ,"move_exception", 0 ,"move_object", 0 , "return_op", 0 , "return_object", 0, " return_void", 0 , "return_wide", 0 , "monitor_enter", 0 , " monitor_exit", 0 , "goto_op", 0 , "goto16", 0 , "goto32", 0 , "invoke_virtual", 0 , "invoke_super", 0 , "invoke_direct", 0 , "invoke_static", 0 , "invoke_interface", 0 , " packed_switch", 0 , "sparse_switch", 0 , "arrayop", 0 , " iftest", 0 , "instanceop", 0 , "static_op", 0 , "iftestz", 0 , "cmpop", 0 , "unop", 0 , "binop", 0 , "throw", 0 ); 198 Appendix A. Partial code of Descent tool. Opcodes Frequency Distribution Computation Listing A.2: Partial code of Descent tool. Opcodes Frequency Distribution Computation my %L_FCG= (); my %topTen= (); my %opCFG; my %opFCG; my %nodeOP = (); my @edges = $dom->getElementsByTagName("edge"); my @nodeId; my %method = (); for (keys %L_FCG) {delete $L_FCG{$_};} for (keys %topTen) {delete $topTen{$_};} for (keys %nodeOP) {delete $nodeOP{$_};} for (keys %method) {delete $method{$_};} my %CFGcounter= (); for (keys %CFGcounter) {delete $CFGcounter{$_};} my %FCGcounter= (); for (keys %FCGcounter){delete $FCGcounter{$_};} my %totalFCG= (); for (keys %totalFCG) {delete $totalFCG{$_};} my %totalCFG= (); for (keys %totalCFG) {delete $totalCFG{$_};} %methodSimRatio= (); for (keys %methodSimRatio) {delete $methodSimRatio{$_};} %methodSimRatio2= (); for (keys %methodSimRatio2) {delete $methodSimRatio2{$_};} my %methodSim= (); for (keys %methodSim) {delete $methodSim{$_};} 199 Appendix A. Partial code of Descent tool. Opcodes Frequency Distribution Computation Listing A.3: Partial code of Descent tool. Opcodes Frequency Distribution Computation foreach my $a (@nodes){ my @att; #print $a->getAttribute(’id’)."\n"; push(@nodeId,$a->getAttribute(’id’)); @att= $a->getChildnodes(); foreach(@att){ if(($_->nodeName) eq ’att’){ if($_->getAttribute(’name’) eq ’name’){ $tempName=$_->getAttribute(’value’);} elsif(($_->getAttribute(’name’) eq ’node.label’) and ($_-> getAttribute(’value’) !˜ /ˆL/) ){ $L_FCG{$a->getAttribute(’id’)}=$tempName; $totalOpcodes++; my $opcode = $_->getAttribute(’value’); $opcode =˜ s/ˆ.*\\n// ; $method{$tempName}.=",".$opcode; if($opcode =˜ m/nop/gi){$relFreqOpcodes{nop_op}++} elsif(($opcode =˜ m/ˆmove-exception$/)){$relFreqOpcodes{’ move_exception’}++} elsif(($opcode =˜ m/ˆmove-object/)){$relFreqOpcodes{’move_object ’}++} elsif((($opcode !˜ m/ˆmove-exception$/) or ($opcode !˜ m/ˆmoveobject$/)) and ($opcode =˜ m/ˆmove/)){$relFreqOpcodes{’move ’}++} elsif($opcode =˜ m/ˆreturn$/gi){$relFreqOpcodes{’return_op’}++} elsif($opcode =˜ m/ˆreturn-object$/gi){$relFreqOpcodes{’ return_object’}++} elsif($opcode =˜ m/ˆreturn-void$/gi){$relFreqOpcodes{’return_void ’}++} elsif($opcode =˜ m/ˆreturn-wide$/gi){$relFreqOpcodes{’return_wide ’}++} elsif($opcode =˜ m/ˆmonitor-enter$/gi){$relFreqOpcodes{’ monitor_enter’}++} elsif($opcode =˜ m/ˆmonitor-exit$/gi){$relFreqOpcodes{’ monitor_exit’}++} elsif($opcode =˜ m/ˆgoto$/gi){$relFreqOpcodes{’goto_op’}++} elsif($opcode =˜ m/ˆgoto\/16$/gi){$relFreqOpcodes{’goto16’}++} elsif($opcode =˜ m/ˆgoto\/32$/gi){$relFreqOpcodes{’goto32’}++} 200 Appendix A. Partial code of Descent tool. Opcodes Frequency Distribution Computation Listing A.4: Partial code of Descent tool. Opcodes Frequency Distribution Computation elsif($opcode =˜ m/ˆinvoke-virtual$/gi){$relFreqOpcodes{’ invoke_virtual’}++} elsif($opcode =˜ m/ˆinvoke-super$/gi){$relFreqOpcodes{’ invoke_super’}++} elsif($opcode =˜ m/ˆinvoke-direct$/gi){$relFreqOpcodes{’ invoke_direct’}++} elsif($opcode =˜ m/ˆinvoke-static$/gi){$relFreqOpcodes{’ invoke_static’}++} elsif($opcode =˜ m/ˆinvoke-interface$/gi){$relFreqOpcodes{’ invoke_interface’}++} elsif($opcode =˜ m/ˆpacked-switch$/gi){$relFreqOpcodes{’ packed_switch’}++} elsif($opcode =˜ m/ˆsparse-switch$/gi){$relFreqOpcodes{’ sparse_switch’}++} elsif(($opcode =˜ m/ˆaget$/) or ($opcode =˜ m/ˆaget-wide$/) or ( $opcode =˜ m/ˆaget-object$/) or ($opcode =˜ m/ˆaget-boolean$ /) or ($opcode =˜ m/ˆaget-byte$/) or ($opcode =˜ m/ˆaget- char$/) or ($opcode =˜ m/ˆaget-short$/) or ($opcode =˜ m/ˆ aput$/) or ($opcode =˜ m/ˆaput-wide$/) or ($opcode =˜ m/ˆ aput-object$/) or ($opcode =˜ m/ˆaput-boolean/) or ($opcode =˜ m/ˆaput-byte$/) or ($opcode =˜ m/ˆaput-char$/) or ( $opcode =˜ m/ˆaput-short$/)){$relFreqOpcodes{’arrayop’}++} elsif(($opcode =˜ m/ˆif-eq$/) or ($opcode =˜ m/ˆif-ne$/) $opcode =˜ m/ˆif-lt$/) or ($opcode =˜ m/ˆif-ge$/) or ( or ( $opcode =˜ m/ˆif-gt$/) or ($opcode =˜ m/ˆif-le$/)){ $relFreqOpcodes{’iftest’}++} elsif(($opcode =˜ m/ˆiget$/) or ($opcode =˜ m/ˆiget-wide$/) or ( $opcode =˜ m/ˆiget-object$/) or ($opcode =˜ m/ˆiget-boolean$ /) or ($opcode =˜ m/ˆiget-byte$/) or ($opcode =˜ m/ˆiget- char$/) or ($opcode =˜ m/ˆiget-short$/) or ($opcode =˜ m/ˆ iput$/) or ($opcode =˜ m/ˆiput-wide$/) or ($opcode =˜ m/ˆ iput-object$/) or ($opcode =˜ m/ˆiput-boolean$/) or ($opcode =˜ m/ˆiput-byte$/) or ($opcode =˜ m/ˆiput-short$/) or ( $opcode =˜ m/ˆiput-char$/)){$relFreqOpcodes{’instanceop’}++} elsif(($opcode =˜ m/ˆcmpl-float/g) or ($opcode =˜ m/ˆcmpg-float/g ) or ($opcode =˜ m/ˆcmpl-double/g) or ($opcode =˜ m/ˆcmpg- double/g) cmpop’}++} or ($opcode =˜ m/ˆcmp-long/g)){$relFreqOpcodes{’ 201 Appendix A. Partial code of Descent tool. Opcodes Frequency Distribution Computation 202 Listing A.5: Partial code of Descent tool. Opcodes Frequency Distribution Computation elsif(($opcode =˜ m/ˆsget$/) or ($opcode =˜ m/ˆsget-wide$/) or ( $opcode =˜ m/ˆsget-object$/) or ($opcode =˜ m/ˆsget-boolean$/) or ($opcode =˜ m/ˆsget-byte$/) or ($opcode =˜ m/ˆsget-char$/) or ($opcode =˜ m/ˆsget-short$/) or ($opcode =˜ m/ˆsput$/) or ( $opcode =˜ m/ˆsput-wide$/) or ($opcode =˜ m/ˆsput-object$/) or ($opcode =˜ m/ˆsput-boolean/) or ($opcode =˜ m/ˆsput-byte$/) or ($opcode =˜ m/ˆsput-char$/) or ($opcode =˜ m/ˆsput-short$/)){ $relFreqOpcodes{’static_op’}++} elsif(($opcode =˜ m/ˆif-eqz$/) or ($opcode =˜ m/ˆif-nez$/) $opcode =˜ m/ˆif-ltz$/) or ($opcode =˜ m/ˆif-gez$/) or ( or ($opcode =˜ m/ˆif-gtz$/) or ($opcode =˜ m/ˆif-lez$/)){$relFreqOpcodes{’ iftestz’}++} elsif(($opcode =˜ m/ˆneg-int$/) or ($opcode =˜ m /ˆnot-int$/) or ($opcode =˜ m/ˆneg-long$/) or ($opcode =˜ m/ˆ not-long$/) or ($opcode =˜ m/ˆneg-float$/) or ($opcode =˜ m/ˆ neg-double$/) or ($opcode =˜ m/ˆint-to-long$/) or ($opcode =˜ m /ˆint-to-float$/) or ($opcode =˜ m/ˆint-to-double$/) or ( $opcode =˜ m/ˆlong-to-int/) or ($opcode =˜ m/ˆlong-to-float$/) or ($opcode =˜ m/ˆlong-to-double$/) or ($opcode =˜ m/ˆfloat-toint$/) or ($opcode =˜ m/ˆfloat-to-long$/) or ($opcode =˜ m/ˆ float-to-double$/) or ($opcode =˜ m/ˆdouble-to-int$/) or ( $opcode =˜ m/ˆdouble-to-long/) or ($opcode =˜ m/ˆdouble-tofloat$/) or ($opcode =˜ m/ˆint-to-byte$/) or ($opcode =˜ m/ˆint -to-char$/) or ($opcode =˜ m/ˆint-to-short$/)){$relFreqOpcodes{’ unop’}++} elsif(($opcode =˜ m/ˆadd-int/) or ($opcode =˜ m/ˆsub-int/) $opcode =˜ m/ˆmul-int/) or ($opcode =˜ m/ˆdiv-int/) =˜ m/ˆrem-int/) or ($opcode =˜ m/ˆand-int/) or-int/) or ($opcode =˜ m/ˆxor-int/) /) or ($opcode =˜ m/ˆshr-int/) ($opcode =˜ m/ˆadd-long$/) or ( or ($opcode or ($opcode =˜ m/ˆ or ($opcode =˜ m/ˆshl-int or ($opcode =˜ m/ˆushr-int/) or or ($opcode =˜ m/ˆsub-long$/) or ( $opcode =˜ m/ˆmul-long/) or ($opcode =˜ m/ˆdiv-long$/) or ( $opcode =˜ m/ˆrem-long$/) or ($opcode =˜ m/ˆand-long$/) $opcode =˜ m/ˆor-long/) or ($opcode =˜ m/ˆxor-long/) or ( or ( $opcode =˜ m/ˆshl-long/) or ($opcode =˜ m/ˆshr-long/) or ( $opcode =˜ m/ˆushr-long/) or ($opcode =˜ m/ˆadd-float/) or ( $opcode =˜ m/ˆsub-float/) or ($opcode =˜ m/ˆmul-float/) or ( $opcode =˜ m/ˆdiv-float/) or ($opcode =˜ m/ˆrem-float/) or ( $opcode =˜ m/ˆadd-double/) or ($opcode =˜ m/ˆsub-double/) or ( $opcode =˜ m/ˆmul-double/) or ($opcode =˜ m/ˆdiv-double/) or ( $opcode =˜ m/ˆrem-double/)){$relFreqOpcodes{’binop’}++} Appendix A. Partial code of Descent tool. Opcodes Frequency Distribution Computation Listing A.6: Partial code of Descent tool. Opcodes Frequency Distribution Computation ##################### Extracting Adjacency Lists for CFG and FCG ##################################### print "\n\t > Extracting Adjacency Lists for CFG and FCG ...\n"; foreach my $e (@edges){ my($label) = $e->getAttribute(’label’); if($label =˜ m/\(cfg\)/g){ my $S= $e->getAttribute(’source’); $edgeCFG{$S}.=",".$e->getAttribute(’target’); if(!exists $opCFG{$S}){$opCFG{$S}=$nodeOP{$e->getAttribute(’ source’)}} $opCFG{$S}.=",". $nodeOP{$e->getAttribute(’target’)} }elsif($label =˜ m/\(fcg\)/g){ my $S= $e->getAttribute(’source’); $edgeFCG{$L_FCG{$S}}.=",".$L_FCG{$e->getAttribute(’target’)}; #$opFCG{$S}.=",". $nodeOP{$e->getAttribute(’target’)} }} my $json_CFG = encode_json(\%opCFG); my $json_FCG = encode_json(\%edgeFCG); $json_CFG=˜s/(\d+)//g; my @opcfg=split(’:’,$json_CFG); my @tempopcfg= sort @opcfg; @opcfg=""; @opcfg= uniq(@tempopcfg); my %hash= (); for (keys %hash) {delete $hash{$_};} my %relFreqOpcodes01= (); for (keys %relFreqOpcodes01) {delete $relFreqOpcodes01{$_};} 203 Appendix A. Partial code of Descent tool. Opcodes Frequency Distribution Computation Listing A.7: Partial code of Descent tool. Opcodes Frequency Distribution Computation ############################################################## my $methodsTotal=0; foreach my $key ( keys %method ) { $methodsTotal++; my %relFreqOpcodes00= (); for (keys %relFreqOpcodes00) {delete $relFreqOpcodes00{$_};} %relFreqOpcodes00 = ("nop_op", 0 , "move_object", 0 ," move_exception", 0 ,"move", 0 , "return_op", 0 , " return_object", 0, "return_void", 0 , "return_wide", 0 , " monitor_enter", 0 , "monitor_exit", 0 , "goto_op", 0 , " goto16", 0 , "goto32", 0 , "invoke_virtual", 0 , " invoke_super", 0 , "invoke_direct", 0 , "invoke_static", 0 , "invoke_interface", 0 , "packed_switch", 0 , "sparse_switch", 0 , "arrayop", 0 , "iftest", 0 , "instanceop", 0 , " static_op", 0 , "iftestz", 0 , "cmpop", 0 , "unop", 0 , " binop", 0 , "throw", 0 ); my @List=""; @List= split(’,’,$method{$key}); shift @List; foreach(@List){#print "XXXXX ".$_ ."\n"; if($_ =˜ m/nop/gi){$relFreqOpcodes00{’nop_op’}++} elsif(($_ =˜ m/ˆmove-exception$/)){$relFreqOpcodes00{’ move_exception’}++} elsif(($_ =˜ m/ˆmove-object/)){$relFreqOpcodes00{’move_object ’}++} elsif((($_ !˜ m/ˆmove-exception$/) or ($_ !˜ m/ˆmove-object$/)) and ($_ =˜ m/ˆmove/)){$relFreqOpcodes00{’move’}++} elsif($_ =˜ m/ˆreturn$/gi){$relFreqOpcodes00{’return_op’}++} elsif($_ =˜ m/ˆreturn-object$/gi){$relFreqOpcodes00{’ return_object’}++} elsif($_ =˜ m/ˆreturn-void$/gi){$relFreqOpcodes00{’return_void ’}++} elsif($_ =˜ m/ˆreturn-wide$/gi){$relFreqOpcodes00{’return_wide ’}++} elsif($_ =˜ m/ˆmonitor-enter$/gi){$relFreqOpcodes00{’ monitor_enter’}++} elsif($_ =˜ m/ˆmonitor-exit$/gi){$relFreqOpcodes00{’monitor_exit ’}++} elsif($_ =˜ m/ˆgoto$/gi){$relFreqOpcodes00{’goto_op’}++} 204 Appendix A. Partial code of Descent tool. Opcodes Frequency Distribution Computation Listing A.8: Partial code of Descent tool. Opcodes Frequency Distribution Computation elsif($_ =˜ m/ˆgoto\/16$/gi){$relFreqOpcodes00{’goto16’}++} elsif($_ =˜ m/ˆgoto\/32$/gi){$relFreqOpcodes00{’goto32’}++} elsif($_ =˜ m/ˆinvoke-virtual$/gi){$relFreqOpcodes00{’ invoke_virtual’}++} elsif($_ =˜ m/ˆinvoke-super$/gi){$relFreqOpcodes00{’invoke_super ’}++} elsif($_ =˜ m/ˆinvoke-direct$/gi){$relFreqOpcodes00{’ invoke_direct’}++} elsif($_ =˜ m/ˆinvoke-static$/gi){$relFreqOpcodes00{’ invoke_static’}++} elsif($_ =˜ m/ˆinvoke-interface$/gi){$relFreqOpcodes00{’ invoke_interface’}++} elsif($_ =˜ m/ˆpacked-switch$/gi){$relFreqOpcodes00{’ packed_switch’}++} elsif($_ =˜ m/ˆsparse-switch$/gi){$relFreqOpcodes00{’ sparse_switch’}++} 205 Appendix A. Partial code of Descent tool. Opcodes Frequency Distribution Computation Listing A.9: Partial code of Descent tool. Opcodes Frequency Distribution Computation elsif(($_ =˜ m/ˆaget$/) or ($_ =˜ m/ˆaget-wide$/) aget-object$/) or ($_ =˜ m/ˆaget-boolean$/) aget-byte$/) or ($_ =˜ m/ˆaget-char$/) short$/) or ($_ =˜ m/ˆaput$/) $_ =˜ m/ˆaput-object$/) =˜ m/ˆaput-byte$/) or ($_ =˜ m/ˆ or ($_ =˜ m/ˆ or ($_ =˜ m/ˆaget- or ($_ =˜ m/ˆaput-wide$/) or ( or ($_ =˜ m/ˆaput-boolean/) or ($_ or ($_ =˜ m/ˆaput-char$/) or ($_ =˜ m/ˆ aput-short$/)){$relFreqOpcodes00{’arrayop’}++} elsif(($_ =˜ m/ˆif-eq$/) or ($_ =˜ m/ˆif-ne$/) lt$/) or ($_ =˜ m/ˆif-ge$/) or ($_ =˜ m/ˆif- or ($_ =˜ m/ˆif-gt$/) or ($_ =˜ m/ˆif-le$/)){$relFreqOpcodes00{’iftest’}++} elsif(($_ =˜ m/ˆiget$/) or ($_ =˜ m/ˆiget-wide$/) iget-object$/) or ($_ =˜ m/ˆiget-boolean$/) or ($_ =˜ m/ˆ or ($_ =˜ m/ˆ iget-byte$/) or ($_ =˜ m/ˆiget-char$/) or ($_ =˜ m/ˆigetshort$/) or ($_ =˜ m/ˆiput$/) or ($_ =˜ m/ˆiput-wide$/) or ( $_ =˜ m/ˆiput-object$/) or ($_ =˜ m/ˆiput-boolean$/) or ($_ =˜ m/ˆiput-byte$/) or ($_ =˜ m/ˆiput-short$/) or ($_ =˜ m/ˆ iput-char$/)){$relFreqOpcodes00{’instanceop’}++} elsif(($_ =˜ m/ˆcmpl-float/g) or ($_ =˜ m/ˆcmpg-float/g) =˜ m/ˆcmpl-double/g) or ($_ =˜ m/ˆcmpg-double/g) or ($_ or ($_ =˜ m /ˆcmp-long/g)){$relFreqOpcodes00{’cmpop’}++} elsif(($_ =˜ m/ˆsget$/) or ($_ =˜ m/ˆsget-wide$/) sget-object$/) or ($_ =˜ m/ˆsget-boolean$/) sget-byte$/) or ($_ =˜ m/ˆsget-char$/) short$/) or ($_ =˜ m/ˆsput$/) $_ =˜ m/ˆsput-object$/) =˜ m/ˆsput-byte$/) or ($_ =˜ m/ˆ or ($_ =˜ m/ˆ or ($_ =˜ m/ˆsget- or ($_ =˜ m/ˆsput-wide$/) or ( or ($_ =˜ m/ˆsput-boolean/) or ($_ or ($_ =˜ m/ˆsput-char$/) or ($_ =˜ m/ˆ sput-short$/)){$relFreqOpcodes00{’static_op’}++} elsif(($_ =˜ m/ˆif-eqz$/) or ($_ =˜ m/ˆif-nez$/) -ltz$/) or ($_ =˜ m/ˆif-gez$/) or ($_ =˜ m/ˆif or ($_ =˜ m/ˆif-gtz$/) or ($_ =˜ m/ˆif-lez$/)){$relFreqOpcodes00{’iftestz’}++} elsif(($_ =˜ m/ˆneg-int$/) long$/) or ($_ =˜ m/ˆnot-int$/) or ($_ =˜ m/ˆneg- or ($_ =˜ m/ˆnot-long$/) or ($_ =˜ m/ˆneg-float$/) or ($_ =˜ m/ˆneg-double$/) or ($_ =˜ m/ˆint-to-long$/) or ( $_ =˜ m/ˆint-to-float$/) or ($_ =˜ m/ˆint-to-double$/) or ( $_ =˜ m/ˆlong-to-int/) or ($_ =˜ m/ˆlong-to-float$/) or ($_ =˜ m/ˆlong-to-double$/) or ($_ =˜ m/ˆfloat-to-int$/) or ($_ =˜ m/ˆfloat-to-long$/) or ($_ =˜ m/ˆfloat-to-double$/) or ( $_ =˜ m/ˆdouble-to-int$/) or ($_ =˜ m/ˆdouble-to-long/) or ( $_ =˜ m/ˆdouble-to-float$/) or ($_ =˜ m/ˆint-to-byte$/) or ( $_ =˜ m/ˆint-to-char$/) or ($_ =˜ m/ˆint-to-short$/)){ $relFreqOpcodes00{’unop’}++} 206 Appendix A. Partial code of Descent tool. Opcodes Frequency Distribution Computation Listing A.10: Partial code of Descent tool. Opcodes Frequency Distribution Computation elsif(($_ =˜ m/ˆadd-int/) or ($_ =˜ m/ˆsub-int/) mul-int/) or ($_ =˜ m/ˆdiv-int/) $_ =˜ m/ˆand-int/) int/) or ($_ =˜ m/ˆ or ($_ =˜ m/ˆrem-int/) or ( or ($_ =˜ m/ˆor-int/) or ($_ =˜ m/ˆxor- or ($_ =˜ m/ˆshl-int/) or ($_ =˜ m/ˆshr-int/) =˜ m/ˆushr-int/) or ($_ =˜ m/ˆadd-long$/) or ($_ or ($_ =˜ m/ˆsub- long$/) or ($_ =˜ m/ˆmul-long/) or ($_ =˜ m/ˆdiv-long$/) ($_ =˜ m/ˆrem-long$/) or ($_ =˜ m/ˆand-long$/) or-long/) or ($_ =˜ m/ˆxor-long/) or or ($_ =˜ m/ˆ or ($_ =˜ m/ˆshl-long/) or ($_ =˜ m/ˆshr-long/) or ($_ =˜ m/ˆushr-long/) or ($_ =˜ m/ˆ add-float/) or ($_ =˜ m/ˆsub-float/) or ($_ =˜ m/ˆmul-float/) or ($_ =˜ m/ˆdiv-float/) or ($_ =˜ m/ˆrem-float/) =˜ m/ˆadd-double/) or ($_ =˜ m/ˆsub-double/) or ($_ or ($_ =˜ m/ˆ mul-double/) or ($_ =˜ m/ˆdiv-double/) or ($_ =˜ m/ˆremdouble/)){$relFreqOpcodes00{’binop’}++} ########################################## my $sp1= sprintf("%.5f",($relFreqOpcodes00{’nop_op’}/ $totalOpcodes)); $relFreqOpcodes01{$key}.= $sp1; $relFreqOpcodes01{$key}.=", "; my $sp4= sprintf("%.5f",($relFreqOpcodes00{’move’}/$totalOpcodes) ); $relFreqOpcodes01{$key}.= $sp4; $relFreqOpcodes01{$key}.=", "; my $sp3= sprintf("%.5f",($relFreqOpcodes00{’move_exception’}/ $totalOpcodes)); $relFreqOpcodes01{$key}.= $sp3; $relFreqOpcodes01{$key}.=", "; my $sp2= sprintf("%.5f",($relFreqOpcodes00{’move_object’}/ $totalOpcodes)); $relFreqOpcodes01{$key}.= $sp2; $relFreqOpcodes01{$key}.=", "; my $sp5= sprintf("%.5f",($relFreqOpcodes00{’return_op’}/ $totalOpcodes)); $relFreqOpcodes01{$key}.= $sp5; $relFreqOpcodes01{$key}.=", "; my $sp6= sprintf("%.5f",($relFreqOpcodes00{’return_object’}/ $totalOpcodes)); $relFreqOpcodes01{$key}.= $sp6; $relFreqOpcodes01{$key}.=", "; 207 Appendix A. Partial code of Descent tool. Opcodes Frequency Distribution Computation Listing A.11: Partial code of Descent tool. Opcodes Frequency Distribution Computation my $sp7= sprintf("%.5f",($relFreqOpcodes00{’return_void’}/ $totalOpcodes)); $relFreqOpcodes01{$key}.= $sp7; $relFreqOpcodes01{$key}.=", "; my $sp8= sprintf("%.5f",($relFreqOpcodes00{’return_wide’}/ $totalOpcodes)); $relFreqOpcodes01{$key}.= $sp8; $relFreqOpcodes01{$key}.=", "; my $sp9= sprintf("%.5f",($relFreqOpcodes00{’monitor_enter’}/ $totalOpcodes)); $relFreqOpcodes01{$key}.= $sp9; $relFreqOpcodes01{$key}.=", "; my $sp10= sprintf("%.5f",($relFreqOpcodes00{’monitor_exit’}/ $totalOpcodes)); $relFreqOpcodes01{$key}.= $sp10; $relFreqOpcodes01{$key}.=", "; my $sp11= sprintf("%.5f",($relFreqOpcodes00{’goto_op’}/ $totalOpcodes)); $relFreqOpcodes01{$key}.= $sp11; $relFreqOpcodes01{$key}.=", "; my $sp12= sprintf("%.5f",($relFreqOpcodes00{’goto16’}/ $totalOpcodes)); $relFreqOpcodes01{$key}.= $sp12; $relFreqOpcodes01{$key}.=", "; my $sp13= sprintf("%.5f",($relFreqOpcodes00{’goto32’}/ $totalOpcodes)); $relFreqOpcodes01{$key}.= $sp13; $relFreqOpcodes01{$key}.=", "; my $sp14= sprintf("%.5f",($relFreqOpcodes00{’invoke_virtual’}/ $totalOpcodes)); $relFreqOpcodes01{$key}.= $sp14; $relFreqOpcodes01{$key}.=", "; my $sp15= sprintf("%.5f",($relFreqOpcodes00{’invoke_super’}/ $totalOpcodes)); $relFreqOpcodes01{$key}.= $sp15; $relFreqOpcodes01{$key}.=", "; my $sp16= sprintf("%.5f",($relFreqOpcodes00{’invoke_direct’}/ $totalOpcodes)); $relFreqOpcodes01{$key}.= $sp16; $relFreqOpcodes01{$key}.=", "; my $sp17= sprintf("%.5f",($relFreqOpcodes00{’invoke_static’}/ $totalOpcodes)); $relFreqOpcodes01{$key}.= $sp17; $relFreqOpcodes01{$key}.=", "; my $sp18= sprintf("%.5f",($relFreqOpcodes00{’invoke_interface’}/ $totalOpcodes)); $relFreqOpcodes01{$key}.= $sp18; $relFreqOpcodes01{$key}.=", "; my $sp19= sprintf("%.5f",($relFreqOpcodes00{’packed_switch’}/ $totalOpcodes)); 208 Appendix A. Partial code of Descent tool. Opcodes Frequency Distribution Computation Listing A.12: Partial code of Descent tool. Opcodes Frequency Distribution Computation $relFreqOpcodes01{$key}.= $sp19; $relFreqOpcodes01{$key}.=", "; my $sp20= sprintf("%.5f",($relFreqOpcodes00{’sparse_switch’}/ $totalOpcodes)); $relFreqOpcodes01{$key}.= $sp20; $relFreqOpcodes01{$key}.=", "; my $sp21= sprintf("%.5f",($relFreqOpcodes00{’arrayop’}/ $totalOpcodes)); $relFreqOpcodes01{$key}.= $sp21; $relFreqOpcodes01{$key}.=", "; my $sp22= sprintf("%.5f",($relFreqOpcodes00{’iftest’}/ $totalOpcodes)); $relFreqOpcodes01{$key}.= $sp22; $relFreqOpcodes01{$key}.=", "; my $sp23= sprintf("%.5f",($relFreqOpcodes00{’instanceop’}/ $totalOpcodes)); $relFreqOpcodes01{$key}.= $sp23; $relFreqOpcodes01{$key}.=", "; my $sp24= sprintf("%.5f",($relFreqOpcodes00{’static_op’}/ $totalOpcodes)); $relFreqOpcodes01{$key}.= $sp24; $relFreqOpcodes01{$key}.=", "; my $sp25= sprintf("%.5f",($relFreqOpcodes00{’iftestz’}/ $totalOpcodes)); $relFreqOpcodes01{$key}.= $sp25; $relFreqOpcodes01{$key}.=", "; my $sp26= sprintf("%.5f",($relFreqOpcodes00{’cmpop’}/ $totalOpcodes)); $relFreqOpcodes01{$key}.= $sp26; $relFreqOpcodes01{$key}.=", "; my $sp27= sprintf("%.5f",($relFreqOpcodes00{’unop’}/$totalOpcodes )); $relFreqOpcodes01{$key}.= $sp27; $relFreqOpcodes01{$key}.=", "; my $sp28= sprintf("%.5f",($relFreqOpcodes00{’binop’}/ $totalOpcodes)); $relFreqOpcodes01{$key}.= $sp28; $relFreqOpcodes01{$key}.=", "; my $sp29= sprintf("%.5f",($relFreqOpcodes00{’throw’}/ $totalOpcodes)); $relFreqOpcodes01{$key}.= $sp29; $relFreqOpcodes01{$key}.=", "; 209 Appendix A. Partial code of Descent tool. Opcodes Frequency Distribution Computation Listing A.13: Partial code of Descent tool. Opcodes Frequency Distribution Computation my $sth = $dbh->prepare( "SELECT * FROM opcodes_freq_distribution "); $sth->execute; while (@data = $sth->fetchrow_array()) { my $id = $data[0]; my $nop_op = $data[2]; my $move = $data[3]; my $moveexception = $data[4]; my $moveobject = $data[5]; my $return_op = $data[6]; my $returnobject = $data[7]; my $returnvoid = $data[8]; my $returnwide = $data[9]; my $monitorenter = $data[10]; my $monitorexit = $data[11]; my $goto = $data[12]; my $goto16 = $data[13]; my $goto32 = $data[14]; my $invokevirtual = $data[15]; my $invokesuper = $data[16]; my $invokedirect = $data[17]; my $invokestatic = $data[18]; my $invokeinterfaces = $data[19]; my $packedswitch = $data[20]; my $sparseswitch = $data[21]; my $arrayop = $data[22]; my $iftest = $data[23]; my $instanceop = $data[24]; my $staticop = $data[25]; my $iftestz = $data[26]; my $cmpop = $data[27]; my $unop = $data[28]; my $binop = $data[29]; my $throw = $data[30]; $hash{$id}= $nop_op.",".$move.",".$moveexception.",".$moveobject .",".$return_op.",".$returnobject.",".$returnvoid.",". $returnwide.",".$monitorenter.",".$monitorexit.",".$goto.",". $goto16.",".$goto32.",".$invokevirtual.",".$invokesuper.",". $invokedirect.",".$invokestatic.",".$invokeinterfaces.",". $packedswitch.",".$sparseswitch.",".$arrayop.",".$iftest.",". $instanceop.",".$staticop.",".$iftestz.",".$cmpop.",".$unop .",".$binop.",".$throw; 210 Appendix A. Partial code of Descent tool. Opcodes Frequency Distribution Computation Listing A.14: Partial code of Descent tool. Opcodes Frequency Distribution Computation $sth->finish(); $dbh->commit or die $DBI::errstr; foreach my $key ( keys %hash ) { my @opc = split(’,’,$hash{$key}); $relFreqOpcodes2{’nop_op’}=$opc[0]; $relFreqOpcodes2{’move’}=$opc[1]; $relFreqOpcodes2{’move_exception’}=$opc[2]; $relFreqOpcodes2{’move_object’}=$opc[3]; $relFreqOpcodes2{’return_op’}=$opc[4]; $relFreqOpcodes2{’return_object’}=$opc[5]; $relFreqOpcodes2{’return_void’}=$opc[6]; $relFreqOpcodes2{’return_wide’}=$opc[7]; $relFreqOpcodes2{’monitor_enter’}=$opc[8]; $relFreqOpcodes2{’monitor_exit’}=$opc[9]; $relFreqOpcodes2{’goto_op’}=$opc[10]; $relFreqOpcodes2{’goto16’}=$opc[11]; $relFreqOpcodes2{’goto32’}=$opc[12]; $relFreqOpcodes2{’invoke_virtual’}=$opc[13]; $relFreqOpcodes2{’invoke_super’}=$opc[14]; $relFreqOpcodes2{’invoke_direct’}=$opc[15]; $relFreqOpcodes2{’invoke_static’}=$opc[16]; $relFreqOpcodes2{’invoke_interface’}=$opc[17]; $relFreqOpcodes2{’packed_switch’}=$opc[18]; $relFreqOpcodes2{’sparse_switch’}=$opc[19]; $relFreqOpcodes2{’arrayop’}=$opc[20]; $relFreqOpcodes2{’iftest’}=$opc[21]; $relFreqOpcodes2{’instanceop’}=$opc[22]; $relFreqOpcodes2{’static_op’}=$opc[23]; $relFreqOpcodes2{’iftestz’}=$opc[24]; $relFreqOpcodes2{’cmpop’}=$opc[25]; $relFreqOpcodes2{’unop’}=$opc[26]; $relFreqOpcodes2{’binop’}=$opc[27]; $relFreqOpcodes2{’throw’}=$opc[28]; 211 Appendix A. Partial code of Descent tool. Opcodes Frequency Distribution Computation Listing A.15: Partial code of Descent tool. Opcodes Frequency Distribution Computation my $s1 = sprintf("%.5f",($relFreqOpcodes{’nop_op’}/$totalOpcodes) ); my $s2 = sprintf("%.5f",($relFreqOpcodes{’move’}/$totalOpcodes) ); my $s3 = sprintf("%.5f",($relFreqOpcodes{’move_exception’}/ $totalOpcodes)); my $s4 = sprintf("%.5f",($relFreqOpcodes{’move_object’}/ $totalOpcodes)); my $s5 = sprintf("%.5f",($relFreqOpcodes{’return_op’}/ $totalOpcodes)); my $s6 = sprintf("%.5f",($relFreqOpcodes{’return_object’}/ $totalOpcodes)); my $s7 = sprintf("%.5f",($relFreqOpcodes{’return_void’}/ $totalOpcodes)); my $s8 = sprintf("%.5f",($relFreqOpcodes{’return_wide’}/ $totalOpcodes)); my $s9 = sprintf("%.5f",($relFreqOpcodes{’monitor_enter’}/ $totalOpcodes)); my $s10 = sprintf("%.5f",($relFreqOpcodes{’monitor_exit’}/ $totalOpcodes)); my $s11 = sprintf("%.5f",($relFreqOpcodes{’goto_op’}/ $totalOpcodes)); my $s12 = sprintf("%.5f",($relFreqOpcodes{’goto16’}/ $totalOpcodes)); my $s13 = sprintf("%.5f",($relFreqOpcodes{’goto32’}/ $totalOpcodes)); my $s14 = sprintf("%.5f",($relFreqOpcodes{’invoke_virtual’}/ $totalOpcodes)); my $s15 = sprintf("%.5f",($relFreqOpcodes{’invoke_super’}/ $totalOpcodes)); my $s16 = sprintf("%.5f",($relFreqOpcodes{’invoke_direct’}/ $totalOpcodes)); my $s17 = sprintf("%.5f",($relFreqOpcodes{’invoke_static’}/ $totalOpcodes)); my $s18 = sprintf("%.5f",($relFreqOpcodes{’invoke_interface’}/ $totalOpcodes)); my $s19 = sprintf("%.5f",($relFreqOpcodes{’packed_switch’}/ $totalOpcodes)); my $s20 = sprintf("%.5f",($relFreqOpcodes{’sparse_switch’}/ $totalOpcodes)); 212 Appendix A. Partial code of Descent tool. Opcodes Frequency Distribution Computation Listing A.16: Partial code of Descent tool. Opcodes Frequency Distribution Computation my $s21 = sprintf("%.5f",($relFreqOpcodes{’arrayop’}/ $totalOpcodes)); my $s29 = sprintf("%.5f",($relFreqOpcodes{’throw’}/ $totalOpcodes)); $sum += $s1 * $relFreqOpcodes2{’nop_op’};# $sum += $s2 * $relFreqOpcodes2{’move’};# $sum += $s3 * $relFreqOpcodes2{’ move_exception’};# $sum += $s4 * $relFreqOpcodes2{’move_object’};# $sum += $s5 * $relFreqOpcodes2{’return_op’};# $sum += $s6 * $relFreqOpcodes2{’return_object’};# $sum += $s7 * $relFreqOpcodes2{’return_void’};# $sum += $s8 * $relFreqOpcodes2{’return_wide’};# $sum += $s9 * $relFreqOpcodes2{’monitor_enter’};# $sum += $s10 * $relFreqOpcodes2{’monitor_exit’};# $sum += $s11 * $relFreqOpcodes2{’goto_op’};# $sum += $s12 * $relFreqOpcodes2{’goto16’};# $sum += $s13 * $relFreqOpcodes2{’goto32’};# $sum += $s14 * $relFreqOpcodes2{’invoke_virtual’};# $sum += $s15 * $relFreqOpcodes2{’invoke_super’};# $sum += $s16 * $relFreqOpcodes2{’invoke_direct’};# $sum += $s17 * $relFreqOpcodes2{’invoke_static’};# $sum += $s18 * $relFreqOpcodes2{’invoke_interface’};# $sum += $s19 * $relFreqOpcodes2{’packed_switch’};# $sum += $s20 * $relFreqOpcodes2{’sparse_switch’};# $sum += $s21 * $relFreqOpcodes2{’arrayop’};# $sum += $s22 * $relFreqOpcodes2{’iftest’};# $sum += $s23 * $relFreqOpcodes2{’instanceop’};# $sum += $s24 * $relFreqOpcodes2{’static_op’};# $sum += $s25 * $relFreqOpcodes2{’iftestz’};# $sum += $s26 * $relFreqOpcodes2{’cmpop’};# $sum += $s27 * $relFreqOpcodes2{’unop’};# $sum += $s28 * $relFreqOpcodes2{’binop’};# $sum += $s29 * $relFreqOpcodes2{’throw’};# 213 Bibliography 1. Android software stack. http://commons.wikimedia.org/wiki/File: Android-System-Architecture.svg. Accessed: 2014-10-12. 2. Android sandbox mechanism. http://CEnriqueOrtiz.com. Accessed: 201410-12. 3. Android sandbox mechanism. http://www.ibm.com/developerworks/ library/x-androidsecurity/. Accessed: 2014-10-12. 4. Menelkir D. motorola milestone boot chain. http://menelkir.itroll.org/ 2011/03/motorola-milestone-boot-chain.html. Accessed: 2014-10-23. 5. Android: Is it secure enough? http://www.sierraware.com/SE_Android_ Security_Integrity_Management.pdf. Accessed: 2014-11-08. 6. White paper : An overview of samsung knox. http://www.samsung.com/se/ business-images/resource/2013/samsung-knox-an-overview/%7B3% 7D/Samsung_KNOX_whitepaper-0-0-0.pdf. Accessed: 2014-10-14. 7. Samsung knox technical overview. https://www.samsungknox.com/en/ products/knox-workspace/technical. Accessed: 2014-10-25. 8. Ti android gingerbread 2.3.4 devkit 2.1 portingguides. http://processors. wiki.ti.com/index.php/TI-Android-GingerBread-2.3.4-DevKit-2. 1_PortingGuides. Accessed: 2014-10-18. 9. Nitay Artenstein and Idan Revivo. Man in the binder: He who controls ipc, controls the droid. BlackHat eu ’14, September 2014. 10. Deep dive into binder. https://thenewcircle.com/s/post/1340/Deep_ Dive_Into_Binder_Presentation.htm. Accessed: 2014-10-18. 214 Bibliography 215 11. Android security overview. https://source.android.com/devices/tech/ security/. Accessed: 2014-10-12. 12. Jeff Six. An in depth introduction to the android permission model. OWASP pubblication presented at AppSecDC 2012, April 2012. 13. Mcafee 2014 mobile security consumer trends report. http://www.mcafee.com/ sg/resources/reports/rp-mobile-security-consumer-trends.pdf. Accessed: 2014-10-31. 14. Yajin Zhou and Xuxian Jiang. Dissecting android malware: Characterization and evolution. In Proceedings of the 2012 IEEE Symposium on Security and Privacy, SP ’12, pages 95–109, Washington, DC, USA, 2012. IEEE Computer Society. 15. A. Amamra, C. Talhi, and J. Robert. Smartphone malware detection: From a survey towards taxonomy. In Malicious and Unwanted Software (MALWARE), 2012 7th International Conference on, pages 79–86, Oct 2012. 16. Xing Liu and Jiqiang Liu. A two-layered permission-based android malware detection scheme. In Proceedings of the 2014 2Nd IEEE International Conference on Mobile Cloud Computing, Services, and Engineering, MOBILECLOUD ’14, pages 142–148, Washington, DC, USA, 2014. IEEE Computer Society. 17. A.M. Aswini and P. Vinod. Droid permission miner: Mining prominent permissions for android malware analysis. In Applications of Digital Information and Web Technologies (ICADIWT), 2014 Fifth International Conference on the, pages 81–86, Feb 2014. 18. Shuang Liang and Xiaojiang Du. Permission-combination-based scheme for android mobile malware detection. In Communications (ICC), 2014 IEEE International Conference on, pages 2301–2306, June 2014. 19. M. Chandramohan and Hee Beng Kuan Tan. Detection of mobile malware in the wild. Computer, 45(9):65–71, Sept 2012. 20. V. Rastogi, Yan Chen, and Xuxian Jiang. Catch me if you can: Evaluating android anti-malware against transformation attacks. Information Forensics and Security, IEEE Transactions on, 9(1):99–108, Jan 2014. Bibliography 216 21. Av-test report ”36 security apps for android are put under constant fire”. http://www.av-test.org/en/news/news-single-view/ 36-security-apps-for-android-are-put-under-constant-fire/?=. Accessed: 2014-12-05. 22. Hugo Gascon, Fabian Yamaguchi, Daniel Arp, and Konrad Rieck. Structural detection of android malware using embedded call graphs. In Proceedings of the 2013 ACM Workshop on Artificial Intelligence and Security, AISec ’13, pages 45–54, New York, NY, USA, 2013. ACM. 23. Kai Chen, Peng Liu, and Yingjun Zhang. Achieving accuracy and scalability simultaneously in detecting application clones on android markets. In Proceedings of the 36th International Conference on Software Engineering, ICSE 2014, pages 175–186, New York, NY, USA, 2014. ACM. 24. Android intent. http://luca-petrosino.blogspot.it/2011/03/ intent-e-intent-filter.html. Accessed: 2014-10-15. 25. Daoyuan W. On the feasibility of automatically generating android component hijacking exploits. HitCon ’14, (21), August 2014. 26. Binders windows tokens. http://www.androiddesignpatterns.com/2013/ 07/binders-window-tokens.html. Accessed: 2014-10-18. 27. William Enck, Machigar Ongtang, and Patrick McDaniel. On lightweight mobile phone application certification. In Proceedings of the 16th ACM Conference on Computer and Communications Security, CCS ’09, pages 235–245, New York, NY, USA, 2009. ACM. 28. Security enhancements in android 4.4. https://source.android.com/ devices/tech/security/enhancements44.html. Accessed: 2014-10-13. 29. Validating security-enhanced linux in android. https://source.android.com/ devices/tech/security/se-linux.html. Accessed: 2014-10-13. 30. Android intent. https://developer.android.com/reference/android/ content/Intent.html. Accessed: 2014-10-14. Bibliography 217 31. Yifei Wang, Srinivas Hariharan, Chenxi Zhao, Jiaming Liu, and Wenliang Du. Compac: enforce component-level access control in android. In CODASPY’14, pages 25–36, 2014. 32. Symantec 2013 mobile adware and malware analysis report. http: //www.symantec.com/content/en/us/enterprise/media/security_ response/whitepapers/madware_and_malware_analysis.pdf. Ac- cessed: 2014-11-04. 33. G. Suarez-Tangil, J.E. Tapiador, P. Peris-Lopez, and A. Ribagorda. Evolution, detection and analysis of malware for smart devices. Communications Surveys Tutorials, IEEE, 16(2):961–987, Second 2014. 34. Shuaifu Dai, Yaxin Liu, Tielei Wang, Tao Wei, and Wei Zou. Behavior-based malware detection on mobile phone. In Wireless Communications Networking and Mobile Computing (WiCOM), 2010 6th International Conference on, pages 1–4, Sept 2010. 35. Abhijit Bose, Xin Hu, Kang G. Shin, and Taejoon Park. Behavioral detection of malware on mobile handsets. In Proceedings of the 6th International Conference on Mobile Systems, Applications, and Services, MobiSys ’08, pages 225–238, New York, NY, USA, 2008. ACM. 36. Hahnsang Kim, Joshua Smith, and Kang G. Shin. Detecting energy-greedy anomalies and mobile malware variants. In Proceedings of the 6th International Conference on Mobile Systems, Applications, and Services, MobiSys ’08, pages 239–252, New York, NY, USA, 2008. ACM. 37. B. Dixon and S. Mishra. Power based malicious code detection techniques for smartphones. In Trust, Security and Privacy in Computing and Communications (TrustCom), 2013 12th IEEE International Conference on, pages 142–149, July 2013. 38. T.K. Buennemeyer, T.M. Nelson, L.M. Clagett, J.P. Dunning, R.C. Marchany, and J.G. Tront. Mobile device profiling and intrusion detection using smart batteries. In Hawaii International Conference on System Sciences, Proceedings of the 41st Annual, pages 296–296, Jan 2008. 39. Aubrey-Derrick Schmidt, Frank Peters, Florian Lamour, and Sahin Albayrak. Monitoring smartphones for anomaly detection. In Proceedings of the 1st International Bibliography 218 Conference on MOBILe Wireless MiddleWARE, Operating Systems, and Applications, MOBILWARE ’08, pages 40:1–40:6, ICST, Brussels, Belgium, Belgium, 2007. ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering). 40. A.-D. Schmidt, R. Bye, H.-G. Schmidt, J. Clausen, O. Kiraz, K.A. Yuksel, S.A. Camtepe, and S. Albayrak. Static analysis of executables for collaborative malware detection on android. In Communications, 2009. ICC ’09. IEEE International Conference on, pages 1–5, June 2009. 41. David Barrera, H. Güneş Kayacik, Paul C. van Oorschot, and Anil Somayaji. A methodology for empirical analysis of permission-based security models and its application to android. In Proceedings of the 17th ACM Conference on Computer and Communications Security, CCS ’10, pages 73–84, New York, NY, USA, 2010. ACM. 42. A. Bartel, J. Klein, M. Monperrus, and Y. Le Traon. Static analysis for extracting permission checks of a large scale framework: The challenges and solutions for analyzing android. Software Engineering, IEEE Transactions on, 40(6):617–632, June 2014. 43. M. Christodorescu, S. Jha, S.A. Seshia, D. Song, and R.E. Bryant. Semantics-aware malware detection. In Security and Privacy, 2005 IEEE Symposium on, pages 32–46, May 2005. 44. Christopher Kruegel, Engin Kirda, Darren Mutz, William Robertson, and Giovanni Vigna. Polymorphic worm detection using structural information of executables. In Proceedings of the 8th International Conference on Recent Advances in Intrusion Detection, RAID’05, pages 207–226, Berlin, Heidelberg, 2006. Springer-Verlag. 45. Joris Kinable and Orestis Kostakis. Malware classification based on call graph clustering. J. Comput. Virol., 7(4):233–245, November 2011. 46. Chanchal Kumar Roy and James R. Cordy. A survey on software clone detection research. SCHOOL OF COMPUTING TR 2007-541, QUEEN’S UNIVERSITY, 115, 2007. 47. D. B. West, editor. Introduction to Graph Theory (2nd Ed.). Prentice Hall, 2000. Bibliography 219 48. S. Cesare and Yang Xiang. Malware variant detection using similarity search over sets of control flow graphs. In Trust, Security and Privacy in Computing and Communications (TrustCom), 2011 IEEE 10th International Conference on, pages 181– 189, Nov 2011. 49. Daniel Arp, Michael Spreitzenbarth, Malte Huebner, Hugo Gascon, and Konrad Rieck. Drebin: Efficient and explainable detection of android malware in your pocket. 21th Annual Network and Distributed System Security Symposium (NDSS), 2014. 50. Michael Spreitzenbarth, Florian Echtler, Thomas Schreck, Felix C. Freling, and Johannes Hoffmann. Mobilesandbox: Looking deeper into android applications. 28th International ACM Symposium on Applied Computing (SAC), 2013.