Catch them alive: A malware detection approach through memory forensics, manifold learning and computer vision
آنها را زنده بگیرید: یک روش تشخیص بدافزار از طریق پزشکی قانونی حافظه ، یادگیری چندگانه و بینایی ماشین-2021
The everlasting increase in usage of information systems and online services have triggered the birth of the new type of malware which are more dangerous and hard to detect. In particular, according to the recent reports, the new type of fileless malware infect the victims’ devices without a persistent trace (i.e. file) on hard drives. Moreover, existing static malware detection methods in literature often fail to detect sophisticated malware utilizing various obfuscation and encryption techniques. Our contribution in this study is two-folded. First, we present a novel approach to recognize malware by capturing the memory dump of suspicious processes which can be represented as a RGB image. In contrast to the conventional approaches followed by static and dynamic methods existing in the literature, we aimed to obtain and use memory data to reveal visual patterns that can be classified by employing computer vision and machine learning methods in a multi-class open-set recognition regime. And second, we have applied a state of art manifold learning scheme named UMAP to improve the detection of unknown malware files through binary classification. Throughout the study, we have employed our novel dataset covering 4294 samples in total, including 10 malware families along with the benign executables. Lastly, we obtained their memory dumps and converted them to RGB images by applying 3 different rendering schemes. In order to generate their signatures (i.e. feature vectors), we utilized GIST and HOG (Histogram of Gradients) descriptors as well as their combination. Moreover, the obtained signatures were classified via machine learning algorithms of j48, RBF kernel-based SMO, Random Forest, XGBoost and linear SVM. According to the results of the first phase, we have achieved prediction accuracy up to 96.39% by employing SMO algorithm on the feature vectors combined with GIST+HOG. Besides, the UMAP based manifold learning strategy has improved accuracy of the unknown malware recognition models up to 12.93%, 21.83%, 20.78% on average for Random Forest, linear SVM and XGBoost algorithms respectively. Moreover, on a commercially available standard desktop computer, the suggested approach takes only 3.56 s for analysis on average. The results show that our vision based scheme provides an effective protection mechanism against malicious applications.
Keywords: Memory forensics | Memory dump | Machine learning | Computer vision | Malware detection | Manifold learning
A novel method for malware detection on ML-based visualization technique
یک روش جدید برای شناسایی بدافزارها در تکنیک تجسم مبتنی بر ML-2020
Malware detection is one of the challenging tasks in network security. With the flourishment of network techniques and mobile devices, the threat from malwares has been of an increasing significance, such as metamorphic malwares, zero-day attack, and code obfuscation, etc . Many machine learning (ML)-based malware detection methods are proposed to address this problem. However, considering the attacks from adversarial examples (AEs) and exponential increase in the malware variant thriving nowadays, malware detection is still an active field of research. To overcome the current limitation, we proposed a novel method using data visualization and adversarial training on ML-based detectors to efficiently detect the different types of malwares and their variants. Experimental results on the MS BIG malware database and the Ember database demonstrate that the proposed method is able to prevent the zero-day attack and achieve up to 97.73% accuracy, along with 96.25% in average for all the malwares tested.
Keywords: Malware detection | Adversarial training | Adversarial examples | Image texture | Data visualization
Static malware detection and attribution in android byte-code through an end-to-end deep system
شناسایی بدافزارهای استاتیکی و انتساب در بایت کد اندرویدی از طریق یک سیستم عمیق انتها به انتها-2020
Android reflects a revolution in handhelds and mobile devices. It is a virtual machine based, an open source mobile platform that powers millions of smartphone and devices and even a larger no. of applications in its ecosystem. Surprisingly in a short lifespan, Android has also seen a colossal expansion in application malware with 99% of the total malware for smartphones being found in the Android ecosystem. Subsequently, quite a few techniques have been proposed in the literature for the analysis and detection of these malicious applications for the Android platform. The increasing and diversified nature of Android malware has immensely attenuated the usefulness of prevailing malware detectors, which leaves Android users susceptible to novel malware. Here in this paper, as a remedy to this problem, we propose an anti-malware system that uses customized learning models, which are sufficiently deep, and are ’End to End deep learning architectures which detect and attribute the Android malware via opcodes extracted from application bytecode’. Our results show that Bidirectional long short-term memory (BiLSTMs) neural networks can be used to detect static behavior of Android malware beating the state-of-the-art models without using handcrafted features. For our experiments in our system, we also choose to work with distinct and independent deep learning models leveraging sequence specialists like recurrent neural networks, Long Short Term Memory networks and its Bidirectional variation as well as those are more usual neural architectures like a network of all connected layers(fully connected), deep convnets, Diabolo network (autoencoders) and generative graphical models like deep belief networks for static malware analysis on Android. To test our system, we have also augmented a bytecode dataset from three open and independently maintained state-of-the-art datasets. Our bytecode dataset, which is on an order of magnitude large, essentially suffice for our experiments. Our results suggests that our proposed system can lead to better design of malware detectors as we report an accuracy of 0.999 and an F1-score of 0.996 on a large dataset of more than 1.8 million Android applications.
Keywords: End-to-end architecture | Malware analysis | Deep neural networks | Android and big data
MalDy: Portable, data-driven malware detection using natural language processing and machine learning techniques on behavioral analysis reports
MalDy: تشخیص بدافزارهای قابل حمل ، داده محور با استفاده از تکنیک های پردازش زبان طبیعی و یادگیری ماشین در گزارش های تحلیل رفتاری-2019
In response to the volume and sophistication of malicious software or malware, security investigators rely on dynamic analysis for malware detection to thwart obfuscation and packing issues. Dynamic analysis is the process of executing binary samples to produce reports that summarise their runtime behaviors. The investigator uses these reports to detect malware and attribute threat types leveraging manually chosen features. However, the diversity of malware and the execution environments make manual approaches not scalable because the investigator needs to manually engineer fingerprinting features for new environments. In this paper, we propose, MalDy (mal die), a portable (plug and play) malware detection and family threat attribution framework using supervised machine learning techniques. The key idea of MalDy portability is the modeling of the behavioral reports into a sequence of words, along with advanced natural language processing (NLP) and machine learning (ML) techniques for automatic engineering of relevant security features to detect and attribute malware without the investigator intervention. More precisely, we propose to use bag-of-words (BoW) NLP model to formulate the behavioral reports. Afterward, we build ML ensembles on top of BoW features. We extensively evaluate MalDy on various datasets from different platforms (Android and Win32) and execution environments. The evaluation shows the effectiveness and the portability of MalDy across the spectrum of the analyses and settings.
Keywords: Malware | Android | Win32 | Behavioral analysis | Machine learning | NLP
A hybrid deep learning image-based analysis for effective malware detection
یک تجزیه و تحلیل مبتنی بر یادگیری عمیق ترکیبی برای تشخیص مؤثر بدافزار -2019
The explosive growth of Internet and the recent increasing trends in automation using intelligent appli- cations have provided a veritable playground for malicious software (malware) attackers. With a variety of devices connected seamlessly via the Internet and large amounts of data collected, the escalating mal- ware attacks and security risks are a big concern. While a number of malware detection methods are available, new methods are required to match with the scale and complexity of such a data-intensive environment. We propose a novel and unified hybrid deep learning and visualization approach for an effective detection of malware. The aim of the paper is two-fold: 1. to present the use of image-based techniques for detecting suspicious behavior of systems, and 2. to propose and investigate the application of hybrid image-based approaches with deep learning architectures for an effective malware classification. The performance is measured by employing various similarity measures of malware behavior patterns as well as cost-sensitive deep learning architectures. The scalability is benchmarked by testing our proposed hybrid approach with both public and privately collected large malware datasets that show high accuracy of our malware classifiers.
Keywords: Malware detection | Similarity mining | Image analysis | Evaluation metrics | Machine learning | Deep learning architectures
A cost analysis of machine learning using dynamic runtime opcodes for malware detection
تجزیه و تحلیل هزینه از یادگیری ماشین با استفاده از کد زمان اجرا پویا برای تشخیص بدافزار-2019
The ongoing battle between malware distributors and those seeking to prevent the onslaught of malicious code has, so far, favored the former. Anti-virus methods are faltering with the rapid evolution and distribution of new malware, with obfuscation and detection evasion techniques exacerbating the issue. Recent research has monitored low-level opcodes to detect malware. Such dynamic analysis reveals the code at runtime, allowing the true behaviour to be examined. While previous research uses machine learning techniques to accurately detect malware using dynamic runtime opcodes, underpinning datasets have been poorly sampled and inadequate in size. Further, the datasets are always fixed size and no attempt, to our knowledge, has been made to examine the cost of retraining malware classification models on datasets which grow continually. In the literature, researchers discuss the explosion of malware, yet opcode analyses have used fixed-size datasets, with no deference to how this model will cope with retraining on escalating datasets. The research presented here examines this problem, and makes several novel contributions to the current body of knowledge. First, the performance of 23 machine learning algorithms are investigated with respect to the largest run trace dataset in the literature. Second, following an extensive hyperpa- rameter selection process, the performance of each classifier is compared, on both accuracy and computational costs (CPU time). Lastly, the cost of retraining and testing updatable and non-updatable classifiers, both parallelized and non-parallelized, is examined with simu- lated escalating datasets. This provides insight into how implemented malware classifiers would perform, given simulated dataset escalation. We find that parallelized RandomForest, using 4 cores, provides the optimal performance, with high accuracy and low training and testing times.
Keywords: Malicious code | Network security | Machine learning Computer security | Malware
A multi-level deep learning system for malware detection
یک سیستم یادگیری عمیق چند سطحی برای تشخیص بدافزار-2019
To defend against an increasing number of sophisticated malware attacks, deep-learning based Malware Detection Systems (MDSs) have become a vital component of our economic and national security. Tra- ditionally, researchers build the single deep learning model using the entire dataset. However, the sin- gle deep learning model may not handle the increasingly complex malware data distributions effectively since different sam ple subspaces representing a group of similar malware may have unique data distribu- tion. In order to further improve the performance of deep learning based MDSs, we propose a Multi-Level Deep Learning System (MLDLS) that organizes multiple deep learning models using the tree structure. Each model in the tree structure of MLDLS was not built on the whole dataset. Instead, each deep learn- ing model focuses on learning a specific data distribution for a particular group of malware and all deep learning models in the tree work together to make a final decision. Consequently, the learning effective- ness of each deep learning model built for one cluster can be improved. Experimental results show that our proposed system performs better than the traditional approach.
Keywords: Malware detection | Deep learning | Multi-level clustering algorithm | Convolutional neural network | Recurrent neural network | Model construction time
From big data to knowledge: A spatio temporal approach to malware detection
از داده های بزرگ به دانش: یک رویکرد زمان فضایی به تشخیص نرم افزارهای مخرب-2018
The deployment of endpoint protection has been gradually migrated from individual clients to remote cloud servers, which is termed as cloud based security service. The new para digm of security defense produces a large amount of data and log files, and motivates data driven techniques for detecting malicious software. This paper conducts an empirical study on the log of a real cloud based security service to characterize the occurrence of execut able files in end hosts, which concerns 124,782 benign and 113,305 malicious executable files occurred in 165,549,417 end hosts. The end hosts and the timestamps that an execut able file occurs in provide insights into the distribution of software in wild from spatial and temporal perspectives, respectively. Meanwhile, we investigate the strategies behind the char acterizations, and observe the preferential attachment process and the periodicity of file occurrence in end hosts. The observed different occurrence patterns of benign and mali cious files in end hosts inspire us a new scalable approach to malware detection. We learn from the characterizations that, the associated files shared more spatial and temporal in formation in common are more likely to be same in their labels, either benign or malicious. Thus, we devise a graph based semi-supervised learning algorithm for real-time malware detection by taking into account the spatio-temporal information of the distribution of ex ecutable files. Experimental results demonstrate that our approach increases the performance on malware detection by 14.7% over previous techniques on average.
Keywords: Malware detection ، Data-driven security analysis ، File co-occurrence ، Graph based semi-supervised ، learning ، Content-agnostic
Big Data Based Security Analytics for Protecting Virtualized Infrastructures in Cloud Computing
تحلیل امنیتی بر اساس داده های بزرگ برای حفاظت زیرساخت های مجازی شده در محاسبات ابری-2018
Virtualized infrastructure in cloud computing has become an attractive target for cyberattackers to launch advanced attacks. This paper proposes a novel big data based security analytics approach to detecting advanced attacks in virtualized infrastructures. Network logs as well as user application logs collected periodically from the guest virtual machines (VMs) are stored in the Hadoop Distributed File System (HDFS). Then, extraction of attack features is performed through graph-based event correlation and MapReduce parser based identification of potential attack paths. Next, determination of attack presence is performed through two-step machine learning, namely logistic regression is applied to calculate attack’s conditional probabilities with respect to the attributes, and belief propagation is applied to calculate the belief in existence of an attack based on them. Experiments are conducted to evaluate the proposed approach using well-known malware as well as in comparison with existing security techniques for virtualized infrastructure. The results show that our proposed approach is effective in detecting attacks with minimal performance overhead.
Index Terms: Virtualized infrastructure, virtualization security, cloud security, malware detection, rootkit detection, security analytics, event correlation, logistic regression, belief propagation
CloudIntell: An intelligent malware detection system
CloudIntell: یک سیستم تشخیص بدافزار هوشمند-2017
Enterprises and individual users heavily rely on the abilities of antiviruses and other security mechanisms. However, the methodologies used by such software are not enough to detect and prevent most of the malicious activities and also consume a huge amount of resources of the host machine for their regular operations. In this paper, we propose a combination of machine learning techniques applied on a rich set of features extracted from a large dataset of benign and malicious files through a bespoke feature extraction tool. We extracted a rich set of features from each file and applied support vector machine, decision tree, and boosting on decision tree to get the highest possible detection rate. We also introduce a cloud-based scalable architecture hosted on Amazon web services to cater the needs of detection methodology. We tested our methodology against different scenarios and generated high achieving results with lowest energy consumption of the host machine.
Keywords: Malware analysis | Machine learning | Cloud | Decision tree | Boosting | SVM | Security | Malware detection | Portable executable | AWS