Thesis

Thesis Latex Document

Code Repository

Automatic Malware Signature Generation Code

How to view the document

To quickly view the document you can select the .pdf version of the document.
If you want to view the source in TexStudio or similar editors just clone this repository on your machine and open the .tex file with TexStudio (or other compatible editor).

Description - Summary

In most recent years the proliferation of malicious software, namely Malware, has had a massive increase: according to AV Atlas in 2019 and 2020 (and until mid 2021 - that is until the time of writing) the number of newly generated malware blew with respect to previous years to the point that approximately 5,1 new Microsoft Windows malware (and PUA – Potentially Unwanted App) are currently generated per second, ~18.500 per hour and ~440.000 per day. The total amount of unique malware (and PUA) variants has nowadays reached impressive numbers, to the point that more than 830 million are now reported by AV Atlas Dashboard. Moreover, nowadays malware commonly use obfuscation and other sophisticated techniques, such as Polymorphism and Metamorphism, to evolve their structure thus evading detection.

For all these reasons signature-based detection techniques (such as manually generated Yara Rules), which are typically used by most commercial anti-virus solutions, are becoming inefficient in the present scenario. In fact, it is now straight up impossible for analysts to manually analyse each malware variant that is found in the wild. Furthermore, even when a new malware family is identified and an appropriate amount of its samples are analysed, the generated signature may not be capable of detecting new variations or may even be rendered useless through the use of obfuscation and/or polymorphic mechanisms. There is therefore the need for automated malware analysis solutions capable of automatically generating (implicit or explicit) signatures effective at distinguishing malicious from benign code while being less susceptible to code modifications and obfuscation attempts.

This thesis presents a research aimed at satisfying this need for automated malware detection solutions. In particular, it presents a novel model built upon previous works on ML-based (Machine Learning based) automatic malware detection and description designed for PE (Microsoft Windows Portable Executable) files. Moreover, it introduces a new evaluation procedure that may prove the applicability of the model learned implicit representation/signature of malware samples in the Malware family prediction and ranking tasks. These tasks are particularly interesting for malware analysts since they allow them to quickly categorize malware samples as being part of specific sets (families) with common behavioural and structural characteristics.

The proposed framework life cycle can be conceptually divided in four phases: model architecture definition, model training and validation, model evaluation and finally model deployment. In particular, in the first phase the proposed FNN (Feedforward Neural Network) model architecture, called Multi Task Joint Embedding (MTJE), is defined and implemented taking inspiration from previous works such as the ALOHA and the Joint Embedding models presented by Rudd et al. and Ducau et al. in the respective papers. In the second phase, instead, the proposed MTJE model is trained (and validated) on an open source large scale dataset of malware and benignware samples (Sorel20M by Harang et al.) with the aim of creating high quality implicit signatures capable of detecting (and describing via SMART tags) unseen malware samples, as well as obfuscated malware and new variants, with high True Positive Rate (TPR) and high Recall at low False Positive Rates (FPRs). The first two phases here described are iteratively repeated until a model with satisfactory training and validation loss trends is generated. In the third phase, on the other hand, the final model architecture is tested on the Malware detection and description tasks and the corresponding prediction scores are computed and plotted. Moreover, in this phase the model learned representation of PE files is also tested on the Malware family prediction and ranking tasks using a novel dataset, referred to as 'Fresh Dataset', containing 10.000 samples belonging to 10 of the most widespread malware families in Italy at the time of writing, specifically created for that purpose. In both datasets the samples are directly represented by the numerical feature representation extracted statically from specific fields of their Windows Portable Executable (PE) file header. The MTJE model thus relies exclusively on static analysis features which are generally simpler, less computationally intensive and thus faster than dynamic analysis ones (behavioural characteristics of executables). Finally, in the last phase the final model architecture is deployed in the wild. In particular, it can be used as an automatic malware detection tool that provides additional description tags useful for remediation. Moreover, potentially, if the corresponding evaluation results allow it, it could also be used to provide information about the malware family each analysed sample most probably belongs to, among the set of families of interest.

This thesis focuses on the first three phases previously mentioned. In particular, it concentrates on defining, training and evaluating the best model architecture possible for the tasks at hand. However, some code optimization challenges resulting from the slowness of the code in the instance used for the experiments meant that it could be possible to train the model only with the first half of the samples of the Sorel20M dataset in a reasonable time, with some approximations on the samples dispersion when random sampling them from the dataset. This resulted in slightly worse performance than might be expected using the current architecture with the entire dataset. Nevertheless, the deployment of the proposed MTJE model on a real-world scenario is theoretically possible with the current final architecture, although it would be better to train the model on the whole Sorel20M dataset on a better instance first in order to see its true potential.

At a later moment, the proposed framework was extended with the addition of a Malware Family Classifier model head defined on top of the proposed MTJE model base topology in order to improve its relatively poor results in the Malware family prediction/classification task. This new model was then specifically trained (and tested) for such purpose using the training and test subsets of the relatively small Fresh Dataset, which contain the information about the malware family each sample belongs to. However, instead of training the newly defined architecture from scratch on such small dataset at the risk of overfitting, the technique called Transfer Learning was used by transferring the knowledge (the learned model parameters) from a previous MTJE model training run on the large Sorel20M dataset onto the new model base topology (the one shared with the MTJE model architecture), before training. Then, during the training procedure, some of the imported parameters were 'fine-tuned' while the ones corresponding to the newly added Family Classifier head were learned from scratch.

However, this new Family Classifier model could not be used to produce family rankings nor to query samples based on their similarity to some anchor, which are very useful tasks in the field of Information Security since they allow to quickly obtain samples similar to the currently analysed one, facilitating its study. Moreover, this model was also limited to working only with a fixed number of predefined families. Therefore, in order to overcome such limitations a new model - referred to as Contrastive Model - was introduced consisting of a Siamese Network which refined, in a contrastive learning setting, the implicit representation of PE files (PE Embeddings) learned by a previous MTJE model training run on the Sorel20M dataset (with the aid of Transfer Learning) using samples from the training subset of the fresh dataset with the Online Triplet Loss function. The learned PE Embeddings can, in fact, be used to address both the family prediction/classification – applying the distance weighted k-NN (k Nearest Neighbours) algorithm in the resulting embedding space - and ranking tasks and to query samples based on their similarity in the Embedding space.

The current implementation of the MTJE model provided very good results in the tasks of Malware detection and Malware description via SMART tags, considering the number of samples it was trained on. Moreover, the Family Classifier and the Contrastive Model performed relatively well on the Malware Family Classification and Ranking (when possible) tasks considering the small and low quality dataset (fresh dataset) they were trained on. However, these models have also some limitations, such as the results in the family classification and ranking tasks which could be much better if a bigger and higher quality dataset was used during training, and the lack of interpretability of the resulting implicit signatures. Future works capable of overcoming these shortcomings may be extremely helpful to malware analysts, antivirus software developers and system administrators and could even enable the generation of explicit (and thus more interpretable) signatures derived from the learned implicit ones.

Copyright and License

Produced as a thesis project at the TORSEC research group of the Polytechnic of Turin (Italy) under the supervision of professor Antonio Lioy and engineer Andrea Atzeni and with the support of engineer Andrea Marcelli.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Name		Name	Last commit message	Last commit date
Latest commit History 389 Commits
Thesis		Thesis
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Thesis

Code Repository

How to view the document

Description - Summary

Copyright and License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Thesis

Code Repository

How to view the document

Description - Summary

Copyright and License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages