Skip to content

recontech404/AnomalyML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AnomalyML - A lightweight ML system which learns a computer(s) behavior around, but not limited to, process, network, user system calls and provides feedback if a call is an system anomaly.

This started with an idea, was there an easier and more efficient way to detect system anomalies (processes, network traffic, etc) for Linux environments with very low resource utilization, minimal maintenance, and high scalability for use in cloud workloads or on-prem? The idea being if I have a Linux system such as a database, web-app, or other rare human user interaction system, I should have a fairly small set of repeating processes, network/dns requests, so could I get an alert if they do something abnormal such as spinning up a new unknown process, connecting to an unknown domain etc.

Lets take an example of a Kubernetes cluster running a web-app. You have the main app process running and a couple of more background system processes, your app also does a network check every 15 seconds out to google.com and a few other domains. Because there is no human user interaction with the system itself, your process/network event list should be fairly finite and repetitive, but your total event log size would grow continuously. From my experience a setup similar to the one above, it had around ~30-40 unique processes/network events (this was a Linux Alpine container running a Go app).

Normally if you wanted to see if there is a process/network anomaly, you would run a database to collect all of the events and then check the db (after a set learning period) when a new process arrives if was not found then create an alert but also insert the new event. However, this has large scalability issues as I have also seen the same Go app have peaks of 20+ events/second. So what happens if you have ten thousand or a hundred thousand devices you need to monitor? Implementing some type of event cache also has issues such as data consistency across cache nodes especially if you need to scale up or down based on demand or add/remove a new benign/malicious event that slipped into the learning period.

This is where I had the odd but interesting idea of using a very small ML model as a read-only database. (I should preface this by saying that my background is not in machine learning and I only started learning about and running LLM models locally a few months ago for an automated ELF file classifier for malware. So there was a definite learning curve and mistakes made (probably still a few) but intriguing and challenging ideas are what keep me interested). I did not want to create massive dataset of known benign and malicious events as that would not be very lightweight, impossible to collect all known events, but could also not be 100% accurate. My model would need to be 100% accurate and I also wanted to remove the human mistake factor when reviewing logs.

Take these two example events, they might look identical but #1 is a bengin event and #2 is malicious

Name: process1 — UserID: usr01 — Path: /var/dev/custom/process1

Name: process1 — UserID: usr01 — Path: /var/dev/custom/processl

The difference is the lower case L instead of the number one (heavily depends on the viewing type face but you can use punycodes in domains to similar effect)

My first challenge was what type of encoding to use for the ML model, one hot encoding, dummy encoding, label encoding? I tried a lot of different types of encoding, but all seemed to have a similar problem. How can you ensure that the data you are testing on is not encoded to the same values you trained the model on? Such was the case in label encoding with the benign (source data) and malicious processes (test data) I showed above. Both were being encoded to the same values because they had no idea the previous existed. After several hours of testing and reading documentation on the different types of encoders, I decided the safest and easiest route would be to convert each text character back into its Unicode code. This way I can ensure that a 1 does not be mistaken for a l and it works so now onto what type of ML model to use.

I will skip the couple of hours I spent learning and testing the different types of models (Decision Trees, Random Forests etc) trying to find the best one before finally settling on k-nearest neighbors (knn). Easiest way I can describe how this works is for every data event in your training set, you create a new dot on a graph. Now because of how my encoding works, unless the two process or network events have exactly the same Unicode array representation (remember Unicode does not share numbers across characters). The newly encoded graph dot will either match perfectly with an existing dot in the model or it will be slightly off. If it is slightly off, then we know it is an anomaly.

Now for the fun part of seeing it in action and how it performs. I put a bit of my Python code up on my Github in case people want to test it with some generated test data. But basically we read in the known events (data.csv) convert each character into its Unicode value then into a 3d numpy array. The reason I said this was meant for Linux up above is that Linux has a maximum path length of 4096 characters so worst case our 3d array depth would be 4096. But like I said that is worst case, the vast majority of the paths (usually the longest part of an event) is much shorter I usually see 40-80 characters depending on the environment so the array can be made shorter which decreases model size. If an incoming event is longer than the training data length than it is obviously a anomaly. That being said, there is no technical limitation preventing this from being used on different platforms/use cases besides Linux process and network events. I am just familiar with for process/network events for Linux so that is what I chose to test for.

After encoding the values, the key step in defining the model is to tell the knn model that it should only look for one neighbor at a time not the default five, this way every data point we have is unique. And after spending a fair amount testing I am happy to see that I am still getting 100% model accuracy in detecting anomalies but what about the model size an performance? This is where I got a bit excited, for a dataset of 40 events with a length of 40 characters, the model size was 79KB, not bad considering if it performs.

model size 1

So how about lookup performance? At least on my system (not the fastest single thread), I am seeing an individual lookup time around ~0.00023 seconds. If we test continuous lookups, I get almost ~10k/second which is pretty good considering this is only single threaded. (in my code I use the decision_function call to check if the model has a nearest neighbor to the test point, if the result is anything but 0, it is an anomaly. So the Anomaly and Normal text in the output below is me changing the validation data between runs and it reflecting on if it found an anomaly or not)

model speed 1

model 10k

As I mentioned above if we look at worse case with 4096 characters our prediction time goes up a fair amount (~0.00057) as does the model size (7.6MB) but still nothing crazy as I mentioned 4096 characters is not very common

model speed 2

Instead of increasing the character count what if we increase the number of events to 100 instead of 40?

model speed 3

Again we see an expected increase in file size 195KB but the more interesting part is the marginal increase in lookup time, it is almost identical at ~0.00025 seconds.

So why do all of this then or what interests me about this approach? As I discussed above about the web-app, if I collect everything for a set learning period into a DB, train a model on those unique events then put this into the event/log pipeline. It completely removes the need for the extra DB lookup for every new events and you would only need to insert the new event or not depending if you need to log it which greatly reduces DB costs which get expensive very fast. So how about the model vs some type of caching? Well for one, once you are done training (which usually takes less than a second) you are left with a binary file. You can archive that file depending on if the monitored system is online or not. You can spin up multiple instances of the model for high availability without having to worry about data in-consistency. But what about updating data if a benign/malicious event is found? Pretty simple, just re-pull the original data from the DB and add or omit your new data. Training takes less than a second and you can then transmit an update command to your pipeline with the new model and can do a checksum to make sure everything matches if needed. Scalability is also partially solved because each model is very small and requires a very little memory. You could feasibly run several thousand or more models on a single system without crazy amounts of memory compared to some database setups I have seen. A couple of more interesting uses would be to run the model on the monitored device itself as network egress costs do add up especially in the cloud so you could only send back those events which are anomalies or if your learning data is very accurate, use the model to remediate any anomalous processes which might have been missed by a protection system. You could also theoritically use a single model to monitor several of the same machines provided they were set up the same, if a new machine is creating anomalies that the other ones are not, it might be an indicator of something wrong.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages