Machine Learning in Infosecurity – Current Challenges and Future Applications

Daniel Faggella

Daniel Faggella is Head of Research at Emerj. Called upon by the United Nations, World Bank, INTERPOL, and leading enterprises, Daniel is a globally sought-after expert on the competitive strategy implications of AI for business and government leaders.

Tuning Machine Learning Algorithms with Scott Clark 3

Episode Summary: Uday Veeramachaneni is taking a new approach to machine learning in infosecurity aka infosec. Traditionally, infosec has approached predicting attacks in two ways: 1—through a system of hand-designed rules and 2—through anomaly detection, a technique that detects statistical outliers in the data. The problem with these approaches, Veermachaneni says, is that the signal-to-noise ratio is too low. In this episode, Veermachaneni discusses how his company, PatternEx, is using machine learning to provide more accurate attack prediction. He also discusses the cooperative role of man and machine in building robust automated cyberdefense systems and walks us through a common security attack scenario.

 

Expertise: Cybersecurity; product management; software defined networking

Brief Recognition: Prior to co-founding PatternEx in 2013, Uday Veermachaneni was a head of product management at Riverbed Technonology, a Principal Product Manager in the Cloud Networking Group at Citrix, a staff software engineer at Juniper Networks, and a Senior Engineer at Motorola. Veermachaneni holds a MS in Computer Science from the University of Texas, Arlington and a MS in Economics from the Birla Institute of Technology and Science.

Current Affiliations: Co-founder and CEO of PatternEx

Interview Highlights:

The following is a condensed version of the full audio interview, which is available in the above links on Emerj’s SoundCloud and iTunes stations.

(2:06) Give us a brief rundown on where you see gaps in anomaly detection when it’s used alone

Uday Veeramachaneni: Generally, the way infosec has worked is through rules, and these rules generate alerts. Human analysts then chase those alerts down and figure out if it’s an attack or a false positive. The industry noticed that those alerts produce a lot of noise, so they looked at anomaly detection as a way of improving the ratio of alerts generated for real attacks versus those generated for false positives. Anomaly detection catches statistical outliers.

The problem is that while attacks could be statistical outliers, not all statistical outliers are attacks. That’s the challenge with anomaly detection: it’s flagging outliers which may end up being false positives. The next evolution in fighting attacks is using a team of human analysts, who identify which events are actual attacks. What a machine should do is go back through the data, and for each human-identified attack it should look for patterns to see how it can identify that attack if it happens again. Once the machine has figured out the pattern it can use that knowledge to predict what a human would identify as an attack. That’s what needs to be done in infosec to address false positive issues.

(4:21) There’s this whole issue of context that the machine may not pick up on. For example, there may be a particular type of attack that happens primarily during the holidays, or in recent months has happened only in a particular industry. A human would immediately pick up on that, while a machine may have trouble with that broader conceptual understanding.

UV: For a machine to work, it needs some examples. Once a human gives it examples, it can go back and identify the patterns to predict that attack.

(5:13) You mentioned that anomaly detection is an unsupervised learning task, which means the machine is looking for patterns in data that does not have labels, like in infosec. When a human does label the data, indicating an attack, and thus transforming it into a supervised learning task. You had used the term “active learning”—how does that differ conceptually from reinforcement learning?

UV: The challenge in infosec is that there aren’t many examples to train the machine. That’s why we’ve used anomaly detection, because you don’t need examples for that. You just need to detect statistical outliers. So we’ve started using the term “active learning”, where a machine asks a human analyst what he thinks of a certain event, and as the human if giving feedback the machine goes back figures out how to construct a predictive model based on what he’s saying. You’re constructing a model on the fly based on the feedback the analyst is giving.

 (7:03) What of the role of man and what is the role of machine in this case? What is the initial human job to “plug in” this kind of technology? What human effort is needed to get this system up and running?

UV: Any company could have 100+ sources of data. We need to adapt to those sources of data and ensure we consume that data in a real-time streaming mode. The real AI piece starts after that. Day 1, there are no training examples. We start with the output from an anomaly detection system, or if you have a rules-based system then perhaps 50 alerts from that rules-based system. The human reviews them, and says, “These 48 are normal, and these two are attacks.”

The machine crunches the data that reviews the data for patterns, and the next day presents event similar to the human-identified attacks from the day before. The human gives feedback on whether the machine is correct or not, and the analyst thus reinforces the machine, and the machine learns. And this happens continuously, because human attackers will evolve, and so the machine needs to evolve as well.

(10:12) What are some examples of signals in a cyberattack? What are the signals that would identify a run-of-the-mill attack?

UV: In infosec, human analysts have been trying to create these rules to figure out what an attack looks like. We’re flipping this around: humans tell the machine what an attack is, and the machine tell the humans what these attacks look like.

As an example, the very standard sort of attack is called command and control (C2) communication. At a very simple level, it’s when your computer is infected with a virus and controlled by the hacker’s remote server. Communication is very systematic; you may see it sending information every two hours, or every 30 minutes. You could create a rule saying, “If a computer communicates every 10 minutes, flag that as C2.” But you’ll get clobbered by false positives.

But if you flag many C2s and feed that to the machine, it could look for other parameters. It could be the standard deviations of the duration of the connections; it could be the number of bytes, or the number of packets. It’s going to identify those patterns for hundreds or thousands of parameters.

(13:27) What kind of information is sent back over a C2 attack? What’s often being taken?

UV: C2 is a stage in the attack cycle. Your machine has been infected with malware, and C2 is when the machine is communicating with the hacker. The hacker tells the machine to look for other things, which could be intellectual property, customer data, credit card information, and so on. They then export this information to his server.

(15:07) What are some other signals a machine might use to distinguish an attack?

UV: Humans generally look at sample statistics for bytes in, bytes out, packets in, packets out. Machines can look at more complex factors. For instance, this machine initiated 40 connections. What was the duration of each of those connections? What was the average of that duration? What was the standard deviation of that duration? And how does that data in previous attacks compare to the current data? Machine can look through those factors that humans can’t find.

(17:08) So there is a myriad of mathematical permutations of factors that a human wouldn’t be able to find, but the machines can find some of those underlying patterns that a human may not be able to even think of.

UV: That’s exactly what it is. Once the human say it’s an attack, the machine goes through millions of combinations to figure out what’s the correct combination to predict that attack.

(18:19) You talked about the “holy grail” of machine learning and AI in info security, where the “good guys” could share and conglomerate knowledge around what attacks look like and be able to protect themselves from attackers. What is that holy grail as you see it?

UV: The holy grail of AI infosec is having the machines that can find complex patterns that are good predictors. To do that, humans have to train the machine, and they’re doing that at every company. Can we share a complex pattern detected at one company with another company? That’s the holy grail. If you’re able to share a complex pattern, it’s very difficult for an attacker to adapt to that. They’d have to change the underlying tools they used, while exchanging a IP address or email address blacklist would be much easier for an attacker to adapt to.

(20:19) There’d have to be some sort of underlying body that would permit for a way to source those common attack pattern across companies.

UV: There’s a lot of automation there, it’s not a manual thing. As AI matures, it should be able to automatically take data in from the outside and learn from it. There’s no exchange of data, per se. It’s more that if someone participates in an AI network, he should be able to train his AI much faster and across a much broader set of attack vectors than if he were having his analysts train it by themselves.

(21:15) Hopefully, there would be some facilitation of that process, where we would see more and more businesses being able to block and detect and attack as it occurs. Although with malicious actors on the other end, they’ll be working hard too.

UV: The key is the machine-analyst combination needs to evolve faster than the attackers are evolving. That’s the crux of solving this problem.

Big Ideas:

1 – In a realm of unsupervised data, infosec analysts can empower machine learning systems through active learning, where machines are constantly given human feedback on unlabeled data to improve pattern recognition.

2 – The “holy grail” of infosec is an AI network that automatically shares attack patterns with all other companies in the network, so that an attack on one can quickly be defended against by all. This is just one of many possible AI applications in data security.

Subscribe