Introduction

We all need to agree that information security is a Big Data problem. Humans generate huge amounts of data in the form of blog posts, social media, business data, e-mails, instant messaging, videos, Darknet traffic, machine generated data and other sources. This explosion of data is fueled by the unprecedented growth of internet usage and smart phones. This ever-cheaper handheld technology enables us to create, capture, store, share, and manage information with unprecedented convenience and efficiency. Woven into this heap of noise are both our most guarded secrets and shadows of threats that seek to uncover them.

Big Data

To give you an idea on how much data we are producing, look at the research conducted by IDC and sponsored by EMC as shown in picture-01.

 

Picture-01

Every two days we create as much information as we did from the dawn of civilization up until 2003, per Schmidt. To quantify his point, We can observe that prior to 2005, we have produced roughly 130 Exabytes. This value skyrockets by 2010, by which we have produced around 1200 Exabytes of data. By 2015, we have produced 7900 Exabytes and project 41000 Exabytes by 2020.To help us comprehend the amount of data in an Exabyte –the Amazon rain forest contains roughly 1.4 billion acres of trees, every acre has about 500 big trees, so that makes it about 700 billion trees in the Amazon rain forest. If we chopped and pulped all those trees into paper and then filled every page with letters , the text would form one to two Petabytes of data. An Exabyte is 1000 Petabytes. This is the magnitude of the data that humanity is producing. Whether we are aware or not we are in the age of Data Exhaust.

5V’s

To better understand big data, we can describe the concept in five dimensions, all starting with V’s:

  • Volume – Refers to the vast amount of data generated every minute.

  • Variety – Refers the different types of data we consume. In the past, we focused on structured data. Specifically in the field of security we deal with a tremendous variety of unstructured data to identify threats and adversaries. This data includes – Network packets, social media conversations, videos, images, darknet traffic, logs, IOT data and other kinds of data.

  • Velocity – Refers to the speed at which new data is generated and the speed at which this data moves around.

  • Veracity – Refers to the trust worthiness of the data. In information security this is very important. If the data is not accurate the tools which consume this data may produce lots of false positives detracting from the confidence in the tools.

  • Value – Refers to the value we glean from this data. This could be qualitative or quantitative. If the data can save an organization from a breach that brings a huge value for that organization and it the same time it can save the company from reputational damages.

It would be humanely impossible to process and analyze this huge data. Businesses become vulnerable to security breaches if they don’t properly analyze the data. To identify attacks and breaches security industry added some tools into their arsenal.

Current State

To transform the 5V’s into insight, the generated data needs to be analyzed by security tools to identify potential attacks and breaches. Traditionally we used SIEM tools which connect disparate, isolated systems and bring their logs/events together to paint a bigger picture. SIEM tools analyse these logs, correlate them based on the signatures, rules, behavior and produce actionable alerts for a potential attack or a breach. SIEM tools eventually evolved to index and search big data using key words and relationships. These tools provide good visualization to see behaviors, trends and predict attacks and breaches.

Future state

As a human data scientist, we can process data in the world is around 5000 Exabytes, At present, machines are analyzing approximately 12000 Exabytes. The data that is not being processed by machines due capacity limitations can be processed by Machine learning techniques. With the ability to process big data, ML (Machine Learning) and DL (Deep Learning) become beacons of hope for cyber security. Machine learning holds great promise for the security industry’s ability to detect advanced and unknown attacks, particularly those leading to data breaches.

Machine learning techniques have been applied in many areas due to their scalability, adaptability, and potential to rapidly adjust to new data sets and unknown challenges. Information security is a fast paced field demanding a great deal of attention because of remarkable progresses in social networks, cloud, IOT, web technologies, online banking, mobile environment, etc. Different machine learning methods have been adopted and deployed in such environments to address different security and non-security problems. We should leverage ML to defend against the bad guys.

Just to give you a glimpse on what myriad of things machine learning applications can do. Take, for example, the task of online shopping. Almost every large online storefront will recommend items you may want to purchase. These recommendations are based on a few data points; for example, previous shopping history, your recent searches, or even based on who your friends are. SOme other Common applications of machine learning in today’s technology include face recognition, voice recognition, email spam filtering, fraud detection, NLU (Natural Language Understanding), NLP (Natural Language Processing), video analysis. etc.

What is ML (Machine Learning)?

Now that we’ve established the need and potential benefits of machine learning, we must understand the concept behind the hype. According to Wikipedia –

Machine learning is the subfield of computer science that, according to Arthur Samuel in 1959, gives "computers the ability to learn without being explicitly programmed." Evolved from the study of pattern recognition and computational learning theory in artificial intelligence, machine learning explores the study and construction of algorithms that can learn from and make predictions on data – such algorithms overcome following strictly static program instructions by making data-driven predictions or decisions, through building a model from sample inputs.

First, let us consider ML from a ten-thousand-foot view. Machine learning can be divided into the following categories:

Supervised learning

The main goal of supervised learning is to develop a model from labeled training data. This mode will allow us to predict future results. The term supervised refers to a set of samples where the desired output labels are already known. The supervised learning is further divided into the two categories:

  • Classification - Classification is a subcategory of supervised learning focused on predicting the categorical class labels of new instances based on trained observations. Those class labels are discrete, unordered values that can be understood as the group memberships of the instances. Consider the example of e-mail spam filtering - we can train a model using a supervised machine learning algorithm on a corpus of labeled e-mail, categorized by users as spam or ham (not-spam). Once the machine learns to predict spam or ham, the machine can prioritize the spam out of the user’s inbox. A supervised learning task with discrete class labels, such as e-mail spam-filtering is a good classification example.

 

  • Regression - A second type of supervised learning is the prediction of continuous outcomes, which is also called regression analysis. In regression analysis, we are given a number of predictor (explanatory) variables and a continuous response variable (outcome), and we try to find a relationship between those variables that allows us to predict an outcome. For example, let's assume that we are interested in predicting the Math SAT scores of our students. If there is a relationship between the time spent studying for the test and the final scores, we could use it as training data to teach a model to use study time to predict the test scores of future students who are planning to take this test.

 

Unsupervised Learning

Unsupervised learning is a type of machine learning algorithm used to draw inferences from datasets consisting of input data without labeled responses. Unsupervised learning is sub categorized into or known as Clustering.

  • Clustering - Clustering is an exploratory data analysis technique that allows us to organize a pile of information into meaningful subgroups (clusters) without having any prior knowledge of their group memberships. Each cluster that may arise during the analysis defines a group of objects that share a certain degree of similarity but are more dissimilar to objects in other clusters. For example, it allows marketers to discover customer groups based on their interests in order to develop distinct marketing programs.

 

Semi Supervised Learning or Reinforce Learning

Another type of machine learning is reinforcement learning. In reinforcement learning, the goal is to develop a system (agent) that improves its performance based on interactions with the environment. Since the information about the current state of the environment typically also includes a so-called reward signal, we can think of reinforcement learning as a field related to supervised learning. However, in reinforcement learning this feedback is not the correct ground truth label or value, but a measure of how well the action was measured by a reward function. Through the interaction with the environment, an agent can then use reinforcement learning to learn a series of actions that maximizes this reward via an exploratory trial-and-error approach or deliberative planning. A popular example of reinforcement learning is a chess engine. Here, the agent decides upon a series of moves depending on the state of the board (the environment), and the reward can be defined as win or lose at the end of the game.

 

Now we know what are the different categories of Machine Learning and their sub categories, their definitions and non-information security use cases, let us try to identify where we can use ML in Information Security without going into the algorithm details.

ML Use cases

With the rapid evolution of web, mobile, cloud, IOT technologies, attack techniques are also becoming more sophisticated in penetrating systems and evading generic signature-based approaches. Machine learning techniques offer potential solutions that can be employed for resolving such challenging and complex situations due to their ability to adapt quickly to new and unknown circumstances. Diverse machine learning methods have been already successfully deployed to address wide-ranging problems in computer and information security. Now let us identify different applications of machine learning in Information security. Those are :

  • Malware detection

  • User behavior analysis

  • Mitigating the Denial of Service Attacks

  • Web application FW

  • Detect Malicious URL

  • SPAM Filtering

  • Reputation in Cyber Space

  • User Identification

  • Detecting Identity Theft

  • Information Leakage Detection and Prevention

  • Social Network Security

  • Detecting Advanced Persisted Threats

  • Detecting Hidden Channels

  • Writing malware

  • and More ...

 

References

Raschka, Sebastian. Python Machine Learning. S.l.: Packt Limited, 2015. Print. Wikipedia contributors. "Machine learning." Wikipedia, The Free Encyclopedia. Wikipedia, The Free Encyclopedia, 4 May. 2017. Web.

THE DIGITAL UNIVERSE IN 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East – IDC and EMC study.

Profile

Raghu started his career as a systems programmer building CAD systems, mathematical packages, and compilers / interpreters and designing IDS systems. He received the achievement award from Bell Labs for his work on configuration management tools. Raghu has had opportunities to work in senior roles in many aspects of Information Technology including Application Development, Information Management, and Enterprise Architecture. Over the past 13 years, he has been involved in Pen testing, malware analysis, malware creation, Security Operations/Architecture, Machine Learning, Blockchains and Security Governance. After holding the Head of Information Security position at a California bank, Raghu is currently working as a CTO for Cryptyk a Blockchain based Distributed Enterprise Storage Company.

Education: Bachelor of Engineering in Computer Science, Masters in Computer Science and Information Management.

Certifications: CISSP, ISSAP, ITIL V3, CCSK, Paloalto ICM & ATS, SANS App Security, CMNA, FireEye SE, McAfee Operations Solutions Certification