Data Science for Cyber-Security
eBook - ePub

Data Science for Cyber-Security

Nick Heard, Niall Adams;Patrick Rubin-Delanchy;Melissa Turcotte

Share book
  1. 304 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Data Science for Cyber-Security

Nick Heard, Niall Adams;Patrick Rubin-Delanchy;Melissa Turcotte

Book details
Book preview
Table of contents
Citations

About This Book

Cyber-security is a matter of rapidly growing importance in industry and government. This book provides insight into a range of data science techniques for addressing these pressing concerns.

The application of statistical and broader data science techniques provides an exciting growth area in the design of cyber defences. Networks of connected devices, such as enterprise computer networks or the wider so-called Internet of Things, are all vulnerable to misuse and attack, and data science methods offer the promise to detect such behaviours from the vast collections of cyber traffic data sources that can be obtained. In many cases, this is achieved through anomaly detection of unusual behaviour against understood statistical models of normality.

This volume presents contributed papers from an international conference of the same name held at Imperial College. Experts from the field have provided their latest discoveries and review state of the art technologies.


Contents:

  • Unified Host and Network Data Set (Melissa J M Turcotte, Alexander D Kent and Curtis Hash)
  • Computational Statistics and Mathematics for Cyber-Security (David J Marchette)
  • Bayesian Activity Modelling for Network Flow Data (Henry Clausen, Mark Briers and Niall M Adams)
  • Towards Generalisable Network Threat Detection (Blake Anderson, Martin Vejman, David McGrew and Subharthi Paul)
  • Feature Trade-Off Analysis for Reconnaissance Detection (Harsha Kumara Kalutarage and Siraj Ahmed Shaikh)
  • Anomaly Detection on User-Agent Strings (Eirini Spyropoulou, Jordan Noble and Christoforos Anagnostopoulos)
  • Discovery of the Twitter Bursty Botnet (Juan Echeverria, Christoph Besel and Shi Zhou)
  • Stochastic Block Models as an Unsupervised Approach to Detect Botnet-Infected Clusters in Networked Data (Mark Patrick Roeling and Geoff Nicholls)
  • Classiffication of Red Team Authentication Events in an Enterprise Network (John M Conroy)
  • Weakly Supervised Learning: How to Engineer Labels for Machine Learning in Cyber-Security (Christoforos Anagnostopoulos)
  • Large-scale Analogue Measurements and Analysis for Cyber-Security (George Cybenko and Gil M Raz)
  • Fraud Detection by Stacking Cost-Sensitive Decision Trees (Alejandro Correa Bahnsen, Sergio Villegas, Djamila Aouada and Björn Ottersten)
  • Data-Driven Decision Making for Cyber-Security (Mike Fisk)


Readership: Researchers at all levels in cyber-security and data science.
Key Features:

  • A collection of papers introducing novel methodology for cyber-data analysis

Frequently asked questions

How do I cancel my subscription?
Simply head over to the account section in settings and click on ā€œCancel Subscriptionā€ - itā€™s as simple as that. After you cancel, your membership will stay active for the remainder of the time youā€™ve paid for. Learn more here.
Can/how do I download books?
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
What is the difference between the pricing plans?
Both plans give you full access to the library and all of Perlegoā€™s features. The only differences are the price and subscription period: With the annual plan youā€™ll save around 30% compared to 12 months on the monthly plan.
What is Perlego?
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, weā€™ve got you covered! Learn more here.
Do you support text-to-speech?
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Is Data Science for Cyber-Security an online PDF/ePUB?
Yes, you can access Data Science for Cyber-Security by Nick Heard, Niall Adams;Patrick Rubin-Delanchy;Melissa Turcotte in PDF and/or ePUB format, as well as other popular books in Computer Science & Cyber Security. We have over one million books available in our catalogue for you to explore.

Information

Publisher
WSPC (EUROPE)
Year
2018
ISBN
9781786345653

Chapter 1

Unified Host and Network Data Set

Melissa J. M. Turcotte*,ā€”, Alexander D. Kent* and Curtis Hashā€ 
*Los Alamos National Laboratory,
Los Alamos, NM 87545, USA

ā€ Ernst & Young, New Mexico, USA
ā€”[email protected]
The lack of data sets derived from operational enterprise networks continues to be a critical deficiency in the cyber-security research community. Unfortunately, releasing viable data sets to the larger community is challenging for a number of reasons, primarily the difficulty of balancing security and privacy concerns against the fidelity and utility of the data. This chapter discusses the importance of cyber-security research data sets and introduces a large data set derived from the operational network environment at Los Alamos National Laboratory (LANL). The hope is that this data set and associated discussion will act as a catalyst for both new research in cyber-security as well as motivation for other organisations to release similar data sets to the community.

1.Introduction

The lack of diverse and useful data sets for cyber-security research continues to play a profound and limiting role within the relevant research communities and their resulting published research. Organisations are reticent to release data for security and privacy reasons. In addition, the data sets that are released are encumbered in a variety of ways, from being stripped of so much information that they no longer provide rich research and analytical opportunities, to being so constrained by access restrictions that key details are lacking and independent validation is difficult. In many cases, organisations do not collect relevant data in sufficient volumes or with high enough fidelity to provide cyber-research value. Unfortunately, there is generally little motivation for organisations to overcome these obstacles.
In an attempt to help stimulate a larger research effort focused on operational cyber-data as well as to motivate other organisations to release useful data sets, Los Alamos National Laboratory (LANL) has released two data sets for public use (Kent, 2014, 2016). A third, entitled the Unified Host and Network Data Set, is introduced in this chapter.
The Unified Host and Network Data Set is a subset of network flow and computer events collected from the LANL enterprise network over the course of approximately 90 days.a The host (computer) event logs originated from the majority of LANLā€™s computers that run the Microsoft Windows operating system. The network flow data originated from many of the internal core routers within the LANL enterprise network and are derived from router netflow records. The two data sets include many of the same computers but are not fully inclusive; the network data set includes many non-Windows computers and other network devices.
Identifying values within the data sets have been de-identified (anonymised) to protect the security of LANLā€™s operational IT environment and the privacy of individual users. The de-identified values match across both the host and network data allowing the two data elements to be used together for analysis and research. In some cases, the values were not de-identified, including well-known network ports, system-level usernames (not associated to people) and core enterprise hosts. In addition, a small set of hosts, users and processes were combined where they represented well-known, redundant entities. This consolidation was done for both normalisation and security purposes.
In order to transform the data into a format that is useful for researchers who are not domain experts, a significant effort was made to normalise the data while minimising the artefacts that such normalisation might introduce.

1.1.Related public data sets

A number of public, cyber-security relevant data sets currently are referenced in the literature (Glasser and Lindauer, 2013; Ma et al., 2009) or are available online.b Some of these represent data collected from operational environments, while others capture specific, pseudo real-world events (for example, cyber-security training exercises). Many data sets are synthetic and created using models intended to represent specific phenomenon of relevance; for example, the Carnegie Melon Software Engineering Institute provides several insider threat data sets that are entirely synthetic (Glasser and Lindauer, 2013). In addition, many of the data sets commonly seen within the research community are egregiously dated. The DARPA cyber-security data sets (Cyber-Systems and Technology Group, 1998) published in the 1990s are still regularly used, even though the systems, networks and attacks they represent have almost no relevance to modern computing environments.
Another issue is that many of the available data sets have restrictive access and constraints on how they may be used. For example, the U.S. Department of Homeland Security provides the Information Marketplace for Policy and Analysis of Cyber-risk and Trust (IMPACT,c which is intended to facilitate information sharing. However, the use of any of the data hosted by IMPACT requires registration and vetting prior to access. In addition, data owners may (and often do) place limitations on how and where the data may be used.
Finally, many of the existing data sets are not adequately characterised for potential researchers. It is important that researchers have a thorough understanding of the context, normalisation processes, idiosyncrasies and other aspects of the data. Ideally, researchers should have sufficiently detailed information to avoid making false assumptions and to reproduce similar data. The need for such detailed discussion around published data sets is a primary purpose of this chapter.
The remainder of this chapter is organised as follows: a description of the Network Flow Data is given in Section 2 followed by the Windows Host Log Data in Section 3. Finally, a discussion of potential research directions is given in Section 4.

2.Network Flow Data

The network flow data set included in this release is comprised of records describing communication events between devices connected to the LANL enterprise network. Each flow is an aggregate summary of a (possibly) bi-directional network communication between two network devices. The data are derived from Cisco NetFlow Version 9 (Claise, 2004) flow records exported by the core routers. As such, the records lack the payload-level data upon which most commercial intrusion detection systems are based. However, research has shown that flow-based techniques have a number of advantages and are successful at detecting a variety of malicious network behaviours (Sperotto et al., 2010). Furthermore, these techniques tend to be more robust against the vagaries of attackers, because they are not searching for specific signatures (for example, byte patterns) and they are encryption-agnostic. Finally, in comparison to full-packet data, collection, analysis and archival storage of flow data at enterprise scales is straightforward and requires minimal infrastructure.

2.1.Collection and transformation

As mentioned previously, the raw data consisted of NetFlow V9 records that were exported from the core network routers to a centralised collection server. While V9 records can contain many different fields, only the following are considered: StartTime, EndTime, SrcIP, DstIP, Protocol, SrcPort, DstPort, Packets and Bytes. The specifics of the hardware and flow export protocol are largely irrelevant, as these fields are common to all network flow formats of which the authors are aware.
This data can be quite challenging to model without a thorough understanding of its various idiosyncrasies. The following paragraphs discuss two of the most relevant issues with respect to modelling. For a comprehensive overview of these issues, among others, readers can refer to Hofstede et al. (2014).
Firstly, note that these flow records are uni-directional (uniflows): each record describes a stream of packets sent from one network device (SrcIP) to another (DstIP). Hence, an established TCP connection ā€” bi-directional by definition ā€” between two network devices, A and B, results in two flow records: one from A to B and another from B to A. It follows that there is no relationship between the direction of a flow and the initiator of a bi-directional connection (i.e., it is not known whether A or B connected first). This is the case for most netflow implementations as bi-directional flow (biflow) protocols such as Trammell and Boschi (2008) have yet to gain widespread adoption. Clearly, this presents a challenge for detection of attack behaviours, such as lateral movement, where directionality is of primary concern.
Secondly, significant duplication can occur due to flows encountering multiple netflow sensors in transit to their destination. Routers can be configured to track flows on ingress and egress, and, in more complex network topologies, a single flow can traverse multiple routers. More recently, the introduction of netflow-enabled switches and dedicated netflow appliances has exacerbated the issue. Ultimately, a single flow can result in many distinct flow records. To add further complexity, the flow records are not necessarily exact duplicates and their arrival times can vary considerably; these inconsistencies occur for many reasons, the particulars of which are too complex to discuss in this context.
In order to simplify the data for modelling, a transformation process known as biflowing or stitching was employed. This is a process intended to aggregate duplicates and marry the opposing uniflows of bi-directional connections into a single, directed biflow record (Table 1). Many approaches to this problem can be found in the literature (Barbosa, 2014; Berthier et al., 2010; Minarik et al., 2009; Nguyen et al., 2017), all of them imperfect. A straightforward approach was used that relies on simple port heuristics to decide direction. These heuristics are based on the assumption that SrcPorts are generally ephemeral (i.e., they are selected from a predefined, high range by the operating system), while DstPorts tend to have lower numbers ...

Table of contents