Data Science for Cyber-Security
eBook - ePub

Data Science for Cyber-Security

Nick Heard, Niall Adams;Patrick Rubin-Delanchy;Melissa Turcotte

Buch teilen
  1. 304 Seiten
  2. English
  3. ePUB (handyfreundlich)
  4. Über iOS und Android verfügbar
eBook - ePub

Data Science for Cyber-Security

Nick Heard, Niall Adams;Patrick Rubin-Delanchy;Melissa Turcotte

Angaben zum Buch
Buchvorschau
Inhaltsverzeichnis
Quellenangaben

Über dieses Buch

Cyber-security is a matter of rapidly growing importance in industry and government. This book provides insight into a range of data science techniques for addressing these pressing concerns.

The application of statistical and broader data science techniques provides an exciting growth area in the design of cyber defences. Networks of connected devices, such as enterprise computer networks or the wider so-called Internet of Things, are all vulnerable to misuse and attack, and data science methods offer the promise to detect such behaviours from the vast collections of cyber traffic data sources that can be obtained. In many cases, this is achieved through anomaly detection of unusual behaviour against understood statistical models of normality.

This volume presents contributed papers from an international conference of the same name held at Imperial College. Experts from the field have provided their latest discoveries and review state of the art technologies.


Contents:

  • Unified Host and Network Data Set (Melissa J M Turcotte, Alexander D Kent and Curtis Hash)
  • Computational Statistics and Mathematics for Cyber-Security (David J Marchette)
  • Bayesian Activity Modelling for Network Flow Data (Henry Clausen, Mark Briers and Niall M Adams)
  • Towards Generalisable Network Threat Detection (Blake Anderson, Martin Vejman, David McGrew and Subharthi Paul)
  • Feature Trade-Off Analysis for Reconnaissance Detection (Harsha Kumara Kalutarage and Siraj Ahmed Shaikh)
  • Anomaly Detection on User-Agent Strings (Eirini Spyropoulou, Jordan Noble and Christoforos Anagnostopoulos)
  • Discovery of the Twitter Bursty Botnet (Juan Echeverria, Christoph Besel and Shi Zhou)
  • Stochastic Block Models as an Unsupervised Approach to Detect Botnet-Infected Clusters in Networked Data (Mark Patrick Roeling and Geoff Nicholls)
  • Classiffication of Red Team Authentication Events in an Enterprise Network (John M Conroy)
  • Weakly Supervised Learning: How to Engineer Labels for Machine Learning in Cyber-Security (Christoforos Anagnostopoulos)
  • Large-scale Analogue Measurements and Analysis for Cyber-Security (George Cybenko and Gil M Raz)
  • Fraud Detection by Stacking Cost-Sensitive Decision Trees (Alejandro Correa Bahnsen, Sergio Villegas, Djamila Aouada and Björn Ottersten)
  • Data-Driven Decision Making for Cyber-Security (Mike Fisk)


Readership: Researchers at all levels in cyber-security and data science.
Key Features:

  • A collection of papers introducing novel methodology for cyber-data analysis

Häufig gestellte Fragen

Wie kann ich mein Abo kündigen?
Gehe einfach zum Kontobereich in den Einstellungen und klicke auf „Abo kündigen“ – ganz einfach. Nachdem du gekündigt hast, bleibt deine Mitgliedschaft für den verbleibenden Abozeitraum, den du bereits bezahlt hast, aktiv. Mehr Informationen hier.
(Wie) Kann ich Bücher herunterladen?
Derzeit stehen all unsere auf Mobilgeräte reagierenden ePub-Bücher zum Download über die App zur Verfügung. Die meisten unserer PDFs stehen ebenfalls zum Download bereit; wir arbeiten daran, auch die übrigen PDFs zum Download anzubieten, bei denen dies aktuell noch nicht möglich ist. Weitere Informationen hier.
Welcher Unterschied besteht bei den Preisen zwischen den Aboplänen?
Mit beiden Aboplänen erhältst du vollen Zugang zur Bibliothek und allen Funktionen von Perlego. Die einzigen Unterschiede bestehen im Preis und dem Abozeitraum: Mit dem Jahresabo sparst du auf 12 Monate gerechnet im Vergleich zum Monatsabo rund 30 %.
Was ist Perlego?
Wir sind ein Online-Abodienst für Lehrbücher, bei dem du für weniger als den Preis eines einzelnen Buches pro Monat Zugang zu einer ganzen Online-Bibliothek erhältst. Mit über 1 Million Büchern zu über 1.000 verschiedenen Themen haben wir bestimmt alles, was du brauchst! Weitere Informationen hier.
Unterstützt Perlego Text-zu-Sprache?
Achte auf das Symbol zum Vorlesen in deinem nächsten Buch, um zu sehen, ob du es dir auch anhören kannst. Bei diesem Tool wird dir Text laut vorgelesen, wobei der Text beim Vorlesen auch grafisch hervorgehoben wird. Du kannst das Vorlesen jederzeit anhalten, beschleunigen und verlangsamen. Weitere Informationen hier.
Ist Data Science for Cyber-Security als Online-PDF/ePub verfügbar?
Ja, du hast Zugang zu Data Science for Cyber-Security von Nick Heard, Niall Adams;Patrick Rubin-Delanchy;Melissa Turcotte im PDF- und/oder ePub-Format sowie zu anderen beliebten Büchern aus Computer Science & Cyber Security. Aus unserem Katalog stehen dir über 1 Million Bücher zur Verfügung.

Information

Jahr
2018
ISBN
9781786345653

Chapter 1

Unified Host and Network Data Set

Melissa J. M. Turcotte*,, Alexander D. Kent* and Curtis Hash
*Los Alamos National Laboratory,
Los Alamos, NM 87545, USA

Ernst & Young, New Mexico, USA
[email protected]
The lack of data sets derived from operational enterprise networks continues to be a critical deficiency in the cyber-security research community. Unfortunately, releasing viable data sets to the larger community is challenging for a number of reasons, primarily the difficulty of balancing security and privacy concerns against the fidelity and utility of the data. This chapter discusses the importance of cyber-security research data sets and introduces a large data set derived from the operational network environment at Los Alamos National Laboratory (LANL). The hope is that this data set and associated discussion will act as a catalyst for both new research in cyber-security as well as motivation for other organisations to release similar data sets to the community.

1.Introduction

The lack of diverse and useful data sets for cyber-security research continues to play a profound and limiting role within the relevant research communities and their resulting published research. Organisations are reticent to release data for security and privacy reasons. In addition, the data sets that are released are encumbered in a variety of ways, from being stripped of so much information that they no longer provide rich research and analytical opportunities, to being so constrained by access restrictions that key details are lacking and independent validation is difficult. In many cases, organisations do not collect relevant data in sufficient volumes or with high enough fidelity to provide cyber-research value. Unfortunately, there is generally little motivation for organisations to overcome these obstacles.
In an attempt to help stimulate a larger research effort focused on operational cyber-data as well as to motivate other organisations to release useful data sets, Los Alamos National Laboratory (LANL) has released two data sets for public use (Kent, 2014, 2016). A third, entitled the Unified Host and Network Data Set, is introduced in this chapter.
The Unified Host and Network Data Set is a subset of network flow and computer events collected from the LANL enterprise network over the course of approximately 90 days.a The host (computer) event logs originated from the majority of LANL’s computers that run the Microsoft Windows operating system. The network flow data originated from many of the internal core routers within the LANL enterprise network and are derived from router netflow records. The two data sets include many of the same computers but are not fully inclusive; the network data set includes many non-Windows computers and other network devices.
Identifying values within the data sets have been de-identified (anonymised) to protect the security of LANL’s operational IT environment and the privacy of individual users. The de-identified values match across both the host and network data allowing the two data elements to be used together for analysis and research. In some cases, the values were not de-identified, including well-known network ports, system-level usernames (not associated to people) and core enterprise hosts. In addition, a small set of hosts, users and processes were combined where they represented well-known, redundant entities. This consolidation was done for both normalisation and security purposes.
In order to transform the data into a format that is useful for researchers who are not domain experts, a significant effort was made to normalise the data while minimising the artefacts that such normalisation might introduce.

1.1.Related public data sets

A number of public, cyber-security relevant data sets currently are referenced in the literature (Glasser and Lindauer, 2013; Ma et al., 2009) or are available online.b Some of these represent data collected from operational environments, while others capture specific, pseudo real-world events (for example, cyber-security training exercises). Many data sets are synthetic and created using models intended to represent specific phenomenon of relevance; for example, the Carnegie Melon Software Engineering Institute provides several insider threat data sets that are entirely synthetic (Glasser and Lindauer, 2013). In addition, many of the data sets commonly seen within the research community are egregiously dated. The DARPA cyber-security data sets (Cyber-Systems and Technology Group, 1998) published in the 1990s are still regularly used, even though the systems, networks and attacks they represent have almost no relevance to modern computing environments.
Another issue is that many of the available data sets have restrictive access and constraints on how they may be used. For example, the U.S. Department of Homeland Security provides the Information Marketplace for Policy and Analysis of Cyber-risk and Trust (IMPACT,c which is intended to facilitate information sharing. However, the use of any of the data hosted by IMPACT requires registration and vetting prior to access. In addition, data owners may (and often do) place limitations on how and where the data may be used.
Finally, many of the existing data sets are not adequately characterised for potential researchers. It is important that researchers have a thorough understanding of the context, normalisation processes, idiosyncrasies and other aspects of the data. Ideally, researchers should have sufficiently detailed information to avoid making false assumptions and to reproduce similar data. The need for such detailed discussion around published data sets is a primary purpose of this chapter.
The remainder of this chapter is organised as follows: a description of the Network Flow Data is given in Section 2 followed by the Windows Host Log Data in Section 3. Finally, a discussion of potential research directions is given in Section 4.

2.Network Flow Data

The network flow data set included in this release is comprised of records describing communication events between devices connected to the LANL enterprise network. Each flow is an aggregate summary of a (possibly) bi-directional network communication between two network devices. The data are derived from Cisco NetFlow Version 9 (Claise, 2004) flow records exported by the core routers. As such, the records lack the payload-level data upon which most commercial intrusion detection systems are based. However, research has shown that flow-based techniques have a number of advantages and are successful at detecting a variety of malicious network behaviours (Sperotto et al., 2010). Furthermore, these techniques tend to be more robust against the vagaries of attackers, because they are not searching for specific signatures (for example, byte patterns) and they are encryption-agnostic. Finally, in comparison to full-packet data, collection, analysis and archival storage of flow data at enterprise scales is straightforward and requires minimal infrastructure.

2.1.Collection and transformation

As mentioned previously, the raw data consisted of NetFlow V9 records that were exported from the core network routers to a centralised collection server. While V9 records can contain many different fields, only the following are considered: StartTime, EndTime, SrcIP, DstIP, Protocol, SrcPort, DstPort, Packets and Bytes. The specifics of the hardware and flow export protocol are largely irrelevant, as these fields are common to all network flow formats of which the authors are aware.
This data can be quite challenging to model without a thorough understanding of its various idiosyncrasies. The following paragraphs discuss two of the most relevant issues with respect to modelling. For a comprehensive overview of these issues, among others, readers can refer to Hofstede et al. (2014).
Firstly, note that these flow records are uni-directional (uniflows): each record describes a stream of packets sent from one network device (SrcIP) to another (DstIP). Hence, an established TCP connection — bi-directional by definition — between two network devices, A and B, results in two flow records: one from A to B and another from B to A. It follows that there is no relationship between the direction of a flow and the initiator of a bi-directional connection (i.e., it is not known whether A or B connected first). This is the case for most netflow implementations as bi-directional flow (biflow) protocols such as Trammell and Boschi (2008) have yet to gain widespread adoption. Clearly, this presents a challenge for detection of attack behaviours, such as lateral movement, where directionality is of primary concern.
Secondly, significant duplication can occur due to flows encountering multiple netflow sensors in transit to their destination. Routers can be configured to track flows on ingress and egress, and, in more complex network topologies, a single flow can traverse multiple routers. More recently, the introduction of netflow-enabled switches and dedicated netflow appliances has exacerbated the issue. Ultimately, a single flow can result in many distinct flow records. To add further complexity, the flow records are not necessarily exact duplicates and their arrival times can vary considerably; these inconsistencies occur for many reasons, the particulars of which are too complex to discuss in this context.
In order to simplify the data for modelling, a transformation process known as biflowing or stitching was employed. This is a process intended to aggregate duplicates and marry the opposing uniflows of bi-directional connections into a single, directed biflow record (Table 1). Many approaches to this problem can be found in the literature (Barbosa, 2014; Berthier et al., 2010; Minarik et al., 2009; Nguyen et al., 2017), all of them imperfect. A straightforward approach was used that relies on simple port heuristics to decide direction. These heuristics are based on the assumption that SrcPorts are generally ephemeral (i.e., they are selected from a predefined, high range by the operating system), while DstPorts tend to have lower numbers ...

Inhaltsverzeichnis