Data Science for Cyber-Security
eBook - ePub

Data Science for Cyber-Security

Nick Heard, Niall Adams;Patrick Rubin-Delanchy;Melissa Turcotte

  1. 304 pages
  2. English
  3. ePUB (adapté aux mobiles)
  4. Disponible sur iOS et Android
eBook - ePub

Data Science for Cyber-Security

Nick Heard, Niall Adams;Patrick Rubin-Delanchy;Melissa Turcotte

DĂ©tails du livre
Aperçu du livre
Table des matiĂšres

À propos de ce livre

Cyber-security is a matter of rapidly growing importance in industry and government. This book provides insight into a range of data science techniques for addressing these pressing concerns.

The application of statistical and broader data science techniques provides an exciting growth area in the design of cyber defences. Networks of connected devices, such as enterprise computer networks or the wider so-called Internet of Things, are all vulnerable to misuse and attack, and data science methods offer the promise to detect such behaviours from the vast collections of cyber traffic data sources that can be obtained. In many cases, this is achieved through anomaly detection of unusual behaviour against understood statistical models of normality.

This volume presents contributed papers from an international conference of the same name held at Imperial College. Experts from the field have provided their latest discoveries and review state of the art technologies.


  • Unified Host and Network Data Set (Melissa J M Turcotte, Alexander D Kent and Curtis Hash)
  • Computational Statistics and Mathematics for Cyber-Security (David J Marchette)
  • Bayesian Activity Modelling for Network Flow Data (Henry Clausen, Mark Briers and Niall M Adams)
  • Towards Generalisable Network Threat Detection (Blake Anderson, Martin Vejman, David McGrew and Subharthi Paul)
  • Feature Trade-Off Analysis for Reconnaissance Detection (Harsha Kumara Kalutarage and Siraj Ahmed Shaikh)
  • Anomaly Detection on User-Agent Strings (Eirini Spyropoulou, Jordan Noble and Christoforos Anagnostopoulos)
  • Discovery of the Twitter Bursty Botnet (Juan Echeverria, Christoph Besel and Shi Zhou)
  • Stochastic Block Models as an Unsupervised Approach to Detect Botnet-Infected Clusters in Networked Data (Mark Patrick Roeling and Geoff Nicholls)
  • Classiffication of Red Team Authentication Events in an Enterprise Network (John M Conroy)
  • Weakly Supervised Learning: How to Engineer Labels for Machine Learning in Cyber-Security (Christoforos Anagnostopoulos)
  • Large-scale Analogue Measurements and Analysis for Cyber-Security (George Cybenko and Gil M Raz)
  • Fraud Detection by Stacking Cost-Sensitive Decision Trees (Alejandro Correa Bahnsen, Sergio Villegas, Djamila Aouada and Björn Ottersten)
  • Data-Driven Decision Making for Cyber-Security (Mike Fisk)

Readership: Researchers at all levels in cyber-security and data science.
Key Features:

  • A collection of papers introducing novel methodology for cyber-data analysis

Foire aux questions

Comment puis-je résilier mon abonnement ?
Il vous suffit de vous rendre dans la section compte dans paramĂštres et de cliquer sur « RĂ©silier l’abonnement ». C’est aussi simple que cela ! Une fois que vous aurez rĂ©siliĂ© votre abonnement, il restera actif pour le reste de la pĂ©riode pour laquelle vous avez payĂ©. DĂ©couvrez-en plus ici.
Puis-je / comment puis-je télécharger des livres ?
Pour le moment, tous nos livres en format ePub adaptĂ©s aux mobiles peuvent ĂȘtre tĂ©lĂ©chargĂ©s via l’application. La plupart de nos PDF sont Ă©galement disponibles en tĂ©lĂ©chargement et les autres seront tĂ©lĂ©chargeables trĂšs prochainement. DĂ©couvrez-en plus ici.
Quelle est la différence entre les formules tarifaires ?
Les deux abonnements vous donnent un accĂšs complet Ă  la bibliothĂšque et Ă  toutes les fonctionnalitĂ©s de Perlego. Les seules diffĂ©rences sont les tarifs ainsi que la pĂ©riode d’abonnement : avec l’abonnement annuel, vous Ă©conomiserez environ 30 % par rapport Ă  12 mois d’abonnement mensuel.
Qu’est-ce que Perlego ?
Nous sommes un service d’abonnement Ă  des ouvrages universitaires en ligne, oĂč vous pouvez accĂ©der Ă  toute une bibliothĂšque pour un prix infĂ©rieur Ă  celui d’un seul livre par mois. Avec plus d’un million de livres sur plus de 1 000 sujets, nous avons ce qu’il vous faut ! DĂ©couvrez-en plus ici.
Prenez-vous en charge la synthÚse vocale ?
Recherchez le symbole Écouter sur votre prochain livre pour voir si vous pouvez l’écouter. L’outil Écouter lit le texte Ă  haute voix pour vous, en surlignant le passage qui est en cours de lecture. Vous pouvez le mettre sur pause, l’accĂ©lĂ©rer ou le ralentir. DĂ©couvrez-en plus ici.
Est-ce que Data Science for Cyber-Security est un PDF/ePUB en ligne ?
Oui, vous pouvez accĂ©der Ă  Data Science for Cyber-Security par Nick Heard, Niall Adams;Patrick Rubin-Delanchy;Melissa Turcotte en format PDF et/ou ePUB ainsi qu’à d’autres livres populaires dans Computer Science et Cyber Security. Nous disposons de plus d’un million d’ouvrages Ă  dĂ©couvrir dans notre catalogue.



Chapter 1

Unified Host and Network Data Set

Melissa J. M. Turcotte*,‡, Alexander D. Kent* and Curtis Hash†
*Los Alamos National Laboratory,
Los Alamos, NM 87545, USA

†Ernst & Young, New Mexico, USA
‡[email protected]
The lack of data sets derived from operational enterprise networks continues to be a critical deficiency in the cyber-security research community. Unfortunately, releasing viable data sets to the larger community is challenging for a number of reasons, primarily the difficulty of balancing security and privacy concerns against the fidelity and utility of the data. This chapter discusses the importance of cyber-security research data sets and introduces a large data set derived from the operational network environment at Los Alamos National Laboratory (LANL). The hope is that this data set and associated discussion will act as a catalyst for both new research in cyber-security as well as motivation for other organisations to release similar data sets to the community.


The lack of diverse and useful data sets for cyber-security research continues to play a profound and limiting role within the relevant research communities and their resulting published research. Organisations are reticent to release data for security and privacy reasons. In addition, the data sets that are released are encumbered in a variety of ways, from being stripped of so much information that they no longer provide rich research and analytical opportunities, to being so constrained by access restrictions that key details are lacking and independent validation is difficult. In many cases, organisations do not collect relevant data in sufficient volumes or with high enough fidelity to provide cyber-research value. Unfortunately, there is generally little motivation for organisations to overcome these obstacles.
In an attempt to help stimulate a larger research effort focused on operational cyber-data as well as to motivate other organisations to release useful data sets, Los Alamos National Laboratory (LANL) has released two data sets for public use (Kent, 2014, 2016). A third, entitled the Unified Host and Network Data Set, is introduced in this chapter.
The Unified Host and Network Data Set is a subset of network flow and computer events collected from the LANL enterprise network over the course of approximately 90 days.a The host (computer) event logs originated from the majority of LANL’s computers that run the Microsoft Windows operating system. The network flow data originated from many of the internal core routers within the LANL enterprise network and are derived from router netflow records. The two data sets include many of the same computers but are not fully inclusive; the network data set includes many non-Windows computers and other network devices.
Identifying values within the data sets have been de-identified (anonymised) to protect the security of LANL’s operational IT environment and the privacy of individual users. The de-identified values match across both the host and network data allowing the two data elements to be used together for analysis and research. In some cases, the values were not de-identified, including well-known network ports, system-level usernames (not associated to people) and core enterprise hosts. In addition, a small set of hosts, users and processes were combined where they represented well-known, redundant entities. This consolidation was done for both normalisation and security purposes.
In order to transform the data into a format that is useful for researchers who are not domain experts, a significant effort was made to normalise the data while minimising the artefacts that such normalisation might introduce.

1.1.Related public data sets

A number of public, cyber-security relevant data sets currently are referenced in the literature (Glasser and Lindauer, 2013; Ma et al., 2009) or are available online.b Some of these represent data collected from operational environments, while others capture specific, pseudo real-world events (for example, cyber-security training exercises). Many data sets are synthetic and created using models intended to represent specific phenomenon of relevance; for example, the Carnegie Melon Software Engineering Institute provides several insider threat data sets that are entirely synthetic (Glasser and Lindauer, 2013). In addition, many of the data sets commonly seen within the research community are egregiously dated. The DARPA cyber-security data sets (Cyber-Systems and Technology Group, 1998) published in the 1990s are still regularly used, even though the systems, networks and attacks they represent have almost no relevance to modern computing environments.
Another issue is that many of the available data sets have restrictive access and constraints on how they may be used. For example, the U.S. Department of Homeland Security provides the Information Marketplace for Policy and Analysis of Cyber-risk and Trust (IMPACT,c which is intended to facilitate information sharing. However, the use of any of the data hosted by IMPACT requires registration and vetting prior to access. In addition, data owners may (and often do) place limitations on how and where the data may be used.
Finally, many of the existing data sets are not adequately characterised for potential researchers. It is important that researchers have a thorough understanding of the context, normalisation processes, idiosyncrasies and other aspects of the data. Ideally, researchers should have sufficiently detailed information to avoid making false assumptions and to reproduce similar data. The need for such detailed discussion around published data sets is a primary purpose of this chapter.
The remainder of this chapter is organised as follows: a description of the Network Flow Data is given in Section 2 followed by the Windows Host Log Data in Section 3. Finally, a discussion of potential research directions is given in Section 4.

2.Network Flow Data

The network flow data set included in this release is comprised of records describing communication events between devices connected to the LANL enterprise network. Each flow is an aggregate summary of a (possibly) bi-directional network communication between two network devices. The data are derived from Cisco NetFlow Version 9 (Claise, 2004) flow records exported by the core routers. As such, the records lack the payload-level data upon which most commercial intrusion detection systems are based. However, research has shown that flow-based techniques have a number of advantages and are successful at detecting a variety of malicious network behaviours (Sperotto et al., 2010). Furthermore, these techniques tend to be more robust against the vagaries of attackers, because they are not searching for specific signatures (for example, byte patterns) and they are encryption-agnostic. Finally, in comparison to full-packet data, collection, analysis and archival storage of flow data at enterprise scales is straightforward and requires minimal infrastructure.

2.1.Collection and transformation

As mentioned previously, the raw data consisted of NetFlow V9 records that were exported from the core network routers to a centralised collection server. While V9 records can contain many different fields, only the following are considered: StartTime, EndTime, SrcIP, DstIP, Protocol, SrcPort, DstPort, Packets and Bytes. The specifics of the hardware and flow export protocol are largely irrelevant, as these fields are common to all network flow formats of which the authors are aware.
This data can be quite challenging to model without a thorough understanding of its various idiosyncrasies. The following paragraphs discuss two of the most relevant issues with respect to modelling. For a comprehensive overview of these issues, among others, readers can refer to Hofstede et al. (2014).
Firstly, note that these flow records are uni-directional (uniflows): each record describes a stream of packets sent from one network device (SrcIP) to another (DstIP). Hence, an established TCP connection — bi-directional by definition — between two network devices, A and B, results in two flow records: one from A to B and another from B to A. It follows that there is no relationship between the direction of a flow and the initiator of a bi-directional connection (i.e., it is not known whether A or B connected first). This is the case for most netflow implementations as bi-directional flow (biflow) protocols such as Trammell and Boschi (2008) have yet to gain widespread adoption. Clearly, this presents a challenge for detection of attack behaviours, such as lateral movement, where directionality is of primary concern.
Secondly, significant duplication can occur due to flows encountering multiple netflow sensors in transit to their destination. Routers can be configured to track flows on ingress and egress, and, in more complex network topologies, a single flow can traverse multiple routers. More recently, the introduction of netflow-enabled switches and dedicated netflow appliances has exacerbated the issue. Ultimately, a single flow can result in many distinct flow records. To add further complexity, the flow records are not necessarily exact duplicates and their arrival times can vary considerably; these inconsistencies occur for many reasons, the particulars of which are too complex to discuss in this context.
In order to simplify the data for modelling, a transformation process known as biflowing or stitching was employed. This is a process intended to aggregate duplicates and marry the opposing uniflows of bi-directional connections into a single, directed biflow record (Table 1). Many approaches to this problem can be found in the literature (Barbosa, 2014; Berthier et al., 2010; Minarik et al., 2009; Nguyen et al., 2017), all of them imperfect. A straightforward approach was used that relies on simple port heuristics to decide direction. These heuristics are based on the assumption that SrcPorts are generally ephemeral (i.e., they are selected from a predefined, high range by the operating system), while DstPorts tend to have lower numbers ...

Table des matiĂšres

  1. Cover
  2. Halftitle
  3. Title
  4. Copyright
  5. Preface
  6. Contents
  7. 1. Unified Host and Network Data Set
  8. 2. Computational Statistics and Mathematics for Cyber-Security
  9. 3. Bayesian Activity Modelling for Network Flow Data
  10. 4. Towards Generalisable Network Threat Detection
  11. 5. Feature Trade-Off Analysis for Reconnaissance Detection
  12. 6. Anomaly Detection on User-Agent Strings
  13. 7. Discovery of the Twitter Bursty Botnet
  14. 8. Stochastic Block Models as an Unsupervised Approach to Detect Botnet-Infected Clusters in Networked Data
  15. 9. Classification of Red Team Authentication Events
  16. 10. Weakly Supervised Learning: How to Engineer Labels for Machine Learning in Cyber-Security
  17. 11. Large-scale Analogue Measurements and Analysis for Cyber-Security
  18. 12. Fraud Detection by Stacking Cost-Sensitive Decision Trees
  19. 13. Data-Driven Decision Making for Cyber-Security
  20. Index