Part 1. Graph visualization basics
In part one of this book, we’ll take a high-level view of graphs. First, I’ll introduce you to what graphs are and how they can be used across a variety of domains, with some detailed case studies. Then, we’ll dive a little deeper into graph models of data, how they might be different from standard relational models of data, and how you can create graph data models from your data. I’ll introduce you to the two tools that we’ll use throughout the book: Gephi and KeyLines. I’ll use both Gephi and KeyLines in later chapters to illustrate how you can create graph visualizations of your own—for you own use, with Gephi, or as part of a visualization application, using KeyLines.
Chapter 1. Getting to know graph visualization
This chapter covers
- Getting to know graphs as data models
- Why graphs are a useful way to think about data
- When to visualize graphs, and the node-link drawing concept
- Other visualizations of graph data and when they’re useful
In December 2001, the Enron Corporation filed for what was at the time the largest ever corporate bankruptcy. Its stock had fallen from a high of $90 per share the previous year to $0.61, decimating its employees’ pensions and shareholders’ investments in it. The FBI’s investigation into this collapse became the largest white-collar criminal investigation in history as they seized over 3,000 boxes of documents and 4 terabytes of data. Among the information seized were about 600,000 emails between key executives at the organization. Although the FBI took pains to read every email individually, the investigators recognized that they were unlikely to find a smoking gun—people committing complex financial fraud seldom disclose their actions in written form. And in 2001, emails were only starting to become the primary means of internal communications; lots of information was still exchanged via phone calls.
In addition to looking at the text of individual emails, the FBI also wanted to uncover patterns in the communications, perhaps in an attempt to better understand who the decision makers were within Enron or who had access to a lot of the information internal to the company. To do this, they modeled the Enron emails as a graph.
A graph is a model of data that consists of nodes, which are discrete data elements (such as people), and edges, which are relationships between nodes. The graph model brings to the forefront relationships that may be hidden in tabular views of the same data and illustrates what is most important. By making those relationships between the data elements a core part of the data structure, you can identify patterns in the data that wouldn’t otherwise be apparent. But building graph data structures is only half the solution to pattern recognition. This book will teach you how to visualize graphs using interactive node-link visualization diagrams, and by the end, you’ll be able to create your own dynamic, interactive visualizations using a variety of tools available today.
In this chapter, I’ll go a little deeper into the concept of a graph and graph history and uses, and talk about various techniques used to visualize graph data. Subsequent chapters build on this framework by introducing concrete examples of graph visualizations and the data they’re based on and discuss various techniques for creating useful visualizations.
1.1. Getting to know graphs
Graphs are everywhere. As long as you’re interested in how items can be related to each other, there’s a graph somewhere in your data. In this section, I’ll walk you through what a graph is and what can be gained from visualizing graphs.
1.1.1. What is a graph?
As described previously, a graph—also called a network—is a set of interconnected data elements that’s expressed as a series of nodes and edges.
In the common definition of a graph, edges have exactly two endpoints, no more. In some cases, those two endpoints can be the same node if a node links to itself. An edge (also known as a link) can take one of two forms:
- Directed— The relationship has a direction. Stella owns the car, but it doesn’t make sense to say the car owns Stella.
- Undirected— The two items are linked without the concept of direction; the relationship inherently goes both ways. If Stella is linked to Roger because they committed a crime together, it means the same thing to say Stella was arrested with Roger as it does to say Roger was arrested with Stella.
In figure 1.1, you see an example of a directed link with properties.
Figure 1.1. A property graph of a single email between Enron executives. The two nodes are the sender and recipient of the email, and the directed edge is the email.
Both nodes and edges can have properties, which are key-value pairs—lists of properties and values, describing either the data element itself or the relationship. Figure 1.2 is a simple property graph showing that Stella bought a 2008 Volkswagen Jetta in September 2007 and sold it in October 2013. Modeling it as a graph highlights that Stella had a relationship with this car, albeit temporarily.
Figure 1.2. A simple property graph with two nodes and an edge. Stella (the first node) bought a 2008 Volkswagen Jetta (the second node) in September 2007 and sold it in October 2013. Modeling it as a graph highlights that Stella had a relationship with this car (the edge).
An email is a relationship, too, between the sender and the recipient. The properties of the nodes are things like email address, name, and title, and the properties of the relationship are the date/time it was sent, its subject line, and the text of the email.
To prove conspiracy, the FBI was interested in all the emails sent among the Enron executives, not just a single one, so let’s add some more nodes to represent a larger number of emails sent during a specified period of time, as shown in figure 1.3.
Figure 1.3. A graph of some of the Enron executives’ email communications. You can easily see that Timothy Belden is a hub of communication in this segment of Enron, sending and receiving email from many other executives.
Figure 1.3 is a directed graph because it matters whether Kevin Presto sent an email to Timothy Belden or received one—there’s a big difference between sending and receiving information when you’re investigating who knew what when. The arrowheads on the edges show that directionality: Kevin Presto sent an email to Timothy Belden, but Timothy Belden didn’t reply, indicating they may not have been close associates or they may have spoken offline. As we start to add more data to the graph, you can see the value of graphs—patterns become apparent. In this example, we can easily see that Timothy Belden is a hub of communication in this segment of Enron, sending and receiving email from many other executives.
1.1.2. A bit of theory
Graph theory began early in the eighteenth century with the Seven Bridges of Königsberg problem. In Königsberg, Prussia (now Kaliningrad, Russia), it was a common parlor game to try to determine a route that would allow someone to pass over all seven bridges over the Pregel River exactly once without passing over any bridge twice. (Go ahead and give it a shot using the map of the city, shown in figure 1.4, and see if you can prove three centuries of mathematicians wrong.)
Figure 1.4. The Seven Bridges of Königsberg problem. Using this map of the bridges of Königsberg, Prussia, try to draw a route that reaches each area of the city but never crosses the same bridge twice.
Leonhard Euler proved this problem unsolvable by abstracting the regions of the city into individual points and the bridges as paths between those points, as you can see in figure 1.5.
Figure 1.5. Seven bridges and four land areas of Königsberg as a graph. In this graph, nodes denote the land masses bordering the Pregel River and the two islands in its middle. Edges represent the bridges connecting the two islands and two shorelines.
E...