Data is being collected everywhere. Every transaction, every web page visit, every paymentâand much moreâis filling databases, relational and otherwise, with raw data. Computing power and storage have grown to be cost effective, a trend where todayâs smart phones are more powerful than supercomputers of yesteryear. Databases are no longer merely platforms for storing data; they are powerful engines for transforming data into useful information about customers and products and business practices.
The focus on data mining has historically been on complex algorithms developed by statisticians and machine-learning specialists. Once upon a time, data mining required downloading source code from a research lab or university, compiling the code to get it to run, and sometimes even debugging it. By the time the data and software were ready, the business problem had lost urgency.
This book takes a different approach because it starts with the data. The billions of transactions that occur every dayâcredit cards swipes, web page visits, telephone calls, and so onâare now often stored in relational databases. Relational database engines count among the most powerful and sophisticated software products in the business world, so they are well suited for the task of extracting useful information. And the lingua franca of relational databases is SQL.
The focus of this book is more on data and what to do with data and less on theory. Instead of trying to squeeze every last iota of information from a small sampleâthe goal of much statistical analysisâthe goal is instead to find something useful in the gigabytes and terabytes of data stored by the business. Instead of asking programmers to learn data analysis, the goal is to give data analystsâand othersâa solid foundation for using SQL to learn from data.
This book strives to assist anyone facing the problem of analyzing data stored in large databases, by describing the power of data analysis using SQL and Excel. SQL, which stands for Structured Query Language, is a language for extracting information from data. Excel is a popular and useful spreadsheet for analyzing smaller amounts of data and presenting results.
The various chapters of this book build skill in and enthusiasm for SQL queries and the graphical presentation of results. Throughout the book, the SQL queries are used for more and more sophisticated types of analyses, starting with basic summaries of tables, and moving to data exploration. The chapters continue with methods for understanding time-to-event problems, such as when customers stop, and market basket analysis for understanding what customers are purchasing. Data analysis is often about building models, andâperhaps surprisingly to most readersâsome models can be built directly in SQL, as described in Chapter 11, âData Mining in SQL.â An important part of any analysis, though, is constructing the data in a format suitable for modelingâcustomer signatures.
The final chapter takes a step back from analysis to discuss performance. This chapter is an overview of a topic, concentrating on good performance practices that work across different databases.
This chapter introduces SQL for data analysis and data mining. Admittedly, this introduction is heavily biased because the purpose is for querying databases rather than building and managing them. SQL is presented from three different perspectives, some of which may resonate more strongly with different groups of readers. The first perspective is the structure of the data, with a particular emphasis on entity-relationship diagrams. The second is the processing of data using dataflows, which happen to be what is âunder the hoodâ of most relational database engines. The third, and strongest thread through subsequent chapters, is the syntax of SQL itself. Although data is well described by entities and relationships, and processing by dataflows, the ultimate goal is to express the transformations in SQL and present the results often through Excel.
Databases, SQL, and Big Data
Collecting and analyzing data is a major activity, so many tools are available for this purpose. Some of these focus on âbig dataâ (whatever that might mean). Some focus on consistently storing the data quickly. Some on deep analysis. Some have pretty visual interfaces; others are programming languages.
SQL and relational databases are a powerful combination that is useful in any arsenal of tools for analysis, particularly ad hoc analyses:
- A mature and standardized language for accessing data
- Multiple vendors, including open source
- Scalability over a very broad range of hardware
- A non-programming interface for data manipulations
Before continuing with SQL, it is worth looking at SQL in the context of other tools.
What Is Big Data?
Big data is one of those concepts whose definition changes over time. In the 1800s, when statistics was first being invented, researchers worked with dozens or hundreds of rows of data. That might not seem like a lot, but if you have to add everything up with a pencil and paper, and do long division by hand or using a slide rule, then it certainly seems like a lot of data.
The concept of big data has always been relative, at least since data processing was invented. The difference is that now data is measured in gigabytes and terabytesâenough bytes to fit the text in all the books in the Library of Congressâand we can readily carry...