PART I
Data Science, Analytics, and Business Analytics
CHAPTER 1
Data Science and Its Scope
Chapter Highlights
• Introduction
• What Is Data Science?
• Objective and Overview of Chapters
• What Is Data Science?
• Another Look at Data Science
• Data Science and Statistics
• Role of Statistics in Data Science
• Data Science: A Brief History
• Difference between Data Science and Data Analytics
• Knowledge and Skills for Data Science Professionals
• Some Technologies used in Data Science
• Career Path for Data Science Professional and Data Scientist
• Future Outlook
• Summary
Introduction
Data science is about extracting knowledge and insights from data. The tools and techniques of data science are used to drive business and process decisions. It can be seen as a major data-driven decision-making approach to decision making. Data science is a multidisciplinary field that involves the ability to understand, process, and visualize data in the initial stages followed by applications of statistics, modeling, mathematics, and technology to address and solve analytically complex problems using structured and unstructured data. At the core of data science is data. It is about using this data in creative and effective ways to help businesses in making data-driven business decisions.
The knowledge of statistics in data science is as important as the applications of computer science. Companies now collect massive amounts of data from exabytes to zettabytes, which are both structured and unstructured. The advancement in technology and the computing capabilities have made it possible to store, process, and analyze this huge data with smarter storage spaces.
Data science is applied to extract information from both structured and unstructured data.1,2
Unstructured data is usually not organized in a structured manner and may contain qualitative or categorical elements, such as dates, categories, and so on, and are text heavy. They also contain numbers and other forms of measurements. Compared to structured data, the unstructured data contain irregularities. The ambiguities in unstructured data make it difficult to apply traditional tools of statistics and data analysis. Structured data are usually stored in clearly defined fields in databases. The software applications and programs are designed to process such data. In recent years, a number of newly developed tools and software programs have emerged that are capable of analyzing big and unstructured data. One of the earliest applications of unstructured data is in analyzing text data using text-mining and other methods.
Recently, unstructured data is becoming more prevalent. In 1998, Merrill Lynch said, “unstructured data comprises the vast majority of data found in an organization, some estimates run as high as 80%.”1 Here are some other predictions: As of 2012, IDC (International Data Group)3 and Dell EMC4 project that data will grow to 40 zettabytes by 2020, resulting in a 50-fold growth from the beginning of 2010.4 More recently, IDC and Seagate predict that the global datasphere will grow to 163 zettabytes by 20255 and majority of that will be unstructured. The Computer World magazine7 states that unstructured information might account for more than 70 to 80 percent of all data in in organizations. (https://en.wikipedia.org/wiki/Unstructured_data)8
Objective and Overview of Chapters
The objective of this book is to provide an introductory overview of data science, understand what data science is, and why data science is such an important field. We will also explore and outline the role of data scientists/professionals and what they do.
The initial chapters of the book introduce data science and closely related areas. The terms data science, data analytics, business analytics, and business intelligence are often used interchangeably even by the professions in the fields. Therefore, Chapter 1, which provides an overview of data science, is followed by two chapters that explain the relationship between data science, analytics, and business intelligence. Analytics itself is wide area and different forms of analytics including descriptive, predictive, and prescriptive analytics are used by companies to drive major business decisions. Chapters 2 and 3 outline the differences and similarities between data science, analytics, and business intelligence. Chapter 2 also outlines the tools of descriptive, predictive, and prescriptive analytics along with the most recent and emerging technologies of machine learning and artificial intelligence. Since the field is data science is about the data, a chapter is devoted to data and data types. Chapter 4 provides definitions of data, different forms of data, and their types followed by some tools and techniques for working with data. One of the major objectives of data science is to make sense from the massive amounts of data companies collect. One of the ways of making sense from data is to apply data visualization or graphical techniques used in data analysis. Understanding other tools and techniques for working with data are also important. A chapter is devoted to data visualization.
Data science is a vast area. Besides visualization techniques and statistical analysis, it uses statistical programming language such as R programming, and a knowledge of databases (SQL or MySQL) or other data base management system.
One major application of data science is in the area of Machine Learning (ML) and Artificial Intelligence. The book provides a detailed overview of data science by defining and outlining the tools and techniques. As mentioned earlier, the book also explains the differences and similarities between data science and data analytics. The other concepts related to data science including analytics, business analytics, and business intelligence (BI) are discussed in detail. The field of data science is about processing, cleaning, and analyzing data. These concepts and topics are important to understand the field of data science and are discussed in this book. Data science is an emerging field in data analysis and decision making.
What Is Data Science?
Data science may be thought of as a data driven decision making approach that uses several different areas, methods, algorithms, models, and disciplines with a purpose of extracting insights and knowledge from structured and unstructured data. These insights are helpful in applying algorithms and models to make decisions. The models in data science are used in predictive analytics to predict future outcomes.
Data science, as a field, has much broader scope than analytics, business analytics, or business intelligence. It brings together and combines several disciplines and areas including statistics, data analysis9, statistical modeling, data mining,10,11,12,13,14 big data,15 machine learning,16 and artificial intelligence (AI), management science, optimization techniques, and related methods in order to “understand and analyze actual phenomena” from data.17
Data science employs techniques and methods from many other fields, such as mathematics, statistics, computer science, and information science. Besides the methods and theories drawn from several fields, data science also uses data visualization techniques using specially designed software—Tableau and other big data software. The concepts of relational data bases (such as SQL), R-statistical software, and programming language Python are all used in different applications to analyze, extract information, and draw conclusions from data. These are the tools of data science. These tools, techniques, and programming languages provide a unifying approach to explore, analyze, draw conclusions, and make decisions from massive amounts of data companies collect.
Data science employs the tools of information technology, management science (mathematical modeling, and simulation), along with data mining and fact-based data to measure past performance to guide an organization in planning and predicting future outcomes to aid in effective decision making.
Turing award18 winner Jim Gray viewed data science as a “fourth paradigm” of science (empirical, theoretical, computational, and now data-driven) and asserted that “everything about science is changing because of the impact of information technology” and the data deluge. In 2015, the American Statistical Association identified database management, statistics and machine learning, distributed and parallel systems as the three emerging foundational professional communities.
Another Look at Data Science
Data science can be viewed as a multidisciplinary field focused on finding actionable insights from large sets of raw, structured...