Big Data Management
eBook - ePub

Big Data Management

Data Governance Principles for Big Data Analytics

Peter Ghavami

  1. 174 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Big Data Management

Data Governance Principles for Big Data Analytics

Peter Ghavami

Book details
Book preview
Table of contents
Citations

About This Book

Data analytics is core to business and decision making. The rapid increase in data volume, velocity and variety offers both opportunities and challenges. While open source solutions to store big data, like Hadoop, offer platforms for exploring value and insight from big data, they were not originally developed with data security and governance in mind. Big Data Management discusses numerous policies, strategies and recipes for managing big data. It addresses data security, privacy, controls and life cycle management offering modern principles and open source architectures for successful governance of big data.

The author has collected best practices from the world's leading organizations that have successfully implemented big data platforms. The topics discussed cover the entire data management life cycle, data quality, data stewardship, regulatory considerations, data council, architectural and operational models are presented for successful management of big data. The book is a must-read for data scientists, data engineers and corporate leaders who are implementing big data platforms in their organizations.

Frequently asked questions

How do I cancel my subscription?
Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.
Can/how do I download books?
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
What is the difference between the pricing plans?
Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.
What is Perlego?
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Do you support text-to-speech?
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Is Big Data Management an online PDF/ePUB?
Yes, you can access Big Data Management by Peter Ghavami in PDF and/or ePUB format, as well as other popular books in Business & Government & Business. We have over one million books available in our catalogue for you to explore.

Information

Publisher
De Gruyter
Year
2020
ISBN
9783110664324
Edition
1

Part 1: Big Data Overview

Chapter 1 Introduction to Big Data

Data is the new gold. And analytics is the machinery that mines, molds, and mints it. Big data analytics is a set of computer-enabled analytics methods, processes, and discipline of extracting and transforming raw data into meaningful insight, new discovery, and knowledge that helps make more effective decision making. Another definition describes big data analytics as the discipline of extracting and analyzing data to deliver new insight about the past performance, current operations, and prediction of future events.
Before there was big data analytics, the study of large data sets was called data mining. But big data analytics has come a long way in a decade and is now gaining popularity thanks to the eruption of five new technologies: big data analytics, cloud computing, mobility, social networking, and smaller sensors. Each of these technologies is significant in its unique way to how business decisions and performance can be improved and how vast amounts of data can be generated.
Big data is known by its three key attributes known as the three Vs: volume, velocity, and variety. The world’s storage volume is increasing at a rapid pace, estimated to double every year. The velocity at which this data is generated is rising, fueled by the advent of mobile devices and social networking. In medicine and healthcare, the cost and size of sensors has shrunk, making continuous patient monitoring and data acquisition from a multitude of human physiological systems an accepted practice.
With the advent of smaller, inexpensive sensors and the volume of data collected from people, internet, and machines we’re challenged with making increasingly analytical decisions quickly from the large sets of data that are being collected. This trend is only increasing giving rise to what’s known in the industry as the “big data problem”: the rate of data accumulation is rising faster than people’s cognitive capacity to analyze increasingly large data sets to make decisions. The big data problem offers an opportunity for improved predictive analytics and fact-based decisions.
Big data is now regarded as one of the leading strategic business imperatives. Research points to an increasing number of executives who believe that without big data they will face extinction.1 Every day more than 4 petabytes of data is created on Facebook, 500 million tweets are sent, and more than 5 billion searches are made online. Several companies now offer big data storage solutions for on-premise (on-prem) and cloud-based solutions. Open source platforms like Hadoop data vaults and warehousing products including Cloudera, MapR, and HortonWorks have been implemented by most major corporations. The dominant cloud platforms such as Amazon AWS and Microsoft Azure offer some form of Hadoop, elastic storage, and other brands of parallel data storage solutions. Companies like Data Bricks, Snowflake, and Denodo are growing in the midst of the appetite to store and manage ever-expanding data sets.
The NoSQL (Not only SQL) and non-relational database movement has evolved beyond Hadoop to include Snowflake, Data Bricks, Elastic Search, and Spark. These unconventional solutions offer new data management tools that allow storage and management of large structured and non-structured data sets. But the biggest challenges of big data governance remain mostly uncharted territory at this time.
The variety of data is also increasing. For example, medical data was confined to paper for too long. As governments around the world, such as the United States, pushed medical institutions to transform their practice into electronic and digital format, patient data became digital and took on diverse forms. It’s now common to think of electronic medical records (EMR) as including diverse forms of data such as audio recordings; MRI, ultrasound, computed tomography (CT), and other diagnostic images; videos captured during surgery or directly from patients; color images of burns and wounds; digital images of dental x-rays; waveforms of brain scans; electrocardiograms (EKG); and the list goes on.
How will we manage and govern this vast and complex sea of data? What will be the costs of poor or no data governance? The goal of this book is to show the components and tips on establishing both effective and lean data governance. This book will contemplate the costs of doing nothing but presents a framework to bring data governance to big data.
New types of data include structured and unstructured text. It will include server logs and other machine generated data. It will include data from sensors, web sites, machines, mobile and wearable devices. It will include streaming data and customer sentiment data about you. It includes social media data including blogs, LinkedIn, Twitter, Instagram, Facebook, and local RSS feeds about your organization, your people, and products. All these varieties of data types and many more can be harnessed to provide a more complete picture of what is happening in delivering value to your customers.
The traditional data warehouse strategies based on relational databases suffer from a latency of up to 24 hours. These data warehouses can’t scale quickly with large data growth; and because they impose relational and data normalization constraints, their use is limited. In addition, they provide retrospective insight and not real-time or predictive analytics. Big data analytics will become a more real-time and “in-the-moment” decision support tool than the traditional business intelligence process of generating batch-oriented reports.
Semantics are critical to data analytics. As much as 60–80% of all data is unstructured data in the form of narrative text, emails, or audio recordings. Correctly extracting the pertinent terms from such data is a challenge. Tools such as natural language processing (NLP) methods combined with subject-specific libraries and ontologies are used to extract useful data from the vast amount of data stored in a Hadoop Data Lake. Hadoop Data Lake is a common term that describes the vast storage of data in the Hadoop file system. Data lake is a term increasingly used to refer to the new generation of big data warehouses where all data is going to be stored using open source and Hadoop technology. However, understanding the differences among sentence structure, context, and relationships between business terms is critical to detecting the semantics and meaning of data.
Given the rising volume of data and the demand for high-speed data access that can handle analytics, IT leaders are contemplating investing in a variety of tools and architectures. For about a decade, most solutions explored non-SQL solutions. SQL was a fundamental feature of data warehouses. It lost favor when NoSQL technologies like Hadoop and MongoDB emerged. NoSQL databases are ideal for storage and retrieval of unstructured data. But a new breed of data storage systems emerged that combine the best of both SQL and non-SQL data storage models.
Amazon offers RDS that comes with PostgreSQL by default, but gives you the option to choose Oracle or other databases. Microsoft Azure Datawarehouse has evolved to perform quite well on a variety of data schema and structures. Today SQL interfaces on top of Hadoop, Kafka, Spark, and many other database systems.
Transactional systems and reporting platforms were not designed to handle high-speed access to big data for analytics and thus are inadequate. As a result, specialized data analytics platforms are needed to handle high-volume data storage and high-speed access required for analytics.
Big data analytics requires very fast data load and data read functionality. In response to the faster database performance needs, dedicated analytics platforms like Hadoop, Snowflake, Denodo, Data Bricks, and other NoSQL databases and open source tools for data lakes have been adopted. Handling streaming data requires special database tools that can read and store data in real time. Some of the common open source tools include Kafka and NiFi. Other solutions include Amazon’s AWS Kinesis.
While big data analytics promises phenomenal improvements in every industry, as with any technology acquisition, we want to take prudent steps toward adoption. We want to define criteria for success, gauge return on investment (ROI), and its data-centric metric that I define as return on data (ROD). A successful project must demonstrate palpable benefits and value derived from new insights. Implementing big data analytics is a necessary and standard procedure for most organizations as they strive to identify any remaining opportunities in improving efficiency, strategic and marketing/sales advantage, and cutting costs. Managing data and data governance are critical to the success of big data analytics initiatives as the volume of data increases.
In addition to implementing a robust data governance structure as a key to big data analytics success, as I mentioned in my earlier book Lean, Agile and Six Sigma IT Management, successful implementation of big data analytics also requires a team effort combined with lean and agile practices. A team effort is required because no single vendor, individual, or solution satisfies all analytics needs of the organization, and data governance is no different. Collaboration and partnerships among all user communities in the organization as well as among vendors is needed for successful implementation. A lean approach is required in order to avoid duplication of efforts, process waste and discordant systems.
The future of big data analytics is bright and will be so for many years to come. We’re finally able to shed light on the data that have been locked up in the darkness of our electronic systems for years. When you consider other applications of analytics yet to be discovered, there are endless opportunities to improve business performance using data analytics.

The Three Dimensions of Analytics

Data analytics efforts may focus on any of the three temporal dimensions of analysis: retrospective analysis, real-time (current time) analysis, and predictive analysis. Retrospective analytics can explain and provide knowledge about the events of the past, show trends, and help find root causes for those events. Real-time analysis shows what is happening right now. It works to present situational awareness, send alarms when data reaches a certain threshold, or send reminders when a certain rule is satisfied. Prospective analysis presents a view into the future. It attempts to predict what will happen and determine the future values of certain variables. Figure 1.1 shows the taxonomy of the three analytics dimensions.
Figure 1.1: The Three Temporal Dimensions of Data Analytics.

The Distinction Between BI and Analytics

The purpose of business intelligence (BI) is to transform raw data into information, insight, and meaning for business purposes. Analytics is for discovery, knowledge creating, assertion and communication of patterns, associations, classifications, and learning from data. While both approaches crunch data and use computers and sof...

Table of contents

Citation styles for Big Data Management

APA 6 Citation

Ghavami, P. (2020). Big Data Management (1st ed.). De Gruyter. Retrieved from https://www.perlego.com/book/2108136/big-data-management-data-governance-principles-for-big-data-analytics-pdf (Original work published 2020)

Chicago Citation

Ghavami, Peter. (2020) 2020. Big Data Management. 1st ed. De Gruyter. https://www.perlego.com/book/2108136/big-data-management-data-governance-principles-for-big-data-analytics-pdf.

Harvard Citation

Ghavami, P. (2020) Big Data Management. 1st edn. De Gruyter. Available at: https://www.perlego.com/book/2108136/big-data-management-data-governance-principles-for-big-data-analytics-pdf (Accessed: 15 October 2022).

MLA 7 Citation

Ghavami, Peter. Big Data Management. 1st ed. De Gruyter, 2020. Web. 15 Oct. 2022.