Practical Big Data Analytics
eBook - ePub

Practical Big Data Analytics

Nataraj Dasgupta, Giancarlo Zaccone, Patrick Hannah

Share book
  1. 412 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Practical Big Data Analytics

Nataraj Dasgupta, Giancarlo Zaccone, Patrick Hannah

Book details
Book preview
Table of contents
Citations

About This Book

Get command of your organizational Big Data using the power of data science and analytics

Key Features

  • A perfect companion to boost your Big Data storing, processing, analyzing skills to help you take informed business decisions
  • Work with the best tools such as Apache Hadoop, R, Python, and Spark for NoSQL platforms to perform massive online analyses
  • Get expert tips on statistical inference, machine learning, mathematical modeling, and data visualization for Big Data

Book Description

Big Data analytics relates to the strategies used by organizations to collect, organize and analyze large amounts of data to uncover valuable business insights that otherwise cannot be analyzed through traditional systems. Crafting an enterprise-scale cost-efficient Big Data and machine learning solution to uncover insights and value from your organization's data is a challenge. Today, with hundreds of new Big Data systems, machine learning packages and BI Tools, selecting the right combination of technologies is an even greater challenge. This book will help you do that.

With the help of this guide, you will be able to bridge the gap between the theoretical world of technology with the practical ground reality of building corporate Big Data and data science platforms. You will get hands-on exposure to Hadoop and Spark, build machine learning dashboards using R and R Shiny, create web-based apps using NoSQL databases such as MongoDB and even learn how to write R code for neural networks.

By the end of the book, you will have a very clear and concrete understanding of what Big Data analytics means, how it drives revenues for organizations, and how you can develop your own Big Data analytics solution using different tools and methods articulated in this book.

What you will learn

  • - Get a 360-degree view into the world of Big Data, data science and machine learning
  • - Broad range of technical and business Big Data analytics topics that caters to the interests of the technical experts as well as corporate IT executives
  • - Get hands-on experience with industry-standard Big Data and machine learning tools such as Hadoop, Spark, MongoDB, KDB+ and R
  • - Create production-grade machine learning BI Dashboards using R and R Shiny with step-by-step instructions
  • - Learn how to combine open-source Big Data, machine learning and BI Tools to create low-cost business analytics applications
  • - Understand corporate strategies for successful Big Data and data science projects
  • - Go beyond general-purpose analytics to develop cutting-edge Big Data applications using emerging technologies

Who this book is for

The book is intended for existing and aspiring Big Data professionals who wish to become the go-to person in their organization when it comes to Big Data architecture, analytics, and governance. While no prior knowledge of Big Data or related technologies is assumed, it will be helpful to have some programming experience.

Frequently asked questions

How do I cancel my subscription?
Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.
Can/how do I download books?
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
What is the difference between the pricing plans?
Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.
What is Perlego?
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Do you support text-to-speech?
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Is Practical Big Data Analytics an online PDF/ePUB?
Yes, you can access Practical Big Data Analytics by Nataraj Dasgupta, Giancarlo Zaccone, Patrick Hannah in PDF and/or ePUB format, as well as other popular books in Computer Science & Data Processing. We have over one million books available in our catalogue for you to explore.

Information

Year
2018
ISBN
9781783554409
Edition
1

Big Data Mining with NoSQL

The term NoSQL was first used by Carlo Strozzi, who, in 1998, released the Strozzi NoSQL opensource relational database. In the late 2000s, new paradigms in database architecture emerged, many of which did not adhere to the strict constraints required of relational database systems. These databases, due to their non-conformity with standard database conventions such as ACID compliance, were soon grouped under a broad category known as NoSQL.
Each NoSQL database claims to be optimal for certain use cases. Although few of them would fit the requirements to be a general-purpose database management system, they all leverage a few common themes across the spectrum of NoSQL systems.
In this chapter, we will visit some of the broad categories of NoSQL database management systems. We will discuss the primary drivers that initiated the migration to NoSQL database systems and how such databases solved specific business needs that led to their widespread adoption, and conclude with a few hands-on NoSQL exercises.
The topics covered in this chapter include:
  • Why NoSQL?
  • NoSQL databases
  • In-memory databases
  • Columnar databases
  • Document-oriented databases
  • Key-value databases
  • Graph databases
  • Other NoSQL types and summary
  • Hands-on exercise on NoSQL systems

Why NoSQL?

The term NoSQL generally means Not Only SQL: that is, the underlying database has properties that are different to those of common and traditional database systems. As such, there is no clear distinction that qualifies a database as NoSQL, other than the fact that they do not provide the characteristics of ACID compliance. As such, it would be helpful to understand the nature of ACID properties that have been the mainstay of database systems for many decades, as well as discuss, in brief, the significance of BASE and CAP, two other terminologies central to databases today.

The ACID, BASE, and CAP properties

Let's first proceed with ACID and SQL.

ACID and SQL

ACID stands for atomicity, consistency, isolation, and durability:
  • Atomicity: This indicates that database transactions either execute in full or do not execute at all. In other words, either all transactions should be committed, that is, persisted in their entirety, or not committed at all. There is no scope for a partial execution of a transaction.
  • Consistency: The constraints on the data, that is, the rules that determine data management within a database, will be consistent throughout the database. Different instances will not abide by rules that are any different to those in other instances of the database.
  • Isolation: This property defines the rules of how concurrent operations (transactions) will read and write data. For example, if a certain record is being updated while another process reads the same record, the isolation level of the database system will determine which version of the data would be returned back to the user.
  • Durability: The durability of a database system generally indicates that committed transactions will remain persistent even in the event of a system failure. This is generally managed by the use of transaction logs that databases can refer to during recovery.
The reader may observe that all the properties defined here relate primarily to database transactions. A transaction is a unit of operation that abides by the aforementioned rules and makes a change to the database. For example, a typical cash withdrawal from an ATM may have the following logical pathway:
  1. User withdraws cash from an ATM
  2. The bank checks the current balance of the user
  3. The database system deducts the corresponding amount from the user's account
  4. The database system updates the amount in the user's account to reflect the change
As such, most databases in popular use prior to the mid-1990s, such as Oracle, Sybase, DB2, and others, were optimized for recording and managing transactional data. Until this time, most databases were responsible for managing transactional data. The rapid growth of the internet in the mid-90s led to new types of data that did not necessarily require the strict ACID compliance requirements. Videos on YouTube, music on Pandora, and corporate email records are all examples of use cases where a a transactional database does not add value beyond simply functioning as a technology layer for storing data.

The BASE property of NoSQL

By the late 2000s, data volume had surged and it was apparent that a new alternative model was required in order to manage the data. This new model, called BASE, became a foundational topic that replaced ACID as the preferred model of database management systems.
BASE stands for Basically Available Soft-state Eventually consistency. This implies that the database is basically available for use most of the time; that is, there can be periods during which the services are unavailable (and hence additional redundancy measures should be implemented). Soft-state means that the state of the system cannot be guaranteed - different instances of the same data might have different content as it may not have yet captured recent updates in another part of the cluster. Finally, eventually consistent implies that although the database might not be in the same state at all times, it will eventually get to the same state; that is, become consistent.

The CAP theorem

First introduced in the late 199...

Table of contents