eBook - ePub

Designing Big Data Platforms

Name: Designing Big Data Platforms
Author: Yusuf Aytas

How to Use, Deploy, and Maintain Big Data Systems

Yusuf Aytas

Compartir libro

English
ePUB (apto para móviles)
Disponible en iOS y Android

eBook - ePub

Designing Big Data Platforms

How to Use, Deploy, and Maintain Big Data Systems

Yusuf Aytas

Detalles del libro

Vista previa del libro

Índice

Citas

Información del libro

DESIGNING BIG DATA PLATFORMS

Provides expert guidance and valuable insights on getting the most out of Big Data systems

An array of tools are currently available for managing and processing data—some are ready-to-go solutions that can be immediately deployed, while others require complex and time-intensive setups. With such a vast range of options, choosing the right tool to build a solution can be complicated, as can determining which tools work well with each other. Designing Big Data Platforms provides clear and authoritative guidance on the critical decisions necessary for successfully deploying, operating, and maintaining Big Data systems.

This highly practical guide helps readers understand how to process large amounts of data with well-known Linux tools and database solutions, use effective techniques to collect and manage data from multiple sources, transform data into meaningful business insights, and much more. Author Yusuf Aytas, a software engineer with a vast amount of big data experience, discusses the design of the ideal Big Data platform: one that meets the needs of data analysts, data engineers, data scientists, software engineers, and a spectrum of other stakeholders across an organization. Detailed yet accessible chapters cover key topics such as stream data processing, data analytics, data science, data discovery, and data security. This real-world manual for Big Data technologies:

Provides up-to-date coverage of the tools currently used in Big Data processing and management
Offers step-by-step guidance on building a data pipeline, from basic scripting to distributed systems
Highlights and explains how data is processed at scale
Includes an introduction to the foundation of a modern data platform

Designing Big Data Platforms: How to Use, Deploy, and Maintain Big Data Systems is a must-have for all professionals working with Big Data, as well researchers and students in computer science and related fields.

Preguntas frecuentes

¿Cómo cancelo mi suscripción?

Simplemente, dirígete a la sección ajustes de la cuenta y haz clic en «Cancelar suscripción». Así de sencillo. Después de cancelar tu suscripción, esta permanecerá activa el tiempo restante que hayas pagado. Obtén más información aquí.

¿Cómo descargo los libros?

Por el momento, todos nuestros libros ePub adaptables a dispositivos móviles se pueden descargar a través de la aplicación. La mayor parte de nuestros PDF también se puede descargar y ya estamos trabajando para que el resto también sea descargable. Obtén más información aquí.

¿En qué se diferencian los planes de precios?

Ambos planes te permiten acceder por completo a la biblioteca y a todas las funciones de Perlego. Las únicas diferencias son el precio y el período de suscripción: con el plan anual ahorrarás en torno a un 30 % en comparación con 12 meses de un plan mensual.

¿Qué es Perlego?

Somos un servicio de suscripción de libros de texto en línea que te permite acceder a toda una biblioteca en línea por menos de lo que cuesta un libro al mes. Con más de un millón de libros sobre más de 1000 categorías, ¡tenemos todo lo que necesitas! Obtén más información aquí.

¿Perlego ofrece la función de texto a voz?

Busca el símbolo de lectura en voz alta en tu próximo libro para ver si puedes escucharlo. La herramienta de lectura en voz alta lee el texto en voz alta por ti, resaltando el texto a medida que se lee. Puedes pausarla, acelerarla y ralentizarla. Obtén más información aquí.

¿Es Designing Big Data Platforms un PDF/ePUB en línea?

Sí, puedes acceder a Designing Big Data Platforms de Yusuf Aytas en formato PDF o ePUB, así como a otros libros populares de Mathématiques y Probabilités et statistiques. Tenemos más de un millón de libros disponibles en nuestro catálogo para que explores.

Información

Editorial

Wiley

Año

2021

ISBN

9781119690955

Edición

Categoría

Mathématiques

Categoría

Probabilités et statistiques

1
An Introduction: What's a Modern Big Data Platform

After reading this chapter, you should be able to:

Define a modern Big Data platform

Describe expectations from data

Describe expectations from a platform

This chapter discusses the different aspects of designing Big Data platforms, in order to define what makes a big platform and to set expectations for these platforms.

1.1 Defining Modern Big Data Platform

The key factor in defining Big Data platform is the extent of data. Big Data platforms involve large amounts of data that cannot be processed or stored by a few nodes. Thus, Big Data platform is defined here as an infrastructure layer that can serve and process large amounts of data that require many nodes. The requirements of the workload shape the number of nodes required for the job. For example, some workloads require tens of nodes for a few hours or fewer nodes for days of work. The nature of the workloads depends on the use case.

Organizations use Big Data platforms for business intelligence, data analytics, and data science, among others, because they identify, extract, and forecast information based on the collected data, thus aiding companies to make informed decisions, improve their strategies, and evaluate parts of their business. The more the data recorded in different aspects of business, the better the understanding. The solutions for Big Data processing vary based on the company strategy.

Companies can either use on‐site or cloud‐based solutions for their Big Data computing and storage needs. In either case, various parts can be considered all together as a Big Data platform. The cogs of the platform might differ in terms of storage type, compute power, and life span. Nevertheless, the platform as a whole remains responsible for business needs.

1.2 Fundamentals of a Modern Big Data Platform

What makes a modern Big Data platform remains unclear. A modern Big Data platform has several requirements, and to meet them correctly, expectations with regard to data should be set. Once a base is established for expectations from data, we can then reason about a modern platform that can serve it.

1.2.1 Expectations from Data

Big Data may be structured, semi‐structured, or unstructured in a modern Big Data platform and come from various sources with different frequencies or volumes. A modern Big Data platform should accept each data source in the current formats and process them according to a set of rules. After processing, the prepared data should meet the following expectations.

1.2.1.1 Ease of Access

Accessing prepared data depends on internal customer groups. The users of the platform can have a very diverse set of technical abilities. Some of them are engineers, who would like to get very deep and technical with the platform. On the other hand, some may be less technically savvy. The Big Data platform should ideally serve both ends of the customer spectrum.

Engineers dealing with the platform expect to have an application programming interface (API) to communicate with about the platform in various integration points. Some of the tasks would require coding or automation from their end. Moreover, data analysts expect to access the data through standard tooling like SQL or write an extract, transform, load (ETL) job to extract or analyze information. Lastly, the platform should offer a graphical user interface to those who simply want to see a performance metric or a business insight even without a technical background.

1.2.1.2 Security

Data is an invaluable asset for organizations. Securing data has become a crucial aspect of a modern Big Data platform. Safeguarding against a possible data breach is a big concern because a leak would result in financial loses, reduced customer trust, and damage to the overall reputation of the company.

Security risks should be eliminated, but users should be able to leverage the platform easily. Achieving both user‐friendliness and data protection requires a combination of different security measures such as authentication, access control, and encryption.

The organizations should identify who can access to the platform. At the same time, access to a particular class of data should be restricted to a certain user or user group. Furthermore, some of the data might contain critical information like PII, which should be encrypted.

1.2.1.3 Quality

High‐quality data enables businesses to make healthier decisions, opens up new opportunities, and provides a competitive advantage. The data quality depends on factors such as accuracy, consistency, reliability, and visibility. A modern Big Data platform should support ways to accomplish accurate and consistent data between data sources to produce visible data definition and reliable processed data. The domain is the driving factor for a Big Data platform when it comes to data quality. Hence, the number of resources allocated to the data quality changes according to the domain. Some of the business domains might be quite flexible, while others would require strict rules or regulations.

1.2.1.4 Extensibility

Iterative development is an essential part of software engineering. It is no surprise that it is also part of Big Data processing. A modern Big Data platform should empower the ease of reprocessing. Once the data is produced, the platform should provide infrastructure to extend the data easily. This is an important aspect because there are many ways things can go wrong when dealing with data. One or more iteration can be necessary.

Moreover, the previously obtained results should be reproducible. The platform should reprocess the data and achieve the same results when the given parameters are the same. It is also important to mention that the platform should offer mechanisms to detect deviations from the expected result.

1.2.2 Expectations from Platform

After establishing expectations regarding the data, how to meet these expectations by the platform should be discussed. Before starting, the importance of the human factor should be noted. Ideal tooling can be built, but these would be useful only in a collaborative environment. Some of the critical business information and processing can occur with good communication and methods. This section will present an overview of the features in pursuit of our ideal Big Data platform; we will not go into detail in explaining each of the features we would employ since we have chapters discussing it.

1.2.2.1 Storage Layer

Ideally, a storage layer that can scale in terms of capacity, process an increasing number of reads and writes, accept different data types, and provide access permissions. Typical Big Data storage systems handle the capacity problem by scaling horizontally. New nodes can be introduced transparently to the applications backed by the system. With the advent of cloud providers, one can also employ cloud storage to deal with the growing amount of storage needs. Moreover, a hybrid solution is an option where the platform uses both on‐site and cloud solutions. While providing scalability in terms of volume and velocity, the platform should also provide solutions in cases of backup, disaster recovery, and cleanups.

One of the hard problems of Big Data is backups as the vast amount of storage needed is overwhelming for backups. One of the options for backups is magnetic tapes as they are resilient to failures and do not require power when they are not in use. A practical option is relying on durable and low‐cost cloud storage. In addition, an expensive but yet very fast solution is to have a secondary system that either holds partly or the whole data storage. With one of the proposed solutions in place, the platform can potentially perform periodic backups.

In case of disaster recovery from backups, separate sets of data sorted by their priority are an option since retrieving backup data would take quite some time. Having different data sets also provides the ability to spin up multiple clusters to process critical data in parallel. The clusters can be spun up on separate hardware or again using a cloud provider. The key is to be able to define which data sets are business‐critical. Categorizing and assigning priority to each data set enables the recovery execution to be process‐driven.

The storage layer can suffer from lost space when the data are replicated in many different ways but no process is available to clean up. There are two ways to deal with data clean up. The first is the retention policy. If all data sets have a retention policy, then one could build processes to flush expired data whenever it executes. The second is the proactive claiming of unused data space. To understand which data is not accessed, a process might look at the access logs and determine unused data. Hence, a reclaiming process should be initiated by warning the owners of the data. Once the owners approve, the process should be initiated and reclaim the space.

1.2.2.2 Resource Management

The workload management consists of managing resources across multiple requests, prioritization of tasks, meeting service‐level agreements (SLAs), and assessing the cost. The platform should enable important tasks to finish on time, respond to ad hoc requests promptly, and use available resources judiciously to complete tasks quickly and measure the cost. To accomplish these, the platform should provide an approach for resource sharing, visibility for the entire platform, monitoring around individual tasks, and cost reporting structure.

Resource sharing strategies can affect the performance of the platform and fairness toward individual jobs. On one hand, when there is no task running, the platform should use as much resources as possible to perform a given task. On the other hand, a previously initiated job slows down all other requests that started after this task. Therefore, most of the Big Data systems provide a queuing mechanism to separate resources. Queuing enables sharing of resources across different business units. On the other hand, it is less dramatic when the platform uses cloud‐based technologies. A cloud solution can give the platform the versatility to run tasks on short‐lived clusters that can automatically scale to meet the demand. With this option, the platform can employ as many nodes as needed to perform tasks faster.

Oftentimes, the visibility of the platform in terms of usage might not be a priority. Thus, making a good judgment is difficult without easily accessible performance information. Furthermore, the platform can consist of a different set of clusters, which then makes it even harder to visualize the activity in the platform at a snapshot of time. For each of the technology used under the hoot, the platform should be able to access performance metrics or calculate itself and report them in multiple graphical dashboards.

The number of tasks performed on the platform slows down a cluster or even bring it down. It is important to set SLAs for each performed task and monitor individual tasks for their runtime or resource allocation. When there is an oddity in executing tasks, the platform should notify the owner of the task or abort the task entirely. If the platform makes use of cloud computing technologies, then it is extremely important to abort tasks or not even start executing them by using the estimated costs.

I believe the cost should be an integral part of the platform. It is extremely important to be transparent for the customers. If the p...