What makes a modern Big Data platform remains unclear. A modern Big Data platform has several requirements, and to meet them correctly, expectations with regard to data should be set. Once a base is established for expectations from data, we can then reason about a modern platform that can serve it.
1.2.1 Expectations from Data
Big Data may be structured, semi‐structured, or unstructured in a modern Big Data platform and come from various sources with different frequencies or volumes. A modern Big Data platform should accept each data source in the current formats and process them according to a set of rules. After processing, the prepared data should meet the following expectations.
1.2.1.1 Ease of Access
Accessing prepared data depends on internal customer groups. The users of the platform can have a very diverse set of technical abilities. Some of them are engineers, who would like to get very deep and technical with the platform. On the other hand, some may be less technically savvy. The Big Data platform should ideally serve both ends of the customer spectrum.
Engineers dealing with the platform expect to have an application programming interface (API) to communicate with about the platform in various integration points. Some of the tasks would require coding or automation from their end. Moreover, data analysts expect to access the data through standard tooling like SQL or write an extract, transform, load (ETL) job to extract or analyze information. Lastly, the platform should offer a graphical user interface to those who simply want to see a performance metric or a business insight even without a technical background.
1.2.1.2 Security
Data is an invaluable asset for organizations. Securing data has become a crucial aspect of a modern Big Data platform. Safeguarding against a possible data breach is a big concern because a leak would result in financial loses, reduced customer trust, and damage to the overall reputation of the company.
Security risks should be eliminated, but users should be able to leverage the platform easily. Achieving both user‐friendliness and data protection requires a combination of different security measures such as authentication, access control, and encryption.
The organizations should identify who can access to the platform. At the same time, access to a particular class of data should be restricted to a certain user or user group. Furthermore, some of the data might contain critical information like PII, which should be encrypted.
1.2.1.3 Quality
High‐quality data enables businesses to make healthier decisions, opens up new opportunities, and provides a competitive advantage. The data quality depends on factors such as accuracy, consistency, reliability, and visibility. A modern Big Data platform should support ways to accomplish accurate and consistent data between data sources to produce visible data definition and reliable processed data. The domain is the driving factor for a Big Data platform when it comes to data quality. Hence, the number of resources allocated to the data quality changes according to the domain. Some of the business domains might be quite flexible, while others would require strict rules or regulations.
1.2.1.4 Extensibility
Iterative development is an essential part of software engineering. It is no surprise that it is also part of Big Data processing. A modern Big Data platform should empower the ease of reprocessing. Once the data is produced, the platform should provide infrastructure to extend the data easily. This is an important aspect because there are many ways things can go wrong when dealing with data. One or more iteration can be necessary.
Moreover, the previously obtained results should be reproducible. The platform should reprocess the data and achieve the same results when the given parameters are the same. It is also important to mention that the platform should offer mechanisms to detect deviations from the expected result.
1.2.2 Expectations from Platform
After establishing expectations regarding the data, how to meet these expectations by the platform should be discussed. Before starting, the importance of the human factor should be noted. Ideal tooling can be built, but these would be useful only in a collaborative environment. Some of the critical business information and processing can occur with good communication and methods. This section will present an overview of the features in pursuit of our ideal Big Data platform; we will not go into detail in explaining each of the features we would employ since we have chapters discussing it.
1.2.2.1 Storage Layer
Ideally, a storage layer that can scale in terms of capacity, process an increasing number of reads and writes, accept different data types, and provide access permissions. Typical Big Data storage systems handle the capacity problem by scaling horizontally. New nodes can be introduced transparently to the applications backed by the system. With the advent of cloud providers, one can also employ cloud storage to deal with the growing amount of storage needs. Moreover, a hybrid solution is an option where the platform uses both on‐site and cloud solutions. While providing scalability in terms of volume and velocity, the platform should also provide solutions in cases of backup, disaster recovery, and cleanups.
One of the hard problems of Big Data is backups as the vast amount of storage needed is overwhelming for backups. One of the options for backups is magnetic tapes as they are resilient to failures and do not require power when they are not in use. A practical option is relying on durable and low‐cost cloud storage. In addition, an expensive but yet very fast solution is to have a secondary system that either holds partly or the whole data storage. With one of the proposed solutions in place, the platform can potentially perform periodic backups.
In case of disaster recovery from backups, separate sets of data sorted by their priority are an option since retrieving backup data would take quite some time. Having different data sets also provides the ability to spin up multiple clusters to process critical data in parallel. The clusters can be spun up on separate hardware or again using a cloud provider. The key is to be able to define which data sets are business‐critical. Categorizing and assigning priority to each data set enables the recovery execution to be process‐driven.
The storage layer can suffer from lost space when the data are replicated in many different ways but no process is available to clean up. There are two ways to deal with data clean up. The first is the retention policy. If all data sets have a retention policy, then one could build processes to flush expired data whenever it executes. The second is the proactive claiming of unused data space. To understand which data is not accessed, a process might look at the access logs and determine unused data. Hence, a reclaiming process should be initiated by warning the owners of the data. Once the owners approve, the process should be initiated and reclaim the space.
1.2.2.2 Resource Management
The workload management consists of managing resources across multiple requests, prioritization of tasks, meeting service‐level agreements (SLAs), and assessing the cost. The platform should enable important tasks to finish on time, respond to ad hoc requests promptly, and use available resources judiciously to complete tasks quickly and measure the cost. To accomplish these, the platform should provide an approach for resource sharing, visibility for the entire platform, monitoring around individual tasks, and cost reporting structure.
Resource sharing strategies can affect the performance of the platform and fairness toward individual jobs. On one hand, when there is no task running, the platform should use as much resources as possible to perform a given task. On the other hand, a previously initiated job slows down all other requests that started after this task. Therefore, most of the Big Data systems provide a queuing mechanism to separate resources. Queuing enables sharing of resources across different business units. On the other hand, it is less dramatic when the platform uses cloud‐based technologies. A cloud solution can give the platform the versatility to run tasks on short‐lived clusters that can automatically scale to meet the demand. With this option, the platform can employ as many nodes as needed to perform tasks faster.
Oftentimes, the visibility of the platform in terms of usage might not be a priority. Thus, making a good judgment is difficult without easily accessible performance information. Furthermore, the platform can consist of a different set of clusters, which then makes it even harder to visualize the activity in the platform at a snapshot of time. For each of the technology used under the hoot, the platform should be able to access performance metrics or calculate itself and report them in multiple graphical dashboards.
The number of tasks performed on the platform slows down a cluster or even bring it down. It is important to set SLAs for each performed task and monitor individual tasks for their runtime or resource allocation. When there is an oddity in executing tasks, the platform should notify the owner of the task or abort the task entirely. If the platform makes use of cloud computing technologies, then it is extremely important to abort tasks or not even start executing them by using the estimated costs.
I believe the cost should be an integral part of the platform. It is extremely important to be transparent for the customers. If the p...