1.1 Introduction
Massive increase of demand for the most powerful computing resources is determined by the steep increase of data available for processing and analysis. Many scientific and industrial advances have been driven by the capability to acquire insights and make precise decisions based on the data analysis outputs. International Data Corporation (IDC) predicts that the total volume of data available globally will exceed 175×1021 bytes, making it five times larger than the estimated 33×1021 bytes in 2018 [1]. The ability to process the data and infer valuable information from them efficiently will be a key driving factor for sustainable economic growth in the future.
In this chapter, we provide an overview of the current state of the art in the high-performance computing (HPC), big data, and cloud domains and how different fields like agriculture, healthcare, or mechanical engineering can benefit from their convergence. The overview is provided from the point of view of four European projects funded by the Horizon 2020 Programme – CYBELE [2,3], EVOLVE [4], and LEXIS [5].
1.1.1 History of Cloud Computing
A common trait of cloud computing is easy access to data center-class computing resources without the requirement to buy any physical hardware [6]. In the cloud, a user can allocate a machine with a large amount of CPU cores and RAM by using a web-based user interface or a REST API. The first commercial offering of cloud machines was done by Amazon (Elastic Cloud 2) in 2006 [7], followed by Google (Compute Cloud) [8], and Microsoft (Azure) [9].
The main expectations about such a machine are that it will be reliable, fast, and secure and that the procurement of such service is done in on-demand self-service manner [6]. These properties are usually provided by hardware and network virtualization, which separates the machine from physical hardware, while also isolating the individual users sharing the same server or network infrastructure. Storage is usually encrypted as well as the application programming interface (API) with the use of Transport Layer Security (TLS).
The definition of the speed of the machine is trickier. Since cloud is usually used for fast scaling of microservices, users of such systems prefer a fast response of the API as well as the possibility to select a preference for some geographical location. For example, allocating a machine in the US–west zone means that the machine will be allocated on a physical hardware located on the West Coast of the United States, placing it closer to users coming from that area without specifying where the data are located. This property enables a truly global and cost-effective deployment of an application that must deal with a highly dynamic load. To make the cloud a highly available and easy to use service, it needs to hide the details about the physical infrastructure used to host the user machines. This fact drives some users with highly sensitive workloads, who want to use the benefits of the cloud, to deploy on-premises cloud often based on technologies like OpenStack [10] or VMware [11].
1.1.2 History of HPC
A supercomputer is a very powerful scientific computing system with computing power that could be in (or approach) the TOP500 [12] list which ranks the systems according to their performance in floating point operations per second (FLOP/s). Origins of supercomputing can be traced to the 1940s where the first machines were used to crack encryption codes or to estimate the yield of an atomic explosion [13,14].
Later, companies like PDP, IBM, and mainly Cray built large monolithic machines with powerful vector CPUs. These machines were quite expensive and only large companies or national-funded laboratories were able to obtain them. Based on that fact, access to such machines has been usually restricted only to a privileged group of scientists and engineers. These people were usually highly trained and had intimate knowledge of the system architecture, and were accustomed to UNIX-like operating systems, terminals, and command line tools [15].
In the early 1990s, the manufacturers started to build supercomputers from more common hardware as it became more powerful and much cheaper than using highly proprietary architectures, specialized, and monolithic machines. At that point, supercomputers became clusters of powerful servers interconnected with a network with very dense topology. The current standards remain very similar to this architecture. However, highly specialized chips and accelerators like GPUs, TPUs, or FPGAs are again taking place next to the general-purpose CPUs more and more often [16,17].
Despite incredible advances in the available computing power, the interface of common HPC clusters remains mostly unchanged. Jobs must be submitted to a batch scheduler which runs the individual applications and shares the workload among the compute nodes. Users usually use SSH to login to a frontend node of an HPC cluster where they prepare the data and compile their applications through a command line interface (CLI). As clusters today almost exclusively run Linux operating systems, users are required to have extensive knowledge about shell scripting, batch schedulers interfaces, and even compiler tool chains and library linking to efficiently utilize such systems. Eliminating this obstacle by creating user-friendly interfaces can open doors to HPC for a wide area of domain experts who do not have to acquire the specialized knowledge about HPC systems.
In comparison with the cloud, access to the largest HPC systems is usually granted only to selected use cases (projects) based on excellence and performance review to avoid wasting of precious natural resources as the most powerful systems can consume several MW of power when fully utilized [12]. Automation and normalization of this formal review process can open doors for wider areas of applications while keeping the requirement for efficient usage.
Such systems are traditionally used for solvers of methods coming from linear algebra like finite element method (FEM) [18], which is often used to simulate various physical processes like deformations, heat transfer, or fluid flow. The solvers usually work using matrix sizes that significantly exceed the computing power available on high-end scientific desktop workstations. Nowadays, the typical application domains utilizing HPC power can be derived while following centers of excellence [34] supported by the European Union (EU) to strengthen Europe’s leadership in HPC applications: not only engineering and industry applications, but also other domains like simulations of weather and climate, urgent computing to support crisis management, biomolecular research, personalized medicine and drug discovery, and new materials design.
1.1.3 Evolution of Big Data
Making educated decisions by interpreting available data is one of the founding principles of modern civilization. Stone tablets were used for tracking crop yields a thousand years ago. Rapid evolution of mathematics and statistics in the 16th century created the foundations for modern data science. With the onset of electronic computers in the 20th century, large amounts of data suddenly became available [19].
At the beginning, the data were stored on magnetic tape and processing was done using large mainframe computers controlled by punch cards. As these machines were expensive, they were used almost exclusively by large companies or financial institutions, mainly to automate tax and payroll processing [19].
Later, when personal computers appeared, data storage and processing became increasingly available for the general population. The Excel spreadsheets are the most common tool for basic data analysis widely available. At the same time, industrial automation started to gain traction and it was now possible to implement complex enterprise resource planning (ERP) systems to automate and optimize the logistics of very complex manufacturing processes, thus creating the business intelligence field (BI) [19].
However, the biggest explosion of data occurred with the onset of the internet and the social network revolution of the 21st century. Currently, the world’s biggest companies are those which specialize in data processing, mainly for advertising and marketing purposes. This revolution not only allowed instant and free access to vast human knowledge using small handheld devices but also invented novel and advanced data analysis and machine learning (ML) techniques, which was followed by renewed interest in neural networks.
Development related to artificial intelligence (AI) techniques and the Internet of things (IoT) is now the cutting edge of data processing. This allows us to integrate the ability to communicate with services on the internet into many appliances. Such advances also allow to move part of the data processing power out of the data center closer to the places where data are generated. This approach is generally called Edge computing and contributes to the wide range of computing resources currently available [20].
1.1.4 Evolution of Big Data Storage and Tools
It would be futile to try to formalize the term big data as it is commonly used in multiple ways. First used by Roger Mougalas in 2005 to describe data “So big that it is almost impossible to manage and process using traditional business intelligence tools.” Another definition describes them as data so big that it is faster to physically move the hard drives with them than to transfer them over the network. Those descriptions point to the fact that powerful computing resources are needed to make use of them [21].
Data lake is another informal term often used in this context. It relates to the storage and processing of massive amounts of unstructured and heterogeneous data only linked by their context [22]. Object storage becomes de facto a standard way to store such data as they use a more practical flat structure (with buckets of unique identifiers) than the tree structures of traditional file systems. Cloud providers started to offer HTTP-based APIs to use such object storages to follow the established principles. Technologies like CEPH [24] or GlusterFS [25] allow the deployment of such object storage on common hardware on premises with comparable performance and resilience to proprietary storage appliances.
Tools used to analyze the data also have undergone rapid evolution. In the 1990s, the data analysis was done on a single and powerful local workstation usin...