Spark
eBook - ePub

Spark

Big Data Cluster Computing in Production

  1. English
  2. ePUB (mobile friendly)
  3. Available on iOS & Android
eBook - ePub

Spark

Big Data Cluster Computing in Production

About this book

Production-targeted Spark guidance with real-world use cases

Spark: Big Data Cluster Computing in Production goes beyond general Spark overviews to provide targeted guidance toward using lightning-fast big-data clustering in production. Written by an expert team well-known in the big data community, this book walks you through the challenges in moving from proof-of-concept or demo Spark applications to live Spark in production. Real use cases provide deep insight into common problems, limitations, challenges, and opportunities, while expert tips and tricks help you get the most out of Spark performance. Coverage includes Spark SQL, Tachyon, Kerberos, ML Lib, YARN, and Mesos, with clear, actionable guidance on resource scheduling, db connectors, streaming, security, and much more.

Spark has become the tool of choice for many Big Data problems, with more active contributors than any other Apache Software project. General introductory books abound, but this book is the first to provide deep insight and real-world advice on using Spark in production. Specific guidance, expert tips, and invaluable foresight make this guide an incredibly useful resource for real production settings.

  • Review Spark hardware requirements and estimate cluster size
  • Gain insight from real-world production use cases
  • Tighten security, schedule resources, and fine-tune performance
  • Overcome common problems encountered using Spark in production

Spark works with other big data tools including MapReduce and Hadoop, and uses languages you already know like Java, Scala, Python, and R. Lightning speed makes Spark too good to pass up, but understanding limitations and challenges in advance goes a long way toward easing actual production implementation. Spark: Big Data Cluster Computing in Production tells you everything you need to know, with real-world production insight and expert guidance, tips, and tricks.

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn more here.
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Yes, you can access Spark by Ilya Ganelin,Ema Orhian,Kai Sasaki,Brennon York in PDF and/or ePUB format, as well as other popular books in Informatik & Data-Warehousing. We have over one million books available in our catalogue for you to explore.

Information

Publisher
Wiley
Year
2016
Print ISBN
9781119254010
eBook ISBN
9781119254058
Edition
1

CHAPTER 1
Finishing Your Spark Job

When you scale out a Spark application for the first time, one of the more common occurrences you will encounter is the application’s inability to merely succeed and finish its job. The Apache Spark framework’s ability to scale is tremendous, but it does not come out of the box with those properties. Spark was created, first and foremost, to be a framework that would be easy to get started and use. Once you have developed an initial application, however, you will then need to take the additional exercise of gaining deeper knowledge of Spark’s internals and configurations to take the job to the next stage.
In this chapter we lay the groundwork for getting a Spark application to succeed. We will focus primarily on the hardware and system-level design choices you need to set up and consider before you can work through the various Spark-specific issues to move an application into production.
We will begin by discussing the various ways you can install a production-grade cluster for Apache Spark. We will include the scaling efficiencies you will need depending on a given workload, the various installation methods, and the common setups. Next, we will take a look at the historical origins of Spark in order to better understand its design and to allow you to best judge when it is the right tool for your jobs. Following that, we will take a look at resource management: how memory, CPU, and disk usage come into play when creating and executing Spark applications. Next, we will cover storage capabilities within Spark and their external subsystems. Finally, we will conclude with a discussion of how to instrument and monitor a Spark application.

Installation of the Necessary Components

Before you can begin to migrate an application written in Apache Spark you will need an actual cluster to begin testing it on. You can download, compile, and install Spark in a number of different ways within its system (some will be easier than others), and we’ll cover the primary methods in this chapter.
Let’s begin by explaining how to configure a native installation, meaning one where only Apache Spark is installed, then we’ll move into the various Hadoop distributions (Cloudera and Hortonworks), and conclude by providing a brief explanation on how to deploy Spark on Amazon Web Services (AWS).
Before diving too far into the various ways you can install Spark, the obvious question that arises is, ā€œWhat type of hardware should I leverage for a Spark cluster?ā€ We can offer various possible answers to this question, but we’d like to focus on a few resounding truths of the Spark framework rather than necessitating a given layout.
It’s important to know that Apache Spark is an in-memory compute grid. Therefore, for maximum efficiency, it is highly recommended that the system, as a whole, maintain enough memory within the framework for the largest workload (or dataset) that will be conceivably consumed. We are not saying that you cannot scale a cluster later, but it is always better to plan ahead, especially if you work inside a larger organization where purchase orders might take weeks or months.
On the concept of memory it is necessary to understand that when computing the amount of memory you need to understand that the computation does not equate to a one-to-one fashion. That is to say, for a given 1TB dataset, you will need more than 1TB of memory. This is because when you create objects within Java from a dataset, the object is typically much larger than the original data element. Multiply that expansion times the number of objects created for a given dataset and you will have a much more accurate representation of the amount of memory a system will require to perform a given task.
To better attack this problem, Spark is, at the time of this writing, working on what Apache has called Project Tungsten, which will greatly reduce the memory overhead of objects by leveraging off heap memory. You don’t need to know more about Tungsten as you continue reading this book, but this information may apply to future Spark releases, because Tungsten is poised to become the de facto memory management system.
The second major component we want to highlight in this chapter is the number of CPU cores you will need per physical machine when you are determining hardware for Apache Spark. This is a much more fragmented answer in that, once the data load normalizes into memory, the application is typically network or CPU bound. That said, the easiest solution is to test your Spark application on a smaller dataset and measure its bounding case, be it either network or CPU, and then plan accordingly from there.

Native Installation Using a Spark Standalone Cluster

The simplest way to install Spark is to deploy a Spark Standalone cluster. In this mode, you deploy a Spark binary to each node in a cluster, update a small set of configuration files, and then start the appropriate processes on the master and slave nodes. In Chapter 2, we discuss this process in detail and present a simple scenario covering installation, deployment, and execution of a basic Spark job.
Because Spark is not tied to the Hadoop ecosystem, this mode does not have any dependencies aside from the Java JDK. Spark currently recommends the Java 1.7 JDK. If you wish to run alongside an existing Hadoop deployment, you can launch the Spark processes on the same machines as the Hadoop installation and configure the Spark environment variables to include the Hadoop configuration.
NOTE For more on a Cloudera installation of Spark try http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_spark_installation.html. For more on the Hortonworks installation try http://hortonworks.com/hadoop/spark/#section_6. And for more on an Amazon Web Services installation of Spark try http://aws.amazon.com/articles/4926593393724923.

The History of Distributed Computing That Led to Spark

We have introduced Spark as a distributed compute framework; however, we haven’t really discussed what this means. Until recently, most computer systems available to both individuals and enterprises were based around single machines. These single machines came in many shapes and sizes and differed dramatically in terms of their performance, as they do today.
We’re all familiar with the modern ecosystem of personal machines. At the low-end, we have tablets and mobile phones. We can think of these as relatively weak, un-networked computers. At the next level we have laptops and desktop computers. These are more powerful machines, with more storage and computational ability, and potentially, with one or more graphics cards (GPUs) that support certain types of massively parallel computations. Next are those machines that some people have networked with in their home, although generally these machines were not networked to share their computational ability, but rather to provide shared storage—for example, to share movies or music across a home network.
Within most enterprises, the picture today is still much the same. Although the machines used may be more powerful, most of the software they run, and most of the work they do, is still executed on a single machine. This fact limits the scale and the potential impact of the work they can do. Given this limitation, a few select organizations have driven the evolution of modern parallel computing to allow networked systems of computers to do more than just share data, and to collaboratively utilize their resources to tackle enormous problems.
In the public domain, you may have heard of the SETI at Home program from Berkeley or the Folding@Home program from Stanford. Both of these programs were early initiatives that let individuals dedicate their machines to solving parts of a massive distributed task. In the former case, SETI has been looking for unusual signals coming from outer space collected via radio telescope. In the latter, the Stanford program runs a piece of a program computing permutations of proteins—essentially building molecules—for medical research.
Because of the size of the data being processed, no single machine, not even the massive supercomputers available in certain universities or government agencies, have had the capacity to solve these problems within the scope of a project or even a lifetime. By distributing the workload to multiple machines, the problem became potentially tractable—solvable in the allotted time.
As these systems became more mature, and the computer science behind these systems was further developed, many organizations created clusters of machines—coordinated systems that could distribute the workload of a particular problem ...

Table of contents

  1. Cover
  2. Title Page
  3. Introduction
  4. Chapter 1: Finishing Your Spark Job
  5. Chapter 2: Cluster Management
  6. Chapter 3: Performance Tuning
  7. Chapter 4: Security
  8. Chapter 5: Fault Tolerance or Job Execution
  9. Chapter 6: Beyond Spark
  10. Copyright
  11. Credits
  12. Acknowledgments
  13. About the Authors
  14. About the Technical Editors
  15. EULA