eBook - ePub

Mastering Spark for Data Science

Name: Mastering Spark for Data Science
Author: Andrew Morgan, Antoine Amend, David George, Matthew Hallett

Andrew Morgan, Antoine Amend, David George, Matthew Hallett

Share book

560 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Mastering Spark for Data Science

Andrew Morgan, Antoine Amend, David George, Matthew Hallett

Book details

Book preview

Table of contents

Citations

About This Book

Master the techniques and sophisticated analytics used to construct Spark-based solutions that scale to deliver production-grade data science productsAbout This Book• Develop and apply advanced analytical techniques with Spark• Learn how to tell a compelling story with data science using Spark's ecosystem• Explore data at scale and work with cutting edge data science methodsWho This Book Is ForThis book is for those who have beginner-level familiarity with the Spark architecture and data science applications, especially those who are looking for a challenge and want to learn cutting edge techniques. This book assumes working knowledge of data science, common machine learning methods, and popular data science tools, and assumes you have previously run proof of concept studies and built prototypes.What You Will Learn• Learn the design patterns that integrate Spark into industrialized data science pipelines• See how commercial data scientists design scalable code and reusable code for data science services• Explore cutting edge data science methods so that you can study trends and causality• Discover advanced programming techniques using RDD and the DataFrame and Dataset APIs• Find out how Spark can be used as a universal ingestion engine tool and as a web scraper• Practice the implementation of advanced topics in graph processing, such as community detection and contact chaining• Get to know the best practices when performing Extended Exploratory Data Analysis, commonly used in commercial data science teams• Study advanced Spark concepts, solution design patterns, and integration architectures• Demonstrate powerful data science pipelinesIn DetailData science seeks to transform the world using data, and this is typically achieved through disrupting and changing real processes in real industries. In order to operate at this level you need to build data science solutions of substance –solutions that solve real problems. Spark has emerged as the big data platform of choice for data scientists due to its speed, scalability, and easy-to-use APIs.This book deep dives into using Spark to deliver production-grade data science solutions. This process is demonstrated by exploring the construction of a sophisticated global news analysis service that uses Spark to generate continuous geopolitical and current affairs insights.You will learn all about the core Spark APIs and take a comprehensive tour of advanced libraries, including Spark SQL, Spark Streaming, MLlib, and more.You will be introduced to advanced techniques and methods that will help you to construct commercial-grade data products. Focusing on a sequence of tutorials that deliver a working news intelligence service, you will learn about advanced Spark architectures, how to work with geographic data in Spark, and how to tune Spark algorithms so they scale linearly.Style and approachThis is an advanced guide for those with beginner-level familiarity with the Spark architecture and working with Data Science applications. Mastering Spark for Data Science is a practical tutorial that uses core Spark APIs and takes a deep dive into advanced libraries including: Spark SQL, visual streaming, and MLlib. This book expands on titles like: Machine Learning with Spark and Learning Spark. It is the next learning curve for those comfortable with Spark and looking to improve their skills.

Frequently asked questions

How do I cancel my subscription?

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.

Can/how do I download books?

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

What is the difference between the pricing plans?

Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.

What is Perlego?

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Do you support text-to-speech?

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Is Mastering Spark for Data Science an online PDF/ePUB?

Yes, you can access Mastering Spark for Data Science by Andrew Morgan, Antoine Amend, David George, Matthew Hallett in PDF and/or ePUB format, as well as other popular books in Computer Science & Data Processing. We have over one million books available in our catalogue for you to explore.

Information

Publisher

Packt Publishing

Year

2017

ISBN

9781785888281

Edition

Topic

Computer Science

Subtopic

Data Processing

Index

Computer Science

Mastering Spark for Data Science

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: March 2017

Production reference: 1240317

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-78588-214-2

www.packtpub.com

Credits

Authors Andrew Morgan Antoine Amend David George Matthew Hallett	Copy Editor Safis Editing
Reviewer Sumit Pal	Project Coordinator Shweta H Birwatkar
Commissioning Editor Akram Hussain	Proofreader Safis Editing
Acquisition Editor Vinay Argekar	Indexer Pratik Shirodkar
Content Development Editor Amrita Noronha	Graphics Tania Dutta
Technical Editor Sneha Hanchate	Production Coordinator Arvindkumar Gupta

Foreword

The impact of Spark on the world of data science has been startling. It is less than 3 years since Spark 1.0 was released and yet Spark is already accepted as the omni-competent kernel of any big data architecture. We adopted Spark as our core technology at Barclays around this time and this was considered a bold (read ‘rash’) move. Now it is taken as a given that Spark is your starting point for any big data science project.

As data science has developed both as an activity and as an accepted term, there has been much talk about the unicorn data scientist. This is the unlikely character who can do both the maths and the coding. They are apparently hard to find, and harder to keep. My team likes to think more in terms of three data science competencies: pattern recognition, distributed computation, and automation. If data science is about exploiting insights from data in production, then you need to be able to develop applications with these three competencies in mind from the start. There is no point using a machine learning methodology that won’t scale with your data, or building an analytical kernel that needs to be re-coded to be production quality. And so you need either a unicorn or a unicorn-team (my preference) to do the work.

Spark is your unicorn technology. No other language not only expresses analytical concepts elegantly but also moves effortlessly from the small scale to big data, and so naturally facilitates production-ready code as Spark (with the Scala API). We chose Spark because we could compose a model in a few lines, run the same code on the cluster as we had tried out on the laptop, and build robust unit-tested JVM applications that we could be confident would run in business-critical use cases. The combination of functional programming in Scala with the Spark abstractions is uniquely powerful, and choosing it has been a significant cause of the success of the team over the last 3 years.

So here's the conundrum. Why are there no books which present Spark in this way, recognizing that one of the best reasons to work in Spark is its application to production data science? If you scan the bookshelves (or look at tutorials online) all you will find is toy models and a review of the Spark APIs and libs. You will find little or nothing about how Spark fits into the wider architecture, or about how to manage data ETL in a sustainable way.

I think you will find that the practical approach taken by the authors in this book is different. Each chapter takes on a new challenge, and each reads as a voyage of discovery where the outcome was not necessarily known in advance of the exploration. And the value of doing data science properly is set out clearly from the start. This is one of the first books on Spark for grown-ups who want to do real data science that will make an impact on their organisation. I hope you enjoy it.

Harry Powell

Head of Advanced Analytics, Barclays

About the Authors

Andrew Morgan is a specialist in data strategy and its execution, and has deep experience in the supporting technologies, system architecture, and data science that bring it to life. With over 20 years of experience in the data industry, he has worked designing systems for some of its most prestigious players and their global clients – often on large, complex and international projects. In 2013, he founded ByteSumo Ltd, a data science and big data engineering consultancy, and he now works with clients in Europe and the USA. Andrew is an active data scientist, and the inventor of the TrendCalculus algorithm. It was developed as part of his ongoing research project investigating long-range predictions based on machine learning the patterns found in drifting cultural, geopolitical and economic trends. He also sits on the Hadoop Summit EU data science selection committee, and has spoken at many conferences on a variety of data topics. He also enjoys participating in the Data Science and Big Data communities where he lives in London.

This book is dedicated to my wife Steffy, to my children Alice and Adele, and to all my friends and colleagues who have been endlessly supportive. It is also dedicated to the memory of one my earliest mentors whom I studied under at the University of Toronto, a Professor Ferenc Csillag. Back in 1994, Ferko inspired me with visions of a future where we could use planet-wide data collection and sophisticated algorithms to monitor and optimize the world around us. It was an idea that changed my life, and his dream of a world saved by Big Data Science, is one I’m still chasing.

Antoine Amend is a data scientist passionate about big data engineering and scalable computing. The book’s theme of torturing astronomical amounts of unstructured data to gain new insights mainly comes from his background in theoretical physics. Graduating in 2008 with a Msc. in Astrophysics, he worked for a large consultancy business in Switzerland before discovering the concept of big data at the early stages of Hadoop. He has embraced big data technologies ever since, and is now working as the Head of Data Science for cyber security at Barclays Bank. By combining a scientific approach with core IT skills, Antoine qualified two years running for the Big Data World Championships finals held in Austin TX. He Placed in the top 12 in both 2014 and 2015 edition (over 2000+ competitors) where he additionally won the Innovation Award using the methodologies and technologies explained in this book.

I would like to thank my wife for standing beside me, she has been my motivation for continuing to improve my knowledge and move my career forward. I thank my wonderful kids for always teaching me how to step back whenever it is necessary to clear my mind and get fresh new ideas.

I would like to extend my thanks to my co-workers, especially Dr. Samuel Assefa, Dr. Eirini Spyropoulou and Will Hardman for their patience listening to my crazy theories, and everyone else I had the pleasure to work with over the past few years. Finally, I want to address a special thanks to all my previous managers and mentors who helped me shape up my career in data and analytics; thanks to Manu, Toby, Gary and Harry.

David George is a distinguished distributed computing expert with 15+ years of data systems experience, mainly with globally recognized IT consultancies and brands. Working with core Hadoop technologies since the early days, he has delivered implementations at the largest scale. David always takes a pragmatic approach to software design and values elegance in simplicity.

Today he continues to work as a lead engineer, designing scalable applications for financial sector customers with some of the toughest requirements. His latest projects focus on the adoption of advanced AI techniques for increasing levels of automation across knowledge-based industries.

For Ellie, Shannon, Pauline and Pumpkin – here’s to the sequel!

Matthew Hallett is a Software Engineer and Computer Scientist with over 15 years of industry experience. He is an expert Object Oriented programmer and systems engineer with extensive knowledge of low level programming paradigms and, for the last 8 years, has developed an expertise in Hadoop and distributed programming within mi...