eBook - ePub

Apache Spark for Data Science Cookbook

Name: Apache Spark for Data Science Cookbook
ISBN: 9781785880100

Padma Priya Chitturi,

392 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Apache Spark for Data Science Cookbook

Padma Priya Chitturi,

About this book

Over insightful 90 recipes to get lightning-fast analytics with Apache Spark

About This Book

Use Apache Spark for data processing with these hands-on recipes
Implement end-to-end, large-scale data analysis better than ever before
Work with powerful libraries such as MLLib, SciPy, NumPy, and Pandas to gain insights from your data

Who This Book Is For

This book is for novice and intermediate level data science professionals and data analysts who want to solve data science problems with a distributed computing framework. Basic experience with data science implementation tasks is expected. Data science professionals looking to skill up and gain an edge in the field will find this book helpful.

What You Will Learn

Explore the topics of data mining, text mining, Natural Language Processing, information retrieval, and machine learning.
Solve real-world analytical problems with large data sets.
Address data science challenges with analytical tools on a distributed system like Spark (apt for iterative algorithms), which offers in-memory processing and more flexibility for data analysis at scale.
Get hands-on experience with algorithms like Classification, regression, and recommendation on real datasets using Spark MLLib package.
Learn about numerical and scientific computing using NumPy and SciPy on Spark.
Use Predictive Model Markup Language (PMML) in Spark for statistical data mining models.

In Detail

Spark has emerged as the most promising big data analytics engine for data science professionals. The true power and value of Apache Spark lies in its ability to execute data science tasks with speed and accuracy. Spark's selling point is that it combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing, and visualizations. It lets you tackle the complexities that come with raw unstructured data sets with ease.

This guide will get you comfortable and confident performing data science tasks with Spark. You will learn about implementations including distributed deep learning, numerical computing, and scalable machine learning. You will be shown effective solutions to problematic concepts in data science using Spark's data science libraries such as MLLib, Pandas, NumPy, SciPy, and more. These simple and efficient recipes will show you how to implement algorithms and optimize your work.

Style and approach

This book contains a comprehensive range of recipes designed to help you learn the fundamentals and tackle the difficulties of data science. This book outlines practical steps to produce powerful insights into Big Data through a recipe-based approach.

Trusted by 375,005 students

Access to over 1.5 million titles for a fair monthly price.

Study more efficiently using our study tools.

Publisher

Packt Publishing

Year

2016

Edition

eBook ISBN

9781785880100

Topic

Computer Science

Subtopic

Data Modelling & Design

Index

Computer Science

Apache Spark for Data Science Cookbook

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: December 2016

Production reference: 1161216

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-78588-010-0

www.packtpub.com

Credits

Author Padma Priya Chitturi	Copy Editor Safis Editing
Reviewer Roberto Corizzo	Project Coordinator Shweta H Birwatkar
Commissioning Editor Akram Hussain	Proofreader Safis Editing
Acquisition Editors Vinay Argekar Manish Nainani	Indexer Mariammal Chettiyar
Content Development Editor Sumeet Sawant	Graphics Disha Haria
Technical Editor Deepti Tuscano	Production Coordinator Arvindkumar Gupta

About the Author

Padma Priya Chitturi is Analytics Lead at Fractal Analytics Pvt Ltd and has over five years of experience in Big Data processing. Currently, she is part of capability development at Fractal and responsible for solution development for analytical problems across multiple business domains at large scale. Prior to this, she worked for an Airlines product on a real-time processing platform serving one million user requests/sec at Amadeus Software Labs. She has worked on realizing large-scale deep networks (Jeffrey dean's work in Google brain) for image classification on the big data platform Spark. She works closely with Big Data technologies such as Spark, Storm, Cassandra and Hadoop. She was an open source contributor to Apache Storm.

First, I would like to thank the Packt Publishing team for providing a great opportunity for me to take part in this exciting journey and would like to express my special thanks and gratitude to my family, friends and colleagues who has been very supportive and helped me in finishing this project within time.

About the Reviewer

Roberto Corizzo is a PhD student at the Department of Computer Science, University of Bari, Italy. His research interests include Big Data analytics, data mining, and predictive modeling techniques for sensor networks. He has been involved as technical reviewer for Packt's Learning Hadoop 2 and Learning Python Web Penetration Testing video courses.

www.PacktPub.com

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser

Customer Feedback

Thank you for purchasing this Packt book. We take our commitment to improving our content and products to meet your needs seriously – that's why your feedback is so valuable. Whatever your feelings about your purchase, please consider leaving a review on this book's Amazon page. Not only will this help us, more importantly it will also help others in the community to make an informed decision about the resources that they invest in to learn. You can also review for us on a regular basis by joining our reviewers club. If you're interested in joining, or would like to learn more about the benefits we offer, please contact us: [email protected].

Preface

In recent years, the volume of data being collected, stored, and analyzed has exploded, in particular in relation to the activity on the Web and mobile devices, as well as data from the physical world collected via sensor networks. While previously large-scale data storage, processing, analysis, and modeling was the domain of the largest institutions such as Google, Yahoo!, Facebook, and Twitter, increasingly, many organizations are being faced with the challenge of how to handle a massive amount of data.

With the advent of big data, extracting knowledge from large, heterogeneous, and noisy datasets requires not only powerful computing resources, but the programming abstractions to use them effectively. The abstractions that emerged in the last decade blend ideas from parallel databases, distributed systems, and programming languages to create a new class of scalable data analytics platforms that form the foundation for data science at realistic scales.

The objective of this book is to get the audience the flavor of challenges in data science and addressing them with a variety of analytical tools on a distributed system such as Spark (apt for iterative algorithms), which offers in-memory processing and more flexible for data analysis at scale. This book introduces readers to the fundamentals of Spark and helps them learn the concepts with code examples. It also talks in brief about data mining, text mining, NLP, machine learning, and so on. The readers get to know how to solve real-world analytical problems with large datasets and are made aware of a very practical approach and code to use analytical tools that leverage the features of Spark.

What this book covers

Chapter 1, Big Data Analytics with Spark, introduces Scala, Python and R can be used for data analysis. It also details about Spark programming model, API will be introduced, shows how to install, set up a development environment for the Spark framework and run jobs in distributed mode. I will also show working with DataFrames and Streaming computation models.

Chapter 2, Tricky Statistics with Spark, shows how to apply various statistical measures such as generating sample data, constructing frequency tables, summary and descriptive statistics on large datasets using Spark and Pandas

Chapter 3, Data Analysis with Spark, details how to apply common data exploration and preparation techniques such as univariate analysis, bivariate analysis, missing values treatment, identifying the outliers and techniques for variable transformation using Spark.

Chapter 4, Clustering, Classification and Regression, deals with creating models for regression, classification and clustering as well as shows how to utilize standard performance-evaluation methodologies for the machine learning models built.

Chapter 5, Working with Spark MLlib, provides an overview of Spark MLlib and ML pipelines and presents examples for implementing Naive Bayes classification, ...

Apache Spark for Data Science Cookbook

Frequently asked questions

Can I cancel at any time?

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription

Can I download books?

No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline

What is the difference between the pricing plans?

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.5M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

How does Perlego work?

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1.5 million books across 990+ topics, we’ve got you covered! Learn about our mission

Do you support text-to-speech?

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud

Can I read on my tablet or smartphone?

Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app

Is Apache Spark for Data Science Cookbook an online PDF/ePUB?

Yes, you can access Apache Spark for Data Science Cookbook by Padma Priya Chitturi in PDF and/or ePUB format, as well as other popular books in Computer Science & Data Modelling & Design. We have over 1.5 million books available in our catalogue for you to explore.

Apache Spark for Data Science Cookbook

Apache Spark for Data Science Cookbook

About this book

Trusted by 375,005 students

Information

Apache Spark for Data Science Cookbook

Apache Spark for Data Science Cookbook

Credits

About the Author

About the Reviewer

www.PacktPub.com

Why subscribe?

Customer Feedback

Preface

What this book covers

Table of contents

Frequently asked questions