![]()
![]()
Copyright Ā© 2016 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: September 2016
Production reference: 1270916
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-78588-565-5
www.packtpub.com
![]()
![]()
Apache Spark is one of the most popular projects in the Hadoop ecosystem and possibly the most actively developed open source project in big data. Its simplicity, performance, and flexibility have made it popular not only among data scientists but also among engineers, developers, and everybody else interested in big data.
With its rising popularity, Duvvuri and Bikram have produced a book that is the need of the hour, Spark for Data Science, but with a difference. They have not only covered the Spark computing platform but have also included aspects of data science and machine learning. To put it in one wordācomprehensive.
The book contains numerous code snippets that one can use to learn and also get a jump start in implementing projects. Using these examples, users also start to get good insights and learn the key steps in implementing a data science projectābusiness understanding, data understanding, data preparation, modeling, evaluation and deployment.
Venkatraman Laxmikanth
Managing Director
Broadridge Financial Solutions India (Pvt) Ltd
![]()
Srinivas Duvvuri is currently Senior Vice President Development, heading the development teams for Fixed Income Suite of products at Broadridge Financial Solutions (India) Pvt Ltd. In addition, he also leads the Big Data and Data Science COE and is the principal member of the Broadridge India Technology Council. He is self learnt Data Scientist. The Big Data /Data Science COE in the past 3 years, has successfully completed multiple POCās and some of the use cases are moving towards production deployment. He has over 25+ years of experience in software product development. His experience spans predominantly in product development in, multiple domains Financial Services, Infrastructure Management, OLAP, Telecom Billing and Customer Care, CAD/CAM. Prior to Broadridge, heās held leadership positions at a Startup and leading IT majors such as CA, Hyperion (Oracle), Globalstar. He has a patent in Relational OLAP.
Srinivas loves to teach and mentor budding Engineers. He has established strong Academic connect and interacts with a host of educational institutions, He is an active speaker in various conferences, summits and meetups on topics such as Big data, Data Science
Srinivas is a B.Tech in Aeronautical Engineering and M.Tech in Computer Science, from IIT, Madras.
At the outset I would like to thank VLK our MD and Broadridge India for supporting me in this endeavor. I would like to thank my parents, teachers, colleagues and extended family who have mentored and motivated me. My thanks to Bikram who agreed me to be the co-author when proposal to author the book came up. My special thanks to my wife Ratna, sons Girish and Aravind who have supported me in completing this book.
I would also like to sincerely thank the editorial team from Packt Arshriya, Rashmi, Deepti and all those, though not mentioned here, who have contributed in this project. Finally last but not the least our publisher Packt.
Bikramaditya Singhal is a data scientist with about 7 years of industry experience. He is an expert in statistical analysis, predictive analytics, machine learning, Bitcoin, Blockchain, and programming in C, R, and Python. He has extensive experience in building scalable data analytics solutions in many industry sectors. He also has an active interest on industrial IoT, machine to machine communication, decentralized computation through Blockchain and Artificial Intelligence.
Bikram currently leads the data science team of āDigital Enterprise Solutionsā group at Tech Mahindra Ltd. He also worked in companies such as Microsoft India, Broadridge, Chelsio Communications and also cofounded a company named āMund Consultingā which focused on Big Data analytics.
Bikram is an active speaker in various conferences, summits and meetups on topics such as big data, data science, IIoT and Blockchain.
I would like to thank my father, my brothers Manoj Agrawal and Sumit Mund for their mentorship. Without learning from them, there is not a chance I could be doing what I do today, and it is because of them and others that I feel compelled to pass my knowledge on to those willing to learn. Special thanks to my mentor and coauthor Srinivas Duvvuri, and my friend Priyansu Panda, without their efforts this book quite possibly would not have happened.
My deepest gratitude to his holiness Sri Sri Ravi Shankar for building me to what I am today. Many thanks and gratitude to my parents and my wife Yashoda for their unconditional love and support.
I would also like to sincerely thank all those, though not mentioned here, who have contributed in this project directly or indirectly.
![]()
Daniel Frimer has been involved in a vast exposure of industries across Healthcare, Web Analytics, Transportation. Across these industries has developed ways to optimize the speed of data workflow, storage, and processing in the hopes of making a highly efficient department. Daniel is currently a Masterās candidate at the University of Washington in Information Sciences pursuing a specialization in Data Science and Business Intelligence. She worked on Python Data Science Essentials
Iād like to thank my grandmother Mary. Who has always believed in mine and everyoneās potential and respects those whose passions make the world a better place.
Priyansu Panda is a research engineer at Underwriters Laboratories, Bangalore, India. He worked as a senior system engineer in Infosys Limited, and served as a software engineer in Tech Mahindra.
His areas of expertise include machine-learning, natural language processing, computer vision, pattern recognition, and heterogeneous distributed data integration. His current research is on applied machine learning for product safety analysis. His major research interests are machine-learning and data-mining applications, artificial intelligence on internet of things, cognitive systems, and clustering research.
Yogesh Tayal is a Technology Consultant at Mu Sigma Business Solutions Pvt. Ltd. and has been with Mu Sigma for more than 3 years. He has worked with the Mu Sigma Business Analytics team and is currently an integral part of the product development team. Mu Sigma is one of the leading Decision Sciences companies in India with a huge client base comprising of leading corporations across an array of industry verticals i.e. technology, retail, pharmaceuticals, BFSI, e-commerce, healthcare etc.
![]()
For support files and downloads related to your book, please visit www.PacktPub.com.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at
www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at
[email protected] for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www.packtpub.com/mapt
Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.
- Fully searchable across every book published by Packt
- Co...