Hands-On Data Science with R
eBook - ePub

Hands-On Data Science with R

Techniques to perform data manipulation and mining to build smart analytical models using R

Vitor Bianchi Lanzetta, Nataraj Dasgupta, Ricardo Anjoleto Farias

  1. 420 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Hands-On Data Science with R

Techniques to perform data manipulation and mining to build smart analytical models using R

Vitor Bianchi Lanzetta, Nataraj Dasgupta, Ricardo Anjoleto Farias

Book details
Book preview
Table of contents

About This Book

A hands-on guide for professionals to perform various data science tasks in R

Key Features

  • Explore the popular R packages for data science
  • Use R for efficient data mining, text analytics and feature engineering
  • Become a thorough data science professional with the help of hands-on examples and use-cases in R

Book Description

R is the most widely used programming language, and when used in association with data science, this powerful combination will solve the complexities involved with unstructured datasets in the real world. This book covers the entire data science ecosystem for aspiring data scientists, right from zero to a level where you are confident enough to get hands-on with real-world data science problems.

The book starts with an introduction to data science and introduces readers to popular R libraries for executing data science routine tasks. This book covers all the important processes in data science such as data gathering, cleaning data, and then uncovering patterns from it. You will explore algorithms such as machine learning algorithms, predictive analytical models, and finally deep learning algorithms. You will learn to run the most powerful visualization packages available in R so as to ensure that you can easily derive insights from your data.

Towards the end, you will also learn how to integrate R with Spark and Hadoop and perform large-scale data analytics without much complexity.

What you will learn

  • Understand the R programming language and its ecosystem of packages for data science
  • Obtain and clean your data before processing
  • Master essential exploratory techniques for summarizing data
  • Examine various machine learning prediction, models
  • Explore the H2O analytics platform in R for deep learning
  • Apply data mining techniques to available datasets
  • Work with interactive visualization packages in R
  • Integrate R with Spark and Hadoop for large-scale data analytics

Who this book is for

If you are a budding data scientist keen to learn about the popular pandas library, or a Python developer looking to step into the world of data analysis, this book is the ideal resource you need to get started. Some programming experience in Python will be helpful to get the most out of this course

Frequently asked questions

How do I cancel my subscription?
Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.
Can/how do I download books?
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
What is the difference between the pricing plans?
Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.
What is Perlego?
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Do you support text-to-speech?
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Is Hands-On Data Science with R an online PDF/ePUB?
Yes, you can access Hands-On Data Science with R by Vitor Bianchi Lanzetta, Nataraj Dasgupta, Ricardo Anjoleto Farias in PDF and/or ePUB format, as well as other popular books in Ciencia de la computaciĂłn & Ciencias computacionales general. We have over one million books available in our catalogue for you to explore.


Machine Learning with R

"What we want is a machine that can learn from experience."
– Alan Turing
Machine learning is an interdisciplinary field that involves computer science, neurocomputing, statistics, and more. The idea of machines actually learning can be dated back to Alan Turing and the beginning of Artificial Intelligence (AI). Although the foundations of machine learning and the vague idea of it could be found earlier in the sayings of the great Turing, it was not until 1959 that the term machine learning, was coined by the computer scientist, Arthur Samuel.
Although such ideas were circulating before 20st century, it only became popular in the first decades of the 21st century; since then, its reputation has skyrocketed. There are many reasons for this having happened—machine learning is extremely useful—but I would mostly point to two different reasons.
First, there is data volume. Huge volumes of data are being produced every day, everywhere. To process all this information, a much more efficient and novel way of doing it was needed. Machine learning methods aimed to solve this problem. Some of their methods are data-hungry and practically each of them is able to handle linear and non-linear relations.
The second reason is feasibility. Algorithms and computing power have improved rapidly; thus, allowing machines to learn from large datasets in a reasonable time. This chapter is designed to introduce readers to the world of machine learning while estabilhing some paralallels with traditional statistics. The chapter also demonstrates how to practially fit several machine learning models through R.
The reader may feel that too much attention is given to unsupervised learning rather than supervised. This approach was purposeful given that later chapters will more cautiously discuss supervised learning methods.
Here is what can be found in this chapter:
  • Which big companies are using machine learning
  • Linear regression with base R
  • Building decision trees with tree and rpart
  • Random forest, bagging, and boosting methods
  • Training support vector machines (SVM) with caret
  • Building feedforward neural networks using h2o
There are several machine learning models already available for R users. In this chapter, quite a few of them will be discussed in a practical manner. But what is machine learning? There are many definitions. The next section is defining machine learning and briefly discussing its use.

What is machine learning?

What do we mean by machine learning? It's an interdisciplinary subject that cares about the development, comprehension, and application of computational methods meant to learn and generalize from datasets; it's usually related but not limited to big data. Machine learning shores up a family of ever-growing methods, suitable for overcoming a wide range of problems.
I deeply appreciate how it has been used to fight junk email. The way it suggests replies to emails (that hardly are spam) proved to be of enormous aid too.
Such a great ability to solve problems certainly attracted big companies and tech geeks all over the world.

Machine learning everywhere

Netflix is uses machine learning to give you personal recommendations of content to watch; Amazon uses machine learning to recommend products to buy based on what you've already bought. These are the so-called recommenders. They are usually (but not only) built using clustering techniques.
Machine learning techniques have been also used to diagnose illnesses. Aside from the application of clustering in cancer diagnosis already mentioned in Chapter 4, KDD, Data Mining, and Text Mining, neural networks can be trained to read various exams and even predict how likely a patient is to develop certain kinds of diseases—this field is called predictive medicine and highly benefits from machine learning advancements.
Saving endangered species is yet another wonderful usage of machine learning. Researchers from the University of Southern California Center for AI in Society have trained a neural network to detect illegal hunters that set foot in national parks from Zimbabwe and Malawi. This system is designed to distinguish hunters from animals using heat signatures and was baptized as Systematic POacher deTector (SPOT).
There are unconventional uses of machine learning models. Some folks are using it to compose songs, poems, and draw figures.
Tech workers, such as Zach Lubarsky and Ethan Phelps-Goodman, are actively engaging in data-driven campaigns to solve social issues. Lubarsky and Phelps-Goodman belong to the Seattle Tech 4 Housing organization, a community dedicated to improving Seattle's residence affordability.
A quick web search will tell you that there are many real-world applications of machine learning as there are stars in the sky. Talking about stars, how do you think that the galactical sized datasets generated by astronomers are being processed? That's right, machine learning.
This collection of methods can be separated into two classes: unsupervised (unlabeled) and supervised (labeled) learning. For the former, there is no target value to fit the models—hierarchical clusters are a good example of those. The objective of unsupervised learning is usually, but not always, to extract features from data rather than actual forecasts.
Next we will be looking at how traditional statistics connect to machine learning. There are many clear connections linking both streams. To mention one, regressions from traditional statistics can also be seen in machine learning applications. Ronald Fisher, a well-renowned statistician, is recognized by some people to be among the first individuals to use machine learning.
Supervised learning models are trained to target one or more variables; hence you need labeled data. Recurrent neural networks (RNNs) can be cited as a supervised learning technique. Although practical examples for both classes are provided in this chapter, more attention is given to unsupervised learning, since supervised is focused on in further chapters such as Chapter 8, Neural Networks and Deep Learning.
Although many concepts adopted in machine learning field are essentially the same as the ones that arose from traditional statisticians and forecasters, machine learning has a vocabulary of its own. Differences may have originated due to the main proponents of the field being more related to computing than statistics.
There is no downside to learning this vocabulary. A great way to do so is to relate machine learning terms to statistical ones. Moving on to the next section, we can see how many core ideas from machine learning can be somehow translated into statistical concepts.

Machine learning vocabulary

At the end of the last section, we already hypothesized why machine learning managed to diverge in vacabulary from statistcs. Let me begin this section by discussing why the core ideas converge in essence. Many statistical methods crave to prae e videre, that is Latin for to see something that did not happen yet before it actually does, or simply, predict.
Prediction tasks, as other pattern recognition duties, often require a very sharp ability to comprehend data and generalize well into yet unseen information. This sort of shared goal drove the distinct efforts from traditional statistics and machine learning to many common places. Also, statistics, virtue to conceive all sorts of events in a probabilistic way makes it very useful to machine learning, which could be another source of shared ground acrross the different fields, not to mention the interdisciplinary nature of machine learning.
No matter the reason for that, machine learning vocabulary can be adapted and understood through statistics. This translation makes it especially easy for lovers of statistics to master machine learning and vice versa. The paper, Neural Networks and Statistical Models, written by Warren S. Sarle and published in 1994, showed how machine learning jargon could be related to statistical jargon. Here are some jargons:
Statistical jargon Machine learning correspondent
Model estimation Model training or learning
Estimation criteria Cost function
Variables Features
Independent variables Inputs
Predicted values Outputs
Dependent variables Training or target values
Now that we acknowledge the existence of a link between statistics and machine learning, the time is coming to take a practical tour through the traditional methods of linear regression given by statistics using our beloved R—but not before examining the general tasks that machine learning is up to.

Generic problems solved by machine learning

Whether a problem can be solved through machine learning is only a matter of how much data, creativity, and computational power does one have. Machine learning can be used to aid diagnosis, draw recommendations, classify stellar objects, protect animal life and tackle social issues.
It can likewise be used to detect frauds, such as fraudulent credit card t...

Table of contents

  1. Title Page
  2. Copyright and Credits
  3. About Packt
  4. Contributors
  5. Preface
  6. Getting Started with Data Science and R
  7. Descriptive and Inferential Statistics
  8. Data Wrangling with R
  9. KDD, Data Mining, and Text Mining
  10. Data Analysis with R
  11. Machine Learning with R
  12. Forecasting and ML App with R
  13. Neural Networks and Deep Learning
  14. Markovian in R
  15. Visualizing Data
  16. Going to Production with R
  17. Large Scale Data Analytics with Hadoop
  18. R on Cloud
  19. The Road Ahead
  20. Other Books You May Enjoy