eBook - ePub

Clojure for Data Science

Name: Clojure for Data Science
ISBN: 9781784397180

Henry Garner,

608 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Clojure for Data Science

Henry Garner,

About this book

Statistics, big data, and machine learning for Clojure programmers

About This Book

Write code using Clojure to harness the power of your data
Discover the libraries and frameworks that will help you succeed
A practical guide to understanding how the Clojure programming language can be used to derive insights from data

Who This Book Is For

This book is aimed at developers who are already productive in Clojure but who are overwhelmed by the breadth and depth of understanding required to be effective in the field of data science. Whether you're tasked with delivering a specific analytics project or simply suspect that you could be deriving more value from your data, this book will inspire you with the opportunities–and inform you of the risks–that exist in data of all shapes and sizes.

What You Will Learn

Perform hypothesis testing and understand feature selection and statistical significance to interpret your results with confidence
Implement the core machine learning techniques of regression, classification, clustering and recommendation
Understand the importance of the value of simple statistics and distributions in exploratory data analysis
Scale algorithms to web-sized datasets efficiently using distributed programming models on Hadoop and Spark
Apply suitable analytic approaches for text, graph, and time series data
Interpret the terminology that you will encounter in technical papers
Import libraries from other JVM languages such as Java and Scala
Communicate your findings clearly and convincingly to nontechnical colleagues

In Detail

The term "data science" has been widely used to define this new profession that is expected to interpret vast datasets and translate them to improved decision-making and performance. Clojure is a powerful language that combines the interactivity of a scripting language with the speed of a compiled language. Together with its rich ecosystem of native libraries and an extremely simple and consistent functional approach to data manipulation, which maps closely to mathematical formula, it is an ideal, practical, and flexible language to meet a data scientist's diverse needs.

Taking you on a journey from simple summary statistics to sophisticated machine learning algorithms, this book shows how the Clojure programming language can be used to derive insights from data. Data scientists often forge a novel path, and you'll see how to make use of Clojure's Java interoperability capabilities to access libraries such as Mahout and Mllib for which Clojure wrappers don't yet exist. Even seasoned Clojure developers will develop a deeper appreciation for their language's flexibility!

You'll learn how to apply statistical thinking to your own data and use Clojure to explore, analyze, and visualize it in a technically and statistically robust way. You can also use Incanter for local data processing and ClojureScript to present interactive visualisations and understand how distributed platforms such as Hadoop sand Spark's MapReduce and GraphX's BSP solve the challenges of data analysis at scale, and how to explain algorithms using those programming models.

Above all, by following the explanations in this book, you'll learn not just how to be effective using the current state-of-the-art methods in data science, but why such methods work so that you can continue to be productive as the field evolves into the future.

Style and approach

This is a practical guide to data science that teaches theory by example through the libraries and frameworks accessible from the Clojure programming language.

Tools to learn more effectively

Saving Books

Keyword Search

Annotating Text

Listen to it instead

Information

Publisher

Packt Publishing

Year

2015

eBook ISBN

9781784397180

Edition

Topic

Business

Subtopic

Business Intelligence

Index

Business

Clojure for Data Science

Credits

About the Author

Acknowledgments

About the Reviewer

www.PacktPub.com

Support files, eBooks, discount offers, and more

Why subscribe?

Free access for Packt account holders

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

1. Statistics

Downloading the sample code

Running the examples

Downloading the data

Inspecting the data

Data scrubbing

Descriptive statistics

The mean

Interpreting mathematical notation

The median

Variance

Quantiles

Binning data

Histograms

The normal distribution

The central limit theorem

Poincaré's baker

Generating distributions

Skewness

Quantile-quantile plots

Comparative visualizations

Box plots

Cumulative distribution functions

The importance of visualizations

Visualizing electorate data

Adding columns

Adding derived columns

Comparative visualizations of electorate data

Visualizing the Russian election data

Comparative visualizations

Probability mass functions

Scatter plots

Scatter transparency

Summary

2. Inference

Introducing AcmeContent

Download the sample code

Load and inspect the data

Visualizing the dwell times

The exponential distribution

The distribution of daily means

The central limit theorem

Standard error

Samples and populations

Confidence intervals

Sample comparisons

Bias

Visualizing different populations

Hypothesis testing

Significance

Testing a new site design

Performing a z-test

Student's t-distribution

Degrees of freedom

The t-statistic

Performing the t-test

Two-tailed tests

One-sample t-test

Resampling

Testing multiple designs

Calculating sample means

Multiple comparisons

Introducing the simulation

Compile the simulation

The browser simulation

jStat

Scalable Vector Graphics

Plotting probability densities

State and Reagent

Updating state

Binding the interface

Simulating multiple tests

The Bonferroni correction

Analysis of variance

The F-distribution

The F-statistic

The F-test

Effect size

Cohen's d

Summary

3. Correlation

About the data

Inspecting the data

Visualizing the data

The log-normal distribution

Visualizing correlation

Jittering

Covariance

Pearson's correlation

Sample r and population rho

Hypothesis testing

Confidence intervals

Regression

Linear equations

Residuals

Ordinary least squares

Slope and intercept

Interpretation

Visualization

Assumptions

Goodness-of-fit and R-square

Multiple linear regression

Matrices

Dimensions

Vectors

Construction

Addition and scalar multiplication

Matrix-vector multiplication

Matrix-matrix multiplication

Transposition

The identity matrix

Inversion

The normal equation

More features

Multiple R-squared

Adjusted R-squared

Incanter's linear model

The F-test of model significance

Categorical and dummy variables

Relative power

Collinearity

Multicollinearity

Prediction

The confidence interval of a prediction

Model scope

The final model

Summary

4. Classification

About the data

Inspecting the data

Comparisons with relative risk and odds

The standard error of a proportion

Estimation using bootstrapping

The binomial distribution

The standard error of a proportion formula

Significance testing proportions

Adjusting standard errors for large samples

Chi-squared multiple significance testing

Visualizing the categories

The chi-squared test

The chi-squared statistic

The chi-squared test

Classification with logistic regression

The sigmoid function

The logistic regression cost function

Parameter optimization with gradient descent

Gradient descent with Incanter

Convexity

Implementing logistic regression with Incanter

Creating a feature matrix

Evaluating the logistic regression classifier

The confusion matrix

The kappa statistic

Probability

Bayes theorem

Bayes theorem with multiple predictors

Naive Bayes classification

Implementing a naive Bayes classifier

Evaluating the naive Bayes classifier

Comparing the logistic regression and naive Bayes approaches

Decision trees

Information

Entropy

Information gain

Using information gain to identify the best predictor

Recursively building a decision tree

Using the decision tree for classification

Evaluating the decision tree classifier

Classification with clj-ml

Loading data with clj-ml

Building a decision tree in clj-ml

Bias and variance

Overfitting

Cross-validation

Addressing high bias

Ensemble learning and random forests

Bagging and boosting

Saving the classifier to a file

Summary

5. Big Data

Downloading the code and data

Inspecting the data

Counting the records

The reducers library

Parallel folds with reducers

Loading large files with iota

Creating a reducers processing pipeline

Curried reductions with reducers

Statistical folds with reducers

Associativity

Calculating the mean using fold

Calculating the variance using fold

Mathematical folds with Tesser

Calculating covariance with Tesser

Commutativity

Simple linear regression with Tesser

Calculating a correlation matrix

Multiple regression with gradient descent

The gradient descent update rule

The gradient descent learning rate

Feature scaling

Feature extraction

Creating a custom Tesser fold

Creating a matrix-sum fold

Calculating the total model error

Creating a matrix-mean fold

Applying a single step of gradient descent

Running iterative gradient descent

Scaling gradient descent with Hadoop

Gradient descent on Hadoop with Tesser and Parkour

Parkour distributed sources and sinks

Running a feature scale fold with Hadoop

Running gradient descent with Hadoop

Preparing our code for a Hadoop cluster

Building an uberjar

Submitting the uberjar to Hadoop

Stochastic gradient descent

Stochastic gradient descent with Parkour

Defining a mapper

Parkour shaping functions

Defining a reducer

Specifying Hadoop jobs with Parkour graph

Chaining mappers and reducers with Parkour graph

Summary

...

Clojure for Data Science

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription

No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 990+ topics, we’ve got you covered! Learn about our mission

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud

Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app

Yes, you can access Clojure for Data Science by Henry Garner in PDF and/or ePUB format, as well as other popular books in Business & Business Intelligence. We have over one million books available in our catalogue for you to explore.

Clojure for Data Science

Clojure for Data Science

About this book

Tools to learn more effectively

Information

Clojure for Data Science

Table of Contents

Table of contents

Frequently asked questions