eBook - ePub

Mastering Text Mining with R

Name: Mastering Text Mining with R
Author: Ashish Kumar, Avinash Paul

Ashish Kumar, Avinash Paul

Buch teilen

258 Seiten
English
ePUB (handyfreundlich)
Über iOS und Android verfügbar

eBook - ePub

Mastering Text Mining with R

Ashish Kumar, Avinash Paul

Angaben zum Buch

Buchvorschau

Inhaltsverzeichnis

Quellenangaben

Über dieses Buch

Master text-taming techniques and build effective text-processing applications with R

About This Book

Develop all the relevant skills for building text-mining apps with R with this easy-to-follow guide
Gain in-depth understanding of the text mining process with lucid implementation in the R language
Example-rich guide that lets you gain high-quality information from text data

Who This Book Is For

If you are an R programmer, analyst, or data scientist who wants to gain experience in performing text data mining and analytics with R, then this book is for you. Exposure to working with statistical methods and language processing would be helpful.

What You Will Learn

Get acquainted with some of the highly efficient R packages such as OpenNLP and RWeka to perform various steps in the text mining process
Access and manipulate data from different sources such as JSON and HTTP
Process text using regular expressions
Get to know the different approaches of tagging texts, such as POS tagging, to get started with text analysis
Explore different dimensionality reduction techniques, such as Principal Component Analysis (PCA), and understand its implementation in R
Discover the underlying themes or topics that are present in an unstructured collection of documents, using common topic models such as Latent Dirichlet Allocation (LDA)
Build a baseline sentence completing application
Perform entity extraction and named entity recognition using R

In Detail

Text Mining (or text data mining or text analytics) is the process of extracting useful and high-quality information from text by devising patterns and trends. R provides an extensive ecosystem to mine text through its many frameworks and packages.

Starting with basic information about the statistics concepts used in text mining, this book will teach you how to access, cleanse, and process text using the R language and will equip you with the tools and the associated knowledge about different tagging, chunking, and entailment approaches and their usage in natural language processing. Moving on, this book will teach you different dimensionality reduction techniques and their implementation in R. Next, we will cover pattern recognition in text data utilizing classification mechanisms, perform entity recognition, and develop an ontology learning framework.

By the end of the book, you will develop a practical application from the concepts learned, and will understand how text mining can be leveraged to analyze the massively available data on social media.

Style and approach

This book takes a hands-on, example-driven approach to the text mining process with lucid implementation in R.

Häufig gestellte Fragen

Wie kann ich mein Abo kündigen?

Gehe einfach zum Kontobereich in den Einstellungen und klicke auf „Abo kündigen“ – ganz einfach. Nachdem du gekündigt hast, bleibt deine Mitgliedschaft für den verbleibenden Abozeitraum, den du bereits bezahlt hast, aktiv. Mehr Informationen hier.

(Wie) Kann ich Bücher herunterladen?

Derzeit stehen all unsere auf Mobilgeräte reagierenden ePub-Bücher zum Download über die App zur Verfügung. Die meisten unserer PDFs stehen ebenfalls zum Download bereit; wir arbeiten daran, auch die übrigen PDFs zum Download anzubieten, bei denen dies aktuell noch nicht möglich ist. Weitere Informationen hier.

Welcher Unterschied besteht bei den Preisen zwischen den Aboplänen?

Mit beiden Aboplänen erhältst du vollen Zugang zur Bibliothek und allen Funktionen von Perlego. Die einzigen Unterschiede bestehen im Preis und dem Abozeitraum: Mit dem Jahresabo sparst du auf 12 Monate gerechnet im Vergleich zum Monatsabo rund 30 %.

Was ist Perlego?

Wir sind ein Online-Abodienst für Lehrbücher, bei dem du für weniger als den Preis eines einzelnen Buches pro Monat Zugang zu einer ganzen Online-Bibliothek erhältst. Mit über 1 Million Büchern zu über 1.000 verschiedenen Themen haben wir bestimmt alles, was du brauchst! Weitere Informationen hier.

Unterstützt Perlego Text-zu-Sprache?

Achte auf das Symbol zum Vorlesen in deinem nächsten Buch, um zu sehen, ob du es dir auch anhören kannst. Bei diesem Tool wird dir Text laut vorgelesen, wobei der Text beim Vorlesen auch grafisch hervorgehoben wird. Du kannst das Vorlesen jederzeit anhalten, beschleunigen und verlangsamen. Weitere Informationen hier.

Ist Mastering Text Mining with R als Online-PDF/ePub verfügbar?

Ja, du hast Zugang zu Mastering Text Mining with R von Ashish Kumar, Avinash Paul im PDF- und/oder ePub-Format sowie zu anderen beliebten Büchern aus Informatica & Data mining. Aus unserem Katalog stehen dir über 1 Million Bücher zur Verfügung.

Information

Verlag

Packt Publishing

Jahr

2016

ISBN

9781783551811

Auflage

Thema

Informatica

Thema

Data mining

Mastering Text Mining with R

Credits

About the Authors

About the Reviewers

www.PacktPub.com

eBooks, discount offers, and more

Why subscribe?

Customer Feedback

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

1. Statistical Linguistics with R

Probability theory and basic statistics

Probability space and event

Theorem of compound probabilities

Conditional probability

Bayes' formula for conditional probability

Independent events

Random variables

Discrete random variables

Continuous random variables

Probability frequency function

Probability distributions using R

Cumulative distribution function

Joint distribution

Binomial distribution

Poisson distribution

Counting occurrences

Zipf's law

Heaps' law

Lexical richness

Lexical variation

Lexical density

Lexical originality

Lexical sophistication

Language models

N-gram models

Markov assumption

Hidden Markov models

Quantitative methods in linguistics

Document term matrix

Inverse document frequency

Words similarity and edit-distance functions

Euclidean distance

Cosine similarity

Levenshtein distance

Damerau-Levenshtein distance

Hamming distance

Jaro-Winkler distance

Measuring readability of a text

Gunning frog index

R packages for text mining

OpenNLP

Rweka

RcmdrPlugin.temis

languageR

koRpus

RKEA

maxent

lsa

Summary

2. Processing Text

Accessing text from diverse sources

File system

PDF documents

Microsoft Word documents

HTML

XML

JSON

HTTP

Databases

Processing text using regular expressions

Tokenization and segmentation

Word tokenization

Operations on a document-term matrix

Sentence segmentation

Normalizing texts

Lemmatization and stemming

Stemming

Lemmatization

Synonyms

Lexical diversity

Analyse lexical diversity

Calculate lexical diversity

Readability

Automated readability index

Language detection

Summary

3. Categorizing and Tagging Text

Parts of speech tagging

POS tagging with R packages

Hidden Markov Models for POS tagging

Basic definitions and notations

Implementing HMMs

Viterbi underflow

Forward algorithm underflow

OpenNLP chunking

Chunk tags

Collocation and contingency tables

Extracting co-occurrences

Surface Co-occurrence

Textual co-occurrence

Syntactic co-occurrence

Co-occurrence in a document

Quantifying the relation between words

Contingency tables

Detailed analysis on textual collocations

Feature extraction

Synonymy and similarity

Multiwords, negation, and antonymy

Concept similarity

Path length

Resnik similarity

Lin similarity

Jiang – Conrath distance

Summary

4. Dimensionality Reduction

The curse of dimensionality

Distance concentration and computational infeasibility

Dimensionality reduction

Principal component analysis

Using R for PCA

Understanding the FactoMineR package

Amap package

Proportion of variance

Scree plot

Reconstruction error

Correspondence analysis

Canonical correspondence analysis

Pearson's Chi-squared test

Multiple correspondence analysis

Implementation of SVD using R

Summary

5. Text Summarization and Clustering

Topic modeling

Latent Dirichlet Allocation

Correlated topic model

Model selection

R Package for topic modeling

Fitting the LDA model with the VEM algorithm

Latent semantic analysis

R Package for latent semantic analysis

Illustrative example of LSA

Text clustering

Document clustering

Feature selection for text clustering

Mutual information

Statistic Chi Square feature selection

Frequency-based feature selection

Sentence completion

Summary

6. Text Classification

Text classification

Document representation

Feature hashing

Classifiers – inductive learning

Tree-based learning

Bayesian classifiers: Naive Bayes classification

K-Nearest neighbors

Kernel methods

Support vector machines

Kernel Trick

How to apply SVM on a real world example?

Number of instances is significantly larger than the number of dimensions.Maximum entropy classifier

Maxent implemenation in R

RTextTools: a text classification framework

Model evaluation

Confusion matrix

ROC curve

Precision-recall

Bias–variance trade-off and learning curve

Bias-variance decomposition

Learning curve

Dealing with reducible error components

Cross validation

Leave-one-out

k-Fold

Bootstrap

Stratified

Summary

7. Entity Recognition

Entity extraction

The rule-based approach

Machine learning

Sentence boundary detection

Word token annotator

Named entity recognition

Training a model with new features

Summary

Index

Mastering Text Mining with R

Über dieses Buch

Häufig gestellte Fragen

Information

Mastering Text Mining with R

Table of Contents

Mastering Text Mining with R

Inhaltsverzeichnis