Mastering Text Mining with R
eBook - ePub

Mastering Text Mining with R

Ashish Kumar, Avinash Paul

Share book
  1. 258 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Mastering Text Mining with R

Ashish Kumar, Avinash Paul

Book details
Book preview
Table of contents
Citations

About This Book

Master text-taming techniques and build effective text-processing applications with R

About This Book

  • Develop all the relevant skills for building text-mining apps with R with this easy-to-follow guide
  • Gain in-depth understanding of the text mining process with lucid implementation in the R language
  • Example-rich guide that lets you gain high-quality information from text data

Who This Book Is For

If you are an R programmer, analyst, or data scientist who wants to gain experience in performing text data mining and analytics with R, then this book is for you. Exposure to working with statistical methods and language processing would be helpful.

What You Will Learn

  • Get acquainted with some of the highly efficient R packages such as OpenNLP and RWeka to perform various steps in the text mining process
  • Access and manipulate data from different sources such as JSON and HTTP
  • Process text using regular expressions
  • Get to know the different approaches of tagging texts, such as POS tagging, to get started with text analysis
  • Explore different dimensionality reduction techniques, such as Principal Component Analysis (PCA), and understand its implementation in R
  • Discover the underlying themes or topics that are present in an unstructured collection of documents, using common topic models such as Latent Dirichlet Allocation (LDA)
  • Build a baseline sentence completing application
  • Perform entity extraction and named entity recognition using R

In Detail

Text Mining (or text data mining or text analytics) is the process of extracting useful and high-quality information from text by devising patterns and trends. R provides an extensive ecosystem to mine text through its many frameworks and packages.

Starting with basic information about the statistics concepts used in text mining, this book will teach you how to access, cleanse, and process text using the R language and will equip you with the tools and the associated knowledge about different tagging, chunking, and entailment approaches and their usage in natural language processing. Moving on, this book will teach you different dimensionality reduction techniques and their implementation in R. Next, we will cover pattern recognition in text data utilizing classification mechanisms, perform entity recognition, and develop an ontology learning framework.

By the end of the book, you will develop a practical application from the concepts learned, and will understand how text mining can be leveraged to analyze the massively available data on social media.

Style and approach

This book takes a hands-on, example-driven approach to the text mining process with lucid implementation in R.

Frequently asked questions

How do I cancel my subscription?
Simply head over to the account section in settings and click on ā€œCancel Subscriptionā€ - itā€™s as simple as that. After you cancel, your membership will stay active for the remainder of the time youā€™ve paid for. Learn more here.
Can/how do I download books?
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
What is the difference between the pricing plans?
Both plans give you full access to the library and all of Perlegoā€™s features. The only differences are the price and subscription period: With the annual plan youā€™ll save around 30% compared to 12 months on the monthly plan.
What is Perlego?
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, weā€™ve got you covered! Learn more here.
Do you support text-to-speech?
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Is Mastering Text Mining with R an online PDF/ePUB?
Yes, you can access Mastering Text Mining with R by Ashish Kumar, Avinash Paul in PDF and/or ePUB format, as well as other popular books in Informatica & Data mining. We have over one million books available in our catalogue for you to explore.

Information

Year
2016
ISBN
9781783551811
Edition
1
Subtopic
Data mining

Mastering Text Mining with R


Table of Contents

Mastering Text Mining with R
Credits
About the Authors
About the Reviewers
www.PacktPub.com
eBooks, discount offers, and more
Why subscribe?
Customer Feedback
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Statistical Linguistics with R
Probability theory and basic statistics
Probability space and event
Theorem of compound probabilities
Conditional probability
Bayes' formula for conditional probability
Independent events
Random variables
Discrete random variables
Continuous random variables
Probability frequency function
Probability distributions using R
Cumulative distribution function
Joint distribution
Binomial distribution
Poisson distribution
Counting occurrences
Zipf's law
Heaps' law
Lexical richness
Lexical variation
Lexical density
Lexical originality
Lexical sophistication
Language models
N-gram models
Markov assumption
Hidden Markov models
Quantitative methods in linguistics
Document term matrix
Inverse document frequency
Words similarity and edit-distance functions
Euclidean distance
Cosine similarity
Levenshtein distance
Damerau-Levenshtein distance
Hamming distance
Jaro-Winkler distance
Measuring readability of a text
Gunning frog index
R packages for text mining
OpenNLP
Rweka
RcmdrPlugin.temis
tm
languageR
koRpus
RKEA
maxent
lsa
Summary
2. Processing Text
Accessing text from diverse sources
File system
PDF documents
Microsoft Word documents
HTML
XML
JSON
HTTP
Databases
Processing text using regular expressions
Tokenization and segmentation
Word tokenization
Operations on a document-term matrix
Sentence segmentation
Normalizing texts
Lemmatization and stemming
Stemming
Lemmatization
Synonyms
Lexical diversity
Analyse lexical diversity
Calculate lexical diversity
Readability
Automated readability index
Language detection
Summary
3. Categorizing and Tagging Text
Parts of speech tagging
POS tagging with R packages
Hidden Markov Models for POS tagging
Basic definitions and notations
Implementing HMMs
Viterbi underflow
Forward algorithm underflow
OpenNLP chunking
Chunk tags
Collocation and contingency tables
Extracting co-occurrences
Surface Co-occurrence
Textual co-occurrence
Syntactic co-occurrence
Co-occurrence in a document
Quantifying the relation between words
Contingency tables
Detailed analysis on textual collocations
Feature extraction
Synonymy and similarity
Multiwords, negation, and antonymy
Concept similarity
Path length
Resnik similarity
Lin similarity
Jiang ā€“ Conrath distance
Summary
4. Dimensionality Reduction
The curse of dimensionality
Distance concentration and computational infeasibility
Dimensionality reduction
Principal component analysis
Using R for PCA
Understanding the FactoMineR package
Amap package
Proportion of variance
Scree plot
Reconstruction error
Correspondence analysis
Canonical correspondence analysis
Pearson's Chi-squared test
Multiple correspondence analysis
Implementation of SVD using R
Summary
5. Text Summarization and Clustering
Topic modeling
Latent Dirichlet Allocation
Correlated topic model
Model selection
R Package for topic modeling
Fitting the LDA model with the VEM algorithm
Latent semantic analysis
R Package for latent semantic analysis
Illustrative example of LSA
Text clustering
Document clustering
Feature selection for text clustering
Mutual information
Statistic Chi Square feature selection
Frequency-based feature selection
Sentence completion
Summary
6. Text Classification
Text classification
Document representation
Feature hashing
Classifiers ā€“ inductive learning
Tree-based learning
Bayesian classifiers: Naive Bayes classification
K-Nearest neighbors
Kernel methods
Support vector machines
Kernel Trick
How to apply SVM on a real world example?
Number of instances is significantly larger than the number of dimensions.Maximum entropy classifier
Maxent implemenation in R
RTextTools: a text classification framework
Model evaluation
Confusion matrix
ROC curve
Precision-recall
Biasā€“variance trade-off and learning curve
Bias-variance decomposition
Learning curve
Dealing with reducible error components
Cross validation
Leave-one-out
k-Fold
Bootstrap
Stratified
Summary
7. Entity Recognition
Entity extraction
The rule-based approach
Machine learning
Sentence boundary detection
Word token annotator
Named entity recognition
Training a model with new features
Summary
Index

Mastering Text Mining with R

Copyright Ā© 2016 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any fo...

Table of contents