"What then is, generally speaking, the truth of history? A fable agreed upon. As it has been very ingeniously remarked"
The world is awash with textual data. If you Google, Bing, or Yahoo! how much of that data is unstructured, that is, in a textual format, estimates would range from 80 to 90 percent. The real number doesn't matter. It matters that a large proportion of the data is in text format. The implication is that anyone seeking to find insights in that data must develop the capability to process and analyze text.
When I first started out as a market researcher, I used to manually pore through page after page of moderator-led focus group and interview transcripts with the hope of capturing some qualitative insight, an aha moment if you will, and then haggle with fellow team members over whether they had the same insight or not. Then, you would always have that one individual in a project who would swoop in and listen to two interviews—out of the 30 or 40 on the schedule—and, alas, they had their mind made up on what was really happening in the world. Contrast that with the techniques being used now, where an analyst can quickly distill data into meaningful quantitative results, support qualitative understanding, and maybe even sway the swooper.
Over the last few years, I've applied the techniques discussed here to mine physician-patient interactions, understand FDA fears on prescription drug advertising, capture patient concerns about rare cancer, and capture customer maintenance problems, to name just a few. Using R and the methods in this chapter, you too can extract the powerful information in textual data.
There are many different methods to use in text mining. The goal here is to provide a basic framework to apply to such an endeavor. This framework is not inclusive of all the possible methods, but will cover those that are probably the most important for the vast majority of projects that you will work on. Additionally, I will discuss the modeling methods in as succinct and clear a manner as possible, because they can get quite complicated. Gathering and compiling text data is a topic that could take up several chapters. One of the things I prefer and will put forward here is the use of the tidy framework. It will allow us to use tibbles and data frames for most of the steps, and the tidytext functions allow an easy transition to other types of text mining structures, such as a corpus.
The first task is to put the text files into a data frame. With that created, the data preparation can begin with the text transformation.
The following list is composed of probably some of the most common and useful transformations for text files:
- Change capital letters to lowercase
- Remove numbers
- Remove punctuation
- Remove stop words
- Remove excess whitespace characters
- Word stemming
- Word replacement
With these transformations, you are creating a more compact dataset and simplify the structure in order to facilitate relationships between the words, thereby leading to increased understanding. However, keep in mind that not all of these transformations are necessary all the time and judgment must be applied, or you can iterate to find the transformations that make the most sense.
By changing words to lowercase, you can prevent the improper counting of words. Say that you have a count for hockey three times and Hockey once, where it is the first word in a sentence. R will not give you a count of hockey=4, but hockey=3 and Hockey=1.
Removing punctuation also achieves the same purpose, but in some cases, punctuation is important, especially if you want to tokenize your documents by sentences.
In removing stop words, you are getting rid of the common words that have no value; in fact, they are detrimental to the analysis, as their frequency masks important words. Examples of stop words are and, is, the, not, and to.
Removing whitespace makes data more compact by getting rid of things such as tabs, paragraph breaks, double-spacing, and so on.
The stemming of words can get tricky and might add to your confusion because it deletes word suffixes, creating the base word, or what is known as the radical. I personally am not a big fan of stemming and the analysts I've worked with agree with that sentiment. Recall that R would count this as two separate words. By running a stemming algorithm, the stemmed word for the two instances would become famili. This would prevent the incorrect count, but in some cases it can be odd to interpret and is not very visually appealing in a word cloud for presentation purposes. In some cases, it may make sense to run your analysis with both stemmed and unstemmed words in order to see which one facilitates understanding.
Probably the most optional of the transformations is to replace the words. The goal of replacement is to combine words with a similar meaning, for example, management and leadership. You can also use it in lieu of stemming. I once examined the outcome of stemmed and unstemmed words and concluded that I could achieve a more meaningful result by replacing about a dozen words instead of stemming. It can be important when you have manual data entry and different operators input data differently. For example, tech support person one types in the system turbocharger, while tech support person two types in turbo charger half the time, and turbo-charger the other half. All three versions are the same, so applying a replacement function such as gsub() or grepl() will solve the problem.
With transformations completed, one structure to create for topic modeling or classification is either a document-term matrix (DTM) or term-document matrix (TDM). What either of these matrices does is create a matrix of word counts for each individual document in the matrix. A DTM would have the documents as rows and the words as columns, while in a TDM, the reverse is true. We will be using a DTM for our example.
Topic models are a powerful method to group documents by their main topics. Topic models allow probabilistic modeling of term frequency occurrence in documents. The fitted model can be used to estimate the similarity between documents, as well as between a set of specified keywords using an additional layer of latent variables, which are referred to as topics (Grun and Hornik, 2011). In essence, a document is assigned to a topic based on the distribution of the words in that document, and the other documents in that topic will have roughly the same frequency of words.
The algorithm that we will focus on is Latent Dirichlet Allocation (LDA) with Gibbs sampling, which is probably the most commonly used sampling algorithm. In building topic models, the number of topics must be determined before running the algorithm (k-dimensions). If no a priori reason for the number of topics exists, then you can build several and apply judgment and knowledge to the final selection. LDA with Gibbs sampling is quite complicated mathematically, but my intent is to provide an introduction so that you are at least able to describe how the algorithm learns to assign a document to a topic in layperson terms. If you are interested in mastering the math associated with the method, block out a couple of hours on your calendar and have a go at it. Excellent background material is av...