fastText Quick Start Guide
Get started with Facebook's library for text representation and classification
Joydeep Bhattacharjee
- 194 pages
- English
- ePUB (mobile friendly)
- Available on iOS & Android
fastText Quick Start Guide
Get started with Facebook's library for text representation and classification
Joydeep Bhattacharjee
About This Book
Perform efficient fast text representation and classification with Facebook's fastText library
Key Features
- Introduction to Facebook's fastText library for NLP
- Perform efficient word representations, sentence classification, vector representation
- Build better, more scalable solutions for text representation and classification
Book Description
Facebook's fastText library handles text representation and classification, used for Natural Language Processing (NLP). Most organizations have to deal with enormous amounts of text data on a daily basis, and gaining efficient data insights requires powerful NLP tools such as fastText.
This book is your ideal introduction to fastText. You will learn how to create fastText models from the command line, without the need for complicated code. You will explore the algorithms that fastText is built on and how to use them for word representation and text classification.
Next, you will use fastText in conjunction with other popular libraries and frameworks such as Keras, TensorFlow, and PyTorch.
Finally, you will deploy fastText models to mobile devices. By the end of this book, you will have all the required knowledge to use fastText in your own applications at work or in projects.
What you will learn
- Create models using the default command line options in fastText
- Understand the algorithms used in fastText to create word vectors
- Combine command line text transformation capabilities and the fastText library to implement a training, validation, and prediction pipeline
- Explore word representation and sentence classification using fastText
- Use Gensim and spaCy to load the vectors, transform, lemmatize, and perform other NLP tasks efficiently
- Develop a fastText NLP classifier using popular frameworks, such as Keras, Tensorflow, and PyTorch
Who this book is for
This book is for data analysts, data scientists, and machine learning developers who want to perform efficient word representation and sentence classification using Facebook's fastText library. Basic knowledge of Python programming is required.
Frequently asked questions
Information
Creating Models Using FastText Command Line
- Commands such as cat, grep, sed, and awk are quite old and their behavior is well-documented on the internet. Chances are high that, for any use case that you might have, you will easily get snippets on Stack Overflow/Google (or your colleague next door will know it).
- Since they are generally implemented in the C language, they are very fast.
- The commands are very crisp and concise, which means there is not a lot of code to write and maintain.
- Text classification using fastText
- FastText word vectors
- Creating word vectors
- Facebook word vectors
- Using pretrained word vectors
Text classification using fastText
- Text classification
- Text representation
- First, you need the data and hence for text classification you need a series of texts or documents that will be labeled. You convert them into a series of text-label pairs.
- The next step is called tokenization. Tokenization is the process of dividing the text into individual pieces or tokens. Tokenization is primarily done by understanding the word boundaries in the given text. Many languages in the world are space delimited. Examples of these are English and French. In some other cases, the word boundaries may not be clear, such as in the case of Mandarin, Tamil, and Urdu.
- Once the tokenization is done, based on the process you may end up with a "bag of words," which is essentially a vector for the document/sentence telling you whether a specific word is there or not, and how many times. The columns in the matrix are all the set of words present, which is called the dictionary, and the rows are the count of the particular words in the document. This is called the bag-of-words approach.
- Convert the bag of words into a TF-IDF matrix to reduce the weight of the common terms. TF-IDF has been used so that the terms that are common in the document do not have too much impact on the resultant matrix.
- Now that you have the matrix, you can pass the matrix as input to a classification algorithm, which will essentially train a model on this input matrix. General algorithms that are quite popular in this stage are logistic regression, as well as algorithms such as XGBoost, random forest, and so on.
- Removal of stop words.
- Stemming or a heurestic removal of end of words. This process works mostly in English and related languages due to the prevalence of derivational affixes.
- Addition of n-grams to the model.
- Synonymous sets.
- Part of speech tagging.
Text preprocessing
- Tokenize the text.
- Convert the text into lowercase. This is only required for lang...