fastText Quick Start Guide
eBook - ePub

fastText Quick Start Guide

Get started with Facebook's library for text representation and classification

  1. 194 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

fastText Quick Start Guide

Get started with Facebook's library for text representation and classification

About this book

Perform efficient fast text representation and classification with Facebook's fastText library

Key Features

  • Introduction to Facebook's fastText library for NLP
  • Perform efficient word representations, sentence classification, vector representation
  • Build better, more scalable solutions for text representation and classification

Book Description

Facebook's fastText library handles text representation and classification, used for Natural Language Processing (NLP). Most organizations have to deal with enormous amounts of text data on a daily basis, and gaining efficient data insights requires powerful NLP tools such as fastText.

This book is your ideal introduction to fastText. You will learn how to create fastText models from the command line, without the need for complicated code. You will explore the algorithms that fastText is built on and how to use them for word representation and text classification.

Next, you will use fastText in conjunction with other popular libraries and frameworks such as Keras, TensorFlow, and PyTorch.

Finally, you will deploy fastText models to mobile devices. By the end of this book, you will have all the required knowledge to use fastText in your own applications at work or in projects.

What you will learn

  • Create models using the default command line options in fastText
  • Understand the algorithms used in fastText to create word vectors
  • Combine command line text transformation capabilities and the fastText library to implement a training, validation, and prediction pipeline
  • Explore word representation and sentence classification using fastText
  • Use Gensim and spaCy to load the vectors, transform, lemmatize, and perform other NLP tasks efficiently
  • Develop a fastText NLP classifier using popular frameworks, such as Keras, Tensorflow, and PyTorch

Who this book is for

This book is for data analysts, data scientists, and machine learning developers who want to perform efficient word representation and sentence classification using Facebook's fastText library. Basic knowledge of Python programming is required.

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn more here.
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Yes, you can access fastText Quick Start Guide by Joydeep Bhattacharjee in PDF and/or ePUB format, as well as other popular books in Computer Science & Data Mining. We have over one million books available in our catalogue for you to explore.

Information

Creating Models Using FastText Command Line

FastText has a powerful command line. In fact, you can call fastText a command-line-first library. Now, a lot of developers and researchers are not comfortable with the command line, and I would ask you to go through the examples in this chapter with greater attention. My hope is that by the end of this chapter, you will have some confidence in command-line file manipulations. The advantages of using the command line are as follows:
  • Commands such as cat, grep, sed, and awk are quite old and their behavior is well-documented on the internet. Chances are high that, for any use case that you might have, you will easily get snippets on Stack Overflow/Google (or your colleague next door will know it).
  • Since they are generally implemented in the C language, they are very fast.
  • The commands are very crisp and concise, which means there is not a lot of code to write and maintain.
We will take a look at how classification and word vector generation works in fastText. In this chapter, we will explore how to implement them using the command line:
  • Text classification using fastText
  • FastText word vectors
  • Creating word vectors
  • Facebook word vectors
  • Using pretrained word vectors

Text classification using fastText

To access the command line, open the Terminal on your Linux or macOS machines, or the command prompt (by typing cmd in Windows + R and hitting Enter) on Windows machines, and then type fastText. You should see some output coming out. If you are not seeing anything, or getting an error saying that the command not found, please take a look at the previous chapter on how to install fastText on your computer. If you are able to see some output, the output is a basic description of all the options. A description of the command line options for fastText can be found in the Appendix of this book.
All the methods and command line statements mentioned in this chapter will work on Linux and Mac machines. If you are a Windows user, focus more on the description and the logic of what is being done and follow the logic of the steps. A helpful guide on command line differences between Windows and Linux is mentioned in the Appendix.
In fastText, there are two primary use cases for the command line. These are the following:
  • Text classification
  • Text representation
One of the core areas of focus for fastText is text classification. Text classification is a technique in which we learn to which set of categories the input text belongs. This is basically a supervised machine learning problem, so first and foremost, you will need a dataset that contains text and the corresponding labels.
Roughly speaking, machine learning algorithms run some kind of optimization problem on a set of matrices and vectors. They do not really understand "raw text," which means that you will need to set up a pipeline to convert the raw text into numbers. Here are the steps that can be followed to do that:
  • First, you need the data and hence for text classification you need a series of texts or documents that will be labeled. You convert them into a series of text-label pairs.
  • The next step is called tokenization. Tokenization is the process of dividing the text into individual pieces or tokens. Tokenization is primarily done by understanding the word boundaries in the given text. Many languages in the world are space delimited. Examples of these are English and French. In some other cases, the word boundaries may not be clear, such as in the case of Mandarin, Tamil, and Urdu.
  • Once the tokenization is done, based on the process you may end up with a "bag of words," which is essentially a vector for the document/sentence telling you whether a specific word is there or not, and how many times. The columns in the matrix are all the set of words present, which is called the dictionary, and the rows are the count of the particular words in the document. This is called the bag-of-words approach.
  • Convert the bag of words into a TF-IDF matrix to reduce the weight of the common terms. TF-IDF has been used so that the terms that are common in the document do not have too much impact on the resultant matrix.
  • Now that you have the matrix, you can pass the matrix as input to a classification algorithm, which will essentially train a model on this input matrix. General algorithms that are quite popular in this stage are logistic regression, as well as algorithms such as XGBoost, random forest, and so on.
Some of the additional steps that may need to be taken are the following:
  • Removal of stop words.
  • Stemming or a heurestic removal of end of words. This process works mostly in English and related languages due to the prevalence of derivational affixes.
  • Addition of n-grams to the model.
  • Synonymous sets.
  • Part of speech tagging.

Text preprocessing

Depending on the dataset, you may need to do some or all of these steps:
  • Tokenize the text.
  • Convert the text into lowercase. This is only required for lang...

Table of contents

  1. Title Page
  2. Copyright and Credits
  3. Dedication
  4. Packt Upsell
  5. Contributors
  6. Preface
  7. First Steps
  8. Introducing FastText
  9. Creating Models Using FastText Command Line
  10. The FastText Model
  11. Word Representations in FastText
  12. Sentence Classification in FastText
  13. Using FastText in Your Own Models
  14. FastText in Python
  15. Machine Learning and Deep Learning Models
  16. Deploying Models to Web and Mobile
  17. Notes for the Readers
  18. References
  19. Other Books You May Enjoy