eBook - ePub

fastText Quick Start Guide

Name: fastText Quick Start Guide
Author: Joydeep Bhattacharjee

Get started with Facebook's library for text representation and classification

Joydeep Bhattacharjee

Share book

194 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

fastText Quick Start Guide

Get started with Facebook's library for text representation and classification

Joydeep Bhattacharjee

Book details

Book preview

Table of contents

Citations

About This Book

Perform efficient fast text representation and classification with Facebook's fastText library

Key Features

Introduction to Facebook's fastText library for NLP
Perform efficient word representations, sentence classification, vector representation
Build better, more scalable solutions for text representation and classification

Book Description

Facebook's fastText library handles text representation and classification, used for Natural Language Processing (NLP). Most organizations have to deal with enormous amounts of text data on a daily basis, and gaining efficient data insights requires powerful NLP tools such as fastText.

This book is your ideal introduction to fastText. You will learn how to create fastText models from the command line, without the need for complicated code. You will explore the algorithms that fastText is built on and how to use them for word representation and text classification.

Next, you will use fastText in conjunction with other popular libraries and frameworks such as Keras, TensorFlow, and PyTorch.

Finally, you will deploy fastText models to mobile devices. By the end of this book, you will have all the required knowledge to use fastText in your own applications at work or in projects.

What you will learn

Create models using the default command line options in fastText
Understand the algorithms used in fastText to create word vectors
Combine command line text transformation capabilities and the fastText library to implement a training, validation, and prediction pipeline
Explore word representation and sentence classification using fastText
Use Gensim and spaCy to load the vectors, transform, lemmatize, and perform other NLP tasks efficiently
Develop a fastText NLP classifier using popular frameworks, such as Keras, Tensorflow, and PyTorch

Who this book is for

This book is for data analysts, data scientists, and machine learning developers who want to perform efficient word representation and sentence classification using Facebook's fastText library. Basic knowledge of Python programming is required.

Frequently asked questions

How do I cancel my subscription?

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.

Can/how do I download books?

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

What is the difference between the pricing plans?

Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.

What is Perlego?

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Do you support text-to-speech?

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Is fastText Quick Start Guide an online PDF/ePUB?

Yes, you can access fastText Quick Start Guide by Joydeep Bhattacharjee in PDF and/or ePUB format, as well as other popular books in Computer Science & Natural Language Processing. We have over one million books available in our catalogue for you to explore.

Information

Publisher

Packt Publishing

Year

2018

ISBN

9781789136715

Edition

Topic

Computer Science

Subtopic

Natural Language Processing

Index

Computer Science

Creating Models Using FastText Command Line

FastText has a powerful command line. In fact, you can call fastText a command-line-first library. Now, a lot of developers and researchers are not comfortable with the command line, and I would ask you to go through the examples in this chapter with greater attention. My hope is that by the end of this chapter, you will have some confidence in command-line file manipulations. The advantages of using the command line are as follows:

Commands such as cat, grep, sed, and awk are quite old and their behavior is well-documented on the internet. Chances are high that, for any use case that you might have, you will easily get snippets on Stack Overflow/Google (or your colleague next door will know it).
Since they are generally implemented in the C language, they are very fast.
The commands are very crisp and concise, which means there is not a lot of code to write and maintain.

We will take a look at how classification and word vector generation works in fastText. In this chapter, we will explore how to implement them using the command line:

Text classification using fastText
FastText word vectors
Creating word vectors
Facebook word vectors
Using pretrained word vectors

Text classification using fastText

To access the command line, open the Terminal on your Linux or macOS machines, or the command prompt (by typing cmd in Windows + R and hitting Enter) on Windows machines, and then type fastText. You should see some output coming out. If you are not seeing anything, or getting an error saying that the command not found, please take a look at the previous chapter on how to install fastText on your computer. If you are able to see some output, the output is a basic description of all the options. A description of the command line options for fastText can be found in the Appendix of this book.

All the methods and command line statements mentioned in this chapter will work on Linux and Mac machines. If you are a Windows user, focus more on the description and the logic of what is being done and follow the logic of the steps. A helpful guide on command line differences between Windows and Linux is mentioned in the Appendix.

In fastText, there are two primary use cases for the command line. These are the following:

Text classification
Text representation

One of the core areas of focus for fastText is text classification. Text classification is a technique in which we learn to which set of categories the input text belongs. This is basically a supervised machine learning problem, so first and foremost, you will need a dataset that contains text and the corresponding labels.

Roughly speaking, machine learning algorithms run some kind of optimization problem on a set of matrices and vectors. They do not really understand "raw text," which means that you will need to set up a pipeline to convert the raw text into numbers. Here are the steps that can be followed to do that:

First, you need the data and hence for text classification you need a series of texts or documents that will be labeled. You convert them into a series of text-label pairs.
The next step is called tokenization. Tokenization is the process of dividing the text into individual pieces or tokens. Tokenization is primarily done by understanding the word boundaries in the given text. Many languages in the world are space delimited. Examples of these are English and French. In some other cases, the word boundaries may not be clear, such as in the case of Mandarin, Tamil, and Urdu.
Once the tokenization is done, based on the process you may end up with a "bag of words," which is essentially a vector for the document/sentence telling you whether a specific word is there or not, and how many times. The columns in the matrix are all the set of words present, which is called the dictionary, and the rows are the count of the particular words in the document. This is called the bag-of-words approach.
Convert the bag of words into a TF-IDF matrix to reduce the weight of the common terms. TF-IDF has been used so that the terms that are common in the document do not have too much impact on the resultant matrix.
Now that you have the matrix, you can pass the matrix as input to a classification algorithm, which will essentially train a model on this input matrix. General algorithms that are quite popular in this stage are logistic regression, as well as algorithms such as XGBoost, random forest, and so on.

Some of the additional steps that may need to be taken are the following:

Removal of stop words.
Stemming or a heurestic removal of end of words. This process works mostly in English and related languages due to the prevalence of derivational affixes.
Addition of n-grams to the model.
Synonymous sets.
Part of speech tagging.

Text preprocessing

Depending on the dataset, you may need to do some or all of these steps:

Tokenize the text.
Convert the text into lowercase. This is only required for lang...