Data Processing with Optimus
eBook - ePub

Data Processing with Optimus

  1. 300 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Data Processing with Optimus

About this book

Written by the core Optimus team, this comprehensive guide will help you to understand how Optimus improves the whole data processing landscapeKey Features• Load, merge, and save small and big data efficiently with Optimus• Learn Optimus functions for data analytics, feature engineering, machine learning, cross-validation, and NLP• Discover how Optimus improves other data frame technologies and helps you speed up your data processing tasksBook DescriptionOptimus is a Python library that works as a unified API for data cleaning, processing, and merging data. It can be used for handling small and big data on your local laptop or on remote clusters using CPUs or GPUs. The book begins by covering the internals of Optimus and how it works in tandem with the existing technologies to serve your data processing needs. You'll then learn how to use Optimus for loading and saving data from text data formats such as CSV and JSON files, exploring binary files such as Excel, and for columnar data processing with Parquet, Avro, and OCR. Next, you'll get to grips with the profiler and its data types - a unique feature of Optimus Dataframe that assists with data quality. You'll see how to use the plots available in Optimus such as histogram, frequency charts, and scatter and box plots, and understand how Optimus lets you connect to libraries such as Plotly and Altair. You'll also delve into advanced applications such as feature engineering, machine learning, cross-validation, and natural language processing functions and explore the advancements in Optimus. Finally, you'll learn how to create data cleaning and transformation functions and add a hypothetical new data processing engine with Optimus. By the end of this book, you'll be able to improve your data science workflow with Optimus easily.What you will learn• Use over 100 data processing functions over columns and other string-like values• Reshape and pivot data to get the output in the required format• Find out how to plot histograms, frequency charts, scatter plots, box plots, and more• Connect Optimus with popular Python visualization libraries such as Plotly and Altair• Apply string clustering techniques to normalize strings• Discover functions to explore, fix, and remove poor quality data• Use advanced techniques to remove outliers from your data• Add engines and custom functions to clean, process, and merge dataWho this book is forThis book is for Python developers who want to explore, transform, and prepare big data for machine learning, analytics, and reporting using Optimus, a unified API to work with Pandas, Dask, cuDF, Dask-cuDF, Vaex, and Spark. Although not necessary, beginner-level knowledge of Python will be helpful. Basic knowledge of the CLI is required to install Optimus and its requirements. For using GPU technologies, you'll need an NVIDIA graphics card compatible with NVIDIA's RAPIDS library, which is compatible with Windows 10 and Linux.

Tools to learn more effectively

Saving Books

Saving Books

Keyword Search

Keyword Search

Annotating Text

Annotating Text

Listen to it instead

Listen to it instead

Section 1: Getting Started with Optimus

By the end of this section, you will have a good understanding of what Optimus brings to the table in terms of improving the entire data processing landscape.
This section comprises the following chapters:
  • Chapter 1, Hi Optimus!
  • Chapter 2, Data Loading, Saving, and File Formats

Chapter 1: Hi Optimus!

Optimus is a Python library that loads, transforms, and saves data, and also focuses on wrangling tabular data. It provides functions that were designed specially to make this job easier for you; it can use multiple engines as backends, such as pandas, cuDF, Spark, and Dask, so that you can process both small and big data efficiently.
Optimus is not a DataFrame technology: it is not a new way to organize data in memory, such as arrow, or a way to handle data in GPUs, such as cuDF. Instead, Optimus relies on these technologies to load, process, explore, and save data.
Having said that, this book is for everyone, mostly data and machine learning engineers, who want to simplify writing code for data processing tasks. It doesn't matter if you want to process small or big data, on your laptop or in a remote cluster, if you want to load data from a database or from remote storage – Optimus provides all the tools you will need to make your data processing task easier.
In this chapter, we will learn about how Optimus was born and all the DataFrame technologies you can use as backends to process data. Then, we will learn about what features separate Optimus from all the various kinds of DataFrame technologies. After that, we will install Optimus and Jupyter Lab so that we will be prepared to code in Chapter 2, Data Loading, Saving, and File Formats.
Finally, we will analyze some of Optimus's internal functions to understand how it works and how you can take advantage of some of the more advanced features.
A key point: this book will not try to explain how every DataFrame technology works. There are plenty of resources on the internet that explain the internals and the day-to-day use of these technologies. Optimus is the result of an attempt to create an expressive and easy to use data API and give the user most of the tools they need to complete the data preparation process in the easiest way possible.
The topics we will be covering in this chapter are as follows:
  • Introducing Optimus
  • Installing everything you need to run Optimus
  • Using Optimus
  • Discovering Optimus internals

Technical requirements

To take full advantage of this chapter, please ensure you implement everything specified in this section.
Optimus can work with multiple backend technologies to process data, including GPUs. For GPUs, Optimus uses RAPIDS, which needs an NVIDIA card. For more information about the requirements, please go to the GPU configuration section.
To use RAPIDS on Windows 10, you will need the following:
  • Windows 10 version 2004 (OS build 202001.1000 or later)
  • CUDA version 455.41 in CUDA SDK v11.1
You can find all the code for this chapter in https://github.com/PacktPublishing/Data-Processing-with-Optimus.

Introducing Optimus

Development of Optimus began with work being conducted for another project. In 2016, Alberto Bonsanto, Hugo Reyes, and I had an ongoing big data project for a national retail business in Venezuela. We learned how to use PySpark and Trifacta to prepare and clean data and to find buying patterns.
But problems soon arose for both technologies: the data had different category/product names over the years, a 10-level categorization tree, and came from different sources, including CSV files, Excel files, and databases, which added an extra process to our workflow and could not be easily wrangled. On the other hand, when we tried Trifacta, we needed to learn its unique syntax. It also lacked some features we needed, such as the ability to remove a single character from every column in the dataset. In addition to that, the tool was closed source.
We thought we could do better. We wanted to write an open source, user-friendly library in Python that would let any non-experienced user apply functions to clean, prepare, and plot big data using PySpark.
From this, Optimus was born.
After that, we integrated other technologies. The first one we wanted to include was cuDF, which supports processing data 20x faster; soon after that, we also integrated Dask, Dask-cuDF, and Ibis. You may be wondering, why so many DataFrame technologies? To answer that, we need to understand a little bit more about how each one works.

Exploring the DataFrame technologies

There are many different well-known DataFrame technologies available today. Optimus can process data using one or many of those available technologies, including pandas, Dask, cuDF, Dask-cuDF, Spark, Vaex, or Ibis.
Let's look at some of the ones that work with Optimus:
  • pandas is, without a doubt, one of the more popular DataFrame technologies. If you work with data in Python, you probably use pandas a lot, but it has an important caveat: pandas cannot handle multi-core processing. This means that you cannot use all the power that modern CPUs can give you, which means you need to find a hacky way to use all the cores with pandas. Also, you cannot process data volumes greater than the memory available in RAM, so you need to write code to process your data in chunks.
  • Dask came out to help parallelize Python data processing. In Dask, we have the...

Table of contents

  1. Data Processing with Optimus
  2. Contributors
  3. Preface
  4. Section 1: Getting Started with Optimus
  5. Chapter 1: Hi Optimus!
  6. Chapter 2: Data Loading, Saving, and File Formats
  7. Section 2: Optimus – Transform and Rollout
  8. Chapter 3: Data Wrangling
  9. Chapter 4: Combining, Reshaping, and Aggregating Data
  10. Chapter 5: Data Visualization and Profiling
  11. Chapter 6: String Clustering
  12. Chapter 7: Feature Engineering
  13. Section 3: Advanced Features of Optimus
  14. Chapter 8: Machine Learning
  15. Chapter 9: Natural Language Processing
  16. Chapter 10: Hacking Optimus
  17. Chapter 11: Optimus as a Web Service
  18. Other Books You May Enjoy

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 990+ topics, we’ve got you covered! Learn about our mission
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud
Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app
Yes, you can access Data Processing with Optimus by Dr. Argenis Leon,Luis Aguirre in PDF and/or ePUB format, as well as other popular books in Computer Science & Artificial Intelligence (AI) & Semantics. We have over one million books available in our catalogue for you to explore.