eBook - ePub

Pentaho Data Integration Quick Start Guide

Name: Pentaho Data Integration Quick Start Guide
Author: María Carina Roldán

Create ETL processes using Pentaho

María Carina Roldán

Buch teilen

178 Seiten
English
ePUB (handyfreundlich)
Über iOS und Android verfügbar

eBook - ePub

Pentaho Data Integration Quick Start Guide

Create ETL processes using Pentaho

María Carina Roldán

Angaben zum Buch

Buchvorschau

Inhaltsverzeichnis

Quellenangaben

Über dieses Buch

Get productive quickly with Pentaho Data Integration

Key Features

Take away the pain of starting with a complex and powerful system
Simplify your data transformation and integration work
Explore, transform, and validate your data with Pentaho Data Integration

Book Description

Pentaho Data Integration(PDI) is an intuitive and graphical environment packed with drag and drop design and powerful Extract-Transform-Load (ETL) capabilities. Given its power and flexibility, initial attempts to use the Pentaho Data Integration tool can be difficult or confusing. This book is the ideal solution.

This book reduces your learning curve with PDI. It provides the guidance needed to make you productive, covering the main features of Pentaho Data Integration. It demonstrates the interactive features of the graphical designer, and takes you through the main ETL capabilities that the tool offers.

By the end of the book, you will be able to use PDI for extracting, transforming, and loading the types of data you encounter on a daily basis.

What you will learn

Design, preview and run transformations in Spoon
Run transformations using the Pan utility
Understand how to obtain data from different types of files
Connect to a database and explore it using the database explorer
Understand how to transform data in a variety of ways
Understand how to insert data into database tables
Design and run jobs for sequencing tasks and sending emails
Combine the execution of jobs and transformations

Who this book is for

This book is for software developers, business intelligence analysts, and others involved or interested in developing ETL solutions, or more generally, doing any kind of data manipulation.

Häufig gestellte Fragen

Wie kann ich mein Abo kündigen?

Gehe einfach zum Kontobereich in den Einstellungen und klicke auf „Abo kündigen“ – ganz einfach. Nachdem du gekündigt hast, bleibt deine Mitgliedschaft für den verbleibenden Abozeitraum, den du bereits bezahlt hast, aktiv. Mehr Informationen hier.

(Wie) Kann ich Bücher herunterladen?

Derzeit stehen all unsere auf Mobilgeräte reagierenden ePub-Bücher zum Download über die App zur Verfügung. Die meisten unserer PDFs stehen ebenfalls zum Download bereit; wir arbeiten daran, auch die übrigen PDFs zum Download anzubieten, bei denen dies aktuell noch nicht möglich ist. Weitere Informationen hier.

Welcher Unterschied besteht bei den Preisen zwischen den Aboplänen?

Mit beiden Aboplänen erhältst du vollen Zugang zur Bibliothek und allen Funktionen von Perlego. Die einzigen Unterschiede bestehen im Preis und dem Abozeitraum: Mit dem Jahresabo sparst du auf 12 Monate gerechnet im Vergleich zum Monatsabo rund 30 %.

Was ist Perlego?

Wir sind ein Online-Abodienst für Lehrbücher, bei dem du für weniger als den Preis eines einzelnen Buches pro Monat Zugang zu einer ganzen Online-Bibliothek erhältst. Mit über 1 Million Büchern zu über 1.000 verschiedenen Themen haben wir bestimmt alles, was du brauchst! Weitere Informationen hier.

Unterstützt Perlego Text-zu-Sprache?

Achte auf das Symbol zum Vorlesen in deinem nächsten Buch, um zu sehen, ob du es dir auch anhören kannst. Bei diesem Tool wird dir Text laut vorgelesen, wobei der Text beim Vorlesen auch grafisch hervorgehoben wird. Du kannst das Vorlesen jederzeit anhalten, beschleunigen und verlangsamen. Weitere Informationen hier.

Ist Pentaho Data Integration Quick Start Guide als Online-PDF/ePub verfügbar?

Ja, du hast Zugang zu Pentaho Data Integration Quick Start Guide von María Carina Roldán im PDF- und/oder ePub-Format sowie zu anderen beliebten Büchern aus Business & Business intelligence. Aus unserem Katalog stehen dir über 1 Million Bücher zur Verfügung.

Information

Verlag

Packt Publishing

Jahr

2018

ISBN

9781789342796

Auflage

Thema

Business

Thema

Business intelligence

Transforming Data

Transforming data is about manipulating the data that flows from step to step in a PDI transformation. There are many ways in which this transformation can be done. We can modify incoming data, change its datatype, add new fields, fix erroneous data, sort, group, and filter unwanted information, aggregate data in several ways, and more. In this chapter we will explain some of the possibilities.

The following is the list of topics that we will cover:

Transforming data in different ways
Sorting and aggregating data
Filtering rows
Looking up for data

Transforming data in different ways

So far, we have seen how to create a PDI dataset mainly using data coming from files or databases. Once you have the data, there are many things you can do with it depending on your particular needs. One very common requirement is to create new fields where the values are based on the values of existent fields.

The set of operations covered in this section is not a full list of the available options, but includes the most common ones, and will inspire you when you come to implement others.

The files that we will use in this section were built with data downloaded from www.numbeo.com, a site containing information about living conditions in cities and countries worldwide.

For learning the topics in this chapter, you are free to create your own data. However, if you want to reproduce the exercises exactly as they are explained, you will need the afore mentioned files from www.numbeo.com.

Before continuing, make sure you download the set of data that comes with the code bundle for the book.

Extracting data from existing fields

First, we will learn how to extract data from fields that exist in our dataset in order to generate new fields. For the first exercise, we will read a file containing data about the cost of living in Europe. The content of the file looks like this:

Rank City Cost of Living Index Rent Index Cost of Living Plus Rent Index Groceries Index Restaurant Price Index Local Purchasing Power Index
1 Zurich, Switzerland 141.25 66.14 105.03 149.86 135.76 142.70
2 Geneva, Switzerland 134.83 71.70 104.38 138.98 129.74 130.96
3 Basel, Switzerland 130.68 49.68 91.61 127.54 127.22 139.01
4 Bern, Switzerland 128.03 43.57 87.30 132.70 119.48 112.71
5 Lausanne, Switzerland 127.50 52.32 91.24 126.59 132.12 127.95
6 Reykjavik, Iceland 123.78 57.25 91.70 118.15 133.19 88.95
...

As you can see, the city field also contains the country name. The purpose of this exercise is to extract the country name from this field. In order to do this, we will go through the following steps:

Create a new transformation and use a Text file input step to read the cost_of_living_europe.txt file.
Drag a Split Fields step from the Transform category and create a hop from the Text file input towards the Split Fields step.

Double-click the step and configure it, as shown in the following screenshot:

Configuring a Split Fields step

Close the window and run a preview. You will see the following:

Previewing a transformation

As you can see, the Split Fields step can be used to split the value of a field into two or more new fields. This step is perfect for the purpose of obtaining the country name because the values were easy to parse. We had a value, then a comma, then another value. This is not always the case, but PDI has other steps for doing similar tasks. Let's look at another method for extracting pieces from a field.

This time, we will read a file containing common daily food items and their prices. The file has two fields—food and price—and looks as follows:

Food Price
Milk (regular), (0.25 liter) 0.19 €
Loaf of Fresh White Bread (125.00 g) 0.24 €
Rice (white), (0.10 kg) 0.09 €
Eggs (regular) (2.40) 0.33 €
Local Cheese (0.10 kg) 0.89 €
Chicken Breasts (Boneless, Skinless), (0.15 kg) 0.86 €
...

Suppose that we want to split the Food field into three fields for the name, quantity, and number of units respectively. Taking the value in the first row, Milk (regular), (0.25 liter), as an example, the name would be Milk (regular), the quantity would be 0.25, and the unit would be liter. We cannot solve this as we did before, but we can use regular expressions instead. In this case, the expression to use will be (.+)\(([0-9.]+)( liter| g| kg| head|)\).*.

Let's try it using the following steps:

Create a new transformation and use a Text file input step to read the recommended_food.txt file.

In order to define the Price as a number, use the format #.00 €.

Drag a Regex Evaluation step from the Scripting category and create a hop from the Text file input toward this new step.

Double-click the step and configure it as shown in the following screenshot. Don't forget to check the Create fields for capture groups option:

Configuring a Regex Evaluation step

Close the window and run a preview. You will see the following:

Previewing a transformation

The RegEx Evaluation step can be used just to evaluate whether or not a field matches a regular expression, or to generate new fields, as in this case. By capturing groups, we were able to create a new field for each group captured from the original field. You will also notice a field named result, which in our example has a Y as its value. This Y mean...