
eBook - ePub
Pentaho Kettle Solutions
Building Open Source ETL Solutions with Pentaho Data Integration
- English
- ePUB (mobile friendly)
- Available on iOS & Android
eBook - ePub
Pentaho Kettle Solutions
Building Open Source ETL Solutions with Pentaho Data Integration
About this book
A complete guide to Pentaho Kettle, the Pentaho Data lntegration toolset for ETL
This practical book is a complete guide to installing, configuring, and managing Pentaho Kettle. If you're a database administrator or developer, you'll first get up to speed on Kettle basics and how to apply Kettle to create ETL solutions—before progressing to specialized concepts such as clustering, extensibility, and data vault models. Learn how to design and build every phase of an ETL solution.
- Shows developers and database administrators how to use the open-source Pentaho Kettle for enterprise-level ETL processes (Extracting, Transforming, and Loading data)
- Assumes no prior knowledge of Kettle or ETL, and brings beginners thoroughly up to speed at their own pace
- Explains how to get Kettle solutions up and running, then follows the 34 ETL subsystems model, as created by the Kimball Group, to explore the entire ETL lifecycle, including all aspects of data warehousing with Kettle
- Goes beyond routine tasks to explore how to extend Kettle and scale Kettle solutions using a distributed "cloud"
Get the most out of Pentaho Kettle and your data warehousing with this detailed guide—from simple single table data migration to complex multisystem clustered data integration tasks.
Frequently asked questions
Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
Perlego offers two plans: Essential and Complete
- Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
- Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Yes, you can access Pentaho Kettle Solutions by Matt Casters,Roland Bouman,Jos van Dongen in PDF and/or ePUB format, as well as other popular books in Computer Science & Data Warehousing. We have over one million books available in our catalogue for you to explore.
Information
Part I
Getting Started
Chapter 1
ETL Primer
The introduction of this book described the need for data integration. This chapter provides a starting point to the wonderful world of data integration and explains the differences and similarities among the three main forms of data integration: ETL, ELT, and EII. To fully understand the reasoning behind using a data warehouse and an ETL solution to load and update data, we start by explaining the differences between a transaction and an analysis database.
OLTP versus Data Warehousing
The first question one might ask is how source data systems differ from business intelligence (BI) systems (sometimes still called decision support systems or DSS). An individual transaction system, often denoted by the acronym OLTP (short for OnLine Transaction Processing), needs to be able to very quickly retrieve a single record of information. When multiple records are needed they are usually tied to a single key that has been retrieved before. Think of an order with the accompanying order lines in an order entry system or a personnel record with all salary and bonus information in an HR system. What’s more: this data often needs to be updated as well, usually just one record at a time.
The biggest difference between an OLTP and a BI database (the data warehouse, or DWH) is the amount of data analyzed in a single transaction. Whereas an OLTP handles many concurrent users and queries touching only a single record or limited groups of records at a time, a data warehouse must have the capability to operate on millions of records to answer a single query. Table 1-1 shows an overview of the major differences between an OLTP and a data warehouse.
Table 1-1: OLTP versus Data Warehouse
| Characteristic | OLTP | Data Warehouse |
| System scope/view | Single business process | Multiple business subjects |
| Data sources | One | Many |
| Data model | Static | Dynamic |
| Dominant query type | Insert/update | Read |
| Data volume per transaction | Small | Big |
| Data volume | Small/medium | Large |
| Data currency | Current timestamp | Seconds to days old |
| Bulk load/insert/update | No | Yes |
| Full history available | No | Yes |
| Response times | < 1 second | < 10 seconds |
| System availability | 24/7 | 8/5 |
| Typical user | Front office | Staff |
| Number of users | Large | Small/medium |
Of course, it’s not as black and white as this table might indicate. The distinctions listed are a rather classic way of looking at the two types of systems. More and more often, business intelligence systems are being used as part of the primary business process. A call center agent might have a screen in front of her with not only customer details such as name and address, but also information about order and payment history retrieved from an operational data store (ODS) or a data warehouse. Many CRM systems are already capable of showing a credit or customer score on-the-fly, items that have been pre-calculated in the data warehouse and are available on demand for front office workers. This means that the more the data warehouse is used for operational purposes, the more the same requirements apply as for OLTP systems, especially regarding system availability and data currency.
Probably the most discussed characteristic of the data warehouse is the required response time. Ten years ago, it wasn’t a problem when a report query took one or two minutes to retrieve and display its data. Nowadays users expect response times similar to what they’re accustomed to when using a search engine. More than ten seconds and users get impatient, start clicking refresh buttons (which will sometimes re-issue the query, making the problem even worse), and eventually avoid using the data warehouse because it’s so slow. On the other hand, when the data warehouse is used for data mining purposes, analysts find a response time of several hours totally acceptable, as long as the result to their inquiry is valuable.
What Is ETL?
You know of course that ETL is short for extract, transform, and load; no secrets here. But what exactly do we mean by ETL? A simple definition could be “the set of processes for getting data from OLTP systems into a data warehouse.” When we look at the roots of ETL it’s probably a viable definition, but for modern ETL solutions it grossly over-simplifies the term. Data is not only coming from OLTP systems but from websites, flat files, e-mail databases, spreadsheets, and personal databases such as Access as well. ETL is not only used to load a single data warehouse but can have many other use cases, like loading data marts, generating spreadsheets, scoring customers using data mining models, or even loading forecasts back into OLTP systems. The main ETL steps, however, can still be grouped into three sections:
- 1. Extract: All processing required to connect to various data sources, extract the data from these data sources, and make the data available to the subsequent processing steps. This may sound trivial but can in fact be one of the main obstacles in getting an ETL solution off the ground.
- 2. Transform: Any function applied to the extracted data between the extraction from sources and loading into targets. These functions can contain (but are not limited to) the following operations:
- Movement of data
- Validation of data against data quality rules
- Modification of the content or structure of the data
- In...
Table of contents
- Cover
- Table of Contents
- Title
- Copyright
- Dedication
- About the Authors
- Credits
- Acknowledgments
- Introduction
- The Origins of Kettle
- About This Book
- How This Book Is Organized
- Prerequisites
- On the Website
- Further Resources
- Part I: Getting Started
- Part II: ETL
- Part III: Management and Deployment
- Part IV: Performance and Scalability
- Part V: Advanced Topics
- Appendix A: The Kettle Ecosystem
- Appendix B: Kettle Enterprise Edition Features
- Appendix C: Built-in Variables and Properties Reference
- Index
- End User License Agreement