Pentaho Kettle Solutions
eBook - ePub

Pentaho Kettle Solutions

Building Open Source ETL Solutions with Pentaho Data Integration

  1. English
  2. ePUB (mobile friendly)
  3. Available on iOS & Android
eBook - ePub

Pentaho Kettle Solutions

Building Open Source ETL Solutions with Pentaho Data Integration

About this book

A complete guide to Pentaho Kettle, the Pentaho Data lntegration toolset for ETL

This practical book is a complete guide to installing, configuring, and managing Pentaho Kettle. If you're a database administrator or developer, you'll first get up to speed on Kettle basics and how to apply Kettle to create ETL solutions—before progressing to specialized concepts such as clustering, extensibility, and data vault models. Learn how to design and build every phase of an ETL solution.

  • Shows developers and database administrators how to use the open-source Pentaho Kettle for enterprise-level ETL processes (Extracting, Transforming, and Loading data)
  • Assumes no prior knowledge of Kettle or ETL, and brings beginners thoroughly up to speed at their own pace
  • Explains how to get Kettle solutions up and running, then follows the 34 ETL subsystems model, as created by the Kimball Group, to explore the entire ETL lifecycle, including all aspects of data warehousing with Kettle
  • Goes beyond routine tasks to explore how to extend Kettle and scale Kettle solutions using a distributed "cloud"

Get the most out of Pentaho Kettle and your data warehousing with this detailed guide—from simple single table data migration to complex multisystem clustered data integration tasks.

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Yes, you can access Pentaho Kettle Solutions by Matt Casters,Roland Bouman,Jos van Dongen in PDF and/or ePUB format, as well as other popular books in Computer Science & Data Warehousing. We have over one million books available in our catalogue for you to explore.

Information

Publisher
Wiley
Year
2010
Print ISBN
9780470635179
eBook ISBN
9780470947524

Part I
Getting Started

Chapter 1
ETL Primer

The introduction of this book described the need for data integration. This chapter provides a starting point to the wonderful world of data integration and explains the differences and similarities among the three main forms of data integration: ETL, ELT, and EII. To fully understand the reasoning behind using a data warehouse and an ETL solution to load and update data, we start by explaining the differences between a transaction and an analysis database.

OLTP versus Data Warehousing

The first question one might ask is how source data systems differ from business intelligence (BI) systems (sometimes still called decision support systems or DSS). An individual transaction system, often denoted by the acronym OLTP (short for OnLine Transaction Processing), needs to be able to very quickly retrieve a single record of information. When multiple records are needed they are usually tied to a single key that has been retrieved before. Think of an order with the accompanying order lines in an order entry system or a personnel record with all salary and bonus information in an HR system. What’s more: this data often needs to be updated as well, usually just one record at a time.
The biggest difference between an OLTP and a BI database (the data warehouse, or DWH) is the amount of data analyzed in a single transaction. Whereas an OLTP handles many concurrent users and queries touching only a single record or limited groups of records at a time, a data warehouse must have the capability to operate on millions of records to answer a single query. Table 1-1 shows an overview of the major differences between an OLTP and a data warehouse.
Table 1-1: OLTP versus Data Warehouse
Characteristic OLTP Data Warehouse
System scope/view Single business process Multiple business subjects
Data sources One Many
Data model Static Dynamic
Dominant query type Insert/update Read
Data volume per transaction Small Big
Data volume Small/medium Large
Data currency Current timestamp Seconds to days old
Bulk load/insert/update No Yes
Full history available No Yes
Response times < 1 second < 10 seconds
System availability 24/7 8/5
Typical user Front office Staff
Number of users Large Small/medium
Of course, it’s not as black and white as this table might indicate. The distinctions listed are a rather classic way of looking at the two types of systems. More and more often, business intelligence systems are being used as part of the primary business process. A call center agent might have a screen in front of her with not only customer details such as name and address, but also information about order and payment history retrieved from an operational data store (ODS) or a data warehouse. Many CRM systems are already capable of showing a credit or customer score on-the-fly, items that have been pre-calculated in the data warehouse and are available on demand for front office workers. This means that the more the data warehouse is used for operational purposes, the more the same requirements apply as for OLTP systems, especially regarding system availability and data currency.
Probably the most discussed characteristic of the data warehouse is the required response time. Ten years ago, it wasn’t a problem when a report query took one or two minutes to retrieve and display its data. Nowadays users expect response times similar to what they’re accustomed to when using a search engine. More than ten seconds and users get impatient, start clicking refresh buttons (which will sometimes re-issue the query, making the problem even worse), and eventually avoid using the data warehouse because it’s so slow. On the other hand, when the data warehouse is used for data mining purposes, analysts find a response time of several hours totally acceptable, as long as the result to their inquiry is valuable.

What Is ETL?

You know of course that ETL is short for extract, transform, and load; no secrets here. But what exactly do we mean by ETL? A simple definition could be “the set of processes for getting data from OLTP systems into a data warehouse.” When we look at the roots of ETL it’s probably a viable definition, but for modern ETL solutions it grossly over-simplifies the term. Data is not only coming from OLTP systems but from websites, flat files, e-mail databases, spreadsheets, and personal databases such as Access as well. ETL is not only used to load a single data warehouse but can have many other use cases, like loading data marts, generating spreadsheets, scoring customers using data mining models, or even loading forecasts back into OLTP systems. The main ETL steps, however, can still be grouped into three sections:
  1. 1. Extract: All processing required to connect to various data sources, extract the data from these data sources, and make the data available to the subsequent processing steps. This may sound trivial but can in fact be one of the main obstacles in getting an ETL solution off the ground.
  2. 2. Transform: Any function applied to the extracted data between the extraction from sources and loading into targets. These functions can contain (but are not limited to) the following operations:
    • Movement of data
    • Validation of data against data quality rules
    • Modification of the content or structure of the data
    • In...

Table of contents

  1. Cover
  2. Table of Contents
  3. Title
  4. Copyright
  5. Dedication
  6. About the Authors
  7. Credits
  8. Acknowledgments
  9. Introduction
  10. The Origins of Kettle
  11. About This Book
  12. How This Book Is Organized
  13. Prerequisites
  14. On the Website
  15. Further Resources
  16. Part I: Getting Started
  17. Part II: ETL
  18. Part III: Management and Deployment
  19. Part IV: Performance and Scalability
  20. Part V: Advanced Topics
  21. Appendix A: The Kettle Ecosystem
  22. Appendix B: Kettle Enterprise Edition Features
  23. Appendix C: Built-in Variables and Properties Reference
  24. Index
  25. End User License Agreement