Temporal Data Mining via Unsupervised Ensemble Learning
eBook - ePub

Temporal Data Mining via Unsupervised Ensemble Learning

  1. 172 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Temporal Data Mining via Unsupervised Ensemble Learning

About this book

Temporal Data Mining via Unsupervised Ensemble Learning provides the principle knowledge of temporal data mining in association with unsupervised ensemble learning and the fundamental problems of temporal data clustering from different perspectives. By providing three proposed ensemble approaches of temporal data clustering, this book presents a practical focus of fundamental knowledge and techniques, along with a rich blend of theory and practice. Furthermore, the book includes illustrations of the proposed approaches based on data and simulation experiments to demonstrate all methodologies, and is a guide to the proper usage of these methods. As there is nothing universal that can solve all problems, it is important to understand the characteristics of both clustering algorithms and the target temporal data so the correct approach can be selected for a given clustering problem. Scientists, researchers, and data analysts working with machine learning and data mining will benefit from this innovative book, as will undergraduate and graduate students following courses in computer science, engineering, and statistics. - Includes fundamental concepts and knowledge, covering all key tasks and techniques of temporal data mining, i.e., temporal data representations, similarity measure, and mining tasks - Concentrates on temporal data clustering tasks from different perspectives, including major algorithms from clustering algorithms and ensemble learning approaches - Presents a rich blend of theory and practice, addressing seminal research ideas and looking at the technology from a practical point-of-view

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn more here.
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, weโ€™ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere โ€” even offline. Perfect for commutes or when youโ€™re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Yes, you can access Temporal Data Mining via Unsupervised Ensemble Learning by Yun Yang in PDF and/or ePUB format, as well as other popular books in Computer Science & Artificial Intelligence (AI) & Semantics. We have over one million books available in our catalogue for you to explore.
Chapter 1

Introduction

Abstract

Machine learning, data mining, temporal data clustering, and ensemble learning are very popular in the research field of computer science and relevant subjects. The knowledge and information addressed in this book is not only essential for graduate students but also useful for professionals who want to get into this field. In this Chapter, we are going to have an overall picture of this book by introducing knowledge background, problem statement, objective of book and overview of the book.

Keywords

Classification; Clustering; Machine Learning; Supervised learning; Temporal Data mining; Unsupervised

1.1. Background

The unsupervised classification or clustering provides an effective way to condensing and summarizing information conveyed in data, which is demanded by a number of application areas for organizing or discovering structures in data. The objective of clustering analysis is to partition a set of unlabeled objects into groups or clusters where all the objects grouped in the same cluster should be coherent or homogeneous. There are two core problems in clustering analysis; that is, model selection and proper grouping. The former is seeking a solution that estimates the intrinsic number of clusters underlying a data set, while the latter demands a rule to group coherent objects together to form a cluster. From the perspective of machine learning, clustering analysis is an extremely difficult unsupervised learning task since it is inherently an ill-posed problem and its solution often violates some common assumptions (Kleinberg, 2003). There have been many researches in clustering analysis (Jain et al., 1999), which leads to various clustering algorithms categorized as partitioning, hierarchical, density-based, and model-based clustering algorithms.
Actually, temporal data are a collection of observations associated with information such as the time at which data have been captured and the time interval during which a data value is valid. Temporal data are composed of a sequence of nominal symbols from the alphabet known as a temporal sequence and a sequence of continuous real-valued elements known as a time series. The use of temporal data have become widespread in recent years, and temporal data mining continues to be a rapidly evolving area of interrelated disciplines including statistics, temporal pattern recognition, temporal databases, optimization, visualization, high-performance computing, and parallel computing.
However, the recent empirical studies in temporal data analysis reveal that most of the existing clustering algorithms do not work well for temporal data due to their special structure and data dependency (Keogh and Kasetty, 2003), which presents a big challenge in clustering temporal data of various and high dimensionality, large volume, very high-feature correlation, and a substantial amount of noise.
Recently, several studies have attempted to improve clustering by combining multiple clustering solutions into a single-consolidated clustering ensemble for better average performance among given clustering solutions. This has led to many real-world applications, including gene classification, image segmentation (Hong et al., 2008), video retrieval, and so on (Jain et al., 1999; Fischer and Buhmann, 2003; Azimi et al., 2006). Clustering ensembles usually involve two stages. First, multiple partitions are obtained through several runs of initial clustering analysis. Subsequently, the specific consensus function is used in order to find a final consensus partition from multiple input partitions. This book is going to concentrate on ensemble learning techniques and its application for temporal data clustering tasks based on three methodologies: the model-based approach, the proximity-based approach, and the feature-based approach.
The model-based approach aims to construct statistical models to describe the characteristics of each group of temporal data, providing more intuitive ways to capture dynamic behaviors and a more flexible means for dealing with the variable lengths of temporal data. In general, the entire temporal data set is modeled by a mixture of these statistical models, while an individual statistical model such as Gaussian distribution, Poisson distribution, or Hidden Markov Model (HMM) is used to model a specific cluster of temporal data. Model-based approaches for temporal data clustering include HMM (Panuccio et al., 2009), Gaussian mixture model (Fraley and Raftery, 2002), mixture of first-order Markov chain (Smyth, 1999), dynamic Bayesian networks (Murphy, 2002), and autoregressive moving average model (Xiong and Yeung, 2002). Usually, these are combined with an expectation-maximization algorithm (Bilmes, 1998) for parameter estimation.
The proximity-based approach is mainly based on the measure of the similarity or distance between each pair of temporal data. The most common methods are agglomerative and divisive clustering (Jain et al., 1999), which partition the unlabeled objects into different groups so that members of the same groups are more alike than members of different groups based on the similarity metric. For proximity-based clustering, either the Euclidean distance or an advanced version of Mahalanobis distance (Bar-Hillel et al., 2006) would be commonly used as the basis for comparing the similarity of two sets of temporal data.
The feature-based approach is indirect temporal data clustering, which begins with the extraction of a set of features from raw temporal data, so that all temporal data can be transformed into a static feature space. Then, classical vector-based clustering algorithms can be implemented within the feature space. Obviously, feature extraction is the essential factor that decides the performance of clustering. Generally, feature-based clustering reduces the computational complexities for higher dimensional temporal data.

1.2. Problem Statement

Although the clustering algorithms have been intensively developing for last decades, due to the natural complexity of temporal data, we still face many challenges for temporal data clustering tasks.
How to select an intrinsic number of clusters is still a critical model selection problem existing in many clustering algorithms. In a statistical framework, model selection is the task of selecting a mixture of the appropriate number of mathematical models with the appropriate parameter setup that fits the target data set by optimizing some criterion. In other words, the model selection problem is solved by optimizing the predefined criterion. For common model selection criterion, Akaike information criterion, AIC (Akaike, 1974), balances the good fit of a statistical model based on maximum log-likelihood and model complexity based on the number of model parameters. The optimal number of clusters is selected with a minimum value of AIC. Based on Bayesian model selection principles, the Bayesian information criterion, BIC (Schwarz, 1978), is a similar approach to AIC. However, while the number of parameters and maximum log-likelihood are required to compute the AIC, the computation of BIC requires the number of observations and maximum log-likelihood instead. Monte-Carlo cross validation (Smyth, 1996) divides a data set into training and test sets at certain times in a random manner. In each run, the training set is used to estimate the best-fitting parameters while the testing set computes the model's error. The optimal number of clusters is selected by posteriori probabilities or criterion function. Recently, the Bayesian Ying-Yang machine (Xu, 1996) has been applied to model selection in clustering analysis (Xu, 1997). It treats the unsupervised learning problem as a problem of estimating the joint distribution between the observable pattern in the observable space and its representation pattern in the representation space. In theory, the optimal number of clusters is given by the minimum value of cost function. In addition, other criterions of model selection include minimum message length (Grunwald et al., 1998), minimum description length (Grunwald et al., 1998), and covariance inflation criterion (Tibshirani and Knight, 1999). However, recent empirical studies (Zucchini, 2000; Hu and Xu, 2003) in model selection reveal that most of the existing criterions have different limitations, which often overestimate or underestimate the cluster number. Performance of these different criterions depends on the structure of the target data set, and no single criterion emerges as outstanding when measured against the others. Moreover, a major problem associated with these model selection criterions also remains: the computation procedures involved are extremely complex and time consuming.
How to significantly reduce the computational cost is another importance issue for temporal data clustering task due to the fact of that temporal data are often collected in a data set with large volume, high and various dimensionality, and complex-clustered structure. From the perspective of model-based temporal data clustering, Zhong and Ghosh (2003) proposed a model-based hybrid partitioning-hierarchical clustering and its variance such as HHM-based hierarchical meta clustering. In the first approach, one is an improved version of model-based agglomerative clustering, which keeps some hierarchical structure. However, associating with HMM-based K-models clustering, the complexity of input data to the agglomerative clustering is relatively reduces. Therefore, this approach requires less computational cost. Moreover, the HHM-based hierarchical meta clustering further reduces the computational cost due to no re-estimation of merged component models as a composite model. However, both of them are still quite time consuming in comparison to most proximity-based and representation-based approaches. Furthermore, the aforementioned model selection problem is still unavoidable. From the perspective of proximity-based temporal data clustering, K-means algorithm is effective in clustering large-scale data sets, and efforts have been made in order to overcome its disadvantages (Huang, 1998; Ordonez and Omiecinski, 2004), which potentially provides a clustering solution for temporal with large volume. Sampling-based approach such as Clustering LARge Applications (CLARA) (Kaufman and Rousseeuw, 1990) and Clustering Using REprisentatives (CURE) (Guha et al., 1998) reduces the computational cost by applying an appropriate sampling technique on the entire data set with large volume. Condensation-based approach such as Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) (Zhang et al., 1996) constructs the compact summaries of the original data in a Cluster Feature (CF) tree, which captures the clustering information and significantly reduces the computational burden. Density-based approach such as Densit-based Spatial Clustering of Application with Noise (DBSCAN) (Ester et al., 1996) Ordering Points To Identify the Clustering Structure (OPTICS) (Ankerst et al., 1999) is able to automatically determine the complex clustered structure by finding the dense area of data set. Although each algorithm has a good performance for clustering large volume of data set, most of them have the difficulty to deal with temporal data with various length. From the perspective of representation-based temporal data clustering, the computational cost can be significantly reduced by projecting the temporal data with various length and high dimensionality into a uniform lower dimensional representation space, where most of the existing clustering algorithms can be applied. However, our previous study (Yang and Chen, 2011a) has shown that no single representation technique could perfectly represent all the different temporal data set, each of them just capture limited amount of characters obtained from temporal data set.
How to thoroughly extract the important features from original temporal data is concerned with the representation methods. Nowadays, various representations have been proposed for temporal data (Faloutsos et al., 1994 Dimitrova and Golshani, 1995; Chen and Chang, 2000; Keogh et al., 2001; Chakrabarti et al., 2002; Bashir, 2005; Cheong et al., 2005; Bagnall et al., 2006; Gionis et al., 2007; Ding et al., 2008; Ye and Keogh, 2009), and its variants such as the multiple-scaled representation (Lin et al., 2004) continues to be proposed for improving the temporal data clustering performance. Nev...

Table of contents

  1. Cover image
  2. Title page
  3. Table of Contents
  4. Copyright
  5. List of Figures
  6. List of Tables
  7. Acknowledgments
  8. Chapter 1. Introduction
  9. Chapter 2. Temporal Data Mining
  10. Chapter 3. Temporal Data Clustering
  11. Chapter 4. Ensemble Learning
  12. Chapter 5. HMM-Based Hybrid Meta-Clustering in Association With Ensemble Technique
  13. Chapter 6. Unsupervised Learning via an Iteratively Constructed Clustering Ensemble
  14. Chapter 7. Temporal Data Clustering via a Weighted Clustering Ensemble With Different Representations
  15. Chapter 8. Conclusions, Future Work
  16. Appendix
  17. References
  18. Index