eBook - ePub

RDF Database Systems

Name: RDF Database Systems
ISBN: 9780128004708

Triples Storage and SPARQL Query Processing

Olivier Curé,

Guillaume Blin,

256 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

RDF Database Systems

Triples Storage and SPARQL Query Processing

Olivier Curé,

Guillaume Blin,

About this book

RDF Database Systems is a cutting-edge guide that distills everything you need to know to effectively use or design an RDF database. This book starts with the basics of linked open data and covers the most recent research, practice, and technologies to help you leverage semantic technology. With an approach that combines technical detail with theoretical background, this book shows how to design and develop semantic web applications, data models, indexing and query processing solutions. - Understand the Semantic Web, RDF, RDFS, SPARQL, and OWL within the context of relational database management and NoSQL systems - Learn about the prevailing RDF triples solutions for both relational and non-relational databases, including column family, document, graph, and NoSQL - Implement systems using RDF data with helpful guidelines and various storage solutions for RDF - Process SPARQL queries with detailed explanations of query optimization, query plans, caching, and more - Evaluate which approaches and systems to use when developing Semantic Web applications with a helpful description of commercial and open-source systems

Tools to learn more effectively

Saving Books

Keyword Search

Annotating Text

Listen to it instead

Information

Publisher

Year

Print ISBN

eBook ISBN

Topic

Subtopic

Chapter One

Introduction

Abstract

This chapter motivates the importance of RDF data management through the Big Data and Web of Data/Semantic Web phenomena. It also provides some insights of existing RDF stores and presents the dimensions used in this book to compare these systems.

Keywords

RDF

Web of Data

Semantic Web

Big Data

Data management

If you have this book in your hands, we guess you are interested in database management systems in general and more precisely those handling Resource Description Framework (RDF) as a data representation model. We believe it’s the right time to study such systems because they are getting more and more attention in industry with communities of Web developers and information technology (IT) experts who are designing innovative applications, in universities and engineering schools with introductory and advanced courses, and in both academia and industry research to design and implement novel approaches to manage large RDF data sets. We can identify several reasons that are motivating this enthusiasm.

In this introductory chapter, we will concentrate on two really important aspects. An obvious one is the role played by RDF in the emergence of the Web of Data and the Semantic Web—both are extensions of the original World Wide Web. The second one corresponds to the impact of this data model in the Big Data phenomenon. Based on the presentation of these motivations, we will introduce the main characteristics of an RDF data management system, and present some of its most interesting features that support the comparison of existing systems.

1.1. Big data

Big Data is much more than a buzzword. It can be considered as a concrete phenomenon that is attracting all kinds of companies facing strategic and decisional issues, as well as scientific fields interested in understanding complex observations that may lead to important discoveries. The National Institute of Standards and Technologies (NIST) has proposed a widely adopted definition referring to the data deluge: “a digital data volume, velocity and/or variety that enable novel approaches to frontier questions previously inaccessible or impractical using current or conventional methods; and/or exceed the capacity or capability of current or conventional methods and systems” (NIST Big Data, 2013). Most Big Data definitions integrate this aspect of the three V’s: volume, velocity, and variety (which is sometimes extended with a fourth V for veracity).

Volume implies that the sizes of data being produced cannot be stored and/or processed using a single machine, but require a distribution over a cluster of machines. The challenges are, for example, to enable the loading and processing of exabytes (i.e., 10³ petabytes or 10⁶ terabytes) of data while we are currently used to data loads in the range of at most terabytes.

Velocity implies that data may be produced at a throughput that cannot be handled by current methods. Solutions, such as relaxing transaction properties, storing incoming data on several servers, or using novel storage approaches to prevent input/output latencies, are being proposed and can even be combined to address this issue. Nevertheless, they generally come with limitations and drawbacks, which are detailed in Chapter 2.

Variety concerns the format (e.g., Microsoft’s Excel [XLS], eXtended Markup Language [XML], comma-separated value [CSV], or RDF) and structure conditions of the data. Three main conditions exist: structured, semi-structured, and unstructured. Structured data implies a strict representation where data is organized in entities, and then similar entities are grouped together and are described with the same set of attributes, such as an identifier, price, brand, or color. This information is stored in an associated schema that provides a type to each attribute—for example, a price is a numerical value. The data organization of a relational database management system (RDBMS) is reminiscent of this approach. The notion of semi-structured data (i.e., self-described) also adopts an entity-centered organization but introduces some flexibility. For example, entities of a given group may not have the same set of attributes, attribute order is generally not important in the description of an entity, and an attribute may have different types in different entity groups. Common and popular examples are XML, JavaScript Object Notation (JSON), and RDF. Finally, unstructured data is characterized by providing very little information on the type of data it contains and the set of formatting rules it follows. Intuitively, text, image, sound, and video documents belong to this category.

Veracity concerns the accuracy and noise of the data being captured, processed, and stored. For example, considering data acquired, one can ask if it’s relevant to a particular domain of interest, if it’s accurate enough to support decision making, or if the noise associated to that data can be efficiently removed. Therefore, data quality methods may be required to identify and clean “dirty” data, a task that in the context of the other three V’s may be considered one of the greatest challenges of the data deluge. Surprisingly, this dimension is the one that has attracted the least attention from Big Data actors.

The emergence of Big Data is tightly related to the increasing adoption of the Internet and the Web. In fact, the Internet proposes an infrastructure to capture and transport large volumes of data, while the Web, and more precisely its 2.0 version, has brought facilities to produce information from the general public. This happens through interactions with personal blogs (e.g., supported by WordPress), wikis (e.g., Wikipedia), online social networks (e.g., Facebook), and microblogging (e.g., Twitter), and also through logs of activities on the most frequently used search engines (e.g., Google, Bing, or Yahoo!) where an important amount of data is produced every day. Among the most stunning recent values, we can highlight that Facebook announced that, by the beginning of 2014, it is recording 600 terabytes of data each day in its 300 petabytes data warehouse and an average of around 6,000 tweets are stored at Twitter per second with a record of 143,199 tweets on August 3, 2013.

The Internet of Things (IoT) is another contributor to the Big Data ecosystem, which is just in its infancy but will certainly become a major data provider. This Internet branch is mainly concerned with machine-to-machine (M2M) communications that are evolving on a Web environment using Web standards such as Uniform Resource Identifiers (URIs), HyperText Transfer Protocol (HTTP), and representational state transfer (REST). It focuses on the devices and sensors that are present in our daily lives, and can belong to either the industrial sector or the consumer market. These active devices may correspond but are not limited to smartphones, radio-frequency identification device (RFID) tagged objects, wireless sensor networks, or ambient devices.

IoT enables the collection of temporospatial information—that is, regrouping temporal as well as spatial aspects. In 2009, considered an early year of IoT, Jeff Jonas in his blog (Jonas, 2009) was already announcing that 600 billion geospatially tagged transactions were generated per day in North America. This ability to produce enormous volumes of data at a high throughput is already a data management challenge that will expand in the coming years. To consider its evolution, a survey conducted by Cisco (Cisco, 2011) emphasized that from 1.84 connected devices per person in 2010, we will reach 6.58 in 2020, or approximately 50 billion devices. Almost all of these devices will produce massive amounts of data on a daily basis.

As a market phenomenon, Big Data is not supervised by any consortium or organism. Therefore, there is a total freedom about the format of generated, stored, queried, and manipulated data. Nevertheless, best practices of the major industrial and open-source actors bring forward some popular formats such as XLS, CSV, XML, JSON, and RDF. The main advantages of JSON are its simplicity, flexibility (it’s schemaless), and native processing support for most Web applications due to a tight integration with the JavaScript programming language. But RDF is not without assets. For example, as a semi-structured data model, RDF data sets can be described with expressive schema languages, such as RDF Schema (RDFS) or Web Ontology Language (OWL), and can be linked to other documents present on the Web, forming the Linked Data movement.

With the emergence of Linked Data, a pattern for hyperlinking machine-readable data sets that extensively uses RDF, URIs, and HTTP, we can consider that more and more data will be directly produced in or transformed into RDF. In 2013, the linked open data (LOD), a set of RDF data produced from open data sources, is considered to contain over 50 billion triples on domains as diverse as medicine, culture, and science, just to name a few. Two other major sources of RDF data are building up with the RDF in attributes (RDFa) standard, where attributes are to be understood in an (X)HTML context, and the Schema.org initiative, which is supported by Google, Yahoo!, Bing, and Yandex (the largest search engine in Russia). The incentive of being well referenced in these search engines already motivates all kinds of Web contributors (i.e., companies, organizations, etc.) to annotate their web page content with descriptions that can be transformed into RDF data. In the next section, we will present some original functionalities that can be developed with Linked Data, such as querying and reasoning at the scale of the Web.

As a conclusion on Big Data, the direct impact of the original three V’s is the calling for new types of database management systems. Specifically, those that will be able to handle rapidly incoming, heterogeneous, and very large data sets. Among others, a major advantage of these systems will be to support novel, more efficient data integration mechanisms. In terms of features expected from these systems, Franklin and colleagues (2005) were the first to propose a new paradigm. Their DataSpace Support Platforms (DSSP) are characterized by a pay-as-you-go approach for the integration and querying of data. Basically, this paradigm is addressing most of the issues of Big Data. Later, in Dong and Halevy (2007), more details on the possible indexing methods to use in data spaces were presented. Although not mentioning the term RDF, the authors presented a data model based on triples that matches the kind of systems this book focuses on and that are considered in Part 2 of this book.

1.2. Web of data and the semantic web

The Web, as a global information space, is evolving from linking documents only, to linking both documents and data. This is a major (r)evolution that is already supporting the design of innovative applications. This extension of the Web is referred to as the Web of Data and enables the access to and sharing of information in ways that are much more efficient and open than previous solutions. This efficiency is due to the exploitation of the infrastructure of the Web by allowing links between distributed data sources. Three major technologies form the cornerstone of this emerging Web: URIs, HTTP, and RDF. The first two have been central to the development of the Web since its inception and respectively address the issues of identifying Web resources (e.g., web pages) and supporting data communication for the Web. The latter provides a data representation model, and the management of such data is the main topic of this book.

The term semantics is getting more and more attention in the IT industry as well as in the open-source ecosystem. It basically amounts to providing some solutions for computers to automatically interpret the information present in documents. The interpretation mechanism is usually supported by annotating this information with vocabularies the elements of which are given a well-defined meaning with a logical formalism. The logical approach enables some dedicated reasoners to perform some inferences. Of course, the Web, and in particular the Web of Data, is an important document provider; in that context, we then talk about a Semantic Web consisting of a set of technologies that are supporting this whole process. In Berners-Lee et al. (2001), the Semantic Web is defined as “an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation” (p. 1). This emphasizes that there is no rupture between a previous non-semantic Web and a semantic one. They will both rely on concepts such as HTTP, URIs, and the stack of representational standards such as HyperText Markup Language (HTML), Cascade Style Sheets (CSS), and all accompanying programming technologies like JavaScript, Ruby, and HyperText PreProcessor (PHP).

The well-defined meaning aspect is related to RDF annotations and vocabularies expressed in RDFS and the OWL standards. These languages are enabling the description of schemata associated to RDF. Together with such vocabularies, reasoning procedures enable the deduction of novel information or knowledge from data available in the Web of Data. In fact, one of the sweet-spots of RDF and other Semantic Web technologies is data integration, which is the ability to efficiently integrate new information in a repository. This leverages on the Linked Data movement, which is producing very large volumes of RDF triples, dereferenceable URIs (i.e., a resource retrieval method that is making use of Internet protocols such as HTTP), and a flexible data model. This will support the development and maintenance of novel mashup-based applications (i.e., mixing distinct and previously not related information resources) tha...

Cover
Title page
Table of Contents
Copyright
Preface
Chapter One: Introduction
Chapter Two: Database Management Systems
Chapter Three: RDF and the Semantic Web Stack
Chapter Four: RDF Dictionaries: String Encoding
Chapter Five: Storage and Indexing of RDF Data
Chapter Six: Query Processing
Chapter Seven: Distribution and Query Federation
Chapter Eight: Reasoning
Chapter Nine: Conclusion
References
Index

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription

No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 990+ topics, we’ve got you covered! Learn about our mission

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud

Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app

Yes, you can access RDF Database Systems by Olivier Curé,Guillaume Blin in PDF and/or ePUB format, as well as other popular books in Informatique & Bases de données. We have over one million books available in our catalogue for you to explore.