WHAT ARE METADATA?*
During the 1990s, the concept of data warehouse** swept the information technology industry. After many years of trying, it appears finally to be possible for a company to store all of its data in one place for purposes of reporting and analysis. The technology for doing this is still new, and the first attempts have had mixed results, but the effort has been quite serious.
One of the problems that arose from this effort was the realization that if a senior executive is going to ask a giant database a question it is necessary to know just what is in the database and what types of questions to ask. In addition to the data themselves, therefore, it is necessary to keep data about the data. The term coined for ādata about dataā during the 1990s was metadata.
Since then, numerous books and magazine articles have been published on this subject, but most have focused on why metadata are important and on technologies and techniques for managing them. What these publications have left out is a clear description of exactly what the stuff is. After a decade, there is still no simple, clear description of metadata in a form that is both comprehensive enough to cover our industry and comprehensible enough that it can be used by people. This book is an attempt to produce such a description.
As with all buzzwords, once invented the term metadata has taken on a life of its own. It is variously described as:
ā¢ Any data about the organizationās data resource [Brackett 2000, p. 149].
ā¢ All physical data and knowledge from inside and outside an organization, including information about the physical data, technical and business processes, rules and constraints of the data, and structures of the data used by a corporation [Marco 2000, p. 5].
ā¢ The detailed description of instance data. The format and characteristics of populated instance data: instances and values, dependent on the role of the metadata recipient [Tannenbaum 2002, p. 93].
Several significant points come out of these definitions. First, as Mr. Marco pointed out there is a difference between business metadata and technical metadata. The business user of metadata is interested in definitions and structures of the language as terms for the types of information to be retrieved. The technician is concerned with the physical technologies used to store and manage data. Both of these points of view are important, and both must be addressed.
Second, the subject is concerned with more than just data. It is, as Mr. Brackett said, āany data about an organizationās data resource.ā Once you have started looking at the structure of an organizationās data, you have to also account for its activities, people and organizations, locations, timing and events, and motivation.
Third, as Ms. Tannenbaum pointed out, the āmetaā aspect of the question is a matter of point of view. There is metadata relative to the data collected by the business. There is also meta-metadata, which is used to understand and manage the metadata.*
This last point is illustrated in Figure 1-1. Here, the bottom row shows examples of things in the world that are often described in information systems. āJulia Robertsā is a real human being. The āWall Street branchā of a bank is a physical place were business is performed. Checking account ā09743569ā is a particular account held in that bank by a particular customer (Julia Roberts, for example). The customer of that account may then perform an actual āATM Withdrawalā at a specific time.
Fig. 1-1 Data and metadata.
The next row up shows, in the first three columns, the data that might describe those three things: (1) A Customer has the name āJulia Robertsā and the āBirthdateā of ā10/28/67ā. (2) A Branch has the address ā111 Wall Streetā and a manager, āSam Sneedā. (3) The checking account has an account number ā09743569ā and a monthly charge, ā$4.50ā. In the fourth column, the first row from the bottom shows that a particular program, called here āJava codeā, is responsible for a āWithdrawal Transactionā. These are the things that would concern a person managing data for a banking business. Note that each of the terms was described as to what it was: customer name, branch manager, account number, and so forth.
The third row from the bottom collects those descriptors and labels them in turn. This is to create what we in the data administration world call the metadata. There are two components to these labels. First are the names of the things of significance being described by the business data, such as the entity classes āCustomerā and āBranchā. Second, each of these is in turn described by attributes, such as āNameā, āAddressā, and āBirthdateā. We also discover, in the case of the bank branch, that there is really an additional entity class, āManagerā, and that it is related to āBranchā. (āEach Branch must be managed by exactly one Employee.ā)
In the checking account column, we see that a checking account is actually the subject of a table in a database. The table is called āCHECKING_ACCOUNTā and has columns āAccount_numberā and āMonthly_chargeā. The ATM program described in the second row simply as āJava codeā is actually a program module with the name āATM Controllerā written in the language āJavaā. As we can see, the metadata row itself encompasses several different types of objects (āEntity classā, āAttributeā, āTableā, āColumnā, āProgram moduleā, and āLanguageā). The assignment of this book, represented by the top row, is to show how these objects relate to one another.
Metadata donāt just describe data. They describe how the organization understands not only its data, but also its activities, people and organizations, geography, timing, and motivation. Yes, metadata describe the entity classes and attributes of an entity-relationship model, and the tables and columns by which these are implemented in a computer system. They also provide, however, structure for describing the activities of the organization and the computerized processes that implement these activities. They describe who has access to data, and why. They describe the types of events and responses that are the nature of an organizationās activities. They describe where the data and processes are, and they describe the motivation and business rules that drive the entire thing. So, from all of this comes the following definition of metadata.
Metadata are the data that describe the structure and workings of an organizationās use of information, and which describe the systems it uses to manage that information.
One anomaly has revealed itself in the line between business data and metadata. The information about what constitutes a legal value for a product category or an account type in the business model is often captured in separate reference tables. To reflect these validation structures, a typical data model often has many ātypeā entity classes (ACCOUNT TYPE, STATUS, DAY OF THE WEEK, and so on) describing legal values for attribu...