PART I
Introducing XML
- CHAPTER 1: What Is XML?
- CHAPTER 2: Well-Formed XML
- CHAPTER 3: XML Namespaces
Chapter 1
What Is XML?
WHAT YOUâLL WILL LEARN IN THIS CHAPTER:
- The story before XML
- How XML arrived
- The basic format of an XML document
- Areas where XML is useful
- A brief introduction to the technologies surrounding, and associated with, XML
XML stands for Extensible Markup Language (presumably the original authors thought that sounded more exciting than EML) and its development and usage have followed a common path in the software and IT world. It started out more than ten years ago and was originally used by very few; later it caught the public eye and began to pervade the world of data exchange. Subsequently, the tools available to process and manage XML became more sophisticated, to such an extent that many people began to use it without being really aware of its existence. Lately there has been a bit of a backlash in certain quarters over its perceived failings and weak points, which has led to various proposed alternatives and improvements. Nevertheless, XML now has a permanent place in IT systems and itâs hard to imagine any non-trivial application that doesnât use XML for either its configuration or data to some degree. For this reason itâs essential that modern software developers have a thorough understanding of its principles, what it is capable of, and how to use it to their best advantage. This book can give the reader all those things.
NOTE Although this chapter presents some short examples of XML, you arenât expected to understand all thatâs going on just yet. The idea is simply to introduce the important concepts behind the language so that throughout the book you can see not only how to use XML, but also why it works the way it does.
STEPS LEADING UP TO XML: DATA REPRESENTATION AND MARKUPS
There are two main uses for XML: One is a way to represent low-level data, for example configuration files. The second is a way to add metadata to documents; for example, you may want to stress a particular sentence in a report by putting it in italics or bold.
The first usage for XML is meant as a replacement for the more traditional ways this has been done before, usually by means of lists of name/value pairs as is seen in Windowsâ INI or Javaâs Property files. The second application of XML is similar to how HTML files work. The document text is contained in an overall container, the <body> element, with individual phrases surrounded by <i> or <b> tags. For both of these scenarios there has been a multiplicity of techniques devised over the years. The problem with these disparate approaches has been more apparent than ever, since the increased use of the Internet and extensive existence of distributed applications, particularly those that rely on components designed and managed by different parties. That problem is one of intercommunication. Itâs certainly possible to design a distributed system that has two components, one outputting data using a Windows INI file and the other which turns it into a Java Properties format. Unfortunately, it means a lot of development on both sides, which shouldnât really be necessary and detracts resources from the main objective, developing new functionality that delivers business value.
XML was conceived as a solution to this kind of problem; it is meant to make passing data between different components much easier and relieve the need to continually worry about different formats of input and output, freeing up developers to concentrate on the more important aspects of coding such as the business logic. XML is also seen as a solution to the question of whether files should be easily readable by software or by humans; XMLâs aim is to be both. Youâll be examining the distinction between data-oriented and document-centric XML later in the book, but for now letâs look a bit more deeply into what the choices were before XML when there was need to store or communicate data in an electronic format.
This section takes a mid-level look at data representation, without taking too much time to explain low-level details such as memory addresses and the like. For the purposes here you can store data in files two ways: as binary or as text.
Binary Files
A binary file, at its simplest, is just a stream of bits (1s and 0s). Itâs up to the application that created the binary file to understand what all of the bits mean. Thatâs why binary files can only be read and produced by certain computer programs, which have been specifically written to understand them.
For example, when saving a document in Microsoft Word, using a version before 2003, the file created (which has a doc extension) is in a binary format. If you open the file in a text editor such as Notepad, you wonât be able to see a picture of the original Word document; the best youâll be able to see is the occasional line of text surrounded by gibberish rather than the prose, which could be in a number of formats such as bold or italic. The characters in the document other than the actual text are metadata, literally information about information. Mixing data and metadata is both common and straightforward in a binary file. Metadata can specify things such as which words should be shown in bold, what text is to be displayed in a table, and so on. To interpret this file you the need the help of the application that created it. Without the help of a converter that has in-depth knowledge of the underlying binary format, you wonât be able to open a document created in Word with another similar application such as WordPerfect. The main advantage of binary formats is that they are concise and can be expressed in a relatively small space. This means that more files can be stored (on a hard drive, for example) but, more importantly nowadays, less bandwidth is used when transporting these files across networks.
Text Files
The main difference between text and binary files is that text files are human and machine readable. Instead of a proprietary format that needs a specific application to decipher it, the data is such that each group of bits represents a character from a known set. This means that many different applications can read text files. On a standard Windows machine you have a choice of Notepad, WordPad, and others, including being able to use command-lineâbased utilities such as Edit. Non-Windows machines have a similar wide range insert of programs available, such as Emacs and Vim.
NOTE The way that characters are represented by the underlying data stream is referred to as a fileâs encoding
. The specific encoding used is often present as the first few bytes in the file; an application checks these bytes upon opening the file and then knows how to display and manipulate the data. There is also a default encoding if these first few bytes are not present. XML also has other ways of specifying how a file was encoded, and youâll see these later on.
The ability to be read and understood by both humans and machines is not the only advantage of text files; they are also comparatively easier to parse than binary files. The main disadvantage however, is their size. In order for text files to contain metadata (for example, a stretch of text to be marked as important), the relevant words are usually surrounded by characters denoting this extra information, which are somehow differentiated from the actual text itself. The most common examples of this can be found in HTML, where angle brackets are special symbols used to convey the meaning that anything within them refers to how the text should be treated rather than the actual data. For example, if I want mark a phrase as important I can wrap it like so:
<strong>returns must include the item order number</strong>
Another disadvantage of text files is their lack of support for metadata. If you open a Word document that contains text in an array of fonts with different styles and save it as a text file, youâll just get a plain rendition; all of the metadata has been lost. What people were looking for was some way to have the best of both worlds â a human-readable file that could also be read by a wide range of applications, and could carry metadata along with its content. This brings us to the subject of markup.
A Brief History of Markup
The advantages of text files mad...