Computer Science

What is Unicode

Unicode is a character encoding standard that assigns a unique number to every character in every language. It allows computers to represent and manipulate text in any language, including those with non-Latin scripts. Unicode is widely used in software development, web development, and digital communication.

Written by Perlego with AI-assistance

10 Key excerpts on "What is Unicode"

  • Book cover image for: Global Knowledge Dynamics and Social Technology
    IN LANGUAGES WE TRUST 27 The Unicode Consortium states that its standard ‘supplies an extensive set of functional character specifications, character data, algorithms and substantial background material that is not in ISO/IEC 10646.’ 5 The Unicode Standard includes more than 120,000 characters that represent more than 120 modern and historic scripts, as well as multiple symbol sets. It provides a computing industry standard for uniform encoding, repre- sentation, and implementation of the world’s writing cultures and sign systems. Currently, the Unicode standard is used and supported by software companies, government bodies, and independent software devel- opers. While most encodings can represent only a few languages (as described earlier in the ASCII example), Unicode represents most written languages: from Arabic to Zulu. Unicode’s success is based on the idea that one character must not code to one byte. Instead, in Unicode one character can code to one to six bytes. Such a variety is important to represent the many characters used in various languages around the world. In this way, Unicode also has the advantage of being backward-compatible with ASCII, the most widely used encoding standard until the mid-2000s. Unicode emerged as the dominant encoding standard on the WWW, overtaking ASCII after 2007. The Unicode Consortium states that it supports ‘the worldwide interchange, processing, and display of the written texts of the diverse languages and technical disciplines of the modern world. In addition, it supports classical and historical texts of many written languages’ (Unicode Consortium, 2009). Unicode covers most writing systems in use today, providing people from Armenia to Yúnnán with their scripts and characters (i.e., Armenian ligatures to Yi Radicals) implemented in digital products and services by the computing and software industries.
  • Book cover image for: Tiny C Projects
    eBook - ePub
    • Dan Gookin(Author)
    • 2023(Publication Date)
    • Manning
      (Publisher)
    The tasks of swapping code pages and exploring extended ASCII character sets are no longer required to generate fancy text. With the advent of Unicode in the 1990s, all the text encoding inconsistencies since the early telegraph days are finally resolved.

    8.1.3 Diving into Unicode

    Back in the 1980s, those computer scientists who sat around thinking of wonderful new things to do hit the jackpot. They considered the possibilities of creating a consistent way to encode text—not just ASCII or Latin alphabet characters, but every scribble, symbol, and gewgaw known on this planet, both forward and backward in time. The result, unveiled in the 1990s, is Unicode.
    The original intention of Unicode was to widen character width from 8 bits to 16 bits. This change doesn’t double the number of characters—it increases possible character encodings from 256 to over 65,000. But even this huge quantity wasn’t enough.
    Today, the Unicode standard encompasses millions of characters, including hieroglyphics and emojis, a sampling of which is shown in figure 8.3. New characters are added all the time, almost every year. For example, in 2021, 838 new characters were added.
    Figure 8.3 Various Unicode characters
    The current code space for Unicode (as of 2022) consists of 1,114,111 code points. Code space is the entire spectrum of Unicode. You can think of code points as characters. Not every code point has a character assigned, however: many chunks of the code space are empty. Some code points are designed as overlays or macrons to combine with other characters. Of the plethora, the first 128 code points align with the ASCII standard.
    Unicode characters are referenced in the format U+nnnn, where nnnn is the hexadecimal value for the code point. The code space is organized into code panes representing various languages or scripts. Most web pages that reference Unicode characters, such as unicode-table.com , use these code planes or blocks when you browse the collection of characters.
    To translate from a Unicode code point—say, U+2665—into a character in C, you must adhere to an encoding format. The most beloved of these encoding formats is the Unicode Transformation Format, UTF. Several flavors of UTF exist:
  • Book cover image for: Natural Language Processing for Historical Texts
    • Michael Piotrowski(Author)
    • 2022(Publication Date)
    • Springer
      (Publisher)
    5.1. UNICODE FOR HISTORICAL TEXT 55 character encoding standard and has made significant progress towards replacing legacy encodings (e.g., ASCII, the ISO 8859 series, KOI-8, or Shift-JIS) for the encoding of new text. Most legacy encodings are now defined in terms of Unicode, i.e., as proper subsets of Unicode. All modern operating systems and programming languages support Unicode. Unicode is the first character encoding that draws a clear distinction between characters and glyphs; apart from so-called compatibility characters (encoded for interoperability with legacy encodings), Unicode aims to encode only characters, not glyphs. By making this distinction, Unicode has a clear definition of the entities it encodes. Unicode also considers diacritics as characters, which simply have the property to graphically combine in some way with a base character. An accented character such as ´ g is thus seen as a combination of the base character g and a combining character, here an acute accent. In Unicode, combining characters always follow the base character to which they apply. In principle, any number of combining characters can follow a base character, and all combining characters can be used with any script (see The Unicode Standard, p. 41), but obviously not all combinations can necessarily be displayed in a reasonable way. Due to compatibility considerations, numerous combinations of base characters with diacritics are also available as precomposed characters; for example, Ä is available as a precomposed character for compatibility with ISO 8859-1. Since these characters can thus be encoded in different ways, Unicode also defines equivalence relations such as B + Ä ≡ B + A + ¨ and normalization forms, i.e., rules for maximally composing or decomposing characters. The policy of the Unicode Consortium is not to encode any new characters that could be encoded using combining characters.
  • Book cover image for: A Companion to Digital Literary Studies
    • Ray Siemens, Susan Schreibman, Ray Siemens, Susan Schreibman(Authors)
    • 2013(Publication Date)
    • Wiley-Blackwell
      (Publisher)
    ISO 2022, which allowed the combined usage of different national character standards in use in East Asia. However, this was rarely fully implemented and, more importantly did not address the problem of combining European and Asian languages in one document.

    Unicode

    Software vendors and the ISO independently worked toward a solution to this problem that would allow the emerging global stream of information to flow without impediments. For many years, work was continuing in two independent groups. One of these was the Unicode Consortium, which was founded by some major software companies interested in capitalizing on the global market; the other was the character-encoding working groups within the ISO, working toward ISO 10646. The latter were developing a 32-bit character code that would have a potential code space to accommodate 4.3 billion characters, intended as an extension of the existing national and regional character codes. This would be similar to having a union catalog for libraries that simply allocates some specific areas to hold the cards of the participating libraries, without actually combining them into one catalog. Patrons would then have to cycle through these sections and look at every catalog separately, instead of having one consolidated catalog in which to look things up.
    Unicode, on the other hand, was planning one universal encoding that would be a truly unified repertoire of characters in the sense that union catalogs are usually understood: Every character would occur just once, no matter how many scripts and languages made use of it.
    Fortunately, after the publication of the first version of Unicode in the early 1990s, an agreement was reached between these two camps to synchronize development. While there are, to this day, still two different organizations maintaining a universal international character set, they did agree to assign new characters in the same way to the same code points with the same name, so for most practical purposes the two can be regarded as equivalent. Since ISO standards are sold by the ISO and not freely available online, whereas all information related to the Unicode standard is available from the website of the Unicode consortium (www.unicode.org ), I will limit the discussion below to Unicode, but it should be understood that it also applies to ISO
  • Book cover image for: Formalizing Natural Languages
    eBook - ePub
    • Max Silberztein(Author)
    • 2016(Publication Date)
    • Wiley-ISTE
      (Publisher)
    Figure 2.8 mentions more than 100 possible encodings for text files: there are, for example, four competing extended ASCII encodings to represent Turkish characters: DOS-IBM857, ISO-8859-9, Mac, and Windows-1254. The two ASCII encodings most frequently used for English are:
    1. ISO 8859-1 coding (also called “ISO-LATIN-1”), widely used on the Internet and chosen by Linux;
    2. Windows-1252 coding is used by the Microsoft Windows operating system for Western European languages.
    Having different extended ASCII encodings does not help communication between systems, and it is common to see texts in which certain characters are displayed incorrectly.
    Figure 2.7.
    Character encoding is still problematic as of late 2015

    2.4.4. Unicode

    The Unicode Consortium was created in 1991 to design a single encoding system capable of representing texts in all writing systems. In its version 5.1, Unicode contains more than 100,000 characters, organized in code charts, and is compatible with most computer systems, including Microsoft Windows, Mac OSX, Unix and Linux. All the code charts are displayed at www.unicode.org/charts . Selecting a particular script system on this page (for example “Basic Latin”) will bring up tables listing all of the characters in this script, along with their corresponding codes (written in hexadecimal notation), an example of which was shown in Figure 2.3 .
    2.4.4.1. Implementations
    Unicode encoding only sets correspondences between alphabetical characters and natural numbers. There are several ways of representing natural numbers in the form of sequences of bits in a computer, therefore there are several Unicode implementations. The most natural implementation is called UTF32, in which each code is represented by a binary number written with 32 bits (4 bytes), following the exact method discussed in the previous chapter: just write the code in binary, and add 0s to the left of the bit sequence so that its length is 32 bits long. With 32 bits it is possible to represent more than 4 billion codes, easily covering all the world’s languages!
  • Book cover image for: Handbook of Technical Communication
    • Alexander Mehler, Laurent Romary, Alexander Mehler, Laurent Romary(Authors)
    • 2012(Publication Date)
    The Unicode standard is published by the Unicode Consortium (Unicode Con-sortium 2011). Both share the same character repertoire and encoding form. In addition, the Unicode Standard adds information (see below) related e.g. to normalization, handling of bidirectional text and various other implementation information. Unicode encodes widely used scripts and unifies regional and historic dif-ferences. For some purposes this unification is too broad, or some minor scripts Figure 2. Variations of the ideographic character for “snow”. Figure 3. Font variants of the character for “snow”. 360 Felix Sasaki and Arle Lommel are not (yet) part of Unicode. As one solution to the first problem, Unicode pro-vides “variation selectors” which follow a character to identify a (glyph related) restriction on the character. For the second problem, there is no real solution other than to get your character into the Unicode character repertoire. Since Unicode is produced by an industrial consortium, minority scripts and historical scripts have a small lobby. The Script Encoding Initiative (Anderson 2003) is participating in the Unicode consortium to give rarely used scripts a voice. An-other solution is proposed by the Text Encoding Initiative (see also Rahtz in this volume): the use of Markup to express glyph variants 3 . 4.3. Character encoding A character encoding (see Section 4.1 of Dürst et al. 2005) encompasses the in-formation necessary for representing and processing textual data. For example, consider Table 1, which shows how the “snow” character is encoded in Uni-code. (Lunde 2009) provides further, detailed information about character en-coding of Chinese, Japanese and Korean. Table 1. Encoding of the “snow” character in Unicode. In a character encoding, a character receives a unique numeric identifier, a code point . In Unicode, the “snow” character has the hexadecimal number 96EA (represented in Unicode with the prefix “U+”).
  • Book cover image for: How to Build a Digital Library
    • Ian H. Witten, David Bainbridge(Authors)
    • 2002(Publication Date)
    • Morgan Kaufmann
      (Publisher)
    Complications include composite Unicode characters and the direction in which the character sequence is displayed. By working through a modern Web browser, digital libraries can avoid having to deal with these issues explicitly. 4.2 Representing documents Unicode provides an all-encompassing form for representing characters, including manipulation, searching, storage, and transmission. Now we turn our attention to document representation. The lowest common denominator for documents on 4 . 2 R E P R E S E N T I N G D O C U M E N T S 155 Figure 4.6 Page produced by a digital library in Devanagari script. computers has traditionally been plain, simple, raw ASCII text. Although there is no formal standard for this, certain conventions have grown up. Plain text A text document comprises a sequence of character values interpreted in ordi-nary reading order: left to right, top to bottom. There is no header to denote the character set used. While 7-bit ASCII is the baseline, the 8-bit ISO ASCII exten-sions are often used, particularly for non-English text. This works well when text is processed by just one application program on a single computer, but when transferring between different applications—perhaps through e-mail, news, http, or FTP—the various programs involved may make different assumptions. These alphabet mismatches often mean that character values in the range 128–255 are displayed incorrectly. Formatting within such a document is rudimentary. Explicit line breaks are usually included. Paragraphs are separated by two consecutive line breaks, or the first line is indented. Tabs are frequently used for indentation and align-ment. A fixed-width font is assumed; tab stops usually occur at every eighth character position. Common typing conventions are adopted to represent char-acters such as dashes (two hyphens in a row). Headings are underlined manually using rows of hyphens, or equal signs for double underlining.
  • Book cover image for: The Architecture of Computer Hardware, Systems Software, and Networking
    • Irv Englander, Wilson Wong(Authors)
    • 2021(Publication Date)
    • Wiley
      (Publisher)
    Some application programs, particularly word pro- cessors and some text markup languages, add their own special character sequences for formatting the text. In Unicode, each standard UTF-16 alphanumeric character can be stored in 2 bytes; thus, half the number of bytes in a pure text file (one without images or Asian ideographs) is a good approximation of the number of characters in the text. Similarly, the number of avail- able bytes also defines the capacity of a device to store textual and numerical data. Only a small percentage of the storage space is needed to keep track of information about the various files; almost all the space is thus available for the text itself.
  • Book cover image for: The Routledge Handbook of Chinese Applied Linguistics
    • Chu-Ren Huang, Zhuo Jing-Schmidt, Barbara Meisterernst(Authors)
    • 2019(Publication Date)
    • Routledge
      (Publisher)
    Part III

    Language, computers and new media

    Passage contains an image

    30Computers and Chinese writing systems

    Qin Lu

    Introduction to computer encoding of writing systems

    A writing system is “a script used to represent a particular language” (Sproat 2000: 25). The writing system of a language is the tool one uses to both put spoken words to text and to overcome the communication barriers of time and space. As a script, the writing system of each language consists of a set of symbols which are considered non-divisible when being used. In English, the set of alphabet letters from A to Z is such a set. A script of a language includes concepts beyond typical words that require encoding as well, such as punctuations (punctuation marks) and numerical representations (numeric characters). These concepts should be part of the symbol set used to represent the language. The set of all symbols used in a script is called a character set.
    Computer processing of a language often starts with the processing of its writing systems, which in turn requires defining the symbol set used in the system for that language first. Here, the term process generally refers to the recognition (taking a computer code as a symbol by definition) for storage (input), display (output) and handling of the text in computer systems. The definition process, also referred to as the encoding process, involves 1) the proper selection for a character set, followed by 2) the assignment of a unique binary code value to each symbol, referred to as a code point, with consideration of the script size, nature and efficiency for computer processing, among other things. The assignment results in a coded character set, or codeset for short. A codeset is a code table mapping all the characters to their respective unique code points.
    The American Standard Code for Information Interchange (ASCII) (American National Standards Institute 1986) can be used as an example to see how symbols from a writing system are defined as a codeset. The ASCII Table is the first commonly used codeset defined for computer use. It includes both the symbols used for writing English text and other symbols necessary for preparing text in computer systems. ASCII encodes characters using the so-called fixed length encoding method where the code point for each character is of the same binary length. This means when you are dealing with binary code sequence, you can read one character at a time using a fixed number of binary sequences. For convenience, we use the decimal numbers to refer to the assigned code points. The corresponding hexadecimal (HEX for short) numbers are short forms for the binary code points used in computer systems.
  • Book cover image for: Spoken Language Reference Materials
    • Dafydd Gibbon, Roger Moore, Richard Winski, Dafydd Gibbon, Roger Moore, Richard Winski(Authors)
    • 2020(Publication Date)
    Character codes and computer readable alphabets A.l Introduction This appendix discusses the relationship between character sets (or alpha-bets) and their encoding on computers. It is based on the terminology as used in the Unicode standard (Unicode Standard vol. 1.0). Three levels of representation can be discerned: • character • glyph • code A character is the basic unit of an alphabet. Within the alphabet it has a name, a position, and a content meaning. For example, the character named a is the first letter of the standard Latin alphabet. Its content meaning (in the European languages) is loosely related to its pronunciation, i.e. a vocalic sound with the following IPA description: front, low, unrounded. Characters have no visible graphic representation; this representation is produced through rendering the character on a suitable medium, e.g. paper, computer screens, etc. A glyph is the essential shape of a character; it is the result of the rendering process. Glyphs can be modified through the application of case, font, style, and size operations. For example, the essential shape of the first letter of the Latin alphabet in lower case is the a. In different fonts, this glyph may be modified with a monospaced font to a, or it may slanted as in o, or boldened as in a, etc. A code is a mapping of characters to a set of symbols or signs, e.g. numbers, other characters, etc. This mapping is in general an arbitrary one, and it must be known in order to encode a character or decode a code. For example, in 7-bit ASCII, the character a is encoded as the 7 bit integer number 97. A script consists of an alphabet and a set of rules that determines the direction of writing (left to right, right to left, up to down, etc.), and the composition of characters (placement of accents, combination of glyphs, etc.).
Index pages curate the most relevant extracts from our library of academic textbooks. They’ve been created using an in-house natural language model (NLM), each adding context and meaning to key research topics.