UTF-8

From TNG_Wiki
Jump to: navigation, search

UTF-8 has to do with the display of characters on a screen; it's one of the ways of implementing the Unicode standard character encoding scheme which can essentially represent all the world's languages. It is becoming the web page and email standard, as recommended by the Internet Mail Consortium.

With the exponential increase of Internet communication over the last two decades, more and more people are using their own native language rather than having to learn English. With the change from built-in hardware character generators to characters displayed in graphics, programs such as web browsers that implement the UTF-8 system are able to 'render' or display any known character or symbol.

If you want your website to display special characters or symbols, UTF-8 makes it possible. This is specially true if you anticipate your website will ever be displayed in a language other than standard English.

UTF-8 and TNG

TNG version: 11.0.0
TNG v11 and later defaults to using charset UTF-8 on a full install.


TNG version: 10.1.3
TNG version: 8.0

TNG V8 defaults to the older ISO-8859-1 character encoding scheme. TNG V8 has made provision for those who wish to switch to UTF-8; a number of TNG users have created Wikis to explain the process. A search for UTF-8 on the TNG Wiki site will reveal many pages to guide you in the process, such as TNG Charset and Changing to UTF-8; see also 'Related Links' at the bottom of this page.

A Brief History of Character Codes

Morse Code

Developed by Alfred Vail it standardized the transmission of numbers, letters, and special codes on Samuel Morse's electrical telegraph system. Friederich Gerke streamlined the code in 1848; this formed the bases of the standard International Morse Code agreed upon at the Paris Telegraph Congress of 1865.

While Morse code was one of the earliest binary codes, consisting of shorter 'dots' and longer 'dashes', which correspond to 0's and 1's, it did not use these binary digits in a simple machine coded fashion since different characters used different numbers of binary digits or bits.

Baudot Code

Developed by Emile Baudot in the early 1870's to control a telegraph based printing machine, the code used 5 bits to encode the Roman alphabet, punctuation, and control signals. It became the standard for the ITA1 (International Telegraphy Alphabet No. 1) Code.

A 5 bit code allows a maximum of 25 = 2 × 2 × 2 × 2 × 2 = 32 different characters to be encoded, so Teletypes used a clever mechanical trick to coax almost 64 characters from the code, by developing a carriage that could be automatically shifted up and down. The 'lower' carriage contained the capital letters, and the 'upper' carriage contained numerals and punctuation.

Baudot's code was modified in 1901 by Donald Murray, becoming the ITA2 Code used on Teletypewriters everywhere. Some of the teletype control code names are still with us today, for example NULL, CR (Carriage Return), and LF (Line Feed).

Hollerith Code

Herman Hollerith invented the tabulating machines used in the 1890 US Census. These machines used data encoded on punched cards with 'no hole' (0) or 'hole' (1) – a binary code. His cards had space for 12 'holes' across each row, but he only looked for 69 possible 'elements' related to the census – it wasn't suitable for universal used.

IBM BCDIC Code

Until 1963 IBM – a growing force in the digital world – used a proprietary Binary Coded Decimal Interchange Code, or BCDIC, a way of representing decimal values in a binary machine.

ASCII Code

By 1963 the Baudot telegraphic code was not adequate for the burgeoning electronics industry. The American Standards Association extended the Baudot ITA2 5 bit shifted code into what became a 7 bit shifted code, naming it the "American Standard Code for Information Interchange" - ASCII. With 27 = 128 possible characters – if shifted, 256 characters – it became the standard teleprinter code.

With the introduction in the 1960's of minicomputers based on integrated circuits, and in the 1970's of microcomputers based on microprocessors, ASCII became the code of choice. Even today almost all computers are able to understand the ASCII character set.

IBM EBCDIC Code

When IBM brought out its revolutionary IBM 360 mainframe computer it introduced an enhanced version of its proprietary BCDIC code (rather than the ASCII code which it had helped formulate):  Extended Binary Coded Decimal Interchange Code, or EBCDIC. It used the now-standard 8 bits which allowed 28 = 255 different characters. It continued using this code in its mainframe for the 1970's – the IBM 370. They created 57 varieties of EBCDIC, a different set for each language.

ISO-8859-1

By 1987 the International Standards Organisation had refined the ASCII character encoding scheme with a view to increasing the reliability of information interchange. Its 8859 standard used 8 bits of information rather than 7, permitting 256 characters. Like EBCDIC, there are multiple versions of the ISO 8859 standard to cover languages using characters other than those of standard English. Part 1, or ISO-8859-1, covers the Latin alphabet as used by Western European languages. Part 2 handles central European languages such as Polish and a total of 15 parts have been defined for other languages with small character sets.

As home computing and Internet use grew in the 1990's, ISO-8859-1 became the default character code for documents created by programs using MS-DOS or MS Windows; HTML and HTTP defaulted to it as well. Today the 8859 standards are a 'dead letter' - ISO no longer maintains them.

Unicode

ASCII was fundamentally an encoding scheme for English letters, punctuation, and numerals, and a few control codes. It soon needed modification to allow for other characters, such as the Russian Cyrilic characters, and diacritical marks, such as French accents.

By the 1980's it was clear that neither the ASCII limit of 27 bits nor the ISO limit of 28 bits would be able to handle the languages coming online such as Japanese and Chinese. In 1988 Joe Becker of Xerox proposed a 16 bit encoding scheme named Unicode which should be enough (65,536 combinations) for all the world's modern languages.

Since that time the Unicode standard has been refined to allow for even more characters, such as Egyptian Hieroglyphics; it allows for over a million combinations.

Unicode is simply a worldwide agreement of which integer decimal numbers are assigned to characters, numerals, punctuation, diacritical marks, control codes, and special characters. Portions of Unicode are compatible with various version of ASCII – in fact, the first 128 Unicode codes are the original 128 ASCII codes!

UTF-8

There have been various schemes for implementing the standardized Unicode integer decimal numbers into the bewilderingly complex digital world. MS Windows internally uses a version known as UTF-16 but it wasn't big enough. UTF-8, first presented in 1993, was designed to make it easy to implement the unicode standard in diverse digital environments such as home computers using Windows, Web Servers running UNIX, and the staggering number of electronic devices that communicate today.

UTF-8 uses groups of from one to four 8-bit bytes, allowing it to be efficient when working with web pages and English, the Lingua Franca of the electronic world, and yet able to manage the most complex character based languages such as Chinese. It is able to manage over 2 million different characters.

In 2008, UTF-8 became the most used character encoding system on the Internet, and is increasingly becoming the standard.

Related Links