Skip to main content
ICT Skills 1

Do you know what coded character sets (ASCII, ANSI, Unicode and UTF-8) are? (8/31)

Yes

Find out

The relationship between a character and its encoding is made by assigning a numerical value called code point to each character. The number of available code points depends upon the number of available bits. 

An 8-bit coded character set can encode 256 characters. This is typically sufficient to encode the characters used in alphabetic scripts, such as Arabic, English and Greek. A 16-bit coded character set can encode 65,536 characters. This might provide a workable minimum for an ideographic writing system, such as that used in Chinese. 

Rather than working with individual bits, or sequences of bits of indeterminate length, computers tend to work with defined groups of bits. These are known as bytes. Generally, bytes have 8 bits. Once the relationship between character and code points has been established, a computer needs to be able to relate each code point to a byte, or sequence of bytes. The rules governing this relationship create a character encoding scheme.

ASCII (binary/dual code)

The American Standard Code for Information Interchange (ASCII) allows the encoding of data in a binary code. It is originally a 7-bit coded character set for information interchange and it was proposed by the American National Standards Institute (ANSI) in 1963 and finalised in 1968, when many computers dealt with eight-bit groups (bytes), the smallest unit of information. 

Later on, the ASCII code had to be extended because the number of written symbols used in common natural languages exceeds its range. The new Extended ASCII code or high ASCII describes eight-bit or larger character encodings (128 codes) that include the standard seven-bit ASCII characters as well as others. In this way, many other languages not easily representable in ASCII could be covered. 

However, the extended ASCII code is still not enough to cover all languages, so even the eight-bit extensions had to have local variants. ASCII means plain text. Therefore, an "ASCII file" is a text file — plain and unformatted. ASCII is commonly used for transmitting data and was developed as a telecommunication standard.

ANSI

This is a set of characters based on the 256 ASCII character set. It includes special and country-specific characters and is particularly important for software that runs under MS-DOS, like Windows. ANSI is also the Microsoft collective name for all Windows code pages.

 

Unicode

Because of the non-compatible encodings in different countries, the complexity of the conversion when exchanging data between standards, and the problems that arise when working with more than one language in the same text, there was an attempt in the 1980's to create a unified character set. Unicode (also ISO 10646) was the result and it aims at assigning a code for each graphic character or element of all known alphabets and writing systems on this planet. It can be defined as the universal character encoding standard that provides the basis for processing, storage and interchange of text data in any language in all modern software and information technology protocols.

 

UTF-8

UTF-8: Unicode Transformation Format, 8-bit encoding form makes Unicode compatible with environments that were designed entirely around ASCII, such as Unix, Linux and similar systems.

 

UTF-16

UTF-16: Unicode Transformation Format, 16-bit encoding form uses two 8-bit bytes for each code point on the Bit Mapped Protocol (BMP), regardless of position. This makes it more compact than UTF-8 for Chinese, Japanese and Korean (CJK) characters, but less compact when dealing with characters, such as Latin letters, which would fit into a single byte in UTF-8. In Microsoft applications UTF-16 is known as Unicode while UTF-8 is known as Unicode (UTF-8).

The following table shows some examples of Unicode Data:

 

Origin

Name

Origin

Name

Bhutan

Gonpo Dorji (film actor)

འབྲུག་ཡུལ།

མགོན་པོ་རྡོ་རྗེ།

Czech Republic

Antonin Dvorak (composer)

Česko
(Česká republika)

Antonín Dvořák

India
(Hindi)

Madhuri Dixit (movie star)

भारत

माधुरी दीक्षित

 

Why is this information important for translators and translation teachers?
Depending on the type of a language's alphabet different standard character sets are used to code information. Therefore, when a translation client sends a file in which the data is coded according to a particular national standard character set and the translator receives the file and opens it on a computer with a different character set, things can go wrong. Translators and translation teachers should be aware of what character encoding sets are and how to deal with them in order to avoid such situations. Coded character sets are important not only when displaying data on the computer, but also when exchanging and sorting data (for example simple words in most Asian languages) and using translation memories.

 

Next