Encoding Character Data

Character Codes for Modern Computers

This lecture covers the standard ways in which characters are stored in
modern computers. There are five main classes of characters.

1. Alphabetic characters: upper case and lower case.

2. Decimal digits.

3. Punctuation.

4. Control characters, which are not usually printed.

5. All other characters.

There are three standard methods for representing characters.

1. EBCDIC Extended Binary Coded Decimal Interchange Code

2. ASCII American Standard Code for Information Interchange

3. Unicode A modern extension of ASCII.

Each encodes a character in eight bits, represented as two hexadecimal digits.

EBCDIC: Origins and Rationale

The EBCDIC (pronounced “IPSY–dick”) coding system was developed by
IBM as an extension for its BCD (Binary Coded Decimal) system.

EBCDIC uses 8 bits to encode each character, for 256 distinct characters.

The BCD system used 6 bits to encode a character; only 64 distinct characters.

Some of the characters represented in BCD were:

1. The 26 upper case alphabetic characters “A” – “Z”.

2. The ten digits “0” – “9”.

3. The space character “ ”.

4. The symbols used in arithmetic “+”, “–”, “*”, “/”, “=”, “&”

5. Punctuation marks “,”, “.”, “(”, “)”, “:”

Note that there are no lower case letters. I have listed 48 of the BCD
characters. There is room for only 16 more.

EBCDIC: Origins and Rationale (Part 2)

The International Business Machines Corporation, called “IBM” by everybody,
developed the EBCDIC standard at the same time that the ASCII standard
was being developed.

The EBCDIC standard was developed for use in the IBM System/360, a
revolutionary computing system introduced in 1964.

IBM supported the ASCII standard strongly. This leads to a simple question:
“Why did IBM not use ASCII?”

Here is a little–known fact. While the computers in the IBM System/360 line
were designed to use the EBCDIC standard, each on had an “ASCII switch” that
would cause it to use ASCII.

Few system administrators knew of this “ASCII switch” and fewer still used it.
When the System/360 evolved to the System/370, the switch was dropped.

IBM used EBCDIC because it was compatible with the existing card codes.

Punched Cards

When the IBM 360 was first designed, most data input was from 80–column
punched cards. IBM experimented with other formats, but they never caught on.

Here is the picture of a typical 80–column punched card.
It has 12 rows, ten rows labeled 0 – 9; rows 12 and 11 are at the top.

Description: PunchCard_80Cols

The IBM 029 Key Punch

Here is a picture of the device used to produce punched data cards.

Description: IBM_029CardPunch

The card feed was at the right.

The card moved right–to–left as it was punched.

The punched cards were stored in a tray at the top left.

IBM 029 Punch Card Codes

Here is a card punched with each of the 64 characters available under this
format. Note the lack of lower case letters; they were not used in programming
languages of the time.

Description: IBM_029PunchCodes

More on the Punch Card Codes

Digits were encoded by a single punch in the appropriate row.

A single punch in row 2 encoded a “2”, etc.

Other characters were encoded by two punches in a column.

The letter “A” was encoded as 12–1; a punch in row 12 (the top row),
and a punch in row 1.

The letter “K” was encoded as 11–1; a punch in row 11 (next to the top
row), and a punch in row 1.

The letter “S” was encoded as 0–2; a punch in row 0 and a punch in row 2.

Back to EBCDIC

Consider the IBM 029 punch codes and compare them to the EBCDIC.

Character	EBCDIC	Punch Card Codes
0 through 9	F0 through F9	0 through 9
A through I	C1 through C9	12–1 through 12–9
J through R	D1 through D9	11–1 through 11–9
S through Z	E2 through E9	0–2 through 0–9

This table explains the design of the EBCDIC system.

1. IBM chose this design for ease in processing input
from existing devices, such as the IBM 029 key punch.

2. The gaps in the EBCDIC system: no character from the 64 character set
has a non–decimal digit as its second digit.

Cards did not have rows marked A, B, C, D, E, or F.

Control Characters

In any character set, some codes represent characters and some codes represent
control information used to indicate how the data are to be processed.

In EBCDIC, the first 64 codes (with hexadecimal values 0x00 – 0x3F) represent
control characters. Here are a few of the codes used for control characters.

Value Name Meaning

0x01 SOH Start of heading section of a message

0x02 STX Start of text section of a message

0x03 ETX End of text section of a message

0x05 HT Horizontal tab (standard tab on a keyboard)

0x0B VT Vertical tab

0x0C FF Form feed (commonly moves to another page)

0x0D CR Carriage return (moves back to column 0 of the display)

0x25 LF Line feed (moves directly down to the next line)

Printable EBCDIC Characters

Here are some of the character codes for printable EBCDIC characters.
The row ID contains the first digit of the code, the column ID the second.

Code	0	1	2	3	4	5	6	7	8	9
8		a	b	c	d	e	f	g	h	i
9		j	k	l	m	n	o	p	q	r
A		~	s	t	u	v	w	x	y	z
B
C	{	A	B	C	D	E	F	G	H	I
D	}	J	K	L	M	N	O	P	Q	R
E	\		S	T	U	V	W	X	Y	Z
F	0	1	2	3	4	5	6	7	8	9

Here, we note that 0xF0 is the code for the digit ‘0’.

Note that there are a lot of gaps in the code. There is no printable character
with the code 0xCA.

The ASCII Printable Character Set

ASCII has its own set of control characters, with meanings similar to those
used in EBCDIC. Here are the ASCII codes for printable characters.

There are 128 code values in ASCII, ranging from 0x00 – 0x7F.
The value 0x20 is the ASCII code for the space character: “ ”.
The value 0x7F is the ASCII code for the delete character, called “DEL”.

‘

(

)

;

[

]

{

}

Properties of ASCII

ASCII has a number of interesting features that make it appealing to a programmer. Suppose we are examining a value stored in a variable.

If the value falls in the range 0x41 – 0x5A, the value represents
an upper case character.

If the value falls in the range 0x71 – 0x7A, the value represents
a lower case character.

For each alphabetic character, the code for the upper case and the code for the
lower case are strongly related. Only one bit is reset.

Look at the codes for the letter A. We give these in binary.

A 0100 0001
a 0110 0001

We shall later develop a formula to convert between upper case and lower case.

Unicode as an Extension of ASCII

The ASCII code set and the EBCDIC code set are each sufficient for
expressing any idea, as long as it can be expressed in standard Latin
characters (the character set used to write in English).

This is not an issue when writing programs, as all programming languages
can be expressed in something that looks like English.

Suppose your company wants to market an application in a country
(such as Korea, Japan, China, Egypt, or Saudi Arabia) in which English is
not the main language. How do you design your GUI (Graphical User
Interface) for the screen displays?

One option is to require that everybody learn English, which is almost a
de facto requirement anyway.

Suppose that you want to market an application to be used in a small shop,
such as a corner market or cobbler shop. Should grandpa learn English?

A better way is to develop a method to represent non–Latin characters.

Code Pages and Unicode

An early modification was to develop what were called “code pages”.

This works for alphabetic languages, such as Arabic and Greek, in which a
relatively small alphabet is used. One just replaces the Latin alphabet.

ASCII could be modified for Arabic just by redefining each of the code values
0x41 – 0x5A and 0x61 – 0x7A to stand for an Arabic character.

The main problem with each of ASCII and EBCDIC is the small number
of distinct characters that can be represented.

Standard ASCII can represent only 128 distinct characters.

Extended ASCII can represent only 256 distinct characters.

EBCDIC can represent only 256 distinct characters.

Unicode, seen as a 16–bit encoding method, can support 65,536 distinct
characters. There seems to be a 32–bit version of Unicode.

Some Unicode Examples

Here are some examples of character sets supported by the Unicode standard.
These are taken from the web site http://www.unicode.org/charts/.

The Latin alphabet (used in English)

Greek

Cyrillic (used in the Russian language)

Egyptian hieroglyphs

Arabic

Hebrew

Cuneiform (old Egyptian) and Runic (Norse characters)

Lycian and Lydian (kingdoms in Anatolia during the 4^th century BC)

Cherokee (an alphabet developed in the early 19^th century)

Phoenician, Parthian, Etruscan, and Old Turkic

Unicode Representation of Some Greek Characters

How About Cuneiform?

A Problem with Unicode

The global Internet will use Unicode to represent the URL (Uniform Resource
Locator). The URL for Columbus State University is http://www.columbusstate.edu/

Here is an example taken from a security textbook. The question is as follows:
Which of these two URLs references the PayPal service.

www.paypal.com

www.pаypal.com

Here is the answer. We look at the word “paypal” and focus on the
16–bit Unicode representation of each of the words.

The first is the correct link. Its encoding is:
0x0070 0x0061 0x0079 0x0070 0x0061 0x006C

The second encoding is
0x0070 0x0430 0x0079 0x0070 0x0061 0x006C

The second letter is the Cyrillic lower case “a”.