Chapter 10: Handling Character Data

Processing Character Data

We now discuss the definitions and uses of character data in an IBM Mainframe computer. By extension, we shall also be discussing zoned decimal data. Character data and zoned decimal data are stored as eightbit bytes. These eightbit bytes are seen by IBM as being organized into two parts. This division is shown in the following table.

Portion

Zone

Numeric

Bit

0

1

2

3

4

5

6

7

There are two things to note about this table. The first is the bit numbering scheme used by IBM, in which the leftmost bit in an item is always bit 0. IBM seems to be unique in this bit numbering scheme; almost all others label the rightmost bit as bit 0.

One might wonder about the nomenclature zone and numeric. In order to understand why these names are given, we must recall the format of an IBM 026 punch card. The point here is that the EBCDIC (Extended Binary Coded Decimal Interchange Code) encoding was designed for compatibility with the IBM 029 punch card codes which evolved from the IBM 026 punch card code illustrated below.

Note the structure of the column punches for the alphabetic character set. Each letter is represented by a punch in either column 11 or 12 (the zone punch) and a punch in one of the columns numbered 0 through 9 (the numeric punch). While the digits are represented by a single punch, the requirement to have a fullbyte representation in the character code has lead to their being assigned a zone code as well.

As noted in a previous chapter, the EBCDIC coding scheme was designed with the specific goal of easy translation from IBM 029 punched card codes, with the names zone and numeric being retained from those days. Why not keep a bit of history?

 


The EBCDIC Character Set

Here is the set of important EBCDIC codes.

Character

Punch Code

EBCDIC

0

0

F0

1

1

F1

9

9

F9

A

12 1

C1

B

12 2

C2

I

12 9

C9

J

11 1

D1

K

11 2

D2

R

11 9

D9

S

0 2

E2

T

0 8

E3

Z

0 9

E9

Note that the EBCDIC codes for the digits 0 through 9 are exactly the zoned decimal representation of those digits. (But see below).

The DS declarative is used to reserve storage for character data, while the DC declarative is used to reserve initialized storage for character data. There are constraints on character declarations, which apply to both the DS and DC declaratives.

1. Their length may be defined from 1 to 256 characters.
As a practical matter, long character constants should be avoided.

2. They may contain any character. Characters not available in the standard
set may be introduced by hexadecimal definitions.

3. The length may be defined either explicitly or implicitly.
It is usually a good idea not to do both, as this can lead to mistakes.

Consider the case in which a DC declarative is used to define a character constant. If the length attribute is specified, it overrides the length implied by the constant itself. Remember that the length is really a byte count, which is the same as a character count. The following examples will illustrate the issues of both explicit and implicit length definitions.

MONTH1 DC CL6SEPTEMBER STORED AS SEPTEM

MONTH2 DC CL6MAY STORED AS MAY

MONTH3 DC CL6AUGUST STORED AS AUGUST

In the first case, the explicit length is less than the actual length of the constant, so that the value stored is truncated after the explicit length is stored. The rightmost characters are lost.

In the second case, the explicit length is greater than the actual length of the constant. The value stored is padded with blanks out to the specified explicit length; here 3 are added.

It should be obvious that nothing special happens when the explicit length is exactly the same as the length of the constant. There may be reasons to do this, possibly for documentation.


Defining Character Strings

While the term string is not exactly appropriate in this context, we need some way to speak of a sequence of characters such as defined above. In the IBM parlance, the sequence defined by the declarative DC CL6AUGUST is viewed as character data. Strictly speaking, this is a sequence of six characters.

We shall speak of general string handling in a later chapter. The issue at this point is how the assembler determines the length of the string when executing an instruction such as MVC. The answer is that each such instruction specifically encodes the length of the string to be processed. Again, it is the instruction that really defines the length and not the declaration.

Examination of the object code for these character instructions will show that the length is stored in modified form as an 8bit unsigned integer. Actually, the length is decremented by one before it is stored. The range of an 8bit unsigned integer is 0 through 255 inclusive, so that the length that can be stored ranges from 1 through 256. There seems to be no provision for zero length sequences of characters. Zero length strings will be discussed in a later chapter in which the entire idea of a string will be fully developed.

First, lets recall one major difference between the DS and DC declaratives. The DS may appear to initialize storage, but it does not. Only the DC initializes storage. The difference is illustrated by considering the following two declarations.

V1 DS CL40000 Define four bytes of uninitialized
storage. The 0000 is just a comment.
The four bytes allocated will have some
value, but that is unpredictable.

V2 DC CL40000 Define four bytes of storage, initialized
to the four bytes F0 F0 F0 F0, which
represent the four characters.

One should use the DS declaration only for fields that will be initialized by some other means, such as the MVC instruction that is discussed below. It is always possible to move values into an area of memory initialized with a DC declarative. In the above example, it is possible to move the character constant 2222 to V2, which would then contain that value.

The student should also note that it is very easy to write the above declarations in a form that might cause assembly errors. Consider the following two declarations.

V3 DS CL4 0000 Define four bytes of uninitialized
storage. Note the blank after CL4.
Since everything after the CL4 is a
comment, this does not cause a problem.

V4 DC CL4 0000 This causes an assembly error. The DC
declarative exists to initialize the
storage area, but the blank after the
CL4 introduces a comment. The 0000
is not recognized as a value.

Note that no declaration above actually defines a number, but just a sequence of characters that happen to be digits.

Explicit Base Addressing for Character Instructions

We now discuss a number of ways in which the operand addresses for character instructions may be presented in the source code. One should note that each of these source code representations will give rise to object code that appears almost identical. These examples are taken from Peter Abel [R_02, pages 271 273].

Assume that generalpurpose register 4 is being used as the base register, as assigned at
the beginning of the CSECT. Assume also that the following statements hold.

1. General purpose register 4 contains the value X8002.

2. The label PRINT represents an address represented in base/offset form as 401A; that
is it is at offset X01A from the value stored in the base register, which is R4.
The address then is X8002 + X01A = X801C.

3. Given that the decimal number 60 is represented in hexadecimal as X3C,
the address PRINT+60 must then be at offset X01A + X3C = X56 from
the address in the base register. XA + XC, in decimal, is 10 + 12 = 16 + 6.

Note that this gives the address of PRINT+60 as X8002 + X056 = X8058,
which is the same as X801C + X03C. The sum XC + XC, in decimal, is
represented as 12 + 12 = 24 = 16 + 8.

4. The label ASTERS is associated with an offset of X09F from the value in the
base register; thus it is located at address X80A1. This label references a storage
of two asterisks. As a decimal value, the offset is 159.

5. That only two characters are to be moved by the MVC instruction examples to be
discussed. Since the length of the move destination is greater than 2, and since the
length of the destination is the default for the number of characters to be moved, this
implies that the number of characters to be moved must be stated explicitly.

The first example to be considered has the simplest appearance. It is as follows:

MVC PRINT+60(2),ASTERS

The operands here are of the form Destination(Length),Source.
The destination is the address PRINT+60. The length (number of characters
to move) is 2. This will be encoded in the length byte as X01, as the length
byte stores one less than the length. The source is the address ASTERS.

As the MVC instruction is encoded with opcode XD2, the object code here is as follows:

Type

Bytes

Operands

1

2

3

4

5

6

SS(1)

6

D1(L,B1),D2(B2)

OP

L

B1 D1

D1D1

B2 D2

D2D2

 

 

 

D2

01

40

56

40

9F

The next few examples are given to remind the reader of other ways to encode
what is essentially the same instruction.


These examples are based on the true nature of the source code for a MVC instruction, which is MVC D1(L,B1),D2(B2). In this format, we have the following.

1. The destination address is given by displacement D1 from the address stored in
the base register indicated by B1.

2. The number of characters to move is denoted by L.

3. The source address is given by displacement D2 from the address stored in
the base register indicated by B2.

The second example uses an explicit base and displacement representation of the destination address, with generalpurpose register 8 serving as the explicit base register.

LA R8,PRINT+60 GET ADDRESS PRINT+60 INTO R8

MVC 0(2,8),ASTERS MOVE THE CHARACTERS

Note the structure in the destination part of the source code, which is 0(2,8).

The displacement is 0 from the address X8058, which is stored in R8. The object code is:

Type

Bytes

Operands

1

2

3

4

5

6

SS(1)

6

D1(L,B1),D2(B2)

OP

L

B1 D1

D1D1

B2 D2

D2D2

 

 

 

D2

01

80

00

40

9F

The instruction could have been written as MVC 0(2,8),159(4), as the label
ASTERS is found at offset 159 (decimal) from the address in register 4.

The third example uses an explicit base and displacement representation of the destination address, with generalpurpose register 8 serving as the explicit base register.

LA R8,PRINT GET ADDRESS PRINT INTO R8

MVC 60(2,8),ASTERS SPECIFY A DISPLACEMENT

Note the structure in the destination part of the source code, which is 60(2,8).

The displacement is 60 from the address X801C, stored in R8. The object code is:

Type

Bytes

Operands

1

2

3

4

5

6

SS(1)

6

D1(L,B1),D2(B2)

OP

L

B1 D1

D1D1

B2 D2

D2D2

 

 

 

D2

01

80

3C

40

9F

The instruction could have been written as MVC 60(2,8),159(4), as the label
ASTERS is found at offset 159 (decimal) from the address in register 4.


Sample Declarations

We now give a few examples of declarations of character constants. These examples will appear in the form of an assembler listing. Each line will have four parts: a location, the object code (EBCDIC characters) that would be generated, the declaration itself, and then some comments in the field that the assembler would reserve for comments.

LOC Obj. Code Source Code Comments

005200 40404040 B1 DC CL4 FOUR BLANKS

005204 40404040 B2 DC 4CL1 FOUR SINGLE BLANKS. NOTE
THE IDENTICAL OBJECT CODE.

005208 F0F0F0F0 Z1 DC C0000 FOUR DIGITS

00520C F2F2F2F2 N2 DC 4CL12 FOUR MORE DIGITS

The MVC Instruction

The MVC (Move Character) instruction is designed to move character data, but it can be used to move data in any format, one byte at a time. As we shall see later, the MVC can be used to move packed decimal data, but this is not advised as strange errors can occur.

The MCV instruction is a storagetostorage (type SS) instruction. The opcode is XD2.

The instruction may be written as MVC DESTINATION,SOURCE

An example of the instruction is MVC F1,F2

The format of the instruction is MVC D1(L,B1),D2(B2). This format reflects the fact that each of the source and destination addresses is specified by a base register (often the default base register) and a displacement. Here is the format of the object code.

Type

Bytes

Form

1

2

3

4

5

6

SS(1)

6

D1(L,B1),D2(B2)

XD2

L

B1 D1

D1D1

B2 D2

D2D2

Here are a few comments on MVC.

1. It may move from 1 to 256 bytes, determined by the use of an 8bit number
as a length field in the machine language instruction.

The destination length is first decremented by 1 and then stored in the length byte,
which can store an unsigned integer representing values between 0 and 255.
This disallows a length of 0, and allows 8 bits to store the value 256.

2. Data beginning in the byte specified by the source operand are moved one
byte at a time to the field beginning with the byte in the destination operand.

One of the reasons for complexity of the implementation is that the source
and destination regions may overlap.

3. The length of the destination field determines the number of bytes moved.


Example of the MVC Instruction

Consider the example assembly language statement, which moves the string of
characters at label CONAME to the location associated with the label TITLE.

MVC TITLE,CONAME

Suppose that: 1. There are fourteen bytes associated with TITLE, say that it was
declared as TITLE DS CL14. Decimal 14 is hexadecimal E.

2. The label TITLE is referenced by displacement X40A
from the value stored in register R3, used as a base register.

3. The label CONAME is referenced by displacement X42C
from the value stored in register R3, used as a base register.

Given that the operation code for MVC is XD2, the instruction assembles as

D2 0D 34 0A 34 2C Length is 14 or X0E; L 1 is X0D

To be totally obvious with this example, let us disassemble the object code that we have just created by manual assembly. The only assumption at the start is that the byte with value
XD2 contains the opcode for the instruction. Here again is the object code format.

Type

Bytes

Form

1

2

3

4

5

6

SS(1)

6

D1(L,B1),D2(B2)

XD2

L

B1 D1

D1D1

B2 D2

D2D2

The opcode XD2 is that for the MVC instruction (surprise!). This is a type SS instruction which has a total of six bytes: the opcode byte and five bytes following.

The second byte contains the length field. Its value is X0D, representing the decimal value 13. This is one less than the length of the destination field, which must have length 14.

Bytes 3 and 4 represents an address, expressed in base/displacement format, as do bytes
5 and 6. The value in bytes 3 and 4 is a 16bit number, in hexadecimal it is X340A.
This indicates that general purpose register 3 is being used as the base for this address and that the offset is given by X40A. Suppose that register 3 contains the value X1700. The address represented would then be X1700 + X40A = X1B0A.

MVC: Explicit Register Usage

The instruction may be written explicitly in the form MVC D1(L,B1),D2(B2)

Consider the following example: MVC 32(5,7),NAME. In this example, suppose that generalpurpose register 7 has the value X22400. We note that the label NAME represents an address that will be converted to the form D2(B2); that is, a displacement from a base register. This base register might be register 7 or any of the ten registers (R3 R12) available for general use.

We examine the specification of the first argument, which is the destination address.
It is of the form D1(L,B1). The length is L = 5. This indicates that five characters are to be moved. The displacement is decimal 32, or X20.

The address of the first character in the destination is given by adding this displacement
to the contents of the base register: X22400 + X20 = X22420. Five characters are moved to the destination. The fifth character is moved to a location that is four bytes displaced from the first character; its address is X22424.

Suppose that the label NAME corresponds to an address given by offset X250(592 in
decimal) from generalpurpose register 10 (denoted in object code by XA).

When the instruction is written in the form MVC D1(L,B1),D2(B2), we see that it has the form MVC 32(5,7),592(10). ALL NUMBERS ARE DECIMAL.

In the object code format, the value stored for the length attribute is one less than
the actual length. The length is 5, so the stored value is 4, or X04.

The object code format is D2 04 70 20 A2 50.

Again, recall the object code format for this instruction.

Op Code

Length

Base

Displacement

Base

Displacement

D

2

0

4

7

0

2

0

A

2

5

0

MVC: Example of Length Mismatch

The number of bytes (characters) to move may be explicitly stated in the source statement. However if it is not explicitly stated, the number is taken as the length (in bytes or characters) of the destination field. Consider the following program fragment.

MVC F1,F2

F1 DC CL4JUNE

F2 DC CL5APRIL

What happens is shown in the next figure.

The assembler recognizes F1 as a fourbyte field from its declaration by the DC
statement. This implicitly sets the number of characters to be moved. The character L is not moved, as it is the fifth character in F2. It is at address F2+4.

MVC: Another Example of Length Mismatch

The number of bytes (characters) to move may be explicitly stated in the source code. While the explicit length may exceed that of the destination field, your instructor (but not many textbook authors) considers that bad programming practice.


Consider the following program fragment, in which an explicit length of 3 is set. Recall the form of the instruction: MVC D1(L,B1),D2(B2).

MVC F1(3),F2 The (3) says move three characters

F1 DC CL4JUNE

F2 DC CL5APRIL

What happens is shown in the next figure.

Note that only APR is moved. The last character of F1, which is an E, is not changed. This last character is at address F1+3.

MVC: Example 3

We may use relative addressing as well as an explicit length declaration. Consider the following program fragment.

MVC F1+1(2),F2+2

F1 DC CL4JUNE

F2 DC CL5APRIL

This calls for moving two characters from address F2+2 to address F1+1. The two characters at address F2+2 are RI. The two characters at the destination address F1+1 are UN. What happens is shown in the next figure.

The other two characters in F1, at addresses F1 and F1+3, are not changed.


MVC: Example 4

We now consider the explicit use of base registers.

Recall the form of the instruction: MVC D1(L,B1),D2(B2).

In the following three examples, we suppose that PRINT is a label associated with
an output field of length 80 bytes. In reality, it only must be big enough.

FRAG01 MVC PRINT+60(2),=C**

FRAG02 LA R8,PRINT+60 LOAD THE ADDRESS.

MVC 0(2,8),=C** DEST ADDRESS IS PRINT+60

FRAG03 LA R8,PRINT LOAD THE ADDRESS.

MVC 60(2,8),=C** NOTE OFFSET IS 60

Suppose that the address of PRINT is given by base register 12 and displacement X200. Suppose register 12 contains a value of X1000. The label PRINT references address X1200. The value of PRINT+60 is then X1200 + X60 = X1260.

As an aside, note that it appears more natural to write the first instruction in the form.

FRAG01 MVC PRINT+60(2), =C**

Note that there is a space following the comma. This space turns whatever fallows it into a comment, thus rendering the instruction incomplete and erroneous.

Describing Input Fields

Consider the following block that declares area for an 80column input (corresponding to an 80column punch card) that is divided into fields.

Here is a declaration of an 80byte input area that will be divided into fields.

CARDIN DS 0CL80 The record has 80 bytes.

NAME DS CL30 The first field has the name.

YEAR DS CL10 The second field.

DOB DS CL8 The third field.

GPA DS CL3 The fourth field.

DS CL29 The last 29 chars are not used.

The address corresponding to the label NAME is the same as that for the label CARDIN. The field NAME corresponds to addresses NAME through NAME+29, inclusive.

The address corresponding to the label YEAR is the same as the address CARDIN+30. The field YEAR corresponds to addresses YEAR through YEAR+9, inclusive. Equivalently, the field corresponds to addresses CARDIN+30 through CARDIN+39, inclusive.

Relative addressing will often be used to extract fields from an input record or place fields into an output record.


Character Comparison: CLC

The CLC (Compare Logical Character) instruction is one of the two used to compare
character fields, one byte at a time, left to right.

Comparison is based on the binary contents (EBCDIC code) contents of the bytes.
The sort order is from X00 through XFF.

The instruction may be written as CLC Operand1,Operand2

The format of the instruction is CLC D1(L,B1),D2(B2)

An example of the instruction is CLC NAME1,NAME2

This instruction sets the condition code that is used by the conditional branch instructions. The condition code is set as follows:

If Operand1 is equal Operand2 Condition Code = 0

If Operand1 is lower than Operand2 Condition Code = 1

If Operand1 is higher than Operand2 Condition Code = 2

The operation moves, byte by byte, from left to right and terminates as soon as an unequal comparison is found or one of the operands runs out.

Using the Condition Codes

The character comparison operators, CLC and CLI, set the condition codes. These codes are used by the branching instructions in their nonnumeric form. Here are the standard comparisons.

BE Branch Equal Condition Code = 0

BNE Branch Not Equal Condition Code 0

BL Branch Low Condition Code = 1

BNL Branch Not Low Condition Code 1

BH Branch High Condition Code = 2

BNH Branch Not High Condition Code 2.

Here are two equivalent examples.

CLC X,Y

BL J20LOEQ X sorts less than Y

BE J20LOEQ Y is equal to Y

CLC X,Y

BNH J20LOEQ X does not sort higher than Y

 


CLC: An Example

Consider the following code fragment. Note that the comparison value is given
as the seven EBCDIC characters 0200000.

Presumably, this would be converted into seven Packed Decimal digits and held to
represent the fixed point number 2000.00, presumably $2,000.00.

C20 CLC SALPR,=C0200000 COMPARE TO 2,000.00

BNH C30 NOT ABOVE 2,000.00

BL C40 LESS THAN 2,000.00

* EQUAL TO 2,000.00

Again, this is presented as representing Packed Decimal data, which it probably
does represent. The comparison, however, is an EBCDIC character comparison.

Here is another example, built around the first one. It represents an important
special case that we shall consider when discussing Packed Decimal format.

C20 CLC SALPR,=C IS THE FIELD BLANK?

BNE NOTBLNK

MVC SALPR,=C0000000 CONVERT BLANKS TO 0S

NOTBLANK PACK SALNUM,SALPR

MVI and CLI

These two operations are similar to their more general cousins, except
that the second operand is a onebyte immediate constant.

The immediate constant may be of any of the following formats:

B binary

C character

X hexadecimal

The format of these instructions are: MVI Operand1,ImmediateOperand

CLI Operand1,ImmediateOperand

Examples of these instructions are: MVI CONTROL,C$ Character $

CLI CODE,C5 Character 5

Character Literals vs. Immediate Operands

The main characteristic of an immediate operation is that the operand, called theimmediate operand is contained within the instruction. The main characteristic of a literal operand is that it is stored separately from the operand, in a literal pool generated by the assembler.

Here are two equivalent instructions to set the currency sign.

Use of a literal: MVC DOLLAR,=C$

Use of immediate operand MVI DOLLAR,C$

Note the = in front of the literal. It is not present in the immediate operand.


Insert Character (IC) and Store Character (STC)

The IC instruction moves a single byte (8 bits) from storage into a register and the STC moves a byte from the register to storage. Each access only the rightmost 8 bits of the general purpose register, denoted as bits 24 through 31.

Each of the instructions is a type RX instruction of the form OP REG,MEMORY. Note that:
1. The first operand denotes a general purpose register, of which only the rightmost
8 bits (24 31) will be used.
2. The second operand references one byte in storage, as each EBCDIC
character is stored in a single byte. As this is a byte address, there are no
restrictions on its value; it can be an even or odd number.

The opcode for IC is X43, while that for STC is X42. The object code is of the form OP R1,D2(X2,B2).

Type

Bytes

Operands

1

2

3

4

RX

4

R1,D2(X2,B2)

OP

R1 X2

B2 D2

D2D2

The first byte contains the 8bit instruction code, either X42 or X43.

The second byte contains two 4bit fields, each of which encodes a register number. The field R1 denotes the general purpose register that is either the source or destination of the transfer. The field X2 denotes the optional index register to be used in address calculation.

The third and fourth bytes hold the standard base/displacement address.

The IC instruction does not change the three leftmost bytes (bits 0 23) of the register being loaded. The STC instruction does not use these three bytes.

Case Conversion

We now present an interesting use for these two instructions. This is the conversion of alphabetical characters from upper case to lower case and back again. In order to do this, we need a few instructions that have yet to be discussed.

The three instructions are here given in their immediate format, though there are other forms that will be discussed later. These are logical AND, logical OR, and logical XOR. Each of these operations is a bitwise operation, defined as follows.

AND 00 = 0 OR 0+0 = 0 XOR 00 = 0
01 = 0 0+1 = 1 01 = 1
10 = 0 1+0 = 1 10 = 1
11 = 1 1+1 = 1 11 = 0

The three instructions, as implemented in the S/370 architecture, are as follows:

NI Logical AND Immediate Opcode X92

OI Logical OR Immediate Opcode X96

XI Logical XOR Immediate Opcode X97

Each instruction is type SI, and is written as source code in the form OP TARGET,MASK.
The indicated operation is applied to the TARGET and the result stored in the TARGET.


Another Look at Part of the EBCDIC Table

In order to investigate the difference between upper case and lower case letters, we here present a slightly different version of the EBCDIC table.

 

Zone

8

C

9

D

A

E

Numeric

 

 

 

 

 

 

 

1

 

a

A

j

J

 

 

2

 

b

B

k

K

s

S

3

 

c

C

l

L

t

T

4

 

d

D

m

M

u

U

5

 

e

E

n

N

v

V

6

 

f

F

o

O

w

W

7

 

g

G

p

P

x

X

8

 

h

H

q

Q

y

Y

9

 

i

I

r

R

z

Z

The structure implicit in the above table will become more obvious when we compare
the binary forms of the hexadecimal digits used for the zone part of the code.

Upper Case C = 1100 D = 1101 E = 1110
Lower Case 8 = 1000 9 = 1001 A = 1010

Note that it is only one bit in the zone that differentiates upper case from lower case.
In binary, this would be noted as 0100 or X4. As this will operate on the zone field of a character field, we extend this to the two hexadecimal digits X40. The student should verify that the onescomplement of this value is XBF. Consider the following operations.

UPPER CASE

A X1100 0001 X1100 0001
OR X 40 X0100 0000 AND X BF X1011 1111
X1100 0001 X1000 0001

Converted to A a

Lower case

a X1000 0001 X1000 0001
OR X 40 X0100 0000 AND X BF X1011 1111
X1100 0001 X1000 0001

Converted to A a

We now have a general method for changing the case of a character, if need be.
Assume that the character is in a one byte field at address LETTER.

Convert a character to upper case. OI,LETTER,=X40

This leaves upper case characters unchanged.

Convert a character to lower case. NI,LETTER,=XBF

This leaves lower case characters unchanged.

Change the case of the character. XI,LETTER,=X40
This changes upper case to lower case and lower case to upper case.