Chapter 18 – The Cray Line of Supercomputers

We now turn to a discussion of the most significant line of supercomputers from the 20th century.  Admittedly, we could also say the last third of the 20th century as there were no supercomputers, in any sense of the word, until about 1964 when the CDC–6600 was introduced.  The year 1964 was also the year that the IBM System/360 was announced.  This is considered a mainframe computer and not a supercomputer.

The difference between a mainframe and a supercomputer is entirely a matter of definition, though one that does make sense.  A supercomputer is a large powerful computer that is designed to bring computational power to solution of a single large problem.  It solves one problem at a time.  A mainframe should be considered as a transaction processor, though this is not strictly necessary.  Each transaction is quite simple: update a bank balance, print a payroll check, post an airline reservation, etc.  What makes the mainframe unique is its ability to handle massive numbers of transactions per unit time, and to do so with very high reliability.  Mainframes and supercomputers address different classes of problems.

Within the confines of this chapter, we shall use the terms “supercomputer” and “vector processor” as synonyms, as they were used for about 25 years.  The first vector processor was the Cray–1, designed by Seymour Cray, installed in Los Alamos National Lab in 1976.  The term “vector processor” is distinguished from “scalar processor” in the literature.

The terms “scalar” and “vector” are adopted from the field of mathematics, and retain their definitions.  A scalar is a number of a number of varieties, typically integer or real.  It is true that complex numbers are also scalars in the mathematical sense, but they are treated differently in computer science.  A vector is just a collection of scalar quantities; e.g., the (X, Y, Z) coordinates of a point.  In this field, we ignore the difference between the coordinates of a point and the vector from the origin of coordinates to that point.

The difference between scalar and vector designs can be seen in the following FORTRAN fragment.  Each adds two arrays to produce a third array.

DO 200, I = 1, 1000
DO 200, J = 1, 1000
C(I, J) = A(I, J) + B(I, J)
200   CONTINUE

Each of A(I, J), B(I, J), and C(I, J) is a scalar element of its respective array.  In a scalar processor, the key instruction is the statement C(I, J) = A(I, J) + B(I, J).  This will be put in a loop and executed a million times.  In a vector processor, there is more of the flavor of a vector addition: C = A + B.  Here there would be no loop structure, and the key instruction would be fetched from memory only once, rather than one million times, as in the scalar processor.  As we shall see later, the vector processor uses a particularly safe form of pipelining, one that has few data hazards.

The Cray series of computers, which are the subject of this chapter, were equipped with both scalar and vector execution units.  When operated only in scalar mode, the computer would become a standard processor, though a fairly powerful one.

The Cray–1, delivered in 1976, was the first of the Cray line of supercomputers, and the first true vector processor.  It was an immediate hit among scientists who needed considerable computational power, mostly because the vector facilities seemed a perfect match to the array–heavy computation that was the life work of these scientists.  Below is a figure of the Cray–1 at the Deutsches Museum in Munich Germany.

Figure: The Cray–1 at the Deutsches Museum in Munich Germany

Note the design.  The low parts contained the power supplies for the computer proper, which was housed in the taller parts.  In 1976, the magazine Computerworld called the Cray–1
“the world’s most expensive love seat”.

History of Seymour Cray and His Companies

In order to understand the Cray line of computers, we must look at the personal history of Seymour Cray, the “father of the supercomputer”.  Cray began work at Control Data Corporation soon after its founding in 1960 and remained there until 1972.  He designed several computers, including the CDC 1604, CDC 6600, and CDC 7600.  The CDC 1604 was intended just to be a good computer; all computers beginning with the CDC 6600 were designed for speed.  The CDC 6600 is often called the first RISC (Reduced Instruction Set Computer), due to the simplicity of its instruction set.  The reason for its simplicity was the desire for speed.  Cray also put a lot of effort into matching the memory and I/O speed with the CPU speed.  As he later noted, “Anyone can build a fast CPU.  The trick is to build a fast system.”  Full disclosure: Your author has programmed on the CDC 6600, CDC 7600, and Cray–1; he found each to be excellent machines with very clean architectures.

The CDC 6600 lead to the more successful CDC 7600.    The CDC 8600 was to be a follow–on to the CDC 7600.  While an excellent design, it proved too complex to manufacture successfully, and was abandoned.  Cray left Control Data Corporation in 1972 to found Cray Research, based in Chippewa Falls, Wisconsin.  The main reason for his departure was the fact that CDC could not bankroll his project leading to development of the Cray–1.  The parting was cordial; CDC did invest some money in Cray’s new company.

In 1989, Cray left Cray Research in order to found Cray Computers, Inc.  His reason for leaving was that he wanted to spend more time on research, rather than just churning out the very profitable computers that his previous company was manufacturing.  This lead to an interesting name game:

Cray Research, Inc.     producing a large number of commercial computers

Cray Computer, Inc.   mostly invested in research on future machines.

The Cray–3, a 16–processor system, was announced in 1993 but never delivered.  The
Cray–4, a smaller version of the Cray–3 with a 1 GHz clock was ended when the Cray Computer Corporation went bankrupt in 1995.  Seymour Cray died on October 5, 1996.

In 1993, Cray Research moved away from pure vector processors, producing its first massively parallel processing (MPP) system, the Cray T3D™.  We shall discuss this decision a bit later, in another chapter of this book, when we discuss the rise of MPPs.  Cray Research merged with SGI (Silicon Graphics, Inc.) in February 1996.  It was spun off as a separate business unit in August 1999. In March 2000, Cray Research was merged with Terra Computer Company to form Cray, Inc.  The company still exists and has a very nice web site (www.cray.com).

Here is a schematic of the Cray–1.  At the base, it is more than 8 feet in diameter.

We may think this a large computer, but for its time the Cray–1 was surprisingly small.  Your author recalls the first time he actually saw a Cray–1.  The first thought was “Is this all there is?”  Admittedly, there were a number of other units associated with this, including disk farms, tape drives, and a variety of other I/O devices.  As your author recalls, there was a dedicated Front End Processor, possibly a CDC–7600, to manage the tapes and disks.  The division of work load into computational and I/O has a long history, for the reason it works.

Processor Specifications of the Cray–1

Source:            Cray–1 Computer System Hardware Reference Manual
Publication 2240004, Revision C, November, 1977 [R106].

Note that the memory size, without error correction, would be 8MB.  Each word has 64 data bits (8 bytes) as well as 8 bits for error correction.  Other material indicates that the memory was low–order interleaved.

Here is a considerable surprise.  We are discussing what was agreed to be the most powerful computer of the late 1970’s.  Yet it had only eight megabytes of memory.  The reader is invited to revisit this textbook’s chapter on memory and note the price of memory in 1976.

The Cray–1 Vector Registers

It is important to understand the structure and function of the vector registers, as it is this set of structures that made the computer a vector processor.  Each of the vector registers is a vector of registers, best viewed as a collection of sixty–four registers each holding 64 bits.  A vector register held 4,096 bits.  Vector registers are loaded from primary memory and store results back to primary memory.  One common use would be to load a vector register from sixty–four consecutive memory words. Nonconsecutive words could be handled if they appeared in a regular pattern, such as every other word or every fourth word, etc.

One might consider each register as an array, but that does not reflect its use.  One of
the key design features of the Cray is the placement of a large number of registers
between the memory and the CPU units.  These function much as cache memory.

Note that the scalar and address registers also have auxiliary registers.

In some sense, we can say that the T registers function as a cache for the S registers, and the B registers function as a cache for the A registers.  Without the register storage provided by the B, T, and V registers, the CRAY–1’s [memory] bandwidth of only 80 million words per second would be a serious impediment to performance.  Each word is 8 bytes; 80 million words per second is 640 million bytes per second, or one byte every 1.6 nanoseconds.

Note that many of the advantages of a vector processor depend on the structure of the problem for which the program has been written.

1. Each result is independent of previous results, which enables deep pipelines and high clock rates without generating any data hazards.

2. A single vector instruction performs a great deal of work, which means fewer instruction fetches in general, and fewer branch instructions and so fewer mispredicted branches.

3. Vector instructions access memory a block at a time, which allows memory latency to be amortized over, say, 64 elements.

4. Vector instructions access memory with known patterns, which allows multiple memory banks to simultaneously supply operands.

These last two advantages mean that vector processors do not need to rely on high hit rates of data caches to have high performance. They tend to rely on low-latency main memory, often made from SRAM, and have as many as 1024 memory banks to get high memory bandwidth.” [R80]

Amount of Work

Note that the single vector instruction can correspond to the execution of 64 scalar instructions in a normal SISD architecture.  This makes very good use of the bandwidth to the Instruction Cache that fronts the memory.  One instruction is fetched and does the work of sixty four, with 63 fewer instruction fetches.  Remember that the goal is to make the best use of primary memory, which is inherently much slower than the CPU execution units.

Locality of Memory Access

All vector instructions accessing memory have a predictable access pattern.  This pattern provides a good match for an interleaved primary memory, of the type we discussed in a previous lecture on matching the cache and main memory.  Consider a vector with 64 entries, each a 64–bit double word.  The vector has size of 512 bytes.  Given a heavily interleaved memory (1024–way is not uncommon) the cost to access the first byte to be transferred is amortized over the entire vector.  If it takes 80 nanoseconds to retrieve the first byte and 4 nanoseconds each to retrieve the remaining 511, the average access time is
(80 + 4
·511)/512 = 2124/512 = 4.15 ns.

Control Hazards

Most of the control hazards in a scalar unit have to do with conditional branches, such as taken at the end of a loop.  Many of these loops are now represented as a single vector instruction, which obviously has no branch hazard.

The only branch hazards in a vector pipeline occur due to conditional branches, such as the IF … THEN … ELSE tests.

Evolution of the Cray–1

In this course, the main significance of the CDC 6600 and CDC 7600 computers lies in their influence on the design of the Cray–1 and other computers in the series.  Remember that Seymour Cray was the principle designer of all three computers.
Here is a comparison of the CDC 7600 and the Cray–1.

Item                                        CDC 7600                        Cray–1

Circuit Elements                     Discrete Components        Integrated Circuitry

Memory                                   Magnetic Core                  Semiconductor (50 nanoseconds)

Scalar (word) size                   60 bits                               64 bits (plus 8 ECC bits)

Vector Registers                     None                                 Eight, each holding 64 scalars.

Scalar Registers                       Eight: X0 – X7                 Eight: S0 – S7

Scalar Buffer Registers           None                                 Sixty–four T0 – T77
Octal numbering was used.

Address Registers                   Eight: A0 – A7                 Eight: A0 – A7

Address Buffer Registers       None                                 Sixty–four: B0 – B77

Two main changes:   1. Addition of the eight vector registers.
2. Addition of fast buffer registers for the A and S registers.

Chaining in the Cray–1

Here is how the technique is described in the 1978 article [R107].

“Through a technique called ‘chaining’, the CRAY–1 vector functional units, in combination with scalar and vector registers, generate interim results and use them again immediately without additional memory references, which slow down the computational process in other contemporary computer systems.”

This is exactly the technique, called “forwarding”, used to handle data hazards in a pipelined control unit.  Essentially a result to be written to a register is made available to computational units in the CPU before it is stored back into the register file.

Consider the following example using the vector multiply and vector addition operators.

MULTV     V1, V2, V3    // V1[K] = V2[K] · V3[K]

ADDV      V4, V1, V5    // V4[K] = V1[K] + V5[K]

Without chaining (forwarding), the vector multiplication operation would have to finish before the vector addition could begin.  Chaining allows a vector operation to start as soon as the individual elements of its vector source become available.  The only restriction is that operations being chained belong to distinct functional units, as each functional unit can do only one thing at a time.

Vector Startup Times

Vector processing involves two basic steps: startup of the vector unit and pipelined operation.  As in other pipelined designs, the maximum rate at which the vector unit executes instructions is called the “Initiation Rate”, the rate at which new vector operations are initiated when the vector unit is running at “full speed”.

The initiation rate is often expressed as a time, so that a vector unit that operated at
100 million operations per second would have an initiation rate of 10 nanoseconds.  The time to process a vector depends on the length of the vector.  For a vector with length N (containing N elements) we have
T(N) = Start–Up_Time + N
·Initiation_Rate

The time per result is then T = (Start–Up Time) / N + Initiation_Rate.  For short vectors (small values of N), this time may exceed the initiation rate of the scalar execution unit.  An important measure of the balance of the design is the vector size at which the vector unit can process faster than the scalar unit.  For a Cray–1, this crossover size was between 2 and 4;
2
£ N £ 4.  For N > 4, the vector unit was always faster.

Here are some comparative data for mathematical operations (Log, cosine, square root, and exponential), showing the per–result times as a function of vector length.  Note the low crossover point, for vectors larger than N = 5, the vector unit is much faster.

The time cost is given in clock ticks, not nanoseconds.  See Russell, 1978 [R107].

The Cooling Issue

The CRAY–1 computer weighed 5.25 tons (10,500 pounds or 4773 kilograms). It consumed 130 kilowatts of power.  All that power emerged as heat.  The problem was keeping the computer cool enough to operate.  The goal was not to keep the semiconductor circuitry from melting, but to keep it from rising to the temperature at which thermal noise made it inoperable.  Thermal noise refers to the fact that all semiconductors generate electric signals as a result of their temperature.  The maximum voltage of the signals due to thermal noise increase as the temperature increases; for reasonable temperatures they are so much smaller than the voltages intentionally generated that they can be ignored.

The CRAY–1 and its derivative computers (CRAY X–MP and CRAY Y–MP, etc.) employed a cold plate approach to cooling.  All circuit elements were mounted on a ground plane (used to assert zero volts) that was a copper sheet also used to cool the circuit.  The copper sheet was attached to an aluminum cold bar, maintained at 25 Celsius by a Freon refrigerant flowing through stainless steel tubes.

It is worth note that it took a year and a half before the first good cold bar was built.  A major problem was the discovery that Aluminum is porous.  The difficulty was not due to loss of Freon, but due to the oil that contaminated the Freon; it could damage the circuits.

The Cray X–MP and Cray Y–MP

The fundamental tension at Cray Research, Inc. was between Seymour Cray’s desire to develop new and more powerful computers and the need to keep the cash flow going.  Seymour Cray realized the need for a cash flow at the start.  As a result, he decided not to pursue his ideas based on the CDC 8600 design and chose to develop a less aggressive machine.  The result was the Cray–1, which was still a remarkable machine.

With its cash flow insured, the company then organized its efforts into two lines of work.

1.   Research and development on the CDC 8600 follow–on, to be called the Cray–2.

2.   Production of a line of computers that were derivatives of the Cray–1 with
improved technologies.  These were called the X–MP, Y–MP, etc.

The X–MP was introduced in 1982.  It was a dual–processor computer with a 9.5 nanosecond (105 MHz) clock and 16 to 128 megawords of static RAM main memory.
A four–processor model was introduced in 1984 with a 8.5 nanosecond clock.

The Y–MP was introduced in 1988, with up to eight processors that used VLSI chips.
It had a 32–bit address space, with up to 64 megawords of static RAM main memory.

The Y–MP M90, introduced in 1992, was a large–memory variant of the Y–MP that replaced the static RAM memory with up to 4 gigawords of DRAM.

The Cray–2

While his assistant, Steve Chen, oversaw the production of the commercially successful
X–MP and Y–MP series, Seymour Cray pursued his development of the Cray–2, a design based on the CDC 8600, which Cray had started while at the Control Data Corporation.  The original intent was to build the VLSI chips from gallium arsenide (GaAs), which would allow must faster circuitry.  The technology for manufacturing GaAs chips was not then mature enough to be useful as circuit elements in a large computer.

The Cray–2 was a four–processor computer that had 64 to 512 megawords of 128–way interleaved DRAM memory.  The computer was built very small in order to be very fast, as a result the circuit boards were built as very compact stacked cards.  Note the hand holding the stacked circuitry in the figure below.

Due to the card density, it was not possible to use air cooling.  The entire system was immersed in a tank of Fluorinert™, an inert liquid intended to be a blood substitute.  When introduced in 1985, the Cray–2 was not significantly faster than the Y–MP.  It sold only thirty copies, all to customers needing its large main memory capacity.

Whatever Happened to Gallium Arsenide?

In his 1981 paper on the CRAY–1, J. S. Kolodzev listed the advantages of Gallium Arsenide as a semiconductor material and hinted that future Cray computers would use circuits fabricated from the material.  Whatever happened?

It is true that the one and only CRAY–3 was build with GaAs circuits.  This was shortly before the Cray Computer Company went bankrupt.

The advantages of Gallium Arsenide are seen in the following table of switching times.

 Material Switching Speed Relative to Silicon Silicon 400 picoseconds 1 Gallium Arsenide 80 picoseconds 5 times faster Josephson Junction 15 picoseconds 27 times faster

The clear winner is the Josephson junction.  The difficulty is that it only operates at superconducting temperatures, which at the time were 4 Kelvins (– 452 F), the temperature of liquid Helium.  The discovery of high–temperature superconducting in the late 1980’s may push this up to 77 Kelvins (– 321 F), the temperature of liquid Nitrogen.

Operation of any circuitry at 4 Kelvins requires use of liquid Helium, which is expensive.  Operation at 77 Kelvins requires use of liquid Nitrogen, which is plentiful and cheap.  Nevertheless, most computer operators do not want to bother with it.  This option is out.

Why not Gallium Arsenide?

Why not Gallium Arsenide?  It was hard to fabricate and, unlike Silicon, did not form a stable oxide that could be used as an insulator.  This made Silicon the preferred choice for the new high–speed circuit technology, called CMOS, which was increasingly used in the late 1980’s and 1990’s.  The economy of scale available to the Silicon industry also played a large role in inhibiting the adoption of Gallium Arsenide.  This is another example of the mass market products that are good enough for the job freezing out the low market, high technology, products that theoretically are better.

The Cray–3 and the End of an Era

After the Cray–2, Seymour Cray began another very aggressive design: the Cray–3.  This was to be a very small computer that fit into a cube one foot on a side.  Such a design would require retention of the Fluorinert cooling system.  It would also be difficult to manufacture as it would require robotic assembly and precision welding.  It would also have been very difficult to test, as there was no direct access to the internal parts of the machine.

The Cray–3 had a 2 nanosecond cycle time (500 MHz).  A single processor machine would have a performance of 948 megaflops; the 16–processor model would have operated at 15.2 gigaflops.  The 16–unit model was never built.  The Cray–3 was delivered in 1993.  In 1994, Cray Research, Inc. released the T90 with a 2.2 nanosecond clock time and eight times the performance of the Cray–3.

In the end, the development of traditional supercomputers ran into several problems.

1.   The end of the cold war reduced the pressing need for massive computing facilities.

2.   The rise of microprocessor technology allowing much faster and cheaper processors.

3.   The rise of VLSI technology, making multiple processor systems more feasible.

Supercomputers vs. Multiprocessor Clusters

“If you were plowing a field, which would you rather use: Two
strong oxen or 1024 chickens?”  Seymour Cray

Although Seymour Cray said it more colorfully, there were many objections to the transition from the traditional vector supercomputer (with a few processors) to the massively parallel computing that replaced it.  Here we quote from an overview article written in 1984 [R108].  It assessed the commercial viability of traditional vector processors and multiprocessor systems.  The key issue in assessing the commercial viability of a multiple–processor system is the speedup factor; how much faster is a processor with N processors.  Here are two opinions from the 1984 IEEE tutorial on supercomputers [R108].

“The speedup factor of using an n–processor system over a uniprocessor system has been theoretically estimated to be within the range (log2n, n/log2n).  For example, the speedup range is less than 6.9 for n = 16.  Most of today’s commercial multiprocessors have only 2 to 4 processors in a system.”

“By the late 1980s, we may expect systems of 8–16 processors.  Unless the technology changes drastically, we will not anticipate massive multiprocessor systems until the 90s.”

As we shall see soon, technology has changed drastically.