Loosely
Coupled Multiprocessors
Our
previous discussions of multiprocessors focused on systems built with a modest
number of processors (no more than about 50), which communicate via a shared
bus.
The
class of computers we shall consider in this and the next lecture is called
“MPP”, for Massively Parallel Processor”.
As we shall see, the development of MPP systems was resisted for a long
time, due to the belief that such designs could not be cost effective.
We shall
see that MPP systems finally evolved due to a number of factors, at least one
of which only became operative in the late 1990’s.
1. The
availability of small and inexpensive microprocessor units (Intel 80386, etc.)
that could be efficiently
packaged into a small unit.
2. The
discovery that many very important problems were quite amenable to
parallel implementation.
3. The
discovery that many of these important problems had structures of such
regularity that sequential code
could be automatically translated for parallel
execution with little loss in
efficiency.
The process of converting a sequential program for
parallel execution is often called “parallelization”. One speaks of “parallelizing an algorithm”; often this is misspoken as “paralyzing an algorithm” – which
unfortunately might be true.
The Speed–Up
Factor
In an
earlier lecture, we spoke of the speed–up factor, S(N),
which denotes how much faster a program will execute on N processors than on
one processor. At the time, we
referenced early opinions that the maximum speed–up for N processors would be
somewhere in the range [log2(N), N/log2(N)].
We shall
show some data for multicomputers with 2 to 65,536 processors in the next few
slides. Recall that 65,536 = 216. The processor counts were chosen so that I
could perform the calculation log2(N) in my
head; log2(65,536) = 16 because 65,536 = 216.
The
plots are log–log plots, in which each axis is scaled logarithmically. This allows the data to be seen. The top line represents linear speedup, the
theoretical upper limit.
In the
speedup graph, we see that the speed–up factor N/log2(N)
might be acceptable, though it is not impressive.
The next
graph is what I call “cost efficiency”.
It is the speed–up factor divided by the number of processors: S(N)/N. This factor measures the economic viability of the
design.
Given
the assumptions above, we see easily why large MPP designs did not appear
to be attractive in the early 1990’s and before.
The Speed–Up
Factor: S(N)
Examine a few values for the N/log2(N)
speedup: S(1024) = 102 and S(65536) = 4096.
These might be acceptable under certain specific circumstances.
The Cost
Efficiency Factor: S(N) / N
This
chart shows the real problem that was seen for MPP systems before the late
1990’s. Simply put, they were thought to
be very cost inefficient.
Linear
Speedup: The MPP Goal
The goal
of MPP system design is called “linear
speedup”, in which the performance of an N–processor system is
approximately N times that of a single processor system.
Earlier
in this lecture we made the comment that “Many important problems, particularly
ones that apply regular computations to massive data sets, are quite amenable
to parallel implementations”.
Within
the context of our lectures, the ambiguous phrase “quite amenable to parallel
implementations” acquires a specific meaning: there are well–know algorithms to
solve the problem and these algorithms can display a nearly–linear speedup when
implemented on MPP systems.
As noted
in an earlier lecture “Characteristics of Numerical Applications”, problems
that can be solved by algorithms in the class called “continuum models” are
likely to show near–linear speedup. This
is due to the limited communication between cells in the continuum grid.
There are other situations in which MPP systems might
be used. In 1990, Hennessy and Patters
on [Ref. 2, page 575] suggested that “a multiprocessor may be more effective
for a timesharing workload than a SISD [single processor]”. This seems to be the usage on the large IBM
mainframe used by CSU to teach the course in assembly language.
Linear
Speedup: The View from the Early 1990’s
Here is
what Harold Stone [Ref. 3] said in his textbook. The first thing to note is that he uses the
term “peak performance” for what we call “linear speedup”.
His
definition of peak performance is quite specific. I quote it here.
“When a multiprocessor is operating at peak
performance, all processors are engaged in useful work. No processor is idle, and no processor is
executing an instruction that would not be executed if the same algorithm were
executing on a single processor. In this
state of peak performance, all N processors are contributing to effective
performance, and the processing rate is increased by a factor of N. Peak performance is a very special state that
is rarely achievable.” [Ref. 3, page
340].
Stone
notes a number of factors that introduce inefficiencies and inhibit peak
performance. Here is his list.
1. The
delays introduced by interprocessor communication.
2. The
overhead in synchronizing the work of one processor with another.
3. The
possibility that one or more processors will run out of tasks and do nothing.
4. The
process cost of controlling the system and scheduling the tasks.
Early
History: The C.mmp
While
this lecture will focus on multicomputers, it is instructive to begin with a
review of a paper on the C.mmp, which is a shared–memory multiprocessor
developed at
The
C.mmp is described in a paper by Wulf and Harbinson [Ref. 6], which has been
noted as “one of the most thorough and balanced research–project retrospectives
… ever seen”. Remarkably, this paper
gives a thorough description of the project’s failures.
The
C.mmp is described [Ref. 6] as “a multiprocessor composed of 16 PDP–11’s, 16
independent memory banks, a crosspoint [crossbar] switch which permits any
processor to access any memory, and a typical complement of I/O
equipment”. It includes an independent
bus, called the “IP bus”, used to communicate control signals.
As of
1978, the system included the following 16 processors.
5 PDP–11/20’s, each rated at 0.20 MIPS (that
is 200,000 instructions per second)
11 PDP–11/40’s, each rated at 0.40 MIPS
3 megabytes of shared memory (650 nsec core
and 300 nsec semiconductor)
The
system was observed to compute at 6 MIPS.
The Design
Goals of the C.mmp
The goal
of the project seems to have been the construction of a simple system using as
many commercially available components as possible.
The
C.mmp was intended to be a research project not only in distributed processors,
but also in distributed software. The
native operating system designed for the C.mmp was called “Hydra”. It was intended as an OS kernel, intended to
provide only minimal services and encourage experimentation in system software.
As of
1978, the software developed on top of the Hydra kernel included file systems,
directory systems, schedulers and a number of language processors.
Another
part of the project involved the development of performance evaluation tools,
including the Hardware Monitor for recording the signals on the PDP–11 data bus
and software tools for analyzing the performance traces.
One of
the more important software tools was the Kernel Tracer, which was built into
the Hydra kernel. It allowed selected
operating system events, such as context swaps and blocking on semaphores, to
be recorded while a set of applications was running.
The Hydra kernel was originally designed based on some
common assumptions. When experimentation
showed these to be false, the Hydra kernel was redesigned.
The C.mmp:
Lessons Learned
The
researchers were able to implement the C.mmp as “a cost–effective, symmetric
multiprocessor” and distribute the Hydra kernel over all of the processors.
The use
of two variants of the PDP–11 was considered as a mistake, as it complicated
the process of making the necessary processor and operating system
modifications. The authors had used
newer variants of the PDP–11 in order to gain speed, but concluded that “It
would have been better to have had a single processor model, regardless of
speed”.
The
critical component was expected to be the crossbar switch. Experience showed the switch to be “very
reliable, and fast enough”. Early
expectations that the “raw speed” of the switch would be important were not
supported by experience.
The
authors concluded that “most applications are sped up by decomposing their
algorithms to use the multiprocessor structure, not by executing on processors
with short memory access times”.
The
simplicity of the Hydra kernel, with much system software built on top of it,
yielded benefits, such as few software errors caused by inadequate
synchronization.
The C.mmp: More
Lessons Learned
Here I
quote from Wulf & Harbison [Ref. 6], arranging their comments in an order
not found in their original. The PDP–11
was a memory–mapped architecture with a single bus, called the UNIBUS, that connected the CPU to both memory and I/O
devices.
1. “Hardware (un)reliability was our largest
day–to–day disappointment … The aggregate mean–time–between–failure (MTBF) of
C.mmp/Hydra fluctuated between two to six hours.”
2. “About two–thirds of the failures were
directly attributable to hardware problems.
There is insufficient fault detection built into the hardware.”
3. “We found the PDP–11 UNIBUS to be especially
noisy and error–prone.”
4. “The crosspoint [crossbar] switch is too
trusting of other components; it can be hung by malfunctioning memories or
processors.”
My favorite lesson learned is summarized in the
following two paragraphs in the report.
“We made a serious error in not writing good
diagnostics for the hardware. The
software developers should have written such programs for the hardware.”
“In our experience, diagnostics written by the
hardware group often did not test components under the type of load generated
by Hydra, resulting in much finger–pointing between groups.”
Task
Management in Multicomputers
The
basic idea behind both multicomputers and multiprocessors is to run multiple
tasks or multiple task threads at the same time. This goal leads to a number of requirements,
especially since it is commonly assumed that any user program will be able to
spawn a number of independently executing tasks or processes or threads.
According
to Baron and Higbie [Ref. 5], any multicomputer or multiprocessor system must
provide facilities for these five task–management capabilities.
1. Initiation A process must be able to spawn another process;
that
is, generate another process and activate it.
2. Synchronization A process must be able to suspend itself or another process
until
some sort of external synchronizing event occurs.
3. Exclusion A process must be able to monopolize a shared resource,
such
as data or code, to prevent “lost updates”.
4. Communication A process must be able to exchange messages with any
other
active process that is executing on the system.
5. Termination A process must be able to terminate itself and release
all
resources being used, without any memory leaks.
These facilities are more efficiently provided if
there is sufficient hardware support.
Hardware
Support for Multitasking
Any
processor or group of processors that supports multitasking will do so more
efficiently if the hardware provides an appropriate primitive operation.
A test–and–set operation with a binary semaphore (also called a “lock variable”) can be used for both
mutual exclusion and process synchronization.
This is best implemented as an atomic
operation, which in this context is one that cannot be interrupted until it
completes execution. It either executes
completely or fails.
The MIPS
provides another set of instructions to support synchronization. In this design, synchronization is achieved
by using a pair of instructions, issued in sequence.
After
the first instruction in the sequence is executed, the second is execute and
returns a value from which it can be deduced whether or not the instruction
pair was executed as if it were a single atomic instruction.
The MIPS
instruction pair is load linked and store conditional.
These
are often used in a spin lock
scenario, in which a processor executes in a tight loop awaiting the
availability of the shared resource that has been locked by another processor.
In fact,
this is not a necessary part of the design, but just its
most common use.
Clusters,
Grids, and the Like
There
are many applications amenable to an even looser grouping of
multicomputers. These often use
collections of commercially available computers, rather than just connecting a
number of processors together in a special network.
In the
past there have been problems of administering large clusters of computers; the
cost of administration scaling as a linear function of the number of
processors. Recent developments in
automated tools for remote management are likely to help here.
It
appears that blade servers are one
of the more recent adaptations of the cluster concept. The major advance represented by blade
servers is the ease of mounting and interconnecting the individual computers,
called “blades”, in the cluster.
In this
aspect, the blade server hearkens back to the 1970’s and the innovation in
instrumentation called “CAMAC”, which was a rack with a standard bus structure
for interconnecting instruments. This
replaced the jungle of interconnecting wires, so complex that it often took a
technician dedicated to keeping the communications intact.
Clusters
can be placed in physical proximity, as in the case of blade servers, or at
some distance and communicate via established networks, such as the
Internet. When a network is used for
communication, it is often designed using TCP/IP on top of Ethernet simply due
to the wealth of experience with this combination.
References
In this lecture, material from one or more of the
following references has been used.
1. Computer
Organization and Design, David A. Patterson & John L. Hennessy,
Morgan Kaufmann, (3rd
Edition, Revised Printing) 2007, (The course textbook)
ISBN 978 – 0 – 12 – 370606 – 5.
2. Computer
Architecture: A Quantitative Approach, John L. Hennessy and
David A. Patterson, Morgan
Kauffman, 1990. There is a later
edition.
ISBN 1 – 55860 –
069 – 8.
3. High–Performance
Computer Architecture, Harold S. Stone,
Addison–Wesley (Third Edition),
1993. ISBN 0 – 201 –
52688 – 3.
4. Structured
Computer Organization, Andrew S. Tanenbaum,
Pearson/Prentice–Hall (Fifth
Edition), 2006. ISBN 0 – 13 – 148521 – 0
5. Computer Architecture,
Robert J. Baron and Lee Higbie,
Addison–Wesley Publishing Company,
1992, ISBN 0 – 201 – 50923 – 7.
6. W. A. Wulf and S. P. Harbison, “Reflections in a pool of
processors / An
experience report on C.mmp/Hydra”,
Proceedings of the National Computer
Conference (AFIPS), June 1978.