
Short form Guide to Vectorization/Parallelization
The Cray YMPEL has four identical vector processors.
In order to make optimal use of the machine, there is a clear priority
of goals:
1) Make sure your code vectorizes well.
This ensures that one individual processor may be put to
good use for your code.
2) Only after #1 has been taken care of, maybe you can also
parallelize your code, i.e. ensure that a good part of it
can be broken into four pieces which can excute concurrently
in parallel on the four processors.
Both tasks will be described below. The vectorization task
is equally important for other vector processors, like the
Cray1, XMP, YMP, C90, T90, J90, the CDC Cyber 205/ETA10,
or the modern CrayX1. The parallelization task comes in
only for parallel processors, i.e. the CrayXMP, YMP, C90,
T90, J90, X1, but also MPP (massively parallel) systems
like the CrayT3E. There is no principal difference between
the very first parallel vector machine, the CrayXMP, and the
most modern CrayX1 in this respect.
The actual transformation of your code into vector operations
and parallel tasks can be taken care of automatically by the
compiling/optimizing systems. The compilers available on our
Cray YMPEL are:
cc Cray Ccompiler (details see man cc)
cf77 Cray FORTRAN77 compiler (details see man cf77)
It is the task of the programmer, however, to make sure that
the compiler/optimizer can do good work, that is that the
algorithms used are fit for vectorization/parallelization.
That is the topic of the present text.
Vectorization essentially implies that there is an inner
loop where the individual iterations are independent of
each other. Typical example:
x(i) = a(i) * c + b(i)
Here, the calculation for one value of i is totally
independent of the calculation for any other value of i.
This independence ensures that the results
do not depend upon the sequence of operations, i.e. some
operations can be carried out quasisimultaneously.
The pipelining inherent in vector operations is logically
equivalent to parallel operations.
Let us start with a simple example. We will use FORTRAN77
(in reality, all we do is FORTRAN IV)
throughout in the hope that it will be understandable to
everyone.
do 1 i=2,64000
1 x(i) = x(i1) * x(i)
or sequentially
x(2)=x(1)*x(2) x(3)=x(2)*x(3) x(4)=x(3)*x(4) etc
This loop is recursive. Each multiply needs to wait for the
result of the previous multiply. Nothing can be done in parallel.
This loop neither vectorizes nor can it be parallelized
without changing the algorithm. Recursion is to be strictly
avoided. This is even bad for scalar processors which tend
to be somewhat pipelined since many years (CDC 6600 onward).
This property changes drastically if i1 is replaced by i+1
within above loop 1:
x(i) = x(i+1) * x(i)
or sequentially
x(1)=x(2)*x(1) x(2)=x(3)*x(2) x(3)=x(4)*x(2)
Now, all multiplies are independent of each other, the results are not reused.
The only requirement is not to change the sequence of
operations. That is sufficient for pipelining: no change in
sequence, but result #1 will not be available quite some
time before calculation #2 is begun. Therefore, this new
loop vectorizes (all operations independent) as well as
parallelizes perfectly. Of course, we have an entirely
different algorithm now presumably solving a totally
different problem.
A slightly more complex case is
do 2 i=1,n
do 2 k=1,n
s=0.0
do 3 j=1,n
3 s=s+a(i,j)*b(j,k)
2 c(i,k)=s
This obviously is a matrix multiply by inner products. The
addition is recursive like the multiplication in loop 1.
Therefore, this code will not vectorize. One could say that
all the (i,k) operations are independent of each other;
therefore the code can be parallelized, i.e. split into
four tasks for the four processors. Each task is doing some
of the scalar products. However, each processor
would be used very inefficiently. This should not be done
therefore; there are more efficient ways for matrix multiply.
One generally helpful recipe for vectorization of recursive
loops (which can be applied only by the programmer, not by any
automatic vectorizer) is the exchange of inner and outer
loops. Look at this rewrite of our matrix multiply
do 4 i=1,n
do 4 k=1,n
4 c(i,k)=0.0
do 5 j=1,n
do 5 i=1,n
do 5 k=1,n
5 c(i,k)=c(i,k)+a(i,j)*b(j,k)
Loop 4 is negligible with respect to loop 5. The inner one
of the loops 5 (kloop) is vectorizable: the operations
for all k, given fixed i,j are independent of each other.
So our primary goal, vbectorizable loop. has been achieved.
However, the code is far from optimal yet. One reason is
that FORTRAN arrays are stored by column, i.e. the first index
runs fastest, i.e. addresses consecutive storage locations.
It is always advisable that the innermost loop should increment
addresses consecutively, since that is the most efficient way
of memory access. Therefore, it is good to exchange the iloop
and the kloop
do 5 j=1,n
do 5 k=1,n
do 5 i=1,n
5 c(i,k)=c(i,k)+a(i,j)*b(j,k)
Now, the inner loop not only vectorizes, but it does so
with stride 1, the optimum memory access pattern. For given j,k,
all the i elements are independent of each other, and to be found
in storage with address increment 1. This is the optimal vector
loop. Furthermore, n should be a multiple of 64 or large with
respect to 64. This condition arises since the length of the
vector registers within the early Cray machines is 64, so that
any vector operation is broken down into chunks of size 64.
Next, let us consider parallelization. For fixed j, all operations
in k are independent of each other. Therefore, one could dispatch
n/4 of the k operations to the 4 processors, to be executed
independently of each other, and not necessarily in sequence.
However, because of the recursion in j, this must be done for each
value of j independently. That necessitates n synchronization
operations between the 4 processors, quite a big overhead.
Can we improve upon that ? Again, exchanging the loop sequence
solves the problem:
do 5 k=1,n
do 5 j=1,n
do 5 i=1,n
5 c(i,k)=c(i,k)+a(i,j)*b(j,k)
The inner (i) loop is the same as before, optimally vectorizable.
The j loop is recursive and can neither be vectorized nor
parallelized. The koperations are independent, however. So we
can dispatch k=1...n/4, k=n/4+1...n/2, k=n/2+1...3n/4,
k=3n/4+1...n to the four processors to be executed independently.
We do not care about the sequence of k executions. We have only
one single rendevouz operation left: all processors done. This
is the optimum parallelization.
Another topic arises in vectorization: irregular memory access
patterns. This may arise in several different forms; most of these
can be dealt with by the gather operation:
do 6 i=1,n
6 v(i)=x(index(i))
and the scatter operation
do 7 i=1,n
7 v(index(i))=x(i)
These two operations are implemenetd on the hardware level in
modern vector computers like our Cray YMPEL. In order to work on
irregularly stored data elements, the rule of good
programming practice is to gather the required input data
into contiguous arrays (gather), work during the computation with these
contiguous arrays, and then distrbute the results to the
desired locations (scatter operation). The loops 6 and 7 above will
be recognized by the Cray vectorizer and will be transformed
into the corresponding hardware instructions.
Unfortunately the matrix multiply example above is not useful
for trying it out on the YMPEL. This is so since the compiler
in many cases recognizes the intent: matrix multiply, and instead
of compiling individual instructions generates a call to a
highly optimized assembler routine for matrix multiply.
While this is a very short guide, the basic principles outlined here
should carry you a long way to write good Cray YMPEL programs.
There also is a vast body of literature on these topics; search for
"vectorization" and/or "parallelization". The very
significant importance of these topics arises since there can be factors
as large as 100 in runtime between unoptimized or bad code,
i.e. single processor scalar code, and optimized parallel vector code.
For example matrix multiply ranges from 4.7 MFLOPS worstcase
to 450 MFLOPS best case on our YMPEL.
