King Abdullah University of Science & Technology (KAUST), 2011 Summer

High Performance Computing, eXtreme Technical Computing

Professor Craig C. Douglas



Homework Covers Worth
hw1 Memory bandwidth 20%
hw2 OpenMP and multithreading 20%
hw3 Cache aware algorithms 30%
hw4 MPI and distributed computing with Lapack and ScaLapack 30%

Advice, Hints, whatever...

All homework should be emailed to me before class on the date due unless another specific time is listed. Only one person in your group needs to send me the solution (preferably as a .tgz or .zip file with everything in it). Always put MA5490 in the Subject line of your message. I will send you a reply (from GMail) when I get your mail. If you do not get a reply, I might not have received your email.

You are free to program in C, C++, Fortran, or something else if you confirm it with me. I find C to be convenient myself, however, and the examples and handouts will all be in C.

If I give you software, check the Notes page often to see if there is an update. I take suggestions for improved software. If you think you found a bug, please send me information about it. I am always happy to see bug fixes or better code. Just because I have been programming since 1968 does not mean I write the best code.

What you should turn in:

As stated in the first class, it is nearly impossible to cheat in this class as long as your group works on the assignments in the computer lab and turns in what you worked on. You are allowed to discuss concepts with other groups. Please do not copy verbatim another group's code, however.



Find the STREAM benchmark on the Internet and investigate its home web site. Download the code and then benchmark

  1. A computer in the classroom.
  2. Your personal computer(s), preferably using more than one operating system.
  3. Any other computer that you find interesting to benchmark including multicore or distributed memory computers. The more the merrier to a degree.

What to turn in

Turn in a report describing the computers benchmarked and their scores. During the semester, you may want to add to your list as we progress through OpenMP and MPI.


We will explore overlapped I/O and computing using multiple threads on a single CPU:

1. Download the software (hw1.tgz or and familiarize yourself with it. Inside the packed files are several files:

A simple way to see how this all works is to unpack the files. Then in a Terminal window in the hw-mm directory type the command make run.

2. Start by implementing MM-mult using a simple formula without OpenMP.

Use the simple formula for cij in (2) above. The tricky part is that you have to get the right blocks of A and B into memory before you can compute any element of C. Work that out on paper before programming and include it in your homework documentation.

3. Add OpenMP last.

You will need to implement a way of communicating with different threads using shared memory to tell one or more threads what disk block(s) to read or write. Your computing thread will need to know when data is available. It will also need to schedule blocks to be brought into memory (so it can compute on blocks already in memory). Once you have read enough blocks into memory, you should be able to make MM-mult compute bound (able to compute without waiting for input from the disk files).

What to turn in

Turn in a report describing the results, the files needed to make an executable code, and how to make and run the code. Give conditions when your code is compute bound based on the computer you used (and state what that was in the report).


We will explore several key areas for cache aware programming in this homework:

1. Download the software (hw2.tgz or and familiarize yourself with it. Inside the packed files are containing several files in a folder hw2:

A simple way to see how this all works is to create a directory called hw-db, change to it, and unpack the files. Then in a Terminal window in the hw-db directory type the command make run.

2. You should add functionality to make accessing the hash table cache aware once the table has been created (but possibly before it has been accessed a lot of times).

3. Do experiments to demonstrate that your code really is better at using the cache than the original code and clearly document what experiments you did and why your code is better at using the cache. It is easy to do this assignment all wrong. Here are some hints:

4. Useful things to think about are

What to turn in


Solve numerically the heat equation on a linear mesh:

(1)  ut = uxx in [0,1]x[0,T], T>>0

with an initial condition at t = 0 of

(2)   u(x,0) = 63sin(8(pi)x).

Due to the initial conditions (2) and the partial differential equation (1),

        u(0,t) = u(1,t) = 0, for all 0 <= t <= T.

We can define a mesh in both space and time for computing numerical approximation of the solution to (1)-(2). Let h = (N-1)-1 and define xi = ih, i = 0, 1, ..., N. Then {xi} is a mesh for the spatial domain [0,1]. Let the difference in time lines be defined by H = T(L-1)-1 and define tk = kH, k = 0, 1, ..., L. Then {tk} is a mesh for the temportal domain [0,T]. The unknowns are

        {u(xi,tk)}, i = 1, 2, ..., N-1 and k = 1, 2, ..., L-1.

The spatial discretization of (1) is based on central differences:

(3)   uxx = h-2(-2Ui,k + Ui-1,k + Ui+1,k), where Ui,k = u(xi,tk).

There are numerous ways to discretize in time, some explicit and others implicit. The former is very fast and parallelizes easily and latter requires a direct solve every time step. Each has useful properties, which you will investigate. The theta methods are defined for a variable that controls how implicit the time stepping method is. We use the variables

(4)   c = Hh-2 and d = theta

in the formula below:

(5)   Ui,k+1 - Ui,k = c [ d( -2Ui,k+1 + Ui-1,k+1 + Ui+1,k+1 ) + (1-d)( -2Ui,k + Ui-1,k + Ui+1,k ) ].

The important values are d = 0 (Forward Euler), d = 1/2 (Crank-Nicolson), and d = 1 (Backward Euler). For convergence, we must have

       c <= (1/2)(1-2d)-1, if 0 <= d < (1/2)
       any c if (1/2) <= d <= 1.

To do:

(1) Write a serial code that solves (1)-(2) for the three important values of d. Look up Lapack (the SIAM book is online for free, see to find a tridiagonal solver to use for implicit schemes.
(2) Write a parallel code that solves (1)-(2) of (1). Use ScaLapack for the parallel solver.
(3) Use Matlab or something similar to graph the solutions. A movie from the data would be ideal.
(4) Explain when Forward Euler, Crank-Nicolson, or Backward Euler should be used over the other two options. Base your argument over the number of processors and time estimates for a given accuracy (you need to research the convergence rates to answer this question).

What to turn in

A report describing everything you did, the results and comparisons, and how to make and run your codes. It should be all inclusive. Turn in the codes.


Craig C. Douglas

Last modified: