University of Wyoming ERS 4990, MA 5490/4800, and COSC 4010/5010, 2016 Fall

1:20-2:35 on Tuesdays (EIC 201, Encana Auditorium) and Thursdays (ENG 2106)

Office hours (Ross Hall 227): T/R 11:00-12:00 and W 1:00-2:00

High Performance Computing, eXtreme Technical Computing

Professor Craig C. Douglas



Homework Due Covers Worth
Reading Chapters 1-2 in Pacheco Why parallel programming and some basics  
hw1 9/8/2016 Historical data and predictions 10%
Reading Chapter 5 in Pacheco OpenMP  
hw2 Step 2 (9/27/2016*)
Step 3 (12/8/2016)
Step 4 (12/8/2016)
Step 5 (12/8/2016)
* Bring it in on time and we will talk about your reports. You can modify it for 2 days.
Numerical linear algebra and OpenMP 10%
hw3 10/18/2016 STREAMS benchmark 15%
Reading Chapter 4 in Pacheco MPI  
hw4 12/2/2016 Parlib and distributed communication 15%
project optional Build a HPC flow simulator and run it  

Advice, Hints, whatever...

All homework should be emailed to me before class on the date due unless another specific time is listed. Only one person in your group needs to send me the solution (preferably as a .tgz or .zip file with everything in it). Always put HPC-XTC in the Subject line of your message. I will send you a reply when I get your mail. If you do not get a reply, I might not have received your email.

You are free to program in C, C++, Fortran, or something else if you confirm it with me. I find C to be convenient myself, however, and the examples and handouts will typically be in C.

If I give you software, check the Notes page often to see if there is an update. I take suggestions for improved software. If you think you found a bug, please send me information about it. I am always happy to see bug fixes or better code. Just because I have been programming since 1968 does not mean I write the best code.

What you should turn in:

As stated in the first class, it is nearly impossible to cheat in this class as long as your group works on the assignments in the computer lab and turns in what you worked on. You are allowed to discuss concepts with other groups. Please do not copy verbatim another group's code, however.


Part 1

For the following computer systems, research information about them (consult as a start):

Find out the following information for each system:

  1. Clock speed
  2. Number of nodes
  3. Number of cores or processors on a node
  4. Number of GPU nodes
  5. GPUs per node
  6. Memory per node
  7. Peak speed (in floating point operations per second) per Processing Element, Per Node and for the whole system.
  8. Linpack performance for this system.
  9. Memory system bandwidth: how many bytes can be transferred in a node of the system per second.
  10. What is the network architecture.
  11. Network bandwidth: how many bytes can be sent off of a node in a second.

Part 2

Review the historical data on the top 500 systems in the world which is available at or in the free Apple App Store Top500 app. For example the top 500 lists are available in Microsoft Excel format for each year since 1993. By collecting the historical data, make plots of the performance of the #1 system, the #100 system, and the #500 system for each year since June 2010. Using these plots project what performance the #1 system, the #100 system and the #500 system will have in June 2019 and November 2022. Use Matlab to make the plots.

What to turn in

For Part 1, turn in a report using one of the writing systems in the Advice section. A table is sufficient. For Part 2, turn in your Matlab script and a report on the raw numbers with the graphs.


We will explore overlapped I/O and computing using multiple threads on a single node (which can have multiple CPUs, but a shared memory):

Equation (2) can be implemented with 3, 5, or 6 loops. For this assignment use 3 loops unless told otherwise specifically in a part.

1. Download the software ( and familiarize yourself with it. Inside the packed files are several files:

A simple way to see how this all works is to unpack the files. Then in a Terminal window in the hw2-mm directory type the command make run.

2. Start by implementing MM-mult using a simple formula without OpenMP and run on only one core with only one block per matrix.

Use the simple formula for cij in (2) above. The tricky part is that you have to get the right blocks of A and B into memory before you can compute any element of C. Work that out on paper before programming and include it in your homework documentation. Remember that you have control over the block shapes for each of the three matrices. You should enforce block shape restrictions for each matrix!

3. Add OpenMP last.

You will need to implement a way of communicating with different threads using shared memory to tell one or more threads what disk block(s) to read or write. Your computing thread will need to know when data is available. It will also need to schedule blocks to be brought into memory (so it can compute on blocks already in memory). Once you have read enough blocks into memory, you should be able to make MM-mult compute bound (i.e., able to compute without waiting for input from the disk files).

4. Make cache aware.

You will experiment with different ways to make MM-mult run faster using the cache memory and tricks (including 5 or 6 loops instead of 3 and anything else you can figure out). Create a new MM-mult-ca function. Do not modify your old MM-mult function.

Suppose the matrices are square with N rows and columns. An example of six loops is given by

     for( i = 0 ; i < N ; i += s )
       for( j = 0 ; j < N ; j += s )
         for( k = 0 ; k < N ; k += s )
           for( l = i ; l < i+s ; l++ )
             for( m = j ; m < j+s ; m++ )
               for( n = k ; n < k+s ; n++ )
                 C[l][m] = B[l][n] * B[n][m];

Compare your 6 (or 5) loop implemetation to your code in the previous step and to using DGEMM from the BLAS (if you need to download and create your own BLAS library, consider ATLAS).

5. Mt Moran

Run your codes from steps 2, 3, and 4 on Mt Moran at ARCC. Provide two graphs (one each of the run times and speedups) for 1, 2, 4, 8, 16, and 32 threads using the same, large matrices A and B. Then repeat using matrices that double in size each time you double the number of threads. Use the batch system to submit your jobs.

What to turn in

Turn in a report describing the results, the files needed to make an executable code, and how to make and run the code. Give conditions when your code is compute bound based on the computer you used (and state what that was in the report). Do not delete your report until told it is okay since there may be more than the number of parts currently listed.

As you complete each step of the homework, add to your report. Describe in detail what you did and how I can run your code to see similar results. Be specific (give block sizes, matrix sizes, any restrictions you imposed, etc.).


Find the STREAM benchmark on the Internet and investigate its home web site. Download the code and then benchmark

  1. A computer in either EN 2106 or Ross Hall 241.
  2. Your personal computer(s), preferably using more than one operating system.
  3. Any other computer that you find interesting to benchmark including multicore or distributed memory computers. The more the merrier to a degree.

What to turn in

Turn in a report describing the computers benchmarked and their scores.


Complete the codes in the Parlib lecture. This is an individual assignment (no groups). You should turn in your codes and a report in which you do a scaling study to show wall clock timings, speedup, and parallel efficiency. As it says in the Parlib lecture, you should never, ever use anything beginning with MPI_...

Optional Project

Download the Open Porous Media flow code (get the source code) and at least one dataset. Build the code on Mt. Moran (or equivalent) and run it.

Now, to be realistic, before you build the code from source files, download a binary executable and experiment with a working version.

You are welcome to show off your results after the course is over and ask questions. You will probably find this to be an interesting way to spend a day or two over the break while your ARCC accounts are still active.


Craig C. Douglas

Last modified: