## Homework

HomeworkDueCovers Worth hw1 1/19/2012 Historical data and future predictions 15% hw2 1/26/2012 Memory bandwidth 15% hw3 2/29/2012 OpenMP and multithreading 20% hw4 3/31/2012 Cache aware algorithms 20% hw5 4/27/2012 MPI and distributed computing with Lapack and ScaLapack 30% ## Advice, Hints, whatever...

All homework should be emailed to me before class on the date due unless another specific time is listed. Only one person in your group needs to send me the solution (preferably as a .tgz or .zip file with everything in it).

Always put MA5490 in the Subject line of your message. I will send you a reply (from GMail) when I get your mail. If you do not get a reply, I might not have received your email.You are free to program in C, C++, Fortran, or something else if you confirm it with me. I find C to be convenient myself, however, and the examples and handouts will all be in C.

If I give you software, check the Notes page often to see if there is an update. I take suggestions for improved software. If you think you found a bug, please send me information about it. I am

happy to see bug fixes or better code. Just because I have been programming since 1968 does not mean I write the best code.alwaysWhat you should turn in:

- All codes in a compact manner.
- A description of the codes in a file (PDF, RTF, Word, Google Docs, or LaTeX).
- How to make the code (this could be as simple as saying, "run the
makecommand" with or without some argument(s)).- How to run the code (if possible, have a
make runoption in your Makefile).As stated in the first class, it is nearly impossible to cheat in this class as long as your group works on the assignments in the computer lab and turns in what you worked on. You are allowed to discuss concepts with other groups. Please do not copy verbatim another group's code, however.

## HW1

## Part 1

For the following computer systems, research information about them (consult http://www.top500.org):

- Cray 1
- K System at Tokyo Tech's RIKEN Advanced Institute for Computational Science
- NUDT YH MPP (Tianhe-1A) at National Supercomputing Center in Tianjin
- Cray XT5-HE (Jaguar) at Oak Ridge National Lab
Find out the following information for each system:

- Clock Speed
- Number of Nodes
- Number of Cores or Processors on a Node
- Memory Per Node
- Peak Speed (in floating point operations per second) per Processing Element, Per Node and for the whole system.
- Linpack performance for this system.
- Memory System Bandwidth: how many bytes can be transferred in a node of the system per second.
- What is the network architecture.
- Network Bandwidth: how many bytes can be sent off of a node in a second.
## Part 2

Review the historical data on the top 500 systems in the world which is available at http://www.top500.org or in the free Apple App Store Top500 app. For example the top 500 lists are available in Microsoft Excel format for each year since 1993. By collecting the historical data, make plots of the performance of the #1 system, the #100 system, and the #500 system for each year since June 2000. Using these plots project what performance the #1 system, the #100 system and the #500 system will have in June 2012, November 2015, and November 2018. Use Matlab to make the plots.

## What to turn in

For Part 1, turn in a report using one of the writing systems in the Advice section. A table is sufficient. For Part 2, turn in your Matlab script and a report on the raw numbers with the graphs.

## HW2

Find the STREAM benchmark on the Internet and investigate its home web site. Download the code and then benchmark

- A computer in the classroom.
- Your personal computer(s), preferably using more than one operating system.
- Any other computer that you find interesting to benchmark including multicore or distributed memory computers. The more the merrier to a degree.
## What to turn in

Turn in a report describing the computers benchmarked and their scores. During the semester, you may want to add to your list as we progress through OpenMP and MPI.

## HW3

We will explore overlapped I/O and computing using multiple threads on a single CPU:

- OpenMP for simple threading on a coarse level.
- Overlapped computing and input/output to a set of disk files using threads to completely hide the input/output steps.
- Matrix-matrix multiplication (MM-mult): C = AB, where A, B, and C are matrices. Recall that if A is mXk and B is kXn, then C is mXn. If we denote the elements of the matrices by

(1) A = [ a_{ij}], B = [ b_{ij}],

and

(2) C = [ c_{ij}], then c_{ij}= a_{i1}*b_{1j}+ a_{i2}*b_{2j}+ ... + a_{ik}*b_{kj}.1. Download the software (hw-mm.tgz or hw-mm.zip) and familiarize yourself with it. Inside the packed files is a subdirectory hw-mm containing several files:

- mkmatrices.c contains an example program to create A and B in blocks as disk files.
- matrices.c contains functions to allocate memory for, print, write, and read the blocks.
- matrices.h contains headers for the functions in matrices.c.
timer.ccontains a high resolution wall clock timer.Makefilecontains information for the program make. There are instructions on using it at the top of the file.A simple way to see how this all works is to unpack the files. Then in a Terminal window in the hw-mm directory type the command

make run.2. Start by implementing MM-mult using a simple formula without OpenMP.

Use the simple formula for c

_{ij}in (2) above. The tricky part is that you have to get the right blocks of A and B into memory before you can compute any element of C. Work that out on paper before programming and include it in your homework documentation.3. Add OpenMP last.

You will need to implement a way of communicating with different threads using shared memory to tell one or more threads what disk block(s) to read or write. Your computing thread will need to know when data is available. It will also need to schedule blocks to be brought into memory (so it can compute on blocks already in memory). Once you have read enough blocks into memory, you should be able to make MM-mult compute bound (able to compute without waiting for input from the disk files).

## What to turn in

Turn in a report describing the results, the files needed to make an executable code, and how to make and run the code. Give conditions when your code is compute bound based on the computer you used (and state what that was in the report).

## HW4

We will explore several key areas for cache aware programming in this homework:

1. Download the software (hw-db.tgz or hw-db.zip) and familiarize yourself with it. Inside the packed files are containing several files:

- mkhashtbl.c contains an example program to create and print a hash table.
- hashtbl.c contains functions to allocate, lookup inside of, add to, and print elements in a hash table.
- hashtbl.h contains headers for the functions in hashtbl.c.
timer.ccontains a high resolution wall clock timer.Makefilecontains information for the program make. There are instructions on using it at the top of the file.A simple way to see how this all works is to create a directory called hw-db, change to it, and unpack the files. Then in a Terminal window in the hw-db directory type the command

make run.2. You should add functionality to make accessing the hash table cache aware once the table has been created (but possibly before it has been accessed a lot of times).

3. Do experiments to demonstrate that your code really is better at using the cache than the original code and clearly document what experiments you did and why your code is better at using the cache. It is easy to do this assignment all wrong. Here are some hints:

- Do not change my code. Let it build a complete hash table with a lot of <key,data> pairs. Consider this table fixed and assume that no more <key,data> pairs will be added in the future. It is static, not dynamic at this point.
- Rebuild the hash table using cache aware data structures that also provide extra information that I did not store about the <key,data> pairs in order to reduce the time it takes to do a look up. Do not make any assumptions about the size of the hash table nor how many collisions you might have.
- Create a lot of sample keys: some in the hash table and some new random ones that might not be in the table.
- Time how long it takes to look up all of the sample keys using my look up function on my version of the hash table.
- Time how long it takes to look up all of the sample keys using your look up function on your version of the hash table.
4. Useful things to think about are

- Suppose there are 1,000 - 100,000 <key,data> pairs in a hash table of size at least 101 and 100,000 - 1,000,000 look ups in hints 4-5 above.
- While not part of the assignment, how would either OpenMP or MPI be useful?
## What to turn in

- Your code (a separate file from hashtbl.c that I gave you), updated Makefile and mkhashtbl.c, and how to run your code with your inputs to any questions asked by the main program.
- Your write up of the timing results and a comparison.
- An explanation why your new data structures use cache better than the simple minded linked list implementation I gave you.
Some remarks on how to add parallelism using OpenMP, MPI, or both. ## HW5

Solve numerically the heat equation on a linear mesh:

(1) u_{t}= u_{xx}in [0,1]x[0,T], T>>0

with an initial condition at t = 0 of

(2) u(x,0) = 63sin(8(pi)x).

Due to the initial conditions (2) and the partial differential equation (1),

u(0,t) = u(1,t) = 0, for all 0 <= t <= T.

We can define a mesh in both space and time for computing numerical approximation of the solution to (1)-(2). Let h = (N-1)^{-1}and define x_{i}= ih, i = 0, 1, ..., N. Then {x_{i}} is a mesh for the spatial domain [0,1]. Let the difference in time lines be defined by H = T(L-1)^{-1}and define t_{k}= kH, k = 0, 1, ..., L. Then {t_{k}} is a mesh for the temportal domain [0,T]. The unknowns are

{u(x_{i},t_{k})}, i = 1, 2, ..., N-1 and k = 1, 2, ..., L-1.

The spatial discretization of (1) is based on central differences:

(3) u_{xx}= h^{-2}(-2U_{i,k}+ U_{i-1,k}+ U_{i+1,k}), where U_{i,k}= u(x_{i},t_{k}).

There are numerous ways to discretize in time, some explicit and others implicit. The former is very fast and parallelizes easily and latter requires a direct solve every time step. Each has useful properties, which you will investigate. The theta methods are defined for a variable that controls how implicit the time stepping method is. We use the variables

(4) c = Hh^{-2}and d = theta

in the formula below:

(5) U_{i,k+1}- U_{i,k}= c [ d( -2U_{i,k+1}+ U_{i-1,k+1}+ U_{i+1,k+1}) + (1-d)( -2U_{i,k}+ U_{i-1,k}+ U_{i+1,k}) ].

The important values are d = 0 (Forward Euler), d = 1/2 (Crank-Nicolson), and d = 1 (Backward Euler). For convergence, we must have

c <= (1/2)(1-2d)^{-1}, if 0 <= d < (1/2)

and

any c if (1/2) <= d <= 1.

To do:

(1) Write a serial code that solves (1)-(2) for the three important values of d. Look up Lapack (the SIAM book is online for free, see http://www.netlib.org/lapack/lug) to find a tridiagonal solver to use for implicit schemes.

(2) Write a parallel code that solves (1)-(2) of (1). Use ScaLapack for the parallel solver.

(3) Use Matlab or something similar to graph the solutions. A movie from the data would be ideal.

(4) Explain when Forward Euler, Crank-Nicolson, or Backward Euler should be used over the other two options. Base your argument over the number of processors and time estimates for a given accuracy (you need to research the convergence rates to answer this question).## What to turn in

A report describing everything you did, the results and comparisons, and how to make and run your codes. It should be all inclusive. Turn in the codes.