## Homework, Projects, Quizzes, and Exams

All homework should be emailed to me. I require that you send me your files as a single zip or gzipped tar file using the naming convention, lastname.zip or lastname.tgz.

I will send you a reply when I get your email. If you do not get a reply, I did not get your email. You are responsible for having an email account that can communicate successfully with me. I am flexible on which of my email accounts you use up to a degree. I will send you email from GMail, however. Please send email to my GMail account since that is where mail to my university address will forward to eventually.

DescriptionHomework Quiz DueWorth hw1 X 1/12/2008 20 hw2 X 1/28/2008 100 quiz1 and solution X 2/2/2009 100 hw3 X 2/27/2009 100 hw4 X 3/6/2009 100 hw5 X 4/3/2009 150 hw6 X 4/10/2009 150 hw7 (extra credit) X 4/30/2009 100 ## Projects

You are expected to research a topic in parallel computing and make a presentation using the computer projection system in the classroom by the end of April. The presentations should be about 20 minutes in length. A good rule of thumb is no more than 20 slides in 20 minutes. Do not read the slides to the class. Talk about the slide's contents and use the slide as a guide to the spoken material. Your presentation should be of the same quality of one given at an international conference (meaning of higher quality than s student seminar) or a thesis defense (though of not as much depth as a thesis defense, however). You should provide the following material:

- Define your topic.
- Define the mathematics and algorithms used in parallel computing implementations.
- Give a summary of results in the field.
- State what the issues are in scaling this problem up to massive parallelism (and state how parallel is this problem icurrently).
You may find the examples in a course I taught at the University of Kentucky useful to see what students did in 75 minutes for a variety of topics. Obviously, I do not want to see those slides over again. If I fall asleep during your presentation, then I will not know what you did and will not grade it highly. I suggest coming to see me a week before your presentation to the class to get feedback from me.

WhoTopicWhenCerwinsky/Wiblimo Random number generation 4/22/2009 Burgess/Flynt Computational fluid dynamics 4/22/2009 Rigelo/Price Galerkin methods and adaptive mesh generation? 4/27/2009 Grace/Tang General relativity 4/27/2009 Skari/Wibisono Subsurface flow modeling 4/27/2009 Raghu/Sivasankaran Graph Algorithms 4/29/2009 The project is worth the sum of the points for hw1 - hw6.

## Hw1

Fill out the course questionaire and email it back to me as an attachment. Note that it is a .doc Word document. Please return it as such (not pdf, docx, txt, rtf, xls, or anything else).

## Hw2

Use the Matlab code gen_sparse.m to generate a large sparse matrix. Write a C program that is modular. You should write several small functions including ones that do the following:

- Reads a sparse matrix A into dynamically allocated memory.
- Prints out a sparse matrix. Use it to print A after reading it from the file A.txt.
- Prints out a vector.
- Computes a sparse matrix - vector multiplication, e.g., b = Ax. Use your print a vector function to print both x and b.
Be careful how you write your code since you will have to modify it at least twice more this semester in later homeworks. At some point you will want to time your code to see how fast it is. This will come later in the course. Right now concentrate on getting codes that work correctly.

Things you should or should not do include the following:

- Store only the nonzeroes of your sparse matrix A. No dense matrices with lots of zeroes (i.e., do
define a matrix A by double A[n][n], where n is the order of the matrix) are allowed.not- Consider storing the sparse matrix using 3 vectors or as 1 vector of length the number of nonzeroes in the matrix.
- Being modular will really help you later. Write general functions, but make your test cases small and simple. For example, you should be able to open any file (the file name should be a char* argument in the function definition) and call the function with the file name 'A.txt' to show that it works.
- Make A*0 = 0 (here, 0 is the zero vector of length N) your first test. Make A*1 = sum of row elements your second case. Make Ax your last case for some nonconstant and nontrivial x.
- Try to keep each function and the main program simple. Do not try to solve all problems in the world all at once. That is what Obama was elected to do immediately.
Turn in your code along with your test cases. In a single file, write a short description of each file that you turn in and call it Readme.{txt,doc, or whatever}.

## Hw3

Parallel codes are frequently written to use only one parallel computing communication system, e.g., MPI. For example, calls to specific MPI functions are embedded directly into a code. This is problematic if you ever have to change communications systems (e.g., to something other than MPI). A cleaner way to do communication is to write a simple library of routines that do what you want a parallel communications system to do independent of the specific system. Some simple functionality that you need for most or all parallel codes includes the following:

Start the parallel communications library up. This also provide information about how many processes there are and which one the caller is. In MPI, this requires calling three functions. MPI function(s) of interest: MPI_Init, MPI_Comm_size, MPI_Comm_rank.Send data to another process. This should be flexible enough to send any type of data. Options should include if this is a blocking or nonblocking send. MPI function(s) of interest: MPI_Send, MPI_Isend.Receive data from another process. This should be flexible enough to receive any type of data. Options should include if this is a blocking or nonblocking receive. MPI function(s) of interest: MPI_Recv, MPI_Irecv.Broadcast to all processes. Send information from one process to all processes. MPI function(s) of interest: MPI_Bcast.Global operation collection. Collect the result from all processes for an operation (e.g., vector sum, smallest or largest value, etc.). MPI function(s) of interest: MPI_Reduce.Wait for an event. This should include waiting for a nonblocking send or receive or a barrier. MPI function(s) of interest: MPI_Wait, MPI_Waitall, MPI_Waitany.Wall clock timing. Get the wall clock time so that you can time your code. MPI function(s) of interest: MPI_Wtime.Shutdown the parallel communications library. You need to decide if this is going to return or not at all. MPI function(s) of interest: MPI_Finalize.Test your library by revising an existing MPI based C program, such as matmul_mpi.c by rewriting the existing code to call your library routines instead of MPI functions directly. Compare the answers to the original code to the modified one using your library. The results should be the same.

## Hw4

Modify the brandt_mc77_openmp.c file to make it more efficient from a parallelism viewpoint. The routine

relaxdoes one Gauss-Seidel iteration when run on one processor. In its current form in the file, it is doing a block Jacobi iteration with one Gauss-Seidel iteration within the blocks when more than one processor is used. The routinemultigtakes the reported error and decides if another relaxation step, coarsening to the next coarser level, interpolating to the next finer level, or the approximate solution has been achieved on the finest level after the return fromrelax. It would be much efficient if some of the logic inmultigwould be inrelaxinstead. Modify both routines so thatrelaxreturns when all of the relaxation iterations for a level are complete (so thatmultigonly changes levels or returns as decisions after relax returns).The routine

relaxshould run in parallel for all relaxation sweeps without going back to serial. You will need to add awhileloop (run in parallel by OpenMP) that runs until the variableerrhas been reduced enough to consider changing levels. You have to be careful updatingerrin parallel (it needs to be done atomically to be safe). You need a barrier before checking iferrhas been reduced sufficiently, too. The currentforloop that has a#pragmain the line above it, no longer needs the OpenMP command since it will be inside thewhileloop that has its own#pragmaabove it.Turn in your modified brandt_mc77_openmp.c file (and call it something else). Do your compiling, running, and testing on mgser1.uwyo.edu since that is what I will use to evaluate your homework.

## Hw5

Write a parallel sorting program using one of the three algorithms (LS3, 4 way merge sort, or rotate sort) using a 2×2 array and the C programming language. While I do not care if you use integers, floating point numbers, or complicated structures, I personally would use integers or doubles. Please do the following:

- Use your communications library from Hw3.
- Provide a Readme.{txt,doc,pdf,...} file that describes what you did, how to make and run your examples, and anything else you think I need to know.
- Provide a simple Makefile that makes and runs your code. It should clean up the junk files, too.
- Generate an initial random data set of size about 32 and save it in a file that you can turn in and use for simple debugging.
- Be able to generate an arbitrarily large random data set to use once your code works correctly.
- What is the computational complexity of your code based on the sorting method(s) and the communications costs? You need to work out the complexity using the number of comparisons and merges plus the message passing costs using big-O notation. You should show all three part of the overall complexity formula that you derive.
Hint: Do the homework first in Matlab. This is known as prototyping in the parallel computing world. You can write the Matlab code quite simply or in a complicated style that simulates how the parallel code would be written. It is up to you how much effort you want to put into this step.

## Hw6

Write a parallel program to transmit data in a two dimensional mesh to neighboring processors. You should look in your textbook for pictures of this topic and hints. You may use a Cartesian tensor product mesh (this is simply a matrix in how it should be stored on a computer) or an unstructured mesh (only do this if you really know what you are getting into; this is a request from Mr. Burgess). Please do the following:

- Use your communications library from Hw3.
- Provide a Readme.{txt,doc,pdf,...} file that describes what you did, how to make and run your examples, and anything else you think I need to know.
- Provide as simple a Makefile as you can that makes and runs your code for several cases. It should clean up the junk files, too.
- Create data for the entire domain, though it will be distributed across all of your processors.
- Describe how your domain is decomposed for p = 2, 4, 6, and 8 processors. You may also decompose it for 3, 5, and 7 processors if you wish, though it is not required.
- Only data immediately on the border of a domain should be transmitted to the correct neighbor.
- Provide a table with wall clock timing for the following communications cases:

- Transmit one element of data at a time between processors.
- Transmit as much data at a time between processors as you can.
Do not do a library/internet search for a solution to this assignment.Do it yourself. You will learn absolutely nothing by just copying and pasting someone else's solution to a well known problem.Hints:

- I debug codes like this by setting the data that will be transmitted to be the (processor id (rank)) * 10000 + row * 100 + column. Use bigger factors if you debug on 100×100 or larger decomposed domains.
- Debug with 2×1, then 1×2, 2×2, and finally 4×2 or 2×4.
- Do very small steps at a time and back up everything that works separately. You never know when you need to back up more than one step in your design.
## Hw7

This is strictly extra credit or for fun. Modify your homework 2 so that you do a sparse matrix-vector multiplication using CUDA on the nVidia GP-GPU on my workstation. Let me know if you intend to work on this after the semester ends.