Parallel Computing for Partial Differential Equations

Sommersemester 2012 - Matthias K. Gobbert

Detailed Schedule - Last Updated 05/23/12

This schedule is designed to give you an overview of the material to be covered and is tentative in nature. This is a living document and will be updated throughout the semester.
The chapter numbers refer to Peter S. Pacheco, Parallel Programming with MPI, Morgan Kaufmann, 1997.

Class	Date / Room	Main Topic / Handouts and Attachments / Summary
Lecture 1	Tu. 04/10/12	Introduction: Motivation for parallel computing
	Room 1403	HPCF-2010-2; HW 1; PDF transcript; class summary
		I used some research slides to introduce UMBC and myself very briefly. Then I asked the students to say a few words about themselves. I clarified a few things about the course and its schedule; see PDF transcript. I showed several webpages that will be relevant for the course, starting from my homepage at http://www.math.umbc.edu/~gobbert to get to the course's page in the Teaching area and from there to the IT Servicezentrum on the cluster; here, we discovered the problem that those pages are in German. We looked through the tech. rep. HPCF-2010-2 to motivate some issues of parallel computing, such as that a certain number of parallel MPI processes can be obtained in multiple ways as combinations of nodes and processes-per-node. I then proceeded with Section 2 of the report to explain the test problem we will also want to solve this semester. Building on the beginning of Section 2, I developed the finite difference discretization in detail to end up with the exact shape of the system matrix. We will continue from here with its properties and methods to set it up in Matlab.
Lecture 2	We. 04/11/12	Matlab program for Poisson Test Problem
	Room 1403	PDF transcript; class summary
		Using the result from last time as starting point, we developed Matlab code that solves the Poisson test problem from tech. rep. HPCF-2010-2. I did not assume that you knew Matlab for this lecture, so this is also a very fast and fairly high-level introduction to both the philosophy of and programming in this software package. To make full sense of everything, you should read a more basic introduction, e.g., in Matlab's own Getting Started guide.
Lab 1	We. 04/11/12	Linux, Matlab, solution of Poisson equation in Matlab
HW 1 due	Room 2421	HW 2; class summary
		We logged into the lab computers first. Then we tried logging in to the Linux cluster in the IT Servicezentrum of the Uni Kassel by connecting to username@its-cs1.its.uni-kassel.de. I showed as an example how to run Matlab on the cluster and pointed out a few more things about Matlab: (i) I showed how to develop code by editing a script m-file and running it repeatedly during code development; from the Matlab command line, use `edit driver_ge.m` to start the Matlab editor for an m-file `driver_ge.m`. (ii) I showed how to cut-and-paste code from the development script into a Matlab function m-file; I showed how to write a function header and how to call the function from the other script. (iii) I demonstrated that many of Matlab's own functions are read-accessible by saying, e.g., `edit pcg`.
Lecture 3	Tu. 04/17/12	Chapter 3: Greetings; Chapter 4: An Application: Numerical Integration
	Room 1403	PDF transcript; greetings.c, mpigreetings.slurm; class summary
		This lecture covered both Chapters 3 and 4 in Pacheco. The purpose and didactic role of both chapters is to introduce the idea of message passing (-> MPI = Message Passing Interface) in an SPMD (= single program multiple data) program. That is, all parallel MPI processes execute the same program, but by branches or algebra that depends on the process rank id different results can be computed on the processes. First, I discussed the minimal MPI programs that are possible, one with only using mpicc to compile, one that additionally includes the header file mpi.h, then one that starts and stops MPI with MPI_Init and MPI_Finalize, and finally one that obtains id and np from MPI_Comm_rank and MPI_Comm_size. These most basic commands are demonstrated in the hello_parallel.c code on the cluster webpage; see Lab 2 for the URL of this page. This page also supplies a submission script for the executable hello_parallel and usage tutorials for compiling and running. The next code is based on seeing this example run: We notice that the output is not in order. The purpose of the greetings program in Chapter 3 is to introduce MPI_Send and MPI_Recv. The motivation is that we might want to force the hello-world messages from the processes to appear in order; another purpose could be to program in such a way that only Process 0 needs access to stdout. We started by learning about MPI_Send from its man page; there are man pages for each MPI command, and you can often find other useful commands from the See Also entry. We developed the greetings.c program by copying hello_parallel.c and modifying it. The result is posted above. I also showed how to obtain the submission script by copying the previous one and changing job name and executable. For Chapter 4, we started by posing the mathematical problem and the task to parallelize it. I stated the idea for the parallelization and developed a hand-written version of the heart of the code. As part of this, I motivated that we want to distinguish global variables that have the same meaning and value on all processes from so-called local variables that can have a different value on each process, hence the chosen variable names. Notice that we applied this also to local_n, and we introduced properly named local_In and In. This code shows a situation, where the existing serial code can be used on each process; not all parallel code can be developed in this way. I also demonstrated the use of a wildcard, here MPI_ANY_SOURCE, in the code, since we do not actually care, in which order the local_In are accumulated into In, and the most general format conversion for a double-precision variables (see printf in last line of code in PDF transcript). Notice that our developed codes are not exactly the same as Pacheco's codes in Chapters 3 and 4. I tried to be cleaner in some ways (variable names, order of if statement on id==0, use of double-precision) --- although not in all ways (overwriting of local_In) --- to use more typical variable names (id, np, local_...), and to use double-precision (double, its format conversion). You should compare to Pacheco's original codes.
Lecture 4	We. 04/18/12	Chapter 5: Collective Communication
	Room 1403	PDF transcript; class summary
		We covered Sections 5.1 through 5.6 of Pacheco. All commands in this chapter are collective, meaning that all parallel MPI processed need to execute them. The example of MPI_Reduce to replace the if-statement and for-loop in the trapezoidal rule program shows that collective communication commands are often a lot simpler to code. The contrast of MPI_Reduce and MPI_Allreduce shows that different versions of MPI functions exist that should take care of most situations requiring communication. I stressed at the end that we have to go back to discussing how arrays like x, y should be defined in C.
Lab 2	We. 04/18/12	Format Conversion in C and Setup of Performance Studies (for HW 3)
HW 2 due	Room 2421	HW 3; Cluster webpage; class summary
		Notice that the three problems on the homework walk you through setting up and starting with running MPI jobs. In Problem 1, you should download code in a tar.Z file from the webpage of Pacheco's book; a link is on our course webpage. Use gunzip to decompress the Z file and get a plain tar file. Then use tar xf on the tar file, that is: gunzip ppmpi_c.tar.Z tar xf ppmpi_c.tar This gets you a directory ppmpi_c, which contains code organized by chapter. We want to use trap.c from chap04 in Problem 3 below. In Problem 2, you download two files from the cluster webpage; link is above. The goal is merely to run with different numbers of nodes and tasks-per-node and confirm from the printout of the hostnames that things are working as intended. The first step in Problem 3 is to obtain the trap.c code, compile it with mpicc trap.c -o trap and create a submission script mpitrap.slurm for trap. I showed how to switch to the public partition; since that has two types of nodes in it and since we want to use only one type in one run, I showed the --constraint option, which can have values 12cores or 4cores. So, you should have the two lines #SBATCH --partition=public #SBATCH --constraint=12cores in your submission script, or alternatively 4cores. I showed concretely how to program better and more complete output. I showed how to set up a directory structure as suggested in Problem 3 and how to use the grep command to access information from the output.
Lecture 5	Tu. 04/24/12	Chapter 5: Collective Communication
	Room 1403	PDF transcript; class summary
		We covered the rest of Chapter 5. That is, after MPI_Bcast, MPI_Reduce, and MPI_Allreduce last time, we talked about MPI_Scatter, MPI_Gather, and MPI_Allgather today. First, we did a sensible example of all three commands when applied to a n-vector x that is divided into l_n-vectors l_x with l_n=n/np on each of the np processes. Then, we attempted to move towards the example in Pacheco that is based on a square n-by-n matrix A. In this process, we discussed that C does not know multiple-dimensional arrays, but that double l_A[l_n][n] is really l_n many arrays of length n. This is necessarily row-oriented storage, by virtue of our mathematical understanding of l_A[i][j], where i is the row index and j the column index. We discussed that we really want to use dynamic memory allocation, and we wrote the standard C way of defining matrices (i.e., arrays of arrays in the double-pointer double *l_A). We ended with the observation that the memory allocated for l_A in this standard C way does not have to be consecutive and therefore MPI_Gather cannot be applied to it. We will overcome this next time by using 1-D arrays only and handling the 2-D indexing ourselves. We saw this already in Lecture 1 in the idea of k=i+N(j-1) which is systematic and produces consecutive column-oriented storage.
Lecture 6	We. 04/25/12	Effective Representation of Matrices in C and Use of Software Packages
	Room 1403	PDF transcript; class summary
		We talked about some aspects of HW 3 that are suitable to a lecture format. Then, I discussed the existence of packages like BLAS, LAPACK, etc. For BLAS, I characterized the difference between level-1, leve-2, and level-3 BLAS and quoted some example results that show the benefit of using a package like that. This served as further motivation that we want to use 1-D arrays in C to store our matrices. Through code and some examples, I explained how to do this. Finally, we talked about HW 4 explicitly with some code ideas and suggestion how to go about designing the algorithm for the matrix-vector product, which is the heart of the algorithm and which will determine the performance of your code.
Lab 3	We. 04/25/12	Discussion of HW 3 and Guidance for HW 4 (Makefile, setup_example)
HW 3 due	Room 2421	HW 4, ver0.0suppliedfiles.tgz, Cleve Moler Matlab Digest article; class summary
		The article by Cleve Moler, the inventor of Matlab, gives an example of a practical application of the power method in a large-scale context. After you put the tgz in a desired location on the cluster, use tar zxf ver0.0suppliedfiles.tgz to expand it into a directory ver0.0suppliedfiles. It contains the Matlab code for the power method, Makefile, c-code with header files, and a submission script. We discussed HW 3. I showed how to use the grep utility in Linux to look at all results and errors for one value of n. This can confirm if runs with different numbers of processes gave the same result. This can often bring out issues that are not even related to parallel computing, for instance the following: I showed two mistakes in the way that the trapezoidal rule is implemented in Pacheco, namely x=x+h accumulates round-off and (f(local_a)+f(local_b)) adds numbers that can be very different in scale. These mistakes degrade the results for large n like those that we need to demonstrate scalability of the code, such as n=1e7, 1e8, 1e9. We noted that one should use -O3 to get best performance of code; we found a magnitude difference in serial time without this flag. Then we looked at the files supplied for HW 4. I explained the structure of the Makefile; notice the invisible TAB stops! I showed the code in setup_example, which is analogous to the example in lecture, but also demonstrates use of l_j for an index that is limited by l_n and the conversion from the local index l_j on Process id to the mathematical global index j.
Lecture 7	Tu. 05/01/12	Tag der Arbeit - Labor Day Holiday - No Class!
Lecture 8	We. 05/02/12	Chapter 10: Design and Coding of Parallel Programs (Jacobi and CG methods)
	Room 1403	PDF transcript; class summary
		I explained that we are going to write the Matlab version of the algorithm in the lab, and that will be our starting point and reference point for testing the serial C code. But in order to transition to C, we collected first what we have in Matlab so far; we developed some missing pieces. Then we transitioned to C. The starting point should be Homework 4 code, with all power method related things deleted, but keeping all auxiliary functions. We noted certain small changes to Makefile and the job submission script. We looked through the supplied function cg.c with cg.h of the CG method (attached to the lab below); this is an example of a black box solver, so our focus needs to be on just understanding the interface; make sure to read the comments in cg.c carefully. Then we discussed the datastructure, ending up with realizing already for next time, what kind of communications between parallel processes are necessary. For the serial C code, we set some variables like needed in the future, but the only actual code to be added is the Ax() function that we developed at the end.
Lab 4	We. 05/02/12	Parallel Code Development on Example of Power Method (HW 4)
HW 4 due	Room 2421	HW 5, 6, and 7, cg.c, cg.h, Outline of CG (Allen, UMBC Review), Krylov subspace method decision tree (Demmel); Ax.m; class summary
		We did Homework 2, Problems 5 and 4 as Quizzes 1 and 2, respectively. The upshot is to obtain a working Matlab code with a matrix-free implementation of the CG method; this is in effect the third version of Homework 2 (Gaussian elimination; CG with matrix A; matrix-free CG). Notice that this all works, because the CG method only needs the action of the matrix in a matrix-vector multiplication, not the matrix itself. This is the case for all Krylov subspace methods (sometimes also action with transpose of the matrix needed), hence the relevance of this test problem. Attached here is Ax.m for the matrix-free implementation of v=A*u. To use this function with Matlab's pcg function, set the input argument A to the function hande of Ax.m, that is, A = @Ax; see the documentation on the pcg function in Matlab.
Lecture 9	Tu. 05/08/12	Chapter 13: Advanced Point-to-Point Communication
	Room 1403	PDF transcript; class summary
		I made some remarks on HW 4 that should be useful for HW 5. Then, the goal of today was to develop a parallel version of the matrix-free matrix-vector function Ax() for v=A*u, now with input l_u and output l_v. I tried simultaneously to give hints how to do this step by step as code development without bugs and to develop slowly the idea why we need to communicate what. We ended with concrete code on the last page of the PDF file; we talked about how it cannot be quite right the way it is stated, since a MPI_Recv should block (i.e., wait right there), until a corresponding (see tag numbers!) MPI_Send is posted. The situation is not as simple as interchanging their order, which we will discuss in the next class. Moreover, we still have to develop a non-blocking version of this code, in which you would do calculations while doing the communications.
Lecture 10	We. 05/09/12	Chapter 13: Advanced Point-to-Point Communication
	Room 1403	PDF transcript; class summary
		We compared some results of HW 5's serial C code for N=2048. The iteration count should probably be 3192; other numbers point to a small bug. Error norm should be 7.81e-7. Timing results appear to be on the order of 300 seconds on a six-core CPU (--constraint=12cores) or over 600 seconds on a dual-core CPU (--constraint=4cores). It should be possible to run for N=4096 on one node of this cluster. We discussed first in somewhat general terms such as deadlocking and safe MPI code associated with MPI_Send/MPI_Recv, as discussed in Chapter 3. I mentioned some of the advanced point-to-point communication commands of Chapter 13 (as opposed the collective communication commands of Chapter 5). We can restrict our attention to the classical blocking MPI_Send/MPI_Recv and non-blocking MPI_Isend/MPI_Irecv. I then discussed concretely how to finish the parallelization of the code for Ax() from last time. To make it safe for blocking MPI_Send/MPI_Recv, I showed the standard trick of a case distinction between even-numbered (id%2==0) and odd-numbered processes. Then we developed the code for MPI_Isend/MPI_Irecv. Finally, I gave some concrete suggestions for how to code the norm calculation and how to obtain the constant pi.
Lab 5	We. 05/09/12	Discussion of HW 5 - Correctness of Convergence Study
HW 5 due	Room 2421	class summary
		I made a number of suggestion on the report and on details such as which partition to run on, but the bulk of the time was taken up by meetings with each team for feedback purposes.
Lecture 11	Tu. 05/15/12	Load-Balancing and Ringsend, Review of Other Aspects of All Homework
	Room 1403	PDF transcript; class summary
		I started with some remarks on how to structure a report by making the main point already in Section 1. Then we talked about the partition thphysik with InfiniBand interconnect. The remainder of class was dedicated to some standard issues that come up in parallel computing: how to do a ringsend with destination = (id+1)%np and source = (id-1+np)%np; notice the "+np" to ensure that source=np-1 for id=0; load-balancing if distributing n items to np processes, such as in the trapezoidal rule program the n sub-intervals of [a,b]; for this problem, my notes in the PDF attachment show a very rudimentary code (including a redundant if-statement actually); a better organized version might be rem = n%np; /* remainder when splitting n to np processes / if (id < rem) { / if id is 0, 1, ..., rem-1 then ... / l_n = n/np + 1; / ... one more trapezoid / l_ia = id(n/np+1); /* local starting index of integral limit / l_ib = l_ia + l_n; / local ending index of integral limit / } else { l_n = n/np; / local number of trapezoids / l_ia = rem(n/np+1)+(id-rem)(n/np); / local starting index / l_ib = l_ia + l_n; / local ending index of integral limit / } local_a = a + l_iah; local_b = a + l_ib*h; In both examples, I also made the debugging point to focus on integers in test output.
Lecture 12	We. 05/16/12	Chapter 10: Design and Coding of Parallel Programs (Sorting)
	Room 1403	PDF transcript; class summary
		We talked about Ch. 10 on algorithm design, specifically the second part on sorting. Really, this portion introduces the MPI command MPI_Alltoall and then shows the need for variable ("v") versions of MPI commands such as MPI_Alltoallv, MPI_Gatherv, MPI_Scatterv, etc. In these commands, a vector replaces the scaler integer count for how many pieces to communicate. At the end of class, I gave a very brief summary of Ch. 6 on grouping data in communications by using derived datatypes. This motivated why send_type and recv_type can be different in MPI commands.
Lab 6	We. 05/16/12	Discussion of HW 4 - Example of Algorithm Design
HW 6 due	Room 2421	class summary
		We mainly discussed the solution to HW 4 on the power method. I showed several versions of the matrix-vector product function that used MPI_Reduce and MPI_Scatter; MPI_Allreduce and local copy; and MPI_Reduce_scatter. I also demonstrated that it is possible to save replace the parallel dot product calls by serial ones, since we already have the full results of y=Ax from MPI_Allreduce in the matrix-vector product; that reduces the number of MPI calls and improves efficiency. However, the crassest demonstration had to do with a serial issue, namely the order of the two loops in the local matrix-vector product of l_Al_x: It can make two orders of magnitude difference and be much more important than differences in MPI calls.
Lecture 13	Tu. 05/22/12	Chapter 7: Communicators and Topologies
	Room 1403	PDF transcript; class summary
		We discussed how to create communicators are sub-groups of the original communicator MPI_COMM_WORLD. In the most general commands, one can list the ID numbers of processes to include or give a condition based on the ID number. For the special case of sub-divisions in the form of a Cartesian grid, there are special MPI commands. Fox's Algorithm, which is really a basic example of a block algorithm for matrix-matrix multiplication, provides a good application of Cartesian grids. Its MPI code is remarkably short and posted in Pacheco; my version looks different since I used a collective MPI_Bcast so that the local product looks like l_C+=l_Al_B on all processes. Then I showed some timing results for code for C=AB using level-1, -2, -3 BLAS functions ddot, dger, and dgemm, respectively, as computational core. It is stunning how much faster the BLAS3 code is, because it uses a block algorithm inside.
Lecture 14	We. 05/23/12	Chapter 15: Parallel Libraries: ScaLAPACK (incl. BLAS and LAPACK)
	Room 1403	PDF transcript; class summary
		After some remarks on HW 7, I talked about BLAS and finally LAPACK, BLACS, and ScaLAPACK. To learn some background about BLAS, we went to www.netlib.org, a large repository of mathematical software. Via "Browse" and "blas", you can navigate to a list of BLAS functions in Fortran. The code here is from the 1970s, but it provides precise reference for what functions do via readable source code. We looked at the ddot.f function for the dot product of two vectors. In the code, we learned the meaning of loop-unrolling (here in fixed steps of 5, i.e., not cache-aware) and of increments in the indices of the input vectors. I demonstrated on the example of a matrix-matrix product, what non-unit increments can be used for. Then I showed a portion of my research slides that derive the method of lines using finite elements for a system of parabolic reaction-diffusion equations. The purpose was to show how the CG method for Poisson appears as computational kernel in such a problem.
Lab 7	We. 05/23/12	Discussion of HW 7
HW 7 due	Room 2421	class summary
		I discussed the final report of HW 7 with each team individually, both to understand their work and conclusions on the one hand and make suggestion for their best presentation and some corrections on the other hand.