Introduction to Parallel Computing

Wintersemester 2011/2012 - Matthias K. Gobbert

Detailed Schedule - Last Updated 05/14/12

This schedule is designed to give you an overview of the material to be covered and is tentative in nature. This is a living document and will be updated throughout the semester.
The chapter numbers refer to Peter S. Pacheco, Parallel Programming with MPI, Morgan Kaufmann, 1997.

Class	Date / Room	Main Topic / Handouts and Attachments / Summary
Lecture 1	Mo. 12/05/11	Motivation for parallel computing
	Room 2421	HPCF-2010-2; HW 1; research slides; PDF transcript; class summary
		The purpose of today's class was to (i) motivate the need for parallel computing and its goals, (ii) demonstrate the teaching techniques, and (iii) demonstrate the learning technques. For (i), I showed slides numbers 2, 3, 4 as information about myself, slides 13, 14, 17, 20, 21 about a prototypical PDE problem, slides 60-67 to introduce how a parallel computing cluster looks like and what the issues are with distributed-memory clusters, and slides 78, 79, 81, 82, 83 to motivate the following conlusion: The most important purpose of parallel computing is to pool the memory of several compute nodes to allow the solution of larger problems than possible on a serial node. The second most important purpose is to speed up calculations to make calculations feasible within reasonable amoung of time. For (ii), I used my iPad2 with NoteTakerHD to write and project hand-written lectures, which will be posted as PDF transcripts here, I annotated the slides with red color (slides posted without annotations here), and used iSSH to log in briefly to the cluster hrz-cs400.hrz.uni-kassel.de that we will use in this class. We talked about (iii) only in as much as having everybody introduce themselves (oral English) and noted the homeworks (written English); I will return to this in the future to clarify more that this class has the goal to prepare you for research, so learning how to learn from written material (by itself or in preparation for lecture), writing reports of progressive sizes, and acting more and more independently are all built into the course. We briefly flipped through the various webpages for this course and related that you are asked to do on HW 1; see URLs there. You should read the first four pages of the tech. report HPCF-2010-2 (notice: year 2010, not 2011) that is linked here via the HPCF Publication page.
Lab 1	Tu. 12/06/11	Linux, hello_serial.c, hello_parallel.c, running jobs
HW 1 due	Room 2421	HW 2; mpich2_pgi.qsub; hello_parallel.c; hello_serial.c; class summary
		We effectively tried to work our way through Problems 1 and 3 of HW 2, by me demonstrating using the projector and students following along. We ran out of time, so you should continue by experimenting yourself, which is what the homework asks you to do. Problem 1 introduces several things: We logged in to the cluster hrz-cs400.hrz.uni-kassel.de; when doing this and the following, we cross-checked against information on the webpage of the IT Servicezentrum. I demonstrated basic Unix/Linux commands such as pwd, ls -l, mkdir, cd, rm. For Problem 1, I showed how to program in the language C, by writing a "Hello, world!" program from scratch; see attached file hello_serial.c. To do this, we had to learn the basics of the vi editor, namely to enter it, simply say "vi filename" which puts you in command mode of the editor; type "i" to enter input mode; then you can enter text. Use ESC to escape the input mode and return to command mode. In this mode, you can use the keyboard keys "h", "j", "k", "l" (or apparently the arrow keys) to go left, up, down, and right, respectively. Finally, to quit the editor and save ('write to file') the result, type ":wq" and hit return. We used the compile command for parallel MPI code, namely mpicc to compile hello_serial.c. The default name for the resulting executable is a.out; to give it a more meaningful name, use the "-o" option, so we used mpicc hello_serial.c -o hello_serial You can run it by saying "./hello_serial". Problem 3 asks you to download two posted files. The easiest method at this point turned out to be to click on the links of the posted file to display them in the web browser, highlight all contents, and copy-and-paste it into the file by the desired name. I then showed what should happen when you run this in parallel by submitting the job to the scheduler user qsub mpich2_pgi.qsub You should follow carefully the steps in Problem 3, because this should not work the first time around. I realize that all this can be very confusing at first. Our goal is to get something working at first as fast as possible, without explaining everything. Watch for the next lectures to explain more and more about what is actually going on.
Lecture 2	We. 12/07/11	Chapter 3: Greetings!
	Room 0450	PDF transcript; video recording; greetings.c; class summary
		We reviewed actually the steps associated with Problem 3 of Homework 2 that was mostly done already yesterday in the lab. I made the point that for a more systematic study, it will be useful to create one directory for each desired combination of nodes and ppn. In this process, I showed a number of Unix features such as how to show hidden files ("ls -al"), how to define short-hand commands (so-called aliases) for the Korn shell (/bin/ksh, see output of "echo $SHELL") in its run-control file .kshrc (alias ll="ls -l"), mkdir, cp -p (to preserve date/time stamps and permissions), cp -r (recursive copy of directories and sub-directories and their files), diff (to compare two files or directories), mv (move or rename), and more. Regarding use of this cluster, we used qsub to submit a job, then qstat to see the status of all currently running or queued jobs (recommended options are -a and -n) and the Ganglia software, which you access by pointing a web browser to hrz-cs409.hrz.uni-kassel.de/ganglia (Notice the 409 [not 400] in this URL!). Then we read the hello_parallel.c code and realized it is nearly the most basic MPI command, only containing MPI_Init and MPI_Finalize at beginning and end, and MPI_Comm_size and MPI_Comm_rank to determine the size of the given communicator and the rank of the process that is executing the code. See Chapter 3 in Pacheco and see the man pages for each of these commands. Notice that we delayed explaining some details of C programming (command-line arguments argc and argv to main program and MPI_Init function, role of ampersand & in &np, pointers as address locations of variables, ambiguity of pointer vs. array, string really being array of characters, and more); please understand that I am applying a tactic that teaches you the necessary C skills by observing supplied code. We will return to more explanations later. We noticed that this hello_parallel.c does not communicate actually between processes. To this end, I formulated a task (see comments in greetings.c) and then showed how to design an algorithm and code from the task. See Chapter 3 of Pacheco for a longer written explanation and for an equivalent code using a slightly different style. The point was to learn about MPI_Send and MPI_Recv. We learned how to use them from their man pages and then implemented code in the attached greetings.c. Its output now results in an ordered list of processes. Please notice that this also shows you how to do for-loops and if-statements in C. We compiled this code with mpicc greetings.c -o greetings which selects the executable name. Then we obtained a new submission script by copying the existing one and changing the line for mpijob to call ./greetings instead of ./hello_parallel. We had to rush a little, so I am asking you to read Chapters 3 and 4 before Monday and to look at the man pages for the MPI commands mentioned so far.
Lecture 3	Mo. 12/12/11	Chapter 4: An Application: Numerical Integration
	Room 2404	PDF transcript; video recording; class summary
		We started by looking at a portion of the taped lecture from last time. It works very smoothly on the iPad, but the recommendation for other laptops/desktops is to download the mpeg-file first, then play. The purpose of today is to extend our use of MPI communications to a context where the communications are part of a numerical algorithm and have the purpose to compute some final result that requires all processes to work together. We use the trapezoidal rule program from Chapter 4 in Pacheco as example. I went through the background of the method and its parallelization idea. The idea tries to divide up the significant calculation work among all processes and pre-compute as much as possible locally to each process, then communicate as little as possible (here: we form local integrals and then communicate a single real number local_In, as opposed to communicating all f(x_i) and then accumulating the entire sum for In on Process 0). My presentation tried to demonstrate development of code and its documentation in text. I demonstrated the idea of using "local_" (or later "l_" for short) as prefix to variables to indicate that they are local in the sense that they can different values on each process (and not global in the sense that the value of that variable is the same on all processes). Like last time, we designed the parallel code from the explicitly stated algorithmic idea. Notice that we designed the algorithm from its parallelization idea, not by looking at a serial code; this leads us to focus more on the higher level and bury calculation details in a function; when actually writing code from scratch, you will still likely start from this bottom up, but you can design more appropriate variable names with the given tactic. We ended by reading Pacheco's trap.c file from Chapter 4 and noticed how some details are different than my design, including choice of variable names. Regarding C programming, the code provides examples of function definitions and the need for a prototype of the function before calling it, float as intrinsic datatype of single-precision real variable, and the ampersand to obtain the address of a variable (used in first argument of the MPI_Send and MPI_Recv).
Lab 2	Tu. 12/13/11	Ch. 3 and 4
HW 2 due	Room 2421	HW 3; class summary
		At the beginning, we went through some preparatory steps: Create directory lab02, copy trap.c from Pacheco's Chapter 4 (ppmpi_c/chap04/trap.c) to lab02, compiled it with mpicc (mpicc trap.c -o trap), and ran it using qsub; for the latter point, I suggested not to download a new submission script from somewhere, but to copy the one from the last lab and modify (mpijob=./trap). Some students still had to download the code from Pacheco; this is easiest done by navigating to the book's webpage (link from the course homepage), then "save the link address" (by right-mouse buttom or similar) of the link to the C code (labeled "C (updated 2000/01/23)" on Pacheco's page, then paste this link address as argument to a wget command at the Linux prompt in the cluster as wget http://www.cs.usfca.edu/~peter/ppmpi/ppmpi_c.tar.Z This puts the file ppmpi_c.tar.Z into the current directory. This file is zipped, i.e., compressed, and can be unzipped by gunzip ppmpi_c.tar.Z which gives the file ppmpi_c.tar. Tu untar this, i.e., expand this directory's sub-directories and their contents, use tar xf ppmpi_c.tar You then will have a directory called ppmpi_c. Change directory into it and list the sub-directories, which shows, among others, chap04, which contains, among others, trap.c. We discussed some ideas for how to improve this code. Two specific issues right at the beginning are that the compilation with mpicc gives a warning (this should be fixed) and the output is poor. I pointed out that the output raises the question what mathematical problem we are solving anyhow; looking through the code for a, b, and f(x) gives the information, and this allows us to obtain the true integral value I=1/3. Using this, I wanted to teach the reasonable habit to output all available information such as, at least, the number of subintervals n, the step size h, true value I, the approximation In (which is the variable total in Pacheco's code), and the error I-In (coded as I-total); in fact, since the latter one should have magnitude h^2, you may also want to print out h*h, as well. I spent some time to demonstrate some pitfalls of C here, such as coding I=1/3 just like that, which gives integer division 1/3=0 and hence float I=0.0. Moreover, this outputs as I=0.000000 for format conversion %f, which is what Pacheco uses in several places. Once we fixed the computation as I as I=1.0/3.0, we still a strange thing like error=I-total=0.000000, which is odd, since there should be some error on the order of h^2=1e-7. The issue is that the output format %f simply does not show enough digits. I suggested highly to always use %24.16e, which is in exponential format, normalized to one non-zero decimal digit before the decimal point, and with 16 decimal digits after the decimal point, then multiplication with powers of 10 as indicated by notation e-12 at the end. Single-precision numbers have about 8 significant decimal digits of accuracy, so the above is too many clearly (%15.8e would suffice). But for very large n values, we will have reason to switch to using double-precision numbers anyways, so I anticipate this with the %24.16e. The 24 indicates the fieldwidth of the output field. We need to allow for 8 additional characters here, since double-precision numbers range from 1e-308 to 1e+308 approximately, hence there can be up to 3 digits necessary in the exponent of 10 as in e-123. In the process of working on trap.c and the submission script, we learned a number of new vi commands: Lowercase x deletes the character under the cursor. When in command mode, while lowercase i switches into insert mode before (!) the position of the cursor, lowercase a switches into insert mode after (!) the position of the cursor; we needed this to append something at the end of the line (hence a for append); to jump to end of the current line for insertion, use uppercase A. When in command mode, lowercase o inserts a blank line and enters insert mode after (!) the line where the cursor is, while uppercase O inserts a blank line and enters insert mode before (!) the line where the cursor is. To jump to the beginning of the file, use ":1", which is a special case of going to any line by saying ":n" for any number n. To jump to the end of the file, use uppercase G in command mode. CTRL-F and CTRL-B move one page forward and backward by one page, respectively. CTRL-U and CTRL-D move half a page down and up, respectively.
Lecture 4	We. 12/14/11	Chapter 5: Collective Communication
	Room 0450	PDF transcript; video recording; class summary
		I started the class by discussing the expectations for the performance study of Homework 3, Part (f). The point is to run on as many nodes as possible and see if the code runs faster. Since the amount of work needs to be higher to make this easier to take place, expect to make n larger (recall that Pacheco uses 1024, which is very small in this sense). I explained the expectation in the way of a graphical representation of speedup. The idea to learn this is to take a closer look now at the tech. rep. HPCF-2010-2, which is posted on the HPCF webpage www.umbc.edu/hpcf under Publications. I also explained the C programming issues of command-line arguments, how to do test output of them, and how to use them to get the variables a, b, n in the example of the trapezoidal rule program. Notice by the way again that code that performs well is only acceptable if its results are numerically correct, so you should check your printouts of h*h and the error. Then we started with Chapter 5 on Collective Communications. This is a chapter with powerful commands that implement essentially all reasonable communications involving all parallel processes. We covered the commands MPI_Bcast, MPI_Reduce, and MPI_Allreduce.
Lecture 5	Mo. 12/19/11	Chapter 5: Collective Communication
	Room 2404	PDF transcript; video recording; class summary
		We continued with Chapter 5. The starting point was to develop code for a parallel dot product as in Sec. 5.5, but already using MPI_Allreduce from Sec. 5.6. This and more are differences of my presentation to Pacheco's presentation. The design of a dot product led us to clarify first how split vectors onto parallel processes. Then we introduced MPI_Gather, MPI_Scatter, and MPI_Allgather from Sec. 5.7, but using strictly vectors x and l_x as examples (not matrices like in Pacheco). Finally, I lectured a little on the Unix make utility with an example of C code split into two files file1.c and file2.c. The make utility makes this more efficient then by only re-compiling one of these files, if only one of them has changed since the last compilation. These date/time stamp comparisons are the contents of the conditional lines (those with a colon ":") in the Makefile. To show this, I logged in to the cluster and showed the files supplied for HW 4 and 5, which include a Makefile and code split into files main.c and memory.c. I demonstrated "make" as well as "make clean". We will look at this more as part of the lab. MISTAKE ALERT: The examples of malloc() and calloc() in class just say "x = malloc(...)". But malloc() and calloc() return pointers to an address that is not of a defined type, so-called void. To use this for x, which is defined as "double x", we need to cast the output to this type, so the example should read x = (double ) malloc(n*sizeof(double)); and analogously for calloc().
Lab 3	Tu. 12/20/11	Introduction to HW 4 and 5 on Power Method
HW 3 due	Room 2421	HW 4 and 5; ver0.0suppliedfiles.tgz; class summary
		We discussed the code contained in the ver0.0suppliedfiles directory. There are two Matlab m-files; use the driver_power.m to run a example of the power method for a small matrix (dimension n=4) in Matlab. This is what we want to translate into C in HW 5, namely have a C function for the power method like my_power.m and the contents (in particular output) of driver_power.m in the main() program in C. There is the Makefile that controls the compilation and linking of the two c-files main.c and memory.c. I showed how to systematically create a header file from a c-file, on the example of a utilities.c. I demonstrated how to compile one file at a time and then link, which is what the Makefile does. The executable is power, and the supplied file mpich2_pgi_power.qsub shows how to run this executable in the line that defines mpijob; the point of this demonstration is to show how to call power with command-line arguments. MISTAKE ALERT: The file mpich2_pgi_power.qsub may have a mistake in it (depending on which version I posted): The mpijob line should read mpijob="./power 8192 1.0e-12 50" with reference to executable power in the current directory by "./power"; it may say "../../power" in the supplied file. SMALL TYPO in hw4and5.pdf: The text refers to the output file slurm.out; that should say qsub.out.
Lecture 6	We. 12/21/11	Chapter 5: Collective Communication
	Room 1403	PDF transcript; video recording; class summary
		We covered the MPI commands of Chapter 5 in Pacheco in the last two lectures. We used sensible examples involving vectors and discussed how to create these in C. Today's purpose is to discuss how to create matrices in C and then use them in MPI commands. My main point was that the C standard does not guarantee anything about how the memory of a matrix in C is organized, which is a problem for our purposes of parallel computing with MPI. This is actually a problem for scientific computing in general, since C does not really understand multi-dimensional arrays at all, which makes it incompatible with other languages. This shows up as issue when you try to use existing libraries such as BLAS or LAPACK. To make our C code efficient and compatible, we will use one-dimensional arrays only, with a column-major ordering of the data, hence split a matrix A onto parallel processes as l_A as blocks of columns to maintain the consecutiveness of the memory. I explained this in hand-writing, see PDF transcript, and then discussed the code supplied for HW 4 and 5, which demonstrates this in the definition and example of l_A, see main.c in the supplied files. We used HW 3's trap code to demonstrate the need to include header files for C functions on the example of pow() from math.h and mentioned the linking to the math library by -lm; see LDFLAGS in the Makefile in the files supplied for HW 4 and 5. I showed some of my solutions to HW 3 in terms of performance and accuracy. These aspects of the lecture were not video taped.
		--- Christmas Break ---

Lecture 7	Mo. 01/09/12	IEEE Standard for Floating-Point Arithmetic
	Room 2404	PDF transcript; video recording; class summary
		Handout: lecture copy of IEEE standard, Matlab examples, and article on Patriot missile. We covered the IEEE Standard 754 for binary floating-point arithmetic. More specifically, we covered its portion on number representation. It is implied that operations among such numbers also result in accuracies, which cause round-off errors of the same type as that caused by the representation itself. Since a number of students could not make today, we covered this topic that stands on its own and is not part of the on-going development of MPI techniques. However, its ideas explain a number of observations that come up regarding accuracy of the results of Homework 3 on the trapezoidal rule and which we will discuss in the next lab.
Lab 4	Tu. 01/10/12	IEEE Standard for Floating-Point Arithmetic
HW 4 due	Room 2421	class summary
		We computed several numbers of the IEEE standard for double-precision floating-point numbers concretely, specifically, the largest number, the smallest normalized number, the smallest de-normalized number, machine epsilon. and also the number x=0.1. We looked at the Matlab commands eps, realmax, and realmin and their documentation. We obtained their binary representations in mathematical notation, in terms of 64 bits, as well as 16 hexadecimal numbers of these bits. I demonstrated that Matlab can output this using "format hex" and that we were able to predict this output. Finally, I did a little Matlab test, where you add h=0.1 ten times; notice that the final result is not 1, which you can see by subtracting 1 from the result. This shows the potential for accumulated round-off in repeated additions, even for fairly large numbers like 0.1 only added a few times. One strategy to avoid this is to pick h as computer number (e.g., here h=1/16). Another, more general one is to avoid repeated additions wherever possible.
Lecture 8	We. 01/11/12	Chapter 11: Performance
	Room 1403	PDF transcript; video recording; class summary
		We discussed a number of considerations and best practices that go into designing and executing meaningful performance studies for parallel code. This presentations includes also remarks that explain the contents of slides 64 to 67 of the research slides presented on the first day of the semester.
Lecture 9	Mo. 01/16/12	The Library Packages BLAS and LAPACK
	Room 2404	PDF transcript; video recording; class summary
		I introduced the BLAS = basic linear algebra subroutines today. For one thing, they are historically significant and present an example for black-box software that one might download and use inside larger code. But we also looked at them to motivate several features that I stressed so far, such as the one-dimensional array storage for matrices, which exactly interfaces with BLAS. I also discussed the order of indexing that needs to be right for a dramatic difference in performance; we can see this by trying i-j vs. j-i ordering in the homework on the power method, while in turn the BLAS function for matrix-vector multiplication is yet faster. To demonstrate BLAS, we went to www.netlib.org and 'browsed' for 'blas', then looked at the code for ddot.f as example. In that, we saw the idea of loop unrolling, which makes efficient use of the CPU's cache and can make code much faster. I want to stress here that this was a lecture and looking at historic Fortran code should be viewed as looking at examples. In reality, you would use libraries supplied with or for your compiler, such as ACML with the Portland Group compiler suite. Their code has been, at a minimum, compiled with an optimizing compiler and with the particulars of your hardware in mind (e.g., cache size, machine epsilon), and might be different in code from the original Fortran. The files supplied with the next lab show how to access the ACML libraries. In the lab, we will also look at the actual documentation available on our system.
Lab 5	Tu. 01/17/12	How to Use BLAS
HW 5 due	Room 2421	HW 6; Makefile; utilities.c; utilities.h; class summary
		The assignment, as stated, expects programming in C to time four different ways to compute the matrix-matrix product C=AB. Our goal for the lab itself should be to solve this problem on paper, that is, the point is to find each appropriate BLAS function and learn how to use it on paper. Some of these functions are more complicated to use than the ddot() function yesterday in class. I would like to demonstrate that you can proceed as follows. Example: Consider as example the goal to use BLAS for a matrix-vector product.* Use the webpage www.netlib.org to "Browse" to "blas", then (in the appropriate list of BLAS1, BLAS2, BLAS3) find the desired function. Example: A matrix-vector product is a BLAS2 operation. Looking through the list of BLAS2 functions, we find dgemv, the double-precision version of the matrix-vector product. For real-life purposes, where we just want to use the function, we will not usually read the code. Rather, I would like you to know that there are man pages, so try "man functioname" on the hrz-cs400 cluster. Example: From the documentation in "man dgemv", we see that the function actually implements y := alphaAx + betay with matrix A and y := alphaA'x + beta with the transpose of matrix A. This choice is controlled by the first input argument, which is either 'N' (= 'no-transpose') or 'T' (= 'transpose'). One very important thing to note is that the output y is also an input. That is very typical of BLAS functions, many of which are designed to accumulate results. So, to avoid unintended accumulation, we should set beta=0.0. However, we might find that this documents the Fortran interface. It does give the meaning of the arguments, but we also need to know the C interface of the function. To this end, I want you know that the file acml.h has useful comments at the beginning and that you can search it for the function name to see its C interface. To find acml.h, use the command "locate acml.h", then use any editor to open and read the file; if in doubt, look at the 64-bit version of it (the one with linux86-64 in the directory). Example: I said "vi /opt/pgi-7.0-5/linux86-64/7.0-5/include/acml.h" to edit the file acml.h and searched it for "dgemv". This confirms the exact C syntax of the function interface. While many of the arguments were as expected int for the integers and double for alpha and beta, this confirms that all array are defined as double, even the matrix A; this is the one-dimensional storage of A that we talked about. Also, we can see here that the first argument is a char; notice it is a char itself, not a char; this is exactly what a single character 'N' or 'T' is in C. You should also read the files that I supply for some useful examples and to see the #include . Example: I picked the example of dgemv above, since you can readily use it in the power method homework. As part of the parallel matrix-vector product for Ax, we need to compute locally l_Al_x; you can do this using dgemv and see a significant performance improvement.
Lecture 10	We. 01/18/12	Class cancelled due to Vollversammlung

Lecture 11	Mo. 01/23/12	Chapter 6: Grouping Data for Communication
	Room 2404	PDF transcript; video recording; class summary
		After a couple of weeks on scientific computing (not necessarily parallel), we returned to learning about MPI. Derived datatypes can be used in MPI to simplify the calling sequences of MPI communication commands. We introduced these following the examples in Pacheco, with only slight modifications and additional comments to motivate them better. We will continue to make the commands discussed more applicable next time.
Lab 6	Tu. 01/24/12	Use of BLAS2 and BLAS3
HW 6 due	Room 2421	HW 7; class summary
		By working on Problem 1. of HW 7, we in effect continued from last time. We talked about solution with naive code and BLAS1, then developed BLAS3 and BLAS2 solutions. I pointed out the strategy of working on BLAS3 first, since that is in fact easier than BLAS2. Then I showed my solution. We observed that BLAS3 is factor 8-10 faster than BLAS2, which is same factor faster than BLAS1. Naive code can behave very differently, depending on how the for loops are ordered. I suggested to really formulate the lessons learned from Problem 1. before working on Problem 2.
	We. 01/25/12
	Room 1403	Class cancelled due to field trip conflict for several students

Lecture 12	Mo. 01/30/12	Chapter 6: Grouping Data for Communication
	Room 2404	PDF transcript; video recording; class summary
		We continued the material from last time, but this time applied to the data structure of one-dimensional arrays with column-oriented storage for matrices. This was useful for thinking through the issues of a derived data type again, and then for realizing facts about them and the MPI_Scatter function. This led to the solution shown in Pacheco in Sec. 8.4.5.
Lab 7	Tu. 01/31/12	Discussion of HW 3 on the Trapezoidal Rule
HW 7 due	Room 2421	class summary
		We discussed the solution to Homework 3 on the trapezoidal rule. Specifically, I showed the actual C code that implements the load-balancing of splitting n trapezoids onto np processes, as explained in class yesterday. I continued by showing the rest of the code, for instance, test output that lists in nice and readable form this splitting. Then I showed calculated results of the error for several values of n=1e3, 1e4, ..., 1e9 and essentially all possible numbers of parallel processes in each case. We learned how it might look like if results are not reliable. We discussed the influence of single- vs. double-precision, accumulated round-off, and the most accurate implementation of the trapezoidal rule. We ended by looking at my Matlab code that collects the data from output files and organizes them into tables of wall clock times, speedup, and efficiency, and plots speedup and efficiency.
Lecture 13	We. 02/01/12	Chapter 7: Communicators and Topologies
	Room 1403	PDF transcript; video recording; class summary
		We introduced MPI commands how to create one communicator as subgroup of the processes in a given one and how to subdivide a communicator into several communicators. We discussed particularly commands necessary in the special case of a Cartesian communicator grid. We will use those for an implementation of Fox's algortihm next time.
Lecture 14	Mo. 02/06/12	Chapter 10: Design and Coding of Parallel Programs (Sorting)
	Room 2404	PDF transcript; video recording; class summary
		We discussed the sorting example in Chapter 10. The narrow purpose is to show the process of designing MPI code and to introduce a new MPI command MPI_Alltoall. The broader purposes include introducing the v-versions of many MPI commands and to point out how to find MPI commands for particular purposes when needed. We used a version that is simpler (since no dynamic memory allocation is needed inside the algorithm; see note at end of class.) Notice that we interrupted the presentation of Chapter 7 with this topic. The reason is that the code for Fox's algorithm uses some of the spirit presented here as well as some pieces from the lab tomorrow.
Lab 8	Tu. 02/07/12	Discussion of HW 2 on Ringsend
	Room 2421	class summary
		We discussed Problem 4. of HW 2 from earlier in the semester. First, I feel that it needs to be stated clearly what we are asked to do: If we are Process id, then we are to send a message to the process with the next higher id number and to Process 0 if we are the process with the highed id number. This later fact is expressed in Pacheco by saying (id+1)%np with the mod operation, instead of just saying id+1, which would leave it open what to do for the process with the highest number id=np-1. This problem is known as a ringsend, and like the load-balancing code, this is useful code to have tested carefully for use in other problems, like Fox's algorithm in Ch. 7. Second, we jointly developed a solution on the board; key is merely to design code for the destination dest in the MPI_Send call and the source in MPI_Recv. I proceeded then to show several versions of my solution. It brings out that apparently working code might not actually be correct. I used this to motivate a number of lessons learnt: Focus on integers when debugging; design test output of integers and print them before using them (even comment out their use in MPI commands to avoid hanging code until after testing them). As part of the printf in a parallel code, print out the id number as in printf("[%3d] ...\n", id, ...); this is always a good habit, so you know where a message comes from. Initialize all defined variables to something, for instance some odd value like -69, so that you can recognize in output if a variable was never set at all. Parallel code might not run if np=1. So, put id(np>1) as condition around MPI calls; to make the code correct mathematically, you often need an else case where date is set. For example, inside a parallel dot product you might have computed the local dot product in variable l_dot; inside the if(np>1) you would call an MPI_Allreduce to obtain the global variable dot; the else case for a single-process (serial) run needs to be dot=l_dot then.
Lecture 15	We. 02/08/12	Chapter 7: Fox's Algorithm
	Room 1403	PDF transcript; video recording; class summary
		We discussed the idea and implementation of Fox's algorithm. This is an example of two things: a block-matrix based calculation, like it is inside of BLAS and similar functions, and of the application of a Cartesian grid communicator. Notice that this is also the first time that our MPI commands do not use MPI_COMM_WORLD, but rather row_comm and col_comm. As the note at the end clarifies, my version of the code has differences to Pacheco's.
Lab 9	Fr. 02/10/12	Discussion of HWs 4 and 5 on the Power Method
	Room 2421	driver_plot.m class summary
		This lab was a make-up for the class cancellation on 01/25/12. I presented the thinking of how you design the algorithm for the matrix-vector product that is at the heart of the algorithm. Then we looked at its implementation as well as the main program, utilities function, and the function for the power method. We looked at my results for various versions of the algorithm. We started with results for an inefficient ordering of the i-j-loops in the local matrix-vector product. This made an order of magnitude difference for the timings. The speedup and efficiency plots were better-than-optimal, which flags the case for having an inefficient one-process run. Then we looked at several other versions of the code, which for instance saved one or two of the MPI commands per iteration. We also looked at a version using BLAS and the MPI_Reduce_scatter command. We then compared to results obtained on the cluster tara in the UMBC High Performance Computing Facility (www.umbc.edu/hpcf). That cluster has a network with less latency, which was visible in the results by scaling to larger numbers of processes. I finally showed how I had used a Matlab code to assemble the results and produce the plots; see attached file driver_plot.m.