Applied Computational Science II, Computer Lab Instructions for Tuesday, 08 October 2013, 3:30-6:00pm, room 152 Dirac Science Library.
This lab introduces OpenMP, which can be used to write parallel programs on shared memory systems.
The lab machines are dual-processors, with each processor having 4 cores. It turns out this means we can get all eight cores to cooperate on a parallel program, so we can do OpenMP experiments directly on the lab machines. This will require copying certain files into your directory, using the editor to make some changes, invoking the correct compiler with the appropriate switches, running the executable program, and comparing the execution times for different numbers of parallel processors.
The OpenMP skills you learn can also be used on FSU's Research Computing Center (RCC) cluster, which includes nodes with 48 processors. To do so, you would need to request an account on the RCC system, and learn a little about how to use the non-interactive batch system to execute jobs. OpenMP is a widely used system, so the skills you practice today can be used on your laptop, desktop, or any research cluster that uses C, C++, or FORTRAN.
The exercises involve the following programs:
For each exercise, there is a source code program that you can use. The source code is available in a C, C++, FORTRAN77 and FORTRAN90 version, so you can stick with your favorite language. (If you are a fan of PYTHON, the only way I know to use OpenMP requires you to write a PYTHON program for which the parallel loops are actually written in C. You are welcome to try such an approach.) You can copy each code, in a language that suits you, by using a browser.
You may want to refer to http://people.sc.fsu.edu/~jburkardt/presentations/acs2_openmp_2013.pdf, the lecture notes on OpenMP.
For our first exercise, we're just going to try to run a program that uses OpenMP. Pick a version of the hello program:
You don't need to look at the text of the program, but if you do, you'll see it's a little more complicated than the average "Hello, world!" program. I added some extra sample OpenMP calls:
Compile the program. The lab machines have the Gnu compilers. Sample compilation statements include:
The compilation should create an executable program called "a.out". Rename your compiled program to "hello":
mv a.out hello
Run your compiled program, using the command
./helloAlthough the program is set up with OpenMP, you probably haven't defined the number of threads to use, so the program will use the default, which might be the default value of 8.
Explicitly set the number of threads to 2, run the program again, and note the difference.
export OMP_NUM_THREADS=2 <=== NO SPACES around the = sign! ./hello
Notice that the program itself did not change at all, only the environment, the value of OMP_NUM_THREADS. You can experiment with other values of this quantity. On some systems, your thread request can't exceed the number of cores available. On others, the thread request can be as high as you like.
If you wish to save a copy of the output file, simply use the output redirection command:
./hello > hello_output.txtNow the file hello_output.txt contains the information that the program would otherwise have printed to the screen.
For this exercise, pick a version of the quad program:
The quad program approximates the integral of
f(x) = 50 / pi / (2500x^2+1)from 0 to 10, by evaluating the function at n equally spaced points, and multiplying the sum by (10-0)/n.
Your program should be ready to compile and run. Try to do it. I suggest you rename your compiled program quad. The program prints out a "wallclock time" measurement, but this is currently zero, because the program is not calling the necessary OpenMP function to measure this number. In fact, it isn't using OpenMP at all!
Your task is to make 4 modifications to the program so that it can take advantage of OpenMP. These changes will involve:
Once you've made your changes, compile the program. Your compile statement must now include a switch indicating that OpenMP is in use. For instance, the C program would now be compiled by
gcc -fopenmp quad.c
You set the number of threads with a command like
export OMP_THREAD_NUM=4Run the program 4 times, setting the number of threads to 1, 2, 4 and 8, and recording the value of wall clock time. Do you see a pattern?
For this exercise, pick a version of the md program:
The md program is a simple example of a molecular dynamics code. It randomly places many particles in a 3D region, gives them an initial velocity, and then tracks their movements. The particles are influenced not just by their own momentum, but by attractions to other particles, whose strength depends on distance.
The program is divided into a few large sections. One section, the compute() function, uses a large part of the computational time. We are going to try to make some simple modifications to this function so that the program runs in parallel, and faster.
The first change you must make to the program is to add a reference to the OpenMP ``include file''.
The second change will allow us to report the time taken by the big loop in the main program. Just before the loop, call omp_get_wtime() and save the value as wtime. Just after the loop, call omp_get_wtime() again, and update the value of wtime, and print its value.
Our third change is to parallelize the loop in the compute() routine. This is actually a nested loop. Our OpenMP directives will go just before the first of the two loop statements.
If you're using C or C++, your parallel directive should have the form:
# pragma omp parallel private (...) shared (...)where you must place every loop variable into one list or the other (except for any reduction variables...and we will have two of those!)
If you're using C or C++, your "for" directive should have the form:
# pragma omp for reduction ( + : pot, kin )because both pot and kin are reduction variables.
Remember that, in the "private" and "shared" lists, OpenMP only wants to see true variables. The compiler will complain if it sees the names of FORTRAN parameters or C/C++ "defined" quantities, because these are not actually variables. If you have a quantity called pi2 showing up in this loop, make sure you know whether it is really a variable or not.
Once you have made the changes, then
If you are interested in this topic, here are two more things you can look into, some other time:
do i = 1, nd do j = 1, np ...We can still parallelize this set of loops, but can you see why it might not be a good idea, assuming the value of nd is 3?
For this exercise, pick a version of the jacobi program:
The jacobi program solves a linear system using the Jacobi iteration. The program already includes some OpenMP information, such as the include statement and the timing calls, but more changes are needed if the program is to run in parallel.
The problem size is set in the main program, in the variable n, and you can change the value of this variable to make the problem be small for debugging (n=10) or big as a better test of the timing (n=500).
We are going to modify the routine called jacobi which solves the linear system. Inside this routine there is an iteration, and the iteration involves a number of loops or vector operations. We can't make the iteration loop itself parallel. But inside that loop, there are several smaller loops that we can work with:
iteration loop begins copy x to x_old loop compute new value of x loop compute ||x-x_old|| loop iteration loop ends
We can use a single OpenMP parallel statement to apply to all three inner loops. In FORTRAN the end parallel directive makes it clear where we are stopping. In C and C++, this will only work if we also use a pair of curly brackets to enclose the three loops:
iteration loop begins parallel directive private(...) shared(...) { copy x to x_old loop compute new value of x loop compute ||x-x_old|| loop } iteration loop ends
The loops are still sequential until we apply the appropriate OpenMP directive to request that they be executed in parallel. Mark each of the three loops. There is a reduction variable in the third loop. There is NOT a reduction variable in the second loop!
When you think your code is correct, compile and run it with 1, 2 and 4 threads. You should expect to see a significant improvement in time.
For the assignment, pick a version of the heated_plate program:
The heated_plate program is designed to solve for the steady state temperature of a rectangular metal plate which has three sides held at 100 degrees, and one side at 0 degrees.
A mesh of m by n points is defined. Points on the boundary are set to the prescribed values. Points in the interior are given an initial value, but we then want to try to make their values reflect the way heat behaves in real systems. Our simplified algorithm is an iteration. Each step of the iteration updates the temperature at an interior point by replacing it by the average of its north, south, east and west neighbors. We continue the iteration until the average change between old and new values is less than some user-specified tolerance epsilon.
This program expects to read the value of epsilon from the command line. Thus, if you were going to run the program interactively, you might type something like
./heated_plate 0.001
As in the jacobi program, there is an iteration loop that we cannot parallelize, but this loop contains three computational loops, each of which can be made parallel.
Your assignment is to modify the program to take advantage of OpenMP, and to produce a table showing the execution time required to run your modified program on 1, 2 and 4 threads.
To get credit for this assignment, you must submit the following information to the lab instructor:
You can go up one level to the CLASSES page.
Last revised on 09 October 2013.