last edited: 2022-05-12 15:05:47 +0000

CS 758: Programming Multicore Processors (Fall 2013 Section 1 of 1)

GPU: 10/30

You should do this assignment alone. No late assignments.

Filelist for the assignment:

The purpose of this assignment is for you to familiarize yourself with GPGPU computing platforms (CUDA) and to gain experience with GPGPU specific optimizations. For this assignment you will be given a basic implementation of an algorithm which runs on the GPU and you will procedurally improve it applying GPGPU optimization principals.

Important: CUDA can be tricky, especially if you make a mistake. Error messages are often cryptic and uninformative. Start this assignment early! If you run into any problems post on the email list.

The problem

For this assignment you will again be implementing the Ocean algorithm. You will be comparing the performance of your GPU-optimized algorithm to your solution from Homework 1. A simple solution to homework 1 is also included in the template files feel free to use it if you want.

The hardware

You will be using the Euler cluster. You should have or soon will receive an email with a username and temporary password. (MAKE SURE YOU RESET YOUR PASSWORD!) Read the above tutorial that describes the hardware configuration.

Job submission

This assignment was originally set up to submit jobs to the Torque queue. For this assignment, please just run jobs directly on euler01.

To get started:

local $ ssh user@euler.wacc.wisc.edu
euler $ ssh euler01
euler01 $ scp <username>@ale-01.cs.wisc.edu:/p/course/cs758-david/public/html/Fall2013/handouts/homeworks/hw6-dist.tgz .
euler01 $ tar -x -f hw6-dist.tgz
euler01 $ mv hw4-dist hw6
euler01 $ cd hw6
euler01 $ make
euler01 $ ./serial_ocean.sh

You shouldn’t have any problems as long as your code finishes quickly and you don’t leave cuda-gdb open for long periods of time (they have come across a few bugs where cuda-gdb sometimes blocks access to all other GPUs).

Information on the hardware provided in the Euler cluster is available here. You will use one of the Fermi cards (Tesla 2070/2050 or GTX 480). Each of which as 448 CUDA cores (14 SMs).

Distributed with CUDA 5.5 is an application called computeprof which does a good job of concisely representing the performance counters available on the NVidia GPUs. To use this program, you will need to use @@ssh -X@@ to login to the Euler cluster in order to forward the X server. You can then run it using @@/usr/local/cuda/5.5.22/cuda/bin/computeprof@@ I recommend sitting on campus while doing this since there is much higher bandwidth. You can use computeprof to diagnose the bottlenecks in each implementation of the algorithm.

Additional Information

Dan Negrut is currently teaching a GPU Computing course (ME964). If you need additional info for your homework, you may find what you need at his course web page: http://sbel.wisc.edu/Courses/ME964/2013/ There is also a forum where students in the class post questions/answers. It is here: http://sbel.wisc.edu/Forum/viewforum.php?f=15

Step 1: Porting the CPU algorithm

I have included this implementation of the @@ocean_kernel@@ in the template files. You can find it in cuda_ocean_kernels.cu after @@#ifdef VERSION1@@. Although considerably more verbose, this is a mostly literal translation of the algorithm in omp_ocean.c` with OpenMP static partitioning. Each thread gets a chunk of locations within the red/black ocean grid and updates those locations. Study this code and be sure to understand how it works.

Step 2: Reduce memory divergence (Convert algorithm to “SIMD”)

Implement @@VERSION2@@ of @@ocean_kernel@@. This version of the kernel will take a step towards reducing the memory divergence. Instead of giving each thread a chunk of the array to work on, re-write the algorithm so that the threads in each block work on adjacent elements. (I.e. for a red iteration, thread 0 will work on element 0, thread 1 will work on element 2, thread 3 will work on element 6, etc).

Step 3: Further reduce memory divergence (Modify data structure to be GPU-centric).

Implement @@VERSION3@@ of @@ocean_kernel@@. Instead of using one flat array to represent the ocean grid, split it into two arrays, one for the red cells and one for the black cells. You should start by writing two other kernels which will split the grid object into red_grid and black_grid and take red/black_grid and put them back into the grid object.

If you’re feeling adventurous, feel free to add any other optimizations to this implementation. Just describe them in your write-up.

Tips and Tricks

What to Hand In

Please turn this homework in on paper at the beginning of lecture. You must include: