Saturday, October 18, 2014

Using Multiple NVIDIA GPUs with OpenCV Part 1

Image processing can be a computation intensive task. If the user needs real time performance in processing high quality video, there is a good chance that  a single GPU will not suffice.

At this time the latest OpenCV release is 3.0-alpha, the library does not provide assistance in utilizing multiple Nvidia GPUs. Here is a link for reference: OpenCV CUDA Doc. It basically tells you that in order to split tasks between GPUs one needs to create threads and use cuda::setDevice(int) or gpu::setDevice(int) depending on what version of OpenCV you have.

Hopefully in this tutorial I can give you a good description of how to best create a program that utilizes 'X' amount of Nvidia graphics cards. Realize however that this is my own method of solving the problem, I didn't take this from anyone else and I didn't see any examples on the web detailing a way to solve the problem. If you have a better way of sharing data between threads let me know. I did most of the coding in a 'C' type fashion, shying away from the C++ thread class and settling for pthreads instead.

Program Steps:

1. Figure out how many CUDA devices are in the system, this can be accomplished with:

So this code will let us know that we are in fact using CUDA enabled devices, have OpenCV complied for CUDA support, and will let us know some device information. Also note that we are using namespace cv and namespace cv::gpu or cv::cuda. With this information we are going to create "cuda_device_count" amount of threads, each thread is going to manage its own CUDA devcie.

2. Create thread arguments. We haven't created any threads yet, but we need to have some sort of object that can be shared between threads that provide input data to them. Do accomplish this, we need to create a structure that contains all of the thread's initial arguments.

Displayed is the data structure that is going to be passed into the pthread as its argument. This may appear overwhelming, but it's actually quite simple with some further explanation. The first argument is the device ID, I just assign each thread its own id, starting at zero. The next argument is a pointer to an integer array "vc_cond_i", basically this variable lets the threads know who's turn it is to read from the VideoCapture Object. We will discuss the CircularBuffer_t type at a later time, but its used to pass data from the gpu threads back to the main() thread. Next is a pointer to a VideoCapture object, we create a video capture object in the main thread, and pass a pointer to the object into the gpu thread's arguments. All of the gpu threads are using pointers to the same VideoCapture object. Finally we have the pthread condition and mutex variables that protect the VideoCapture object from multiple threads trying to read from it at the same time. We will get into their uses later.

3. Let us initialize some thread arguments and create the pthread_t.

So what we have done is create an array of thread arguments using our type 'pc_args_t', an array of pthread condition variables, a single mutex, an array of integers 'vc_cond_i' that I explained earlier, and the actual pthread types. Lets initialize our thread_args structure with the following function.

And use the function with the arguments we created.

4. Start the threads. Spawn the threads by passing in the arguments and giving it a pointer to the function that will run in the thread.

The code above creates a thread for however many CUDA devices you have in your computer, passing in the arguments we initialized earlier, and a pointer to a function called gpu_routine. Lets delve into the gpu_routine function.

5. The gpu_routine function Here is the function prototype.

Inside the function we need to cast the arguments as pc_args_t and do some more initializing inside the thread.

So the comments are self documenting. But notice on Line 7 where we use 'setDevice' (cv::gpu::setDevice(int)) it lets that thread communicate with a given CUDA device.

6. Reading from the VideoCapture Object inside the threads. It is very important to have system in place in which the threads are reading from the VideoCapture object in a cooperative manner, processing the data, and giving the processed data back to the main() thread. For Part 1, I am going to explain how the threads share the same object without data races and reading collisions. "Talk is cheap show me the code."

I hope the code is semi self explanatory. What is going on is that all the gpu threads are competing on getting the lock for the VideoCapture object. I explained earlier that the thread argument vc_cond_i is for determining what thread's turn is it to read from the VideoCapture object. If a thread gets the lock and its not its turn to read, it unlocks the mutex and waits for its condition to be met. This all goes on in lines 8-10. Once the a thread has the lock and it is its turn to read, we can get a frame from the VideoCapture object, on line 12. After reading from the object, we need to signal the next thread that it is its turn to read, and then release the lock on the mutex.

7. Do some image processing Finally! So now you can take your data from mframe, do some image processing, upload it to the CUDA device and run some OpenCV CUDA routines, and download that data back into host memory. The next step of the tutorial is how to pass the processed image data back to the main() thread and display it inside a namedWindow. Note, you cannot share a namedWindow between threads, they only work inside the main() thread.