At this time the latest OpenCV release is 3.0-alpha, the library does not provide assistance in utilizing multiple Nvidia GPUs. Here is a link for reference: OpenCV CUDA Doc. It basically tells you that in order to split tasks between GPUs one needs to create threads and use cuda::setDevice(int) or gpu::setDevice(int) depending on what version of OpenCV you have.
Hopefully in this tutorial I can give you a good description of how to best create a program that utilizes 'X' amount of Nvidia graphics cards. Realize however that this is my own method of solving the problem, I didn't take this from anyone else and I didn't see any examples on the web detailing a way to solve the problem. If you have a better way of sharing data between threads let me know. I did most of the coding in a 'C' type fashion, shying away from the C++ thread class and settling for pthreads instead.
Program Steps:
1. Figure out how many CUDA devices are in the system, this can be accomplished with:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
if((cuda_device_count = getCudaEnabledDeviceCount()) == 0) { | |
printf("No GPU found or the library is compiled without GPU support\n"); | |
return -1; | |
} | |
printf("Number of enabled Cuda devices: %d\n", cuda_device_count); | |
for(int device = 0; device<cuda_device_count; device++) { | |
printShortCudaDeviceInfo(device); | |
} |
2. Create thread arguments. We haven't created any threads yet, but we need to have some sort of object that can be shared between threads that provide input data to them. Do accomplish this, we need to create a structure that contains all of the thread's initial arguments.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/* Thread Arguments */ | |
typedef struct { | |
int dev_id; | |
int* vc_cond_i; | |
CircularBufffer_t* cb; | |
VideoCapture* vc; | |
pthread_cond_t* vc_cond; | |
pthread_mutex_t* vc_lock; | |
} pc_args_t; |
3. Let us initialize some thread arguments and create the pthread_t.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/* Init Thread Args */ | |
pc_args_t thread_args[cuda_device_count]; | |
pthread_cond_t vc_cond[cuda_device_count]; | |
pthread_mutex_t vc_mut; | |
int vc_cond_i[cuda_device_count]; | |
pthread_t threads[cuda_device_count]; |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
void init_thread_ctr( CircularBufffer_t* cb, pc_args_t* args, pthread_cond_t* vc_cond, int* vc_cond_i ,pthread_mutex_t vc_mut, VideoCapture &vc, int cuda_device_count ){ | |
dev_count = cuda_device_count; | |
/*** Don't worry about this part yet! ***/ | |
/* Init Circular Buffer */ | |
for(int i=0; i<BUFFER_SIZE; i++){ | |
pthread_cond_init(&cb->data[i].full, NULL); | |
pthread_cond_init(&cb->data[i].empty, NULL); | |
pthread_mutex_init(&cb->data[i].lock, NULL); | |
sprintf(cb->data[i].c, "Nothing has been written yet"); | |
cb->data[i].id = 555; | |
} | |
/*** Worry about this part! ***/ | |
/* Point everything in the right direction, and assign args */ | |
for(int i=0;i<cuda_device_count;i++){ | |
args[i].vc_lock = &vc_mut; | |
args[i].cb = cb; | |
args[i].dev_id = i; | |
args[i].vc = &vc; | |
args[i].vc_cond = vc_cond; | |
args[i].vc_cond_i = vc_cond_i; | |
} | |
/* Init Pthread arguements */ | |
for(int i=0;i<cuda_device_count;i++){ | |
pthread_mutex_init(args[i].vc_lock, NULL); | |
for(int n=0;n<cuda_device_count;n++){ | |
pthread_cond_init(&args[i].vc_cond[n], NULL); | |
} | |
} | |
printf("Init Data Structures Done\n"); | |
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/* Init thread Controlers */ | |
init_thread_ctr(cb, thread_args, vc_cond, vc_cond_i, vc_mut, vc, cuda_device_count); |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
for (int dev=0;dev<cuda_device_count;dev++) { | |
int err = pthread_create(&threads[dev], NULL, gpu_routine, (void*)&thread_args[dev]); | |
if (err != 0) | |
perror("cannot create thread"); | |
} |
5. The gpu_routine function Here is the function prototype.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
void* gpu_routine(void *args); |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
void* gpu_routine(void *args){ | |
pc_args_t *args_s = (pc_args_t*) args; | |
/* Set GPU Device for the given thread */ | |
int dev_id = args_s->dev_id; | |
setDevice(dev_id); | |
/* Set Conditions for threads */ | |
memset(args_s->vc_cond_i, 0, dev_count * sizeof(int)); | |
printf("Starting thread on GPU device %d\n", dev_id); | |
int w_i = dev_id; /* Used for writing into the circular buffer, worry about this in Part 2 */ | |
args_s->vc_cond_i[0] = 1; /* Lets the thread that is managing device zero know it is his turn to read from the VideoCapture object */ | |
Mat mframe, mframe_gray; | |
/* More to come */ |
6. Reading from the VideoCapture Object inside the threads. It is very important to have system in place in which the threads are reading from the VideoCapture object in a cooperative manner, processing the data, and giving the processed data back to the main() thread. For Part 1, I am going to explain how the threads share the same object without data races and reading collisions. "Talk is cheap show me the code."
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/* Inside the gpu_routine */ | |
for(;;){ | |
/* Reads Data from VideoCapture object | |
* Each GPU device thread reads from the VC in the order | |
* of their GPU device ID. | |
*/ | |
if(dev_count >= 2){ | |
pthread_mutex_lock(args_s->vc_lock); | |
while(args_s->vc_cond_i[dev_id] != 1){ // Not this device ID's thread's turn to read from VideoCapture | |
pthread_cond_wait(&args_s->vc_cond[dev_id], args_s->vc_lock); | |
} | |
if(!args_s->vc->read(mframe)){ | |
/* No success, or end of video file */ | |
return ((void*)-1); | |
} | |
/* Reset condition for current Device */ | |
args_s->vc_cond_i[dev_id] = 0; | |
/* Signal next Device's ID's thread to read from video capture */ | |
if( dev_id == dev_count-1){ | |
args_s->vc_cond_i[0] = 1; | |
pthread_cond_signal(&args_s->vc_cond[0]); | |
} else { | |
args_s->vc_cond_i[dev_id+1] = 1; | |
pthread_cond_signal(&args_s->vc_cond[dev_id+1]); | |
} | |
pthread_mutex_unlock(args_s->vc_lock); | |
} else { | |
if(!args_s->vc->read(mframe)){ | |
/* No success, or end of video file */ | |
return ((void*)-1); | |
} | |
} | |
/* Process the data in mframe */ | |
/* More to Come in Part 2 */ |
7. Do some image processing Finally! So now you can take your data from mframe, do some image processing, upload it to the CUDA device and run some OpenCV CUDA routines, and download that data back into host memory. The next step of the tutorial is how to pass the processed image data back to the main() thread and display it inside a namedWindow. Note, you cannot share a namedWindow between threads, they only work inside the main() thread.