å¥ã®CUDAèšäº-ãªãã§ããïŒ
Habréã«ã¯ãCUDAã«é¢ããè¯ãèšäºããã§ã«ãããããããŸããã ãã ãããCUDA scan ãã®çµã¿åããã®æ€çŽ¢ã§ã¯ãæãåºæ¬çãªã¢ã«ãŽãªãºã ã®1ã€ã§ããGPUã®ã¹ãã£ã³ã¢ã«ãŽãªãºã ãšã¯ãŸã£ããé¢ä¿ã®ãªã2ã€ã®èšäºã®ã¿ãè¿ãããŸããã ãã®ãããæè¿èŠãUdacity- 䞊åããã°ã©ãã³ã°å ¥éã®ã³ãŒã¹ã«è§ŠçºãããŠãCUDAã«é¢ããèšäºã®ããå®å šãªã·ãªãŒãºãæžãããšã«ããŸããã ãã®ã·ãªãŒãºã¯ãã®ç¹å®ã®ã³ãŒã¹ã«åºã¥ããŠããããšãããã«èšããªããã°ãªããŸãããæéãããå Žåã¯ããããééããæ¹ãã¯ããã«äŸ¿å©ã§ãã
å 容
çŸåšã次ã®èšäºãäºå®ãããŠããŸãã
ããŒã1ïŒã¯ããã«ã
ããŒã2ïŒGPUããŒããŠã§ã¢ãšäžŠåéä¿¡ãã¿ãŒã³ã
ããŒã3ïŒåºæ¬çãªGPUã¢ã«ãŽãªãºã ïŒåæžãã¹ãã£ã³ãããã³ãã¹ãã°ã©ã ã
ããŒã4ïŒåºæ¬çãªGPUã¢ã«ãŽãªãºã ïŒã³ã³ãã¯ããªã»ã°ã¡ã³ãã¹ãã£ã³ã䞊ã¹æ¿ãã ããã€ãã®ã¢ã«ãŽãªãºã ã®å®çšåã
ããŒã5ïŒGPUããã°ã©ã ã®æé©åã
ããŒã6ïŒé次ã¢ã«ãŽãªãºã ã®äžŠååã®äŸã
ããŒã7ïŒäžŠåããã°ã©ãã³ã°ãåç䞊ååŠçã®è¿œå ãããã¯ã
é 延ãšåž¯åå¹

GPUã䜿çšããŠåé¡ã解決ããåã«èª°ããæåã«å°ããã¹ã質åã¯ãGPUãäœã«é©ããŠãããããã€äœ¿çšãã¹ãããšããããšã§ãã çããã«ã¯ã2ã€ã®æŠå¿µãå®çŸ©ããå¿ èŠããããŸãã
ã¬ã€ãã³ã·ãŒ-1ã€ã®åœä»€/æäœãå®äºããã®ã«ãããæéã
ã¹ã«ãŒããã -åäœæéãããã«å®è¡ãããåœä»€/æäœã®æ°ã
ç°¡åãªäŸïŒé床ã90 km / hã§å®¹éã4人ã®ä¹çšè»ãšãé床ã60 km / hã§å®¹éã20人ã®ãã¹ããããŸãã æäœã®ããã«1ããã¡ãŒãã«ããã1人ã®åãããšããšãè»ã®é 延-3600/90 = 40ç§-1ç§ã§1人ã1ããã¡ãŒãã«ã®è·é¢ãä¹ãè¶ããã®ã§ãè»ã®ã¹ã«ãŒãããã¯4/40 = 0.1æäœ/ç§ã§ãã ãã¹é 延-3600/60 = 60ç§ããã¹ã¹ã«ãŒããã-20/60 = 0.3ïŒ3ïŒæäœ/ç§ã
ãããã£ãŠãCPUã¯èªåè»ã§ãããGPUã¯ãã¹ã§ãã倧ããªé 延ããããŸããã垯åå¹ ã倧ãããªããŸãã ã¿ã¹ã¯ã§ç¹å®ã®åæäœã®é 延ã1ç§ãããã®ãããã®æäœã®æ°ã»ã©éèŠã§ãªãå Žåã¯ãGPUã®äœ¿çšãæ€èšãã䟡å€ããããŸãã
CUDAã®åºæ¬æŠå¿µãšçšèª
ããã§ã¯ãCUDAã®çšèªãæ±ããŸãããã

- ããã€ã¹ -GPU ãåŸå±ãã®åœ¹å²ãæãããŸã-CPUã圌ã«äŒããããšã ããè¡ããŸãã
- ãã¹ãïŒãã¹ãïŒ -CPUã å¶åŸ¡ããŒã«ãå®è¡ããŸã-ããã€ã¹ã§ã¿ã¹ã¯ãèµ·åããããã€ã¹ã«ã¡ã¢ãªãå²ãåœãŠãããã€ã¹ãšã®éã§ã¡ã¢ãªã移åããŸãã ã¯ããCUDAã䜿çšããå Žåãããã€ã¹ãšãã¹ãã®äž¡æ¹ã«ç¬èªã®ã¡ã¢ãªãããããšãåæãšããŠããŸãã
- ã«ãŒãã«ã¯ãããã€ã¹äžã®ãã¹ãã«ãã£ãŠèµ·åãããã¿ã¹ã¯ã§ãã
CUDAã䜿çšããå Žåããæ°ã«å ¥ãã®ããã°ã©ãã³ã°èšèªïŒCããã³C ++ãé€ããµããŒããããŠããèšèªã®ãªã¹ãïŒã§ã³ãŒããèšè¿°ããã ãã§ããã®åŸãCUDAã³ã³ãã€ã©ãŒã¯ãã¹ãçšãšããã€ã¹çšã«å¥ã ã«ã³ãŒããçæããŸãã å°ããªèŠåïŒããã€ã¹ã®ã³ãŒãã¯ãããã€ãã®ãCUDAæ¡åŒµæ©èœãã䜿çšããŠCã§ã®ã¿èšè¿°ããå¿ èŠããããŸãã
CUDAããã°ã©ã ã®äž»ãªæ®µé
- ãã¹ãã¯ãããã€ã¹ã«å¿ èŠãªéã®ã¡ã¢ãªãå²ãåœãŠãŸãã
- ãã¹ãã¯ã¡ã¢ãªããããã€ã¹ã®ã¡ã¢ãªã«ããŒã¿ãã³ããŒããŸãã
- ãã¹ãã¯ããã€ã¹äžã®ç¹å®ã®ã³ã¢ã®å®è¡ãéå§ããŸãã
- ããã€ã¹ã¯ã«ãŒãã«ãå®è¡ããŸãã
- ãã¹ãã¯ãçµæãããã€ã¹ã¡ã¢ãªããã¡ã¢ãªã«ã³ããŒããŸãã
åœç¶ãGPUãæãå¹ççã«äœ¿çšããã«ã¯ãã³ã¢ã®äœæ¥ã«è²»ããããæéãšã¡ã¢ãªã®å²ãåœãŠãšããŒã¿ã®ç§»åã«è²»ããããæéã®æ¯çãã§ããã ã倧ããããå¿ èŠããããŸãã
ã«ãŒãã«
ã«ãŒãã«ãšãã®èµ·åçšã®ã³ãŒããèšè¿°ããããã»ã¹ããã詳现ã«æ€èšããŸãããã éèŠãªååã¯ã ã«ãŒãã«ãïŒå®éã«ã¯ïŒéåžžã®ã·ãŒã±ã³ã·ã£ã«ããã°ã©ã ãšããŠèšè¿°ãããŠããããšã§ããã€ãŸããã«ãŒãã«èªäœã®ã³ãŒãã«ã¹ã¬ããã®äœæãšéå§ã¯è¡šç€ºãããŸããã 代ããã«ã䞊åã³ã³ãã¥ãŒãã£ã³ã°ãç·šæããããã«ã GPUã¯ç°ãªãã¹ã¬ããã§åãã«ãŒãã«ã®å€æ°ã®ã³ããŒãèµ·åããŸã -ãŸãã¯ãèªåã§èµ·åããã¹ã¬ããã®æ°ãèªåã§èšããŸãã ãããŠãã¯ããGPUã䜿çšããå¹çã®åé¡ã«æ»ããŸã-éå§ããã¹ã¬ãããå€ãã»ã©ïŒãã¹ãŠã®ã¹ã¬ãããæçšãªä»äºããããšããæ¡ä»¶ã§ïŒ-ããè¯ãã§ãã
ãã®ãããªå Žåãã«ãŒãã«ã®ã³ãŒãã¯éåžžã®ã·ãŒã±ã³ã·ã£ã«ã³ãŒããšã¯ç°ãªããŸãã
- ã«ãŒãã«å éšã§ã¯ããèå¥åãããŸãã¯ããç°¡åã«ãçŸåšå®è¡äžã®ã¹ã¬ããã®äœçœ®ãèŠã€ããæ©äŒããããŸãããã®äœçœ®ã䜿çšããŠãå®è¡äžã®ã¹ã¬ããã«å¿ããŠåãã³ã¢ãç°ãªãããŒã¿ã§åäœããããã«ããŸãã ã¡ãªã¿ã«ããã®ãããªäžŠåã³ã³ãã¥ãŒãã£ã³ã°ã®çµç¹ã¯ã SIMD ïŒåäžåœä»€è€æ°ããŒã¿ïŒãšåŒã°ããŸã -è€æ°ã®ããã»ããµãç°ãªãããŒã¿ã«å¯ŸããŠåãæäœãåæã«å®è¡ããå Žåã
- å Žåã«ãã£ãŠã¯ãã«ãŒãã«ã³ãŒãã§ããŸããŸãªåææ¹æ³ã䜿çšããå¿ èŠããããŸãã
ã«ãŒãã«ãèµ·åããã¹ã¬ããã®æ°ãã©ã®ããã«èšå®ããŸããïŒ GPUã¯äŸç¶ãšããŠã°ã©ãã£ãã¯åŠçãŠãããã§ãããããããã¯åœç¶ãCUDAã¢ãã«ãã€ãŸãã¹ã¬ããæ°ã®èšå®æ¹æ³ã«åœ±é¿ãäžããŸãã
- æåã«ãããããã°ãªããã®å¯žæ³ã3D座æšã§èšå®ãããŸãïŒ grid_xãgrid_yãgrid_z ã ãã®çµæãã°ãªããã¯grid_x * grid_y * grid_zãããã¯ã§æ§æãããŸãã
- 次ã«ããããã¯ãµã€ãºã3D座æšã§èšå®ãããŸãïŒ block_xãblock_yãblock_z ã ãã®çµæããããã¯ã¯block_x * block_y * block_zã¹ã¬ããã§æ§æãããŸãã åèšã§ã grid_x * grid_y * grid_z * block_x * block_y * block_zãããŒããããŸãã éèŠãªæ³šæ-1ãããã¯å ã®ã¹ã¬ããã®æ倧æ°ã¯å¶éãããGPUã¢ãã«ã«äŸåããŸã-å žåçãªå€ã¯512ïŒå€ãã¢ãã«ïŒãš1024ïŒæ°ããã¢ãã«ïŒã§ãã
- ã«ãŒãã«å ã§ã¯ã threadIdxå€æ°ãšblockIdxå€æ°ã¯ãã£ãŒã«ãxãyãzã§äœ¿çšã§ããŸãããããã«ã¯ããããããããã¯å ã®ã¹ããªãŒã ãšã°ãªããå ã®ãããã¯ã®3D座æšãå«ãŸããŠããŸãã åããã£ãŒã«ããæã€BlockDimå€æ°ãšgridDimå€æ°ã䜿çšã§ããŸãïŒãããããããã¯ãµã€ãºãšã°ãªãããµã€ãºïŒã
ã芧ã®ãšããããã®ã¹ããªãŒã ããªã¬ãŒæ¹æ³ã¯2Dããã³3Dç»åã®åŠçã«æ¬åœã«é©ããŠããŸããããšãã°ã2DãŸãã¯3Dç»åã®åãã¯ã»ã«ãç¹å®ã®æ¹æ³ã§åŠçããå¿ èŠãããå Žåããããã¯ãµã€ãºïŒç»åãµã€ãºãåŠçæ¹æ³ãGPUã¢ãã«ã«äŸåïŒãã°ãªãããµã€ãºãéžæããåŸç»åã®å€§ããããããã¯ã®å€§ããã§å®å šã«åå²ãããŠããªãå Žåãããããäœå°ã§ç»åå šäœãèŠãããããã«éžæãããŸãã
CUDAã§æåã®ããã°ã©ã ãäœæããŠããŸã
ååãªçè«ãã³ãŒããæžãæéã ããŸããŸãªãªãã¬ãŒãã£ã³ã°ã·ã¹ãã çšã®CUDAã®ã€ã³ã¹ããŒã«ãšæ§æã®æé -docs.nvidia.com/cuda/index.html ãŸããç»åãã¡ã€ã«ã®æäœãç°¡åã«ããããã«ã OpenCVã䜿çšããCPUãšGPUã®ããã©ãŒãã³ã¹ãæ¯èŒããããã«OpenMPã䜿çšããŸãã
ã¿ã¹ã¯ã¯éåžžã«ç°¡åã§ããã«ã©ãŒç»åãã°ã¬ãŒã®æ¿æ·¡ã«å€æããŸã ã ãã®ãããã°ã¬ãŒã¹ã±ãŒã«ã®pixãã¯ã»ã«ã®èŒåºŠã¯æ¬¡ã®åŒã«åŸã£ãŠèšç®ãããŸãïŒ Y = 0.299 * pix.R + 0.587 * pix.G + 0.114 *pix.Bã
ãŸããããã°ã©ã ã®ã¹ã±ã«ãã³ãèšè¿°ããŸãã
main.cpp
#include <chrono> #include <iostream> #include <cstring> #include <string> #include <opencv2/core/core.hpp> #include <opencv2/highgui/highgui.hpp> #include <opencv2/opencv.hpp> #include <vector_types.h> #include "openMP.hpp" #include "CUDA_wrappers.hpp" #include "common/image_helpers.hpp" using namespace cv; using namespace std; int main( int argc, char** argv ) { using namespace std::chrono; if( argc != 2) { cout <<" Usage: convert_to_grayscale imagefile" << endl; return -1; } Mat image, imageGray; uchar4 *imageArray; unsigned char *imageGrayArray; prepareImagePointers(argv[1], image, &imageArray, imageGray, &imageGrayArray, CV_8UC1); int numRows = image.rows, numCols = image.cols; auto start = system_clock::now(); RGBtoGrayscaleOpenMP(imageArray, imageGrayArray, numRows, numCols); auto duration = duration_cast<milliseconds>(system_clock::now() - start); cout<<"OpenMP time (ms):" << duration.count() << endl; memset(imageGrayArray, 0, sizeof(unsigned char)*numRows*numCols); RGBtoGrayscaleCUDA(imageArray, imageGrayArray, numRows, numCols); return 0; }
ããã§ã¯ãã¹ãŠãæããã§ã-ç»åãã¡ã€ã«ãèªã¿åããã«ã©ãŒããã³ã°ã¬ãŒã¹ã±ãŒã«ç»åãžã®ãã€ã³ã¿ãæºåãããªãã·ã§ã³ãå®è¡ããŸã
OpenMPãšCUDAã®ãªãã·ã§ã³ã䜿çšããŠãæéã枬å®ããŸãã prepareImagePointersé¢æ°ã®åœ¢åŒã¯æ¬¡ã®ãšããã§ãã
prepareImagePointers
template <class T1, class T2> void prepareImagePointers(const char * const inputImageFileName, cv::Mat& inputImage, T1** inputImageArray, cv::Mat& outputImage, T2** outputImageArray, const int outputImageType) { using namespace std; using namespace cv; inputImage = imread(inputImageFileName, IMREAD_COLOR); if (inputImage.empty()) { cerr << "Couldn't open input file." << endl; exit(1); } //allocate memory for the output outputImage.create(inputImage.rows, inputImage.cols, outputImageType); cvtColor(inputImage, inputImage, cv::COLOR_BGR2BGRA); *inputImageArray = (T1*)inputImage.ptr<char>(0); *outputImageArray = (T2*)outputImage.ptr<char>(0); }
ç§ã¯ã¡ãã£ãšããããªãã¯ã«è¡ããŸããïŒå®éã«ã¯ãç»åã®åãã¯ã»ã«ã«å¯ŸããŠã»ãšãã©äœæ¥ãè¡ããªããšããããšã§ã-ã€ãŸããCUDAãªãã·ã§ã³ã§ã¯ãäžèšã®åé¡ã¯ãæçšãªæäœã®å®è¡æéãšã¡ã¢ãªå²ãåœãŠããã³ããŒã¿ã³ããŒã®æéã®æ¯çãããã³çµæãšããŠåèšæéã«çºçããŸãCUDAããŒãžã§ã³ã¯OpenMPããŒãžã§ã³ããã倧ãããªããŸãããCUDAã®æ¹ãé«éã§ããããšã瀺ããããšæããŸã:)ãããã£ãŠãCUDAã®å Žåãå®éã®ç»åå€æã®å®è¡ã«è²»ããããæéã®ã¿ã枬å®ãããŸãïŒã¡ã¢ãªæäœãé€ãïŒã ç§ã®åŒè·ã§ã¯ãã¿ã¹ã¯ã®å€§èŠæš¡ãªã¯ã©ã¹ã®å Žåãèçšå¹Žæ°ãäŸç¶ãšããŠæ¯é çã§ãããã¡ã¢ãªæäœãèæ ®ããŠãCUDAã¯ããé«éã«ãªããšèšããŸãã
次ã«ãOpenMPããŒãžã§ã³ã®ã³ãŒããèšè¿°ããŸãã
openMP.hpp
#include <stdio.h> #include <omp.h> #include <vector_types.h> void RGBtoGrayscaleOpenMP(uchar4 *imageArray, unsigned char *imageGrayArray, int numRows, int numCols) { #pragma omp parallel for collapse(2) for (int i = 0; i < numRows; ++i) { for (int j = 0; j < numCols; ++j) { const uchar4 pixel = imageArray[i*numCols+j]; imageGrayArray[i*numCols+j] = 0.299f*pixel.x + 0.587f*pixel.y+0.114f*pixel.z; } } }
ãã¹ãŠãéåžžã«ç°¡åã§ã-omp parallel forãã£ã¬ã¯ãã£ããã·ã³ã°ã«ã¹ã¬ããã³ãŒãã«è¿œå ããã ãã§ã-ãããOpenMPã®çŸãããšãã¯ãŒã§ãã ã¹ã±ãžã¥ãŒã«ãã©ã¡ãŒã¿ãããã£ãŠã¿ãŸãããã ã¹ã±ãžã¥ãŒã«ãã©ã¡ãŒã¿ããªãå Žåãããæªãçµæã«ãªããŸããã
æåŸã«ãCUDAã«é²ã¿ãŸãã ããã§ããã«è©³ãã説æããŸãã ãŸããå ¥åçšã®ã¡ã¢ãªãå²ãåœãŠãããããCPUããGPUã«ç§»åããŠãåºåçšã®ã¡ã¢ãªãå²ãåœãŠãå¿ èŠããããŸãã
é衚瀺ã®ããã¹ã
void RGBtoGrayscaleCUDA(const uchar4 * const h_imageRGBA, unsigned char* const h_imageGray, size_t numRows, size_t numCols) { uchar4 *d_imageRGBA; unsigned char *d_imageGray; const size_t numPixels = numRows * numCols; cudaSetDevice(0); checkCudaErrors(cudaGetLastError()); //allocate memory on the device for both input and output checkCudaErrors(cudaMalloc(&d_imageRGBA, sizeof(uchar4) * numPixels)); checkCudaErrors(cudaMalloc(&d_imageGray, sizeof(unsigned char) * numPixels)); //copy input array to the GPU checkCudaErrors(cudaMemcpy(d_imageRGBA, h_imageRGBA, sizeof(uchar4) * numPixels, cudaMemcpyHostToDevice));
CPUäžã®ããŒã¿ã¯h_ ïŒ h ostïŒã§å§ãŸããããŒã¿ãšGPUã¯d_ ïŒ d eviceïŒã§å§ãŸã-CUDAã®å€æ°ã®åœååºæºã«æ³šæãæã䟡å€ããããŸãã checkCudaErrors-ã³ãŒã¹Udacity githubãªããžããªããååŸãããã¯ãã 次ã®åœ¢åŒããããŸãã
é衚瀺ã®ããã¹ã
#include <cuda.h> #define checkCudaErrors(val) check( (val), #val, __FILE__, __LINE__) template<typename T> void check(T err, const char* const func, const char* const file, const int line) { if (err != cudaSuccess) { std::cerr << "CUDA error at: " << file << ":" << line << std::endl; std::cerr << cudaGetErrorString(err) << " " << func << std::endl; exit(1); } }
cudaMalloc -GPUã®mallocã®é¡äŒŒäœã cudaMemcpy - memcpyã®é¡äŒŒäœã«ã¯ãã³ããŒã®ã¿ã€ãã瀺ãåæåã®è¿œå ãã©ã¡ãŒã¿ãŒããããŸãïŒcudaMemcpyHostToDeviceãcudaMemcpyDeviceToHostãcudaMemcpyDeviceToDeviceã
次ã«ãæéã®æž¬å®ãå¿ããã«ãã°ãªãããšãããã¯ã®æ¬¡å ãèšå®ããã«ãŒãã«ãåŒã³åºãå¿ èŠããããŸãã
é衚瀺ã®ããã¹ã
dim3 blockSize; dim3 gridSize; int threadNum; cudaEvent_t start, stop; cudaEventCreate(&start); cudaEventCreate(&stop); threadNum = 1024; blockSize = dim3(threadNum, 1, 1); gridSize = dim3(numCols/threadNum+1, numRows, 1); cudaEventRecord(start); rgba_to_grayscale_simple<<<gridSize, blockSize>>>(d_imageRGBA, d_imageGray, numRows, numCols); cudaEventRecord(stop); cudaEventSynchronize(stop); cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError()); float milliseconds = 0; cudaEventElapsedTime(&milliseconds, start, stop); std::cout << "CUDA time simple (ms): " << milliseconds << std::endl;
ã«ãŒãã«åŒã³åºã圢åŒ-kernel_name <<< gridSizeãblockSize >>>ã«æ³šæããŠãã ããã ã«ãŒãã«ã³ãŒãèªäœãããã»ã©è€éã§ã¯ãããŸããã
rgba_to_grayscale_simple
__global__ void rgba_to_grayscale_simple(const uchar4* const d_imageRGBA, unsigned char* const d_imageGray, int numRows, int numCols) { int y = blockDim.y*blockIdx.y + threadIdx.y; int x = blockDim.x*blockIdx.x + threadIdx.x; if (x>=numCols || y>=numRows) return; const int offset = y*numCols+x; const uchar4 pixel = d_imageRGBA[offset]; d_imageGray[offset] = 0.299f*pixel.x + 0.587f*pixel.y+0.114f*pixel.z; }
ããã§ãåè¿°ã®å€æ°threadIdx ã blockIdx ã blockDimã䜿çšããŠãåŠçããããã¯ã»ã«ã®y座æšãšx座æšãèšç®ããå€æãè¡ããŸãã ïŒx> = numCols || y> = numRowsïŒã®ãã§ãã¯ã«æ³šæããŠãã ãã-ç»åã®ãµã€ãºã¯å¿ ããããããã¯ã®ãµã€ãºã§å®å šã«åå²ããããšã¯éããªããããäžéšã®ãããã¯ã¯ç»åââã®ãç¯å²ãè¶ ãããããšããããŸãããããã£ãŠããã®ãã§ãã¯ãå¿ èŠã§ãã ãŸããã«ãŒãã«é¢æ°ã¯__global__æå®åã§ããŒã¯ããå¿ èŠããããŸã ã
æåŸã®ã¹ãããã¯ãçµæãGPUããCPUã«ã³ããŒããŠãå²ãåœãŠãããã¡ã¢ãªã解æŸããããšã§ãã
é衚瀺ã®ããã¹ã
checkCudaErrors(cudaMemcpy(h_imageGray, d_imageGray, sizeof(unsigned char) * numPixels, cudaMemcpyDeviceToHost)); cudaFree(d_imageGray); cudaFree(d_imageRGBA);
ãšããã§ãCUDAã䜿çšãããšããã¹ãã³ãŒãã«C ++ã³ã³ãã€ã©ã䜿çšã§ããŸãããããã£ãŠãèªåçã«ã¡ã¢ãªã解æŸããã©ãããŒãç°¡åã«äœæã§ããŸãã
ããã§ã枬å®ãéå§ããŸãïŒå ¥åç»åã®ãµã€ãºã¯10.109Ã4.542ã§ã ïŒïŒ
OpenMP time (ms):45 CUDA time simple (ms): 43.1941
ãã¹ããå®è¡ããããã·ã³ã®æ§æïŒ
é衚瀺ã®ããã¹ã
ããã»ããµãŒïŒIntel®CoreïŒTMïŒi7-3615QM CPU @ 2.30GHzã
GPUïŒNVIDIA GeForce GT 650Mã1024 MBã900 MHzã
RAMïŒDD3ã2x4GBã1600 MHzã
OSïŒOS X 10.9.5ã
ã³ã³ãã€ã©ïŒg ++ïŒGCCïŒ4.9.2 20141029ã
CUDAã³ã³ãã€ã©ïŒCudaã³ã³ãã€ã«ããŒã«ããªãªãŒã¹6.0ãV6.0.1ã
ãµããŒããããŠããOpenMPã®ããŒãžã§ã³ïŒOpenMP 4.0ã
GPUïŒNVIDIA GeForce GT 650Mã1024 MBã900 MHzã
RAMïŒDD3ã2x4GBã1600 MHzã
OSïŒOS X 10.9.5ã
ã³ã³ãã€ã©ïŒg ++ïŒGCCïŒ4.9.2 20141029ã
CUDAã³ã³ãã€ã©ïŒCudaã³ã³ãã€ã«ããŒã«ããªãªãŒã¹6.0ãV6.0.1ã
ãµããŒããããŠããOpenMPã®ããŒãžã§ã³ïŒOpenMP 4.0ã
ã©ããããããããŸãå°è±¡çã§ã¯ãããŸããã§ãã:)ããããåé¡ã¯åãã§ã-åãã¯ã»ã«ã§è¡ãããäœæ¥ãå°ãªããã-æ°åã®ã¹ã¬ãããå®è¡ãããããããã»ãŒç¬æã«åäœããŸãã CPUã®å Žåããã®åé¡ã¯çºçããŸãã-OpenMPã¯æ¯èŒçå°æ°ã®ã¹ã¬ããïŒç§ã®å Žåã¯8ïŒãèµ·åããã¹ã¬ããéã§åçã«äœæ¥ãåå²ããŸã-ãã®ããã«ããŠãããã»ããµã¯ã»ãŒ100ïŒ å æãããŸãããGPUã§ã¯å®éã«ããã®ãã¹ãŠã®åã䜿çšããªãã§ãã ããã 解決çã¯ããªãæçœã§ã-ã«ãŒãã«å ã®ããã€ãã®ãã¯ã»ã«ãåŠçããŸãã æ°ãããæé©åãããã«ãŒãã«ã¯æ¬¡ã®ããã«ãªããŸãã
rgba_to_grayscale_optimized
#define WARP_SIZE 32 __global__ void rgba_to_grayscale_optimized(const uchar4* const d_imageRGBA, unsigned char* const d_imageGray, int numRows, int numCols, int elemsPerThread) { int y = blockDim.y*blockIdx.y + threadIdx.y; int x = blockDim.x*blockIdx.x + threadIdx.x; const int loop_start = (x/WARP_SIZE * WARP_SIZE)*(elemsPerThread-1)+x; for (int i=loop_start, j=0; j<elemsPerThread && i<numCols; i+=WARP_SIZE, ++j) { const int offset = y*numCols+i; const uchar4 pixel = d_imageRGBA[offset]; d_imageGray[offset] = 0.299f*pixel.x + 0.587f*pixel.y+0.114f*pixel.z; } }
ãã¹ãŠã以åã®ã«ãŒãã«ã»ã©ç°¡åã§ã¯ãããŸããã ã芧ã®ãšãã ãåã¹ã¬ããã¯elemsPerThreadãã¯ã»ã«ãåŠçããé£ç¶ã§ã¯ãªãããããã®éã®WARP_SIZEã®è·é¢ã§åŠçããŸãã WARP_SIZEãšã¯äœãã32ã§ããçç±ããã¯ã»ã«ãèªç±ãªæ¹æ³ã§åŠçããå¿ èŠãããçç±ã«ã€ããŠã¯ã次ã®éšåã§ããã«è©³ãã説æããŸããã¡ã¢ãªã§ããå¹ççãªäœæ¥ãè¡ãããšãã§ããŸãã åã¹ã¬ããã¯elemsPerThreadãã¯ã»ã«ãWARP_SIZEã®è·é¢ã§åŠçããããã«ãªã£ãããããããã¯å ã®äœçœ®ã«åºã¥ããã®ã¹ã¬ããã®æåã®ãã¯ã»ã«ã®x座æšã¯ã以åãããããè€éãªæ°åŒã䜿çšããŠèšç®ãããããã«ãªããŸããã
ãã®ã«ãŒãã«ã¯æ¬¡ã®ããã«èµ·åããŸãã
é衚瀺ã®ããã¹ã
threadNum=128; const int elemsPerThread = 16; blockSize = dim3(threadNum, 1, 1); gridSize = dim3(numCols / (threadNum*elemsPerThread) + 1, numRows, 1); cudaEventRecord(start); rgba_to_grayscale_optimized<<<gridSize, blockSize>>>(d_imageRGBA, d_imageGray, numRows, numCols, elemsPerThread); cudaEventRecord(stop); cudaEventSynchronize(stop); cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError()); milliseconds = 0; cudaEventElapsedTime(&milliseconds, start, stop); std::cout << "CUDA time optimized (ms): " << milliseconds << std::endl;
x座æšã«ãããããã¯æ°ã¯ã numCols / threadNum + 1ã§ã¯ãªãnumCols /ïŒthreadNum * elemsPerThreadïŒ+ 1ãšããŠèšç®ãããããã«ãªããŸããã ããã§ãªããã°ããã¹ãŠãåããŸãŸã§ããã
以äžãéå§ããŸãã
OpenMP time (ms):44 CUDA time simple (ms): 53.1625 CUDA time optimized (ms): 15.9273
é床ã2.76åã«ãªããŸããïŒããã§ããã¡ã¢ãªæäœã®æéãèæ ®ããŠããŸããïŒ-ãã®ãããªåçŽãªåé¡ã®å Žåãããã¯ããªãè¯ãã§ãã ã¯ãããã®ã¿ã¹ã¯ã¯åçŽãããŸã-CPUãéåžžã«ããŸã察åŠããŸãã 2çªç®ã®ãã¹ããããããããã«ãGPUã§ã®åçŽãªå®è£ ã§ããCPUã§ã®å®è£ é床ãäœäžããå¯èœæ§ããããŸãã
ä»æ¥ã¯ããã§çµããã§ãã次ã®ããŒãã§ã¯ãGPUããŒããŠã§ã¢ãšåºæ¬çãªäžŠåéä¿¡ãã¿ãŒã³ã«ã€ããŠèª¬æããŸãã
ãã¹ãŠã®ãœãŒã¹ã³ãŒãã¯bitbucketã§å ¥æã§ããŸãã