ç§ãã¡ã¯äœã«ã€ããŠè©±ããŠããã®ã§ãã
ãã®æçš¿ã§ã¯ã DVMHã³ã³ãã€ã©ã®ã©ã³ã¿ã€ã ã·ã¹ãã ïŒRTS-以äžãã©ã³ã¿ã€ã ã·ã¹ãã ïŒã®äžéšã«ã€ããŠèª¬æããŸãã æ€èšäžã®éšåã¯ãããããŒãããããããã«ãGPUã§ã®ãŠãŒã¶ãŒé åã®åŠçãã€ãŸããã¢ã¯ã»ã©ã¬ãŒã¿ãŒã¡ã¢ãªã§ã®èªåå€æãŸãã¯åç·šæãæããŸãã ãããã®å€æã¯ãèšç®ãµã€ã¯ã«ã§GPUã¡ã¢ãªã«å¹æçã«ã¢ã¯ã»ã¹ããããã«è¡ãããŸãã DVMHãšã¯äœããã©ã®ããã«ã³ã³ãã¥ãŒãã£ã³ã°ã«é©å¿ã§ãããããããŠãããèªåçã«è¡ãããçç±ã以äžã«èª¬æããŸãã
DVMHãšã¯
ãã®æçš¿ã¯é åå€æã¢ã«ãŽãªãºã ã®ã¬ãã¥ãŒã«åœãŠãããŠãããããDVMHãäœã§ããããç°¡åã«èª¬æããŸããããã¯ãæäœã®åçã説æããããã«å¿ èŠãªããã§ãã DVMïŒä»®æ³ã¡ã¢ãªã®åæ£ïŒã·ã¹ãã -ããŒãã®ã¢ã¯ã»ã©ã¬ãŒã¿ãŒïŒGPU Nvidiaããã³Intel Xeon PhiïŒããã³ãã«ãã³ã¢ããã»ããµãå«ãã¯ã©ã¹ã¿ãŒçšã®ããã°ã©ã ãéçºããããã«èšèšãããã·ã¹ãã ã§éå§ããå¿ èŠããããŸãã ãã®ã·ã¹ãã ã䜿çšãããšãäž»ã«æ§é ã°ãªãããŸãã¯æ§é ããŒã¿åã§åäœãã倧èŠæš¡ãªé 次ããã°ã©ã ãç°¡åã«äžŠååã§ããã ãã§ãªããç°ãªãã¢ãŒããã¯ãã£ã®ããã€ã¹ãååšããå¯èœæ§ãããããŒãã«ããã°ã©ã ãã¯ã©ã¹ã¿ãŒã«ç°¡åã«ãããã³ã°ã§ããŸãã DVMã·ã¹ãã ã«ã¯ä»¥äžãå«ãŸããŸãã
- Cããã°ã©ãã³ã°èšèªïŒããã³å°æ¥ãC ++ããã¡ããå¶éä»ãïŒããã³Fortran-C-DVMHããã³Fortran-DVMHã®ã³ã³ãã€ã©ãŒã äžè¬çã«DVMHèšèªãšã³ã³ãã€ã©ãŒãåŒã³åºããŸãã DVMHã¯ãæšæºã³ã³ãã€ã©ãŒã«ã¯èŠããªããã©ã°ããŸãã¯ç¹å¥ãªã³ã¡ã³ãïŒOpenMPãOpenACCãªã©ã®é¡æšã«ããïŒã§æ€èšäžã®ããã°ã©ãã³ã°èšèªã®æ¡åŒµæ©èœã§ãã ãããã£ãŠãããã°ã©ããŒã¯ãããŒã¿ãã©ã®ããã«åæ£ãããèšç®ãã©ã®ããã«èšç®ãµã€ã¯ã«ã§åæ£ããŒã¿ã«ãããã³ã°ããããã瀺ãå°æ°ã®ãã£ã¬ã¯ãã£ããé 眮ããã ãã§ååã§ãã ãã®åŸããŠãŒã¶ãŒã¯ã·ãªã¢ã«ããã°ã©ã ãšãã©ã¬ã«ããã°ã©ã ã®äž¡æ¹ãåãåããŸãã çµæã®ããã°ã©ã ã¯ãç°ãªãæ°ã®ããŒãäžã®ã¯ã©ã¹ã¿ãŒãåäžããŒãå ã®1ã€ãŸãã¯è€æ°ã®GPUã§å®è¡ã§ããããšãã°ããã«ãã³ã¢ããã»ããµãã°ã©ãã£ãã¯ã¢ã¯ã»ã©ã¬ãŒã¿ãããã³Intel Xeon Phiã¢ã¯ã»ã©ã¬ãŒã¿ãããã«äœ¿çšã§ããŸãïŒãããåé¡ã®ãµãŒããŒã«ããå ŽåïŒã 詳现ã«ã€ããŠã¯ã ãã¡ããã芧ãã ããã
- Lib-DVMHãµããŒãã©ã€ãã©ãªãŸãã¯RTSHã©ã³ã¿ã€ã ã·ã¹ãã ïŒHã¯ããããžãã¢ã¹ãè¡šããŸããå€ãã®ã³ã³ããŒãã³ãã®ååã®ãã®æåã¯ãã·ã¹ãã ãGPUãšXeon PhiããµããŒãããããã«æ¡åŒµãããåŸã«è¡šç€ºãããŸãïŒã ãã®ã·ã¹ãã ã®å©ããåããŠãããã°ã©ã ã®åäœäžã«ãŠãŒã¶ãŒããã°ã©ã ã®èšå®å šäœãå®è¡ãããŸãã
- DVMHããã°ã©ã ã®æå¹æ§ã®ããã®ãããã°ããŒã«ãšãããã°ããŒã«ïŒãããŸã§ã¯Fortran-DVMHããã°ã©ã ã®ã¿ïŒã
ãã®ãããªã·ã¹ãã ãäœæããäž»ãªç®çã¯ãæ¢åã®ããã°ã©ã ã䞊ååããããšã§ãŠãŒã¶ãŒã®ç掻ãç°¡çŽ åããæ°ãã䞊åããã°ã©ã ã®äœæãç°¡çŽ åããããšã§ãã DVMHã³ã³ãã€ã©ãŒã¯ãMPIãOpenMPãCUDAãRTSHåŒã³åºãã䜿çšããŠãDVMHãã£ã¬ã¯ãã£ããå«ãçµæã®ããã°ã©ã ãããã°ã©ã ã«å€æããŸãã ãããã£ãŠããŠãŒã¶ãŒããã°ã©ã ã¯ãèšç®é åžãã£ã¬ã¯ãã£ãïŒOpenMPãŸãã¯OpenACCãšã»ãŒåæ§ïŒãšãââãŒã¿é åžãã£ã¬ã¯ãã£ãã䜿çšããŠç°¡åã«äžŠååã§ããŸãã ããã«ããã®ããã°ã©ã ã¯åŒãç¶ãäžè²«ããŠãããããã¯ãã®éçºãšãµããŒãã«ãšã£ãŠéèŠã§ãã ããã§ãããè¯ããã·ãŒã±ã³ã·ã£ã«ããã°ã©ã ãèšè¿°ããDVMHã³ã³ãã€ã©çšã®ãã®ãããªããã°ã©ã ã«ãã£ã¬ã¯ãã£ããé 眮ããããšã¯ãæåã®äžŠååãè¡ãããç°¡åã§ãã
åæå®è¡ã¬ãã«
çã玹ä»ãšäžé£ã®äºæã®çŽ¹ä»ã®åŸãåæããã°ã©ã ïŒã€ãŸããèšç®ãµã€ã¯ã«ïŒãRTSHå ã®ç°ãªãã¬ãã«ã®äžŠååŠçã«ã©ã®ããã«ãããã³ã°ãããããæ€èšããŸãã çŸåšãåªããèšç®èœåãå®çŸããããã«ãåã¹ã¬ããã®é »åºŠãå¢ãã代ããã«ãåäžã®ããã€ã¹å ã§å€æ°ã®ã¹ã¬ããã䜿çšããŠããŸãã ããã¯ãããŒãéã®ããã»ã¹ã®çžäºäœçšã®æšæºïŒãã¡ããMPIïŒã ãã§ãªããç°çš®ã¢ãŒããã¯ãã£ã®åºçŸãããã³ããŸããŸãªäžŠåèšèªïŒé«äœäž¡æ¹ïŒã®åºçŸãç 究ããå¿ èŠã«ã€ãªãããŸãã ãããŠããã®ãã¹ãŠããšã³ããŠãŒã¶ãŒã®ç掻ãé£ããããŠããŸãã ãŸããé«æ§èœã¢ããªã±ãŒã·ã§ã³ãå®çŸããã«ã¯ãç¹å®ã®ã³ã³ãã¥ãŒãã£ã³ã°ã¯ã©ã¹ã¿ãŒã®ãã¹ãŠã®æ©èœã䜿çšããå¿ èŠããããŸãã
çŸæç¹ã§ã¯ãããŒãéãšããŒãå ã®2ã€ã®ã¬ãã«ã®äžŠååŠçãæ³åã§ããŸãã ããŒãå ã§ã¯ã1ã€ä»¥äžã®GPUãšãã«ãã³ã¢ããã»ããµã䜿çšã§ããŸãã ãã®ã¹ããŒã ã®Intel Xeon Phiã¯ããã«ãã³ã¢ããã»ããµãå«ãå¥ã®ããŒããšèŠãªãããŸãã 以äžã¯ãDVMHããã°ã©ã ããããã³ã°ã§ããã³ã³ãã¥ãŒãã£ã³ã°ã¯ã©ã¹ã¿ãŒã®äžè¬çãªå³ã§ãã
åœç¶ãããã€ã¹ã®è² è·ãåæ£ããåé¡ïŒDVMHã®ã¡ã«ããºã ïŒãçºçããŸãããããã¯ãã®èšäºã®ç¯å²å€ã§ãã ä»åŸã®èæ ®äºé ã¯ãã¹ãŠãåäžããŒãå ã®åäžGPUã«åœ±é¿ããŸãã 以äžã§èª¬æããå€æã¯ãDVMHããã°ã©ã ãç¬ç«ããŠåäœãããã¹ãŠã®GPUã§å®è¡ãããŸãã
ããŒã¿ã®åç·šæãå¿ èŠãªçç±
æåŸã«ãé·æã«ããã玹ä»ã®åŸãåç·šæã®åé¡èªäœã«åãçµã¿ãŸããã ãªãããããã¹ãŠå¿ èŠãªã®ã§ããïŒ ããããäœã®ããã«ã ããçš®ã®èšç®ãµã€ã¯ã«ãèããŸãïŒ
double ARRAY[Z][Y][X][5]; for(int I = 1; i < Z; ++I) for(int J = 1; J < Y; ++J) for(int K = 1; K < X; ++K) ARRAY[K][J][I][4] = Func(ARRAY[K][J][I][2], ARRAY[K][J][I][5]);
ããšãã°ãæåã®ïŒæéã®ïŒæž¬å®ã«ã¯5ã€ã®ç©çéãå«ãŸããæ®ãã¯èšç®ãã¡ã€ã³ã®ç©ºéã®åº§æšã§ãã4次å é åããããŸãã ããã°ã©ã ã§ã¯ãæåã®ïŒã¡ã¢ãªå ã«è¿ãïŒæž¬å®ãããã5ã€ã®èŠçŽ ã§æ§æãããããã«é åã宣èšãããŠãããããããã»ããµãã£ãã·ã¥ã¯èšç®ãµã€ã¯ã«ã§é©åã«æ©èœããŸãã ãã®äŸã§ã¯ã3ãµã€ã¯ã«ã®åå埩ã§ãé«é枬å®ã®2ã4ãããã³5èŠçŽ ãžã®ã¢ã¯ã»ã¹ãå¿ èŠã§ãã ãŸãããã®æ¬¡å ã«ã¯ãµã€ã¯ã«ããªãããšã«æ³šæãã䟡å€ããããŸãã ãŸããããšãã°ããããã®éã®æ§è³ªãç°ãªãããã5ã€ã®èŠçŽ ããããã®èšç®ãç°ãªããŸãã
ãããã£ãŠãIãJãKã«æ²¿ã£ãŠã«ãŒãã䞊åã«å®è¡ããããšãã§ããŸããã®äŸã§ã¯ãARRAYé åã®åèŠçŽ ã¯ãããšãã°æ¬¡ã®ããã«äžŠåã«ãŒãã«ãããã³ã°ãããŸãã
#pragma dvm array distribute[block][block][block][*] double ARRAY[Z][Y][X][5]; #pragma dvm parallel([I][J][K] on ARRAY[I][J][K][*]) for(int I = 1; i < Z; ++I) for(int J = 1; J < Y; ++J) for(int K = 1; K < X; ++K) ARRAY[K][J][I][4] = Func(ARRAY[K][J][I][2], ARRAY[K][J][I][5]);
ã€ãŸããèšç®ãåæ£ãããããŒã¿ååžã衚瀺ãããŸãã äžèšã®DVMHãã£ã¬ã¯ãã£ãã¯ãé åãçãããããã¯ã§3次å ã«åæ£ãã4çªç®ïŒæéïŒã®ãããã¯ãä¹ç®ããå¿ èŠãããããšã瀺ããŠããŸãã ãã®ãšã³ããªã䜿çšãããšãDVMHããã°ã©ã ã®èµ·åæã«æå®ãããããã»ããµã©ãã£ã¹ã«ã¢ã¬ã€ããããã³ã°ã§ããŸãã 次ã®ãã£ã¬ã¯ãã£ãã¯ãARRAYé åã®åæ£èŠåã«åŸã£ãŠã¿ãŒã³ïŒIãJãKïŒãå®è¡ããå¿ èŠãããããšã瀺ããŠããŸãã ãããã£ãŠãPARALLELãã£ã¬ã¯ãã£ãã¯ãã«ãŒãå埩ã®ãããã³ã°ãé åèŠçŽ ã«èšå®ããŸãã RTSHã¯ãå®è¡æã«ãé åã®é 眮æ¹æ³ãšèšç®ã®ç·šææ¹æ³ãèªèããŠããŸãã
ãã®ã«ãŒããã€ãŸãã¿ãŒã³ã®å šã¹ããŒã¹ã¯ãOpenMPã¹ã¬ãããšCUDAã¹ã¬ããã®äž¡æ¹ã§è¡šç€ºã§ããŸããããã¯ã3ã€ã®ãµã€ã¯ã«ãã¹ãŠã«äŸåé¢ä¿ããªãããã§ãã CUDAã¢ãŒããã¯ãã£ãžã®ãããã³ã°ã«èå³ããããŸãã CUDAãããã¯ã«ã¯3ã€ã®æ¬¡å ïŒxãyãzïŒãå«ãŸããŠããããšã¯èª°ããç¥ã£ãŠããŸãã æåãæéã§ãã CUDAãããã¯ã®ã¯ãŒãã«ã¯ã€ãã¯æž¬å®ã衚瀺ãããŸãã ãªãããããã¹ãŠãèšåããå¿ èŠãããã®ã§ããïŒ æ¬¡ã«ãã°ããŒãã«ã¡ã¢ãªGPUïŒGDDR5ïŒãããããã³ã³ãã¥ãŒãã£ã³ã°ã®ããã«ããã¯ã§ããããšã確èªããŸãã ãŸããã¡ã¢ãªã¯ãããŒãããããã¹ãŠã®èŠçŽ ãé£ç¶ããŠããå Žåã«ã®ã¿ã1ã€ã®ã¯ãŒãã«ãã£ãŠæéã®ã¢ã¯ã»ã¹ãæäŸãããããã«é 眮ãããŸãã äžèšã®ã«ãŒãã§ã¯ãã³ã€ã«ã¹ããŒã¹ïŒIãJãKïŒãCUDAãããã¯ïŒxãyãzïŒã«ãããã³ã°ããããã®6ã€ã®ãªãã·ã§ã³ããããŸãããARRAYé åã«å¹ççã«ã¢ã¯ã»ã¹ã§ãããªãã·ã§ã³ã¯ãããŸããã
ããã¯äœããæ¥ãã®ã§ããïŒ é åã®èª¬æãèŠããšãæåã®æ¬¡å ã«5ã€ã®èŠçŽ ãå«ãŸããŠããããµã€ã¯ã«ãååšããªãããšãããããŸãã ãããã£ãŠã2çªç®ã®é«é枬å®ã®èŠçŽ ã¯40ãã€ãïŒdoubleåã®5èŠçŽ ïŒã®è·é¢ã«ãããGPUã¡ã¢ãªãžã®ãã©ã³ã¶ã¯ã·ã§ã³æ°ãå¢å ããŸãïŒ1ãã©ã³ã¶ã¯ã·ã§ã³ã§ã¯ãªãã1ã¯ãŒãã§æ倧32ãã©ã³ã¶ã¯ã·ã§ã³ïŒã ããã¯ãã¹ãŠãã¡ã¢ãªãã¹ã®éè² è·ãšããã©ãŒãã³ã¹ã®äœäžã«ã€ãªãããŸãã
ãã®å Žåãåé¡ã解決ããã«ã¯ã1次å ãš2次å ãå ¥ãæ¿ãããã€ãŸãã2次å è¡åãX * 5 Y * Zå転眮ããããY * Zç¬ç«è»¢çœ®ãå®è¡ããã ãã§ååã§ãã ããããé åã®æ¬¡å ãå ¥ãæ¿ãããšã¯ã©ãããæå³ã§ããïŒ æ¬¡ã®åé¡ãçºçããå¯èœæ§ããããŸãã
- ãã®å€æãå®è¡ããã³ãŒãã«ãµã€ã¯ã«ãè¿œå ããå¿ èŠããããŸãã
- ããã°ã©ã ã³ãŒãå ã®æž¬å®å€ãåé 眮ããå Žåãããã°ã©ã å šäœã®ä¿®æ£ãå¿ èŠã«ãªããŸããããã°ã©ã ã®1ç¹ã§é å枬å®å€ãåé 眮ãããšããã¹ãŠã®ãµã€ã¯ã«ã®èšç®ãäžæ£ç¢ºã«ãªãããã§ãã ããã°ã©ã ã倧ããå Žåãå€ãã®ééããç¯ãå¯èœæ§ããããŸãã
- é åã®å¹æãã©ã®ããã«åŸãããé åãå¥ã®ãµã€ã¯ã«ã«ã©ã®ããã«åœ±é¿ãããã¯æããã§ã¯ãããŸããã é åãå ã®ç¶æ ã«æ»ãããŸãã¯é åãæ»ãå¿ èŠããããŸãã
- ãã®ãµã€ã¯ã«ã¯CPUäžã§å¹ççã«å®è¡ãããªããªããããããã°ã©ã ã®2ã€ã®ããŒãžã§ã³ãäœæããŸãã
RTSHã§ã®ããŸããŸãªé åã®å®è£
åè¿°ã®åé¡ã解決ããããã«ãRTSHã¯é åã®èªåå€æã¡ã«ããºã ãçºæããŸãããããã«ãããGPUã¡ã¢ãªãžã®ã¢ã¯ã»ã¹ã倱æããå ŽåïŒãã®æ©èœã䜿çšããªãå Žåã®å®è¡ãšæ¯èŒããŠïŒããŠãŒã¶ãŒã®DVMHããã°ã©ã ãå€§å¹ ã«é«éåã§ããŸãïŒæ°åïŒã å€æã®ã¿ã€ããšCUDAã§ã®å®è£ ãæ€èšããåã«ããã®ã¢ãããŒãã®è°è«ã®äœå°ã®ãªãå©ç¹ãããã€ãæããŸãã
- ãŠãŒã¶ãŒã«ã¯DVMHããã°ã©ã ã1ã€ãããã¢ã«ãŽãªãºã ã®äœæã«çŠç¹ãåœãŠãŠããŸãã
- -autoTfmã³ã³ãã€ã©ã«DVMHãªãã·ã§ã³ã1ã€ã ãæå®ããããšã«ãããDVMHããã°ã©ã ã®ã³ã³ãã€ã«äžã«ãã®ã¢ãŒããæå¹ã«ã§ããŸãã ãããã£ãŠãããã°ã©ã ãå€æŽããã«ãŠãŒã¶ãŒã¯äž¡æ¹ã®ã¢ãŒããè©ŠããŠãå éãè©äŸ¡ã§ããŸãã
- ãã®å€æã¯ãªã³ããã³ãã§å®è¡ãããŸãã ããã¯ãèšç®ãµã€ã¯ã«ã®åã«é åã®æž¬å®é åºãå€æŽãããå Žåã次ã®ãµã€ã¯ã«ã§ãã®é åã®é 眮ãæå©ã«ãªãå¯èœæ§ããããããèšç®åŸã®é眮æã¯å®è¡ãããªãããšãæå³ããŸãã
- ãã®ãªãã·ã§ã³ã䜿çšããã«å®è¡ãããåãããã°ã©ã ãšæ¯èŒããŠãããã°ã©ã ã®å€§å¹ ãªå éïŒæ倧6åïŒã
1.ç©ççã«é£æ¥ããã¢ã¬ã€ã®æ¬¡å ã亀æããŸãã
äžèšã®äŸã¯ããã®ã¿ã€ãã®å€æã«é©ããŠããŸãã ãã®å Žåã2次å å¹³é¢ã転眮ããå¿ èŠããããŸããããã¯ãé åã®2ã€ã®é£æ¥ãã次å ã«é 眮ã§ããŸãã æåã®2次å ã転眮ããå¿ èŠãããå Žåã¯ãå ±æã¡ã¢ãªã䜿çšãããã説æãããŠããè¡å転眮ã¢ã«ãŽãªãºã ãé©åã§ãã
__shared__ T temp[BLOCK_DIM][BLOCK_DIM + 1]; CudaSizeType x1Index = (blockIdx.x + pX) * blockDim.x + threadIdx.x; CudaSizeType y1Index = (blockIdx.y + pY) * blockDim.y + threadIdx.y; CudaSizeType x2Index = (blockIdx.y + pY) * blockDim.y + threadIdx.x; CudaSizeType y2Index = (blockIdx.x + pX) * blockDim.x + threadIdx.y; CudaSizeType zIndex = blockIdx.z + pZ; CudaSizeType zAdd = zIndex * dimX * dimY; CudaSizeType idx1 = x1Index + y1Index * dimX + zAdd; CudaSizeType idx2 = x2Index + y2Index * dimY + zAdd; if ((x1Index < dimX) && (y1Index < dimY)) { temp[threadIdx.y][threadIdx.x] = inputMatrix[idx1]; } __syncthreads(); if ((x2Index < dimY) && (y2Index < dimX)) { outputMatrix[idx2] = temp[threadIdx.x][threadIdx.y]; }
æ£æ¹è¡åã®å Žåããããããã€ã³ãã¬ãŒã¹ãã転眮ããããšãã§ããGPUã«è¿œå ã®ã¡ã¢ãªãå²ãåœãŠãå¿ èŠã¯ãããŸããã
2.ç©ççã«é£æ¥ããŠããªãã¢ã¬ã€ã®æ¬¡å ã亀æããŸãã
ãã®ã¿ã€ãã«ã¯ãé åã®ä»»æã®2次å ã®é åãå«ãŸããŸãã ãã®ãããªçœ®æã®2ã€ã®ã¿ã€ãã匷調ãã䟡å€ããããŸãã1ã€ç®ã¯ãæåã®æ¬¡å ãå€æŽãããããã®éã§æ¬¡å ãå€æŽããããšã§ãã æéã®æž¬å®ã®èŠçŽ ã¯é£ç¶ããŠãããããããžã®ã¢ã¯ã»ã¹ãé£ç¶ããŠå¯èœã§ããå¿ èŠãããããããã®åé¢ã¯ç解ã§ããã¯ãã§ãã ããã«ã¯å ±æã¡ã¢ãªã䜿çšã§ããŸãã
__shared__ T temp[BLOCK_DIM][BLOCK_DIM + 1]; CudaSizeType x1Index = (blockIdx.x + pX) * blockDim.x + threadIdx.x; CudaSizeType y1Index = (blockIdx.y + pY) * blockDim.y + threadIdx.y; CudaSizeType x2Index = (blockIdx.y + pY) * blockDim.y + threadIdx.x; CudaSizeType y2Index = (blockIdx.x + pX) * blockDim.x + threadIdx.y; CudaSizeType zIndex = blockIdx.z + pZ; CudaSizeType zAdd = zIndex * dimX * dimB * dimY; CudaSizeType idx1 = x1Index + y1Index * dimX * dimB + zAdd; CudaSizeType idx2 = x2Index + y2Index * dimY * dimB + zAdd; for (CudaSizeType k = 0; k < dimB; k++) { if (k > 0) __syncthreads(); if ((x1Index < dimX) && (y1Index < dimY)) { temp[threadIdx.y][threadIdx.x] = inputMatrix[idx1 + k * dimX]; } __syncthreads(); if ((x2Index < dimY) && (y2Index < dimX)) { outputMatrix[idx2 + k * dimY] = temp[threadIdx.x][threadIdx.y]; } }
ä»ã®æž¬å®å€ãçžäºã«åé 眮ããå¿ èŠãããå Žåãå ±æã¡ã¢ãªã¯å¿ èŠãããŸãããã¢ã¬ã€ã®é«é枬å®ãžã®ã¢ã¯ã»ã¹ããæ£ãããå®è¡ãããããã§ãïŒé£æ¥ããã¹ã¬ããã¯GPUã¡ã¢ãªå ã®é£æ¥ã»ã«ã§åäœããŸãïŒã
3.é åã察è§åããŸãã
ãã®ã¿ã€ãã®é åã¯éæšæºã§ãããéåžžã®ããŒã¿äŸåæ§ãæã€ãµã€ã¯ã«ã®äžŠåå®è¡ã«å¿ èŠã§ãã ãã®é åã¯ãäŸåé¢ä¿ããããµã€ã¯ã«ãåŠçãããšãã«ãæ£ãããã¢ã¯ã»ã¹ãæäŸããŸãã ãã®ãããªã«ãŒãã®äŸãèããŠã¿ãŸãããã
#pragma dvm parallel([ii][j][i] on A[i][j][ii]) across(A[1:1][1:1][1:1]) for (ii = 1; ii < K - 1; ii++) for (j = 1; j < M - 1; j++) for (i = 1; i < N - 1; i++) A[i][j][ii] = A[i + 1][j][ii] + A[i][j + 1][ii] + A[i][j][ii + 1] + A[i - 1][j][ii] + A[i][j - 1][ii] + A[i][j][ii - 1];
ãã®å Žåããµã€ã¯ã«ã®3次å ãã¹ãŠãŸãã¯é åAã®3次å ã«äŸåããŠããŸããDVMHã³ã³ãã€ã©ã«ããã®ãµã€ã¯ã«ã«éåžžã®äŸåãããããšãéç¥ããããã«ïŒäŸåèŠçŽ ã¯åŒa * x + bã§è¡šçŸã§ããŸããaããã³bã¯å®æ°ïŒãACROSSä»æ§ãååšããŸãã ãã®ãµã€ã¯ã«ã«ã¯ãçŽæ¥ããã³éã®äŸåé¢ä¿ããããŸãã ãã®ãµã€ã¯ã«ã®ã¿ãŒã³ã®ç©ºéã¯ãå¹³è¡å é¢äœïŒããã³ç¹å®ã®å Žå-äžæ¬¡å ç«æ¹äœïŒã«ãã£ãŠåœ¢æãããŸãã åé¢ã«å¯ŸããŠ45床å転ãããã®å¹³è¡å é¢äœã®é¢ã¯ãé¢èªäœãçŽåã«ãªã£ãŠããéã䞊è¡ããŠå®è¡ã§ããŸãã ãã®ãããé åAã®æåã®2ã€ã®æé枬å®ã®å¯Ÿè§èŠçŽ ãžã®ã¢ã¯ã»ã¹ã衚瀺ãããŸãGPUã®ããã©ãŒãã³ã¹ãåäžãããã«ã¯ãé åã®å¯Ÿè§å€æãå®è¡ããå¿ èŠããããŸãã åçŽãªå Žåã1ã€ã®å¹³é¢ã®å€æã¯æ¬¡ã®ããã«ãªããŸãã
ãã®å€æã¯ãè¡åã転眮ããã®ãšåãéãã§å®è¡ã§ããŸãã ãããè¡ãã«ã¯ãå ±æã¡ã¢ãªã䜿çšããŸãã è¡åã®è»¢çœ®ãšã®ã¿å¯Ÿç §çã«ãåŠçããããããã¯ã¯æ£æ¹åœ¢ã§ã¯ãªããå¹³è¡å蟺圢ã®åœ¢ã§ãããããèªã¿åãããã³æžã蟌ã¿æã«GPUã¡ã¢ãªåž¯åå¹ ã䜿çšããã®ãå¹ççã§ãïŒä»ã®ãã¹ãŠãåãããã«å£ããŠãããããæåã®ã¹ããªããã®ã¿ã察è§åã®ããã«ç€ºãããŠããŸãïŒïŒ
次ã®ã¿ã€ãã®å¯Ÿè§åãå®è£ ãããŠããŸãïŒRxããã³Ryã¯ã察è§åãããé·æ¹åœ¢ã®ãµã€ãºã§ãïŒã
- åŽé¢ã®å¯Ÿè§ç·ã«å¹³è¡ã§ãRx == Ry;
- åŽé¢ã®å¯Ÿè§ç·ã«å¹³è¡ã§ãRx <Ry;
- åŽé¢ã®å¯Ÿè§ç·ã«å¹³è¡ã§ãRx> Ry;
- 䞻察è§ç·ã«å¹³è¡ã§ãRx == Ry;
- 䞻察è§ç·ã«å¹³è¡ã§ãRx <Ry;
- 䞻察è§ç·ããã³Rx> Ryã«å¹³è¡ã
察è§åã®äžè¬çãªã³ã¢ã¯æ¬¡ã®ãšããã§ãã
__shared__ T data[BLOCK_DIM][BLOCK_DIM + 1]; __shared__ IndexType sharedIdx[BLOCK_DIM][BLOCK_DIM + 1]; __shared__ bool conditions[BLOCK_DIM][BLOCK_DIM + 1]; bool condition; IndexType shift; int revX, revY; if (slash == 0) { shift = -threadIdx.y; revX = BLOCK_DIM - 1 - threadIdx.x; revY = BLOCK_DIM - 1 - threadIdx.y; } else { shift = threadIdx.y - BLOCK_DIM; revX = threadIdx.x; revY = threadIdx.y; } IndexType x = (IndexType)blockIdx.x * blockDim.x + threadIdx.x + shift; IndexType y = (IndexType)blockIdx.y * blockDim.y + threadIdx.y; IndexType z = (IndexType)blockIdx.z * blockDim.z + threadIdx.z; dvmh_convert_XY<IndexType, slash, cmp_X_Y>(x, y, Rx, Ry, sharedIdx[threadIdx.y][threadIdx.x]); condition = (0 <= x && 0 <= y && x < Rx && y < Ry); conditions[threadIdx.y][threadIdx.x] = condition; if (back == 1) __syncthreads(); #pragma unroll for (int zz = z; zz < z + manyZ; ++zz) { IndexType normIdx = x + Rx * (y + Ry * zz); if (back == 0) { if (condition && zz < Rz) data[threadIdx.y][threadIdx.x] = src[normIdx]; __syncthreads(); if (conditions[revX][revY] && zz < Rz) dst[sharedIdx[revX][revY] + zz * Rx * Ry] = data[revX][revY]; } else { if (conditions[revX][revY] && zz < Rz) data[revX][revY] = src[sharedIdx[revX][revY] + zz * Rx * Ry]; __syncthreads(); if (condition && zz < Rz) dst[normIdx] = data[threadIdx.y][threadIdx.x]; } }
ãã®å Žåãdvmh_convert_XYã䜿çšããŠãæ¡ä»¶å€ãšèšç®ããã座æšãdvmh_convert_XYã䜿çšããŠä»ã®ã¹ã¬ããã«è»¢éããå¿ èŠããããŸãã転眮ãšã¯ç°ãªããäž¡æ¹ã®åº§æšïŒèªã¿åãå Žæãšæžã蟌ã¿å ŽæïŒãæ確ã«èšç®ããããšã¯ã§ããªãããã§ãã
çµæã å®è£ ãããŠããé åïŒ
- é åã®é£æ¥ãã2ã€ã®æ¬¡å ãåé 眮ããŸãã
- é£æ¥ããªã2ã€ã®é å次å ã®åé 眮ã
- é£æ¥ãã2ã€ã®æéã®é å次å ã®ãã€ãŽãã€ãŒãŒã·ã§ã³ã
- [èšç»æžã¿]é åã®ä»»æã®2ã€ã®æé次å ã®å¯Ÿè§åïŒå¯Ÿè§åå¯èœãªæ¬¡å ãæéã«ãªããŸãïŒã
- 察è§åå¯èœãªé åããã®ã¯ãªããã³ã°ã®ã³ããŒïŒããšãã°ãè€æ°ã®GPUã§ã«ãŠã³ãããå Žåã«ãã·ã£ããŠããšããžãæŽæ°ããïŒã
æ§èœè©äŸ¡
ã¢ãããŒãã®æå¹æ§ãå®èšŒããããã«ãé åèªäœã®ããã©ãŒãã³ã¹ã瀺ãã°ã©ããæäŸãã2ã€ã®ããã°ã©ã ã®çµæã瀺ããŸãïŒã¬ã¹æµäœååŠåé¡ã®LUå解ãšã3次å ãã£ãªã¯ã¬åé¡ã®è§£ã®é£ç¶çäžéšç·©åã®æ¹æ³ãå®è£ ããåæãã¹ãã§ãã ãã¹ãŠã®ãã¹ãã¯ãGTX Titan GPUãšNvidia CUDA ToolKit 7.0ãããã³Intelã³ã³ãã€ã©ããŒãžã§ã³15ãæèŒããIntel Xeon E5 1660 v2ããã»ããµã§å®è¡ãããŸããã
é åã®åç·šæã¯ãç¹å®ã®èŠåã«åŸã£ãŠã¡ã¢ãªã®äžéšãå¥ã®ã¡ã¢ãªã«ã³ããŒãããããå®è£ ããããã¹ãŠã®å€æãéåžžã®ã³ããŒã®ã³ã¢ãšæ¯èŒããŸãã ã³ããŒã³ã¢ã¯æ¬¡ã®ããã«ãªããŸãã
__global__ void copyGPU(const double *src, double *dist, unsigned elems) { unsigned idx = blockIdx.x * blockDim.x + threadIdx.x; if(idx < elems) dist[idx] = src[idx]; }
ãã®å ŽåãCUDAãããã¯å ã®åæãããã³å ±æã¡ã¢ãªïŒL1ãã£ãã·ã¥ïŒãžã®ã¢ã¯ã»ã¹ã«è¿œå ã®ãªãŒããŒãããããããã³ããŒããã©ãŒãã³ã¹ãäœäžãããããå ±æã¡ã¢ãªã䜿çšããã¢ã«ãŽãªãºã ã«å¯ŸããŠã®ã¿å€æé床ãæå®ããŸãããã®ä»ã®é åã 100ïŒ ã®å ŽåãcopyGPUã³ããŒã³ã¢ã®é床ã䜿çšããŸãããã®å ŽåããªãŒããŒãããã¯æå°éã§ããããã®ã³ã¢ã«ãããã»ãŒæ倧ã®GPUã¡ã¢ãªåž¯åå¹ ãååŸã§ããŸãã
æåã®ã°ã©ãã¯ã2次å ãããªãã¯ã¹ã®å€æïŒè»¢çœ®ïŒãšå¯Ÿè§åãã©ãã»ã©é ããã瀺ããŠããŸãã ãããªãã¯ã¹ãµã€ãºã®ç¯å²ã¯ãæ°ã¡ã¬ãã€ããã1ã®ã¬ãã€ãã§ãã ã°ã©ãããã2次å ãããªãã¯ã¹ã§ã¯ãcopyGPUã³ã¢ãšæ¯èŒããŠããã©ãŒãã³ã¹ã20ã25ïŒ äœäžããŠããããšãããããŸãã 察è§åã¢ã«ãŽãªãºã ã¯è¡å転眮ãããããè€éã§ããããã察è§åã5ïŒ é·ãå®è¡ãããããšãããããŸãã
2çªç®ã®ã°ã©ãã¯ã3次å ãããªãã¯ã¹ã®å€æïŒè»¢çœ®ïŒããã³å¯Ÿè§åãã©ãã»ã©é ããã瀺ããŠããŸãã ãããªãã¯ã¹ã®ãµã€ãºã¯ã4次å ã®ã¿ã€ãN * N * N * 5ãšä»»æã®X * Y * Zã®2ã€ã®ã¿ã€ãã§ååŸãããŸããã ãããªãã¯ã¹ãµã€ãºã®ç¯å²ã¯10ã¡ã¬ãã€ããã500ã¡ã¬ãã€ãã§ãã å°ããªãããªãã¯ã¹ã§ã¯ãå€æé床ã¯40ïŒ äœäžããŸããã倧ããªãããªãã¯ã¹ã§ã¯ãå€æé床ã¯90ïŒ ã«éãã察è§åçã¯ã³ããŒé床ã®80ïŒ ã§ãã
3çªç®ã®ã°ã©ãã¯ãã·ã³ã»ãã£ãã¯ãã¹ãã®å®è¡æéã瀺ããŠããŸããããã¯ã察称äžéšç·©åã®æ¹æ³ãå®è£ ããŠããŸãã ãã®ã¡ãœããã®èšç®ãµã€ã¯ã«ã«ã¯ã3ã€ã®æ¬¡å ãã¹ãŠã®äŸåé¢ä¿ãå«ãŸããŸãïŒCã®ãã®ãµã€ã¯ã«ã«ã€ããŠã¯äžèšã§èª¬æããŠããŸãïŒã ãã®ã°ã©ãã¯ãåãDVMHããã°ã©ã ïŒFortranã§èšè¿°ããããœãŒã¹ã³ãŒãã¯èšäºã®æåŸã«æ·»ä»ãããŠããŸãïŒã®å®è¡æéã瀺ããŠããŸãã ãã®å Žåã察è§åã¯ãå埩èšç®ã®åã«äžåºŠã ãè¡ãå¿ èŠããããŸãã
4çªç®ã®ã°ã©ãã¯ã察称é次äžç·©åæ³ïŒSSORã¢ã«ãŽãªãºã ãLUåé¡ïŒã䜿çšããŠãéç·åœ¢å埮åæ¹çšåŒã®åæã·ã¹ãã ïŒå§çž®æ§æ¶²äœãŸãã¯æ°äœã®æ¹çšåŒã®3次å ãããšã¹ããŒã¯ã¹ã·ã¹ãã ïŒã解ãã¢ããªã±ãŒã·ã§ã³ã®å éã瀺ããŠããŸãã ãã®ãã¹ãã¯ãæšæºã®NASAãã¹ãã¹ã€ãŒãïŒææ°ããŒãžã§ã³3.3.1ïŒã®äžéšã§ãã ãã®ã»ããã§ã¯ãMPIãšOpenMPã ãã§ãªããã·ãŒã±ã³ã·ã£ã«ããŒãžã§ã³ã®ãã¹ãŠã®ãã¹ãã®ãœãŒã¹ã³ãŒããå©çšã§ããŸãã
ãã®ã°ã©ãã¯ãXeon E5ã®1ã€ã®ã³ã¢ãXeon E5ã®6ã€ã®ã¹ã¬ãããããã³2ã€ã®ã¢ãŒãã®GPUã§å®è¡ãããã·ãªã¢ã«ããŒãžã§ã³ã«å¯Ÿããããã°ã©ã ã®å éã瀺ããŠããŸãã ãã®ããã°ã©ã ã§ã¯ã2ãµã€ã¯ã«ã ã察è§åãè¡ããé åãå ã®ç¶æ ã«æ»ãå¿ èŠããããŸããã€ãŸãããäžè¯ããµã€ã¯ã«ã®åå埩ã§ãå¿ èŠãªãã¹ãŠã®é åã察è§åãããå®è¡åŸã«å¯Ÿè§åãè¡ãããŸãã ãã®ããã°ã©ã ã«ã¯ãFortran 90ã¹ã¿ã€ã«ã®çŽ25äžè¡ãããããšã«æ³šæããŠãã ããïŒãã€ãããŒã·ã§ã³ãªãã§ãã³ãŒãã¯èšäºã®æåŸã«æ·»ä»ãããŸãïŒã 125åã®DVMHãã£ã¬ã¯ãã£ããè¿œå ããŠããã®ããã°ã©ã ãã¯ã©ã¹ã¿ãŒäžãç°ãªãããã€ã¹äžã®1ã€ã®ããŒãäžãããã³ã·ãªã¢ã«ã¢ãŒãã§å®è¡ã§ããããã«ããŸããã
ãã®ããã°ã©ã ã¯ãã·ãªã¢ã«ã³ãŒãã®ã¬ãã«ã§æé©åãããŠããïŒ6 Xeon E5ã³ã¢ã®8åã®é«éåãããããïŒãGPUã¢ãŒããã¯ãã£ã ãã§ãªããã«ãã³ã¢ããã»ããµã§ãé©åã«è¡šç€ºãããŸãã DVMHã³ã³ãã€ã©ã§ã¯ã-Minfoãªãã·ã§ã³ã䜿çšããŠã衚瀺ãããåãµã€ã¯ã«ã«å¯Ÿå¿ããCUDAã«ãŒãã«ã«å¿ èŠãªã¬ãžã¹ã¿ã®æ°ã確èªã§ããŸãïŒãã®æ å ±ã¯Nvidiaã³ã³ãã€ã©ããååŸãããŸãïŒã ãã®å Žåã3ã€ã®ã¡ã€ã³ã³ã³ãã¥ãŒãã£ã³ã°ãµã€ã¯ã«ã®ããããã«ãã¹ã¬ãããããçŽ160åã®ã¬ãžã¹ã¿ïŒ255åã®ãã¡äœ¿çšå¯èœïŒãå¿ èŠã§ãããã°ããŒãã«ã¡ã¢ãªã«ã¢ã¯ã»ã¹ããæäœã®æ°ã¯çŽ10ïŒ1ã§ãã ãããã£ãŠãåç·šæã®äœ¿çšã«ããå éã¯ããã»ã©å€§ãããããŸããããããã¯ãŸã ååšãã倧ããªã¿ã¹ã¯ã®å Žåããã®ãªãã·ã§ã³ãªãã§å®è¡ãããåãããã°ã©ã ãšæ¯èŒããŠ1.5åã§ãã ãŸãããã®ãã¹ãã¯ã6ã€ã®CPUã³ã¢ãããGPUã§3åé«éã«å®è¡ãããŸãã
ãããã«
ãã®æçš¿ã§ã¯ãDVMHããã°ã©ã ãå®è¡ããããã®ãµããŒãã·ã¹ãã ã§ãGPUäžã®ããŒã¿ãèªåçã«åç·šæããæ¹æ³ãæ€èšãããŸããã ãã®å Žåããã®ããã»ã¹ã®å®å šãªèªååã«ã€ããŠè©±ããŠããŸãã RTSHã«ã¯ãåç·šæã®ã¿ã€ããå€å¥ããããã«å¿ èŠãªããã°ã©ã ã®å®è¡äžã«ãã¹ãŠã®æ å ±ãå«ãŸããŠããŸãã ãã®ã¢ãããŒãã«ãããGPUã°ããŒãã«ã¡ã¢ãªãæé©ãªæ¹æ³ã§ã¢ã¯ã»ã¹ããããµã€ã¯ã«ã衚瀺ãããšãã«ããé©åãªãã·ãŒã±ã³ã·ã£ã«ããã°ã©ã ãäœæã§ããªãããã°ã©ã ã§é©åãªå éãå®çŸã§ããŸãã å€æãå®è¡ãããšãããã€ã¹å ã®æéã®ã¡ã¢ãªã³ããŒã³ã¢ãšæ¯èŒããŠãã°ããŒãã«GPUã¡ã¢ãªïŒGTX Titanã®å Žåã¯çŽ240 GB / sïŒã®ããã©ãŒãã³ã¹ã®æ倧90ïŒ ãéæãããŸãã
åç §è³æ
1ïŒ DVMã·ã¹ãã
2ïŒ Fotran-DVMHã®ãœãŒã¹ã³ãŒãLUããã³SOR
3ïŒ NASAãã¹ã