ã«ã·ã¢ãŒãã«ã€ããŠåè«ãèšããªãããã«ã»ãšãã©ææãããŠããªãïŒ
ããšãã°ãCaffeã¯äººæ°ã®ãã深局åŠç¿ãã¥ãŒã©ã«ãããã¯ãŒã¯éçºãã©ãããã©ãŒã ã§ãã Berkley Vision and Learning CenterïŒBVLCïŒã§äœæããããã®éçºã«è²¢ç®ããç¬ç«ããéçºè ã®ã³ââãã¥ããã£ã«å¥œãŸããŸããã ãã©ãããã©ãŒã ã¯åç¶ããéçºãããŠããŸããããã®èšŒæ ã¯ãGitHubã®ãããžã§ã¯ãããŒãžã®çµ±èšã§ãã Caffeã¯ããã£ãŒãã©ãŒãã³ã°ã®ããã®é«éã§ãªãŒãã³ãªãã©ãããã©ãŒã ããšåŒã°ããŠããŸãã ãã®ãããªãã¯ã€ãã¯ãããŒã«ã»ãããé«éåããããšã¯å¯èœã§ããïŒ ãã®è³ªåãããŠãCaffeãIntelã¢ãŒããã¯ãã£åãã«æé©åããããšã«ããŸããã
ä»åŸãCaffeã¯ãIntel Math Kernel Library 2017ãšã®çµ±åãšã ãã®èšäºã§æŠèª¬ããèšç»ã«åŸã£ãŠå®è¡ããäžé£ã®æé©åã®ãããã§ãããŒã¹ããŒãžã§ã³ããã10å以äžé«éãªIntelããã»ããµã§åäœãå§ããããšã«æ³šæããŠãã ãã以äžãBVLC CaffeãšåŒã³ãŸãã Intelã¢ãŒããã¯ãã£åãã«æé©åãããããŒãžã§ã³ã¯ãç°¡æœã«ããããIntel CaffeãšåŒã°ããŸãã ãœãŒã¹ã³ãŒãã¯æ¬¡ã®ãšããã§ãã
ããã©ãŒãã³ã¹ã®æ¹åã®äž»ãªåéïŒè©³çŽ°ã¯ä»¥äžã§èª¬æããŸãïŒã¯ãã³ãŒãã®ãªãã¡ã¯ã¿ãªã³ã°ãIntel AVX2ãªã©ã®ãã¯ãã«åœä»€ã»ããã®äœ¿çšã«åºã¥ãæé©åãã³ã³ãã€ã«ã®åŸ®èª¿æŽãããã³OpenMPã䜿çšãããã«ãã¹ã¬ããã³ãŒãå®è¡ã®å¹çåã§ãã ãã¹ãã¯ã2ã€ã®Intel Xeonããã»ããµãæèŒããã·ã¹ãã ã§å®æœãããŸããã ç¹ã«ãCIFAR-10ã»ããã®ç»åãæäœããéã«ãCaffeããŒã«ã«ãã£ãŠæ§ç¯ããããã¥ãŒã©ã«ãããã¯ãŒã¯ã®é床ã調æ»ããŸããã ããã°ã©ã ã®å®è¡çµæã¯ãIntel VTune Amplifier XE 2017ããã³ä»ã®ããŒã«ã䜿çšããŠåæãããŸããã
åæ§ã®ã¢ãããŒãã䜿çšããŠãããŸããŸãªããã°ã©ã ïŒãã¥ãŒã©ã«ãããã¯ãŒã¯ã®æ·±å±€åŠç¿çšã®ä»ã®ãã©ãããã©ãŒã ãªã©ïŒã®ããã©ãŒãã³ã¹ãåäžãããããšãã§ããŸãã
æé©åã®åé¡ã«é²ãåã«ããã£ãŒãã©ãŒãã³ã°ã¢ã«ãŽãªãºã ãšãããã®å©ããåããŠè§£æ±ºãããã¿ã¹ã¯ã«ã€ããŠèª¬æããŸãã
深局åŠç¿ã¢ã«ãŽãªãºã ã«ã€ããŠ
ãã£ãŒãã©ãŒãã³ã°ã¢ã«ãŽãªãºã ã¯ãããäžè¬çãªã¯ã©ã¹ã®æ©æ¢°åŠç¿ã¢ã«ãŽãªãºã ã®äžéšã§ãããè¿å¹Žãåçããããªã®ãã¿ãŒã³èªèãé³å£°èªèãèªç¶èšèªåŠçãããã³å€§éã®æ å ±ãåŠçããå¿ èŠãããä»ã®åéã§éèŠãªçµæã瀺ããŠããŸãããŒã¿åæã®åé¡ã解決ããŸãã ãã£ãŒãã©ãŒãã³ã°ã®æåã¯ã倧èŠæš¡ãªããŒã¿ã»ãããåŠçããèœåã«ãããã³ã³ãã¥ãŒãã£ã³ã°ãšã¢ã«ãŽãªãºã ã®ææ°ã®é²æ©ã«åºã¥ããŠããŸãã ãã®ãããªã¢ã«ãŽãªãºã ã®åäœåçã¯ãããŒã¿ããããã¯ãŒã¯ã¬ã€ã€ãŒãééããããã§æ å ±ãå€æãããããã«è€éãªæ©èœãããããæœåºããããšããããšã§ãã
ãã£ãŒããã¥ãŒã©ã«ãããã¯ãŒã¯ã®åã¬ãã«ããã¬ãŒãã³ã°ããŠãããã«è€éãªå åãç¹å®ããæ¹æ³ã®äŸã次ã«ç€ºããŸãã ããã¯ãã°ã¬ãŒã¹ã±ãŒã«ç»åãšããŠèŠèŠåããããæ·±ããããã¯ãŒã¯ã«ãã£ãŠèªèãããæ©èœã®å°ããªã»ããã瀺ããŠããŸãã ãŸããå ã®ã«ã©ãŒç»åã衚瀺ããããã®åŠçã«ãããããã®æšèãéžæãããŸãã ããããæ®åœ±ããç»åã
ç³ã¿èŸŒã¿ãã¥ãŒã©ã«ãããã¯ãŒã¯
ãã£ãŒãã©ãŒãã³ã°ã¢ã«ãŽãªãºã ãæåž«ãšé£æºãããã«ã¯ãã©ãã«ä»ãããŒã¿ã»ãããå¿ èŠã§ãã æåž«ãæãã3ã€ã®äžè¬çãªã¿ã€ãã®ãã£ãŒããã¥ãŒã©ã«ãããã¯ãŒã¯ã¯ãå€å±€ããŒã»ãããã³ïŒMLMïŒãç³ã¿èŸŒã¿ãã¥ãŒã©ã«ãããã¯ãŒã¯ïŒCNNïŒãããã³ãªã«ã¬ã³ããã¥ãŒã©ã«ãããã¯ãŒã¯ïŒRNNïŒã§ãã ãããã®ãããã¯ãŒã¯ã§ã¯ãå ¥åããŒã¿ã¯ããããã¯ãŒã¯ã®åã¬ã€ã€ãŒãééãããšãã«ãäžé£ã®ç·åœ¢ããã³éç·åœ¢å€æãåããŸãã ãã®çµæããããã¯ãŒã¯åºåããŒã¿ãçæãããŸãã ãããã¯ãŒã¯å¿çãäºæ³ãããçµæãšæ¯èŒããããšã©ãŒãæ€åºãããåŸãåºåå±€ã«å¯ŸããŠãšã©ãŒè¡šé¢åŸé ãã¯ãã«ãèšç®ããã掻æ§åé¢æ°ãèæ ®ããŠããã¥ãŒãã³ã®ã·ããã¹éã¿ã®ãããã¯ãŒã¯ãžã®å¯äžãèæ ®ããããã®åŸãåãæé ãä»ã®å±€ã«å¯ŸããŠå®è¡ãããŸã以åã«åä¿¡ããããŒã¿ã ãã®ãã¬ãŒãã³ã°æ¹æ³ã¯ãšã©ãŒéäŒæã¢ã«ãŽãªãºã ãšåŒã°ãããã®é©çšã®çµæãšããŠããããã¯ãŒã¯ãã¥ãŒãã³ã®éã¿ä¿æ°ã®æ®µéçãªå€æŽãå®è¡ãããŸãã
å€å±€ããŒã»ãããã³ã§ã¯ãåã¬ã€ã€ãŒã®å ¥åããŒã¿ïŒãã¯ãã«ã§è¡šãããïŒã«ããã®ã¬ã€ã€ãŒã«åºæã®å®å šã«æºãããããŠã§ã€ããããªãã¯ã¹ãæåã«ä¹ç®ãããŸãã ãªã«ã¬ã³ããããã¯ãŒã¯ã§ã¯ããã®ãããªãããªãã¯ã¹ã¯åã¬ã€ã€ãŒã§åãã§ããïŒã¬ã€ã€ãŒããªã«ã¬ã³ãã§ããããïŒããããã¯ãŒã¯ã®ããããã£ã¯å ¥åä¿¡å·ã«äŸåããŸãã ç³ã¿èŸŒã¿ãããã¯ãŒã¯ã¯å€å±€ããŒã»ãããã³ã«äŒŒãŠããŸãããç³ã¿èŸŒã¿ãããã¯ãŒã¯ãšåŒã°ããé ãå±€ã«ã¹ããŒã¹è¡åã䜿çšããŸãã ãã®ãããªãããã¯ãŒã¯ã§ã¯ãè¡åä¹ç®ã¯ãéã¿ã®è¡åè¡šçŸãšã¬ã€ã€ãŒã®å ¥åããŒã¿ã®è¡åè¡šçŸã®ç³ã¿èŸŒã¿ã«ãã£ãŠè¡šãããŸãã ç³ã¿èŸŒã¿ãããã¯ãŒã¯ã¯ç»åèªèã§äžè¬çã§ãããé³å£°èªèãèªç¶èšèªã®åŠçã«å¿çšã§ããŸãã ããã§ã¯ããã®ãããªãããã¯ãŒã¯ã«ã€ããŠè©³ããèªãããšãã§ããŸãã
CaffeãCIFAR-10ããã³ç»ååé¡
æ¢ã«è¿°ã¹ãããã«ãããã§ã¯ããã£ãŒãã©ãŒãã³ã°ãããã¯ãŒã¯ãäœæããã³æ¢çŽ¢ããããã®äžè¬çãªãã©ãããã©ãŒã ã§ããIntel BVLC Caffeã®ã¢ãŒããã¯ãã£ãæé©åããŸãã ç»ååé¡ã¿ã¹ã¯ã§ãã䜿çšãããCIFAR-10ããŒã¿ã»ãããšCaffeã§æ§ç¯ããããã¥ãŒã©ã«ãããã¯ãŒã¯ã¢ãã«ã䜿çšããŠããã©ãããã©ãŒã ã®åæããŒãžã§ã³ãšæé©åããŒãžã§ã³ããã¹ãããŸãã
CIFAR-10ã»ããã®ç»åäŸ
CIFAR-10ããŒã¿ã»ããã¯ããµã€ãºã32x32ãã¯ã»ã«ã®60,000è²ã®ç»åã§æ§æãããé£è¡æ©ãè»ãé³¥ãç«ã鹿ãç¬ãã«ãšã«ã銬ãè¹ããã©ãã¯ã®10ã¯ã©ã¹ã«åããããŸãã ã¯ã©ã¹ã¯äº€å·®ããŸããã ããšãã°ãã¯ã©ã¹ãè»ããšããã©ãã¯ãã®éã«éè€ã¯ãããŸããã ãè»ãã«ã¯ãããšãã°ã»ãã³ãSUVãå«ãŸããŸãã ããã©ãã¯ãã¯ã©ã¹ã«ã¯å€§åãã©ãã¯ã®ã¿ãå«ãŸããããšãã°ãããã¯ã¢ãããã©ãã¯ã¯ã©ã®ç»åã°ã«ãŒãã«ãå«ãŸããŠããŸããã
ããã©ãŒãã³ã¹ãã¹ãäžã«äœ¿çšããããããã¯ãŒã¯ã«ã¯ãããŸããŸãªã¿ã€ãã®ã¬ã€ã€ãŒãå«ãŸããŸãã ç¹ã«ããããã¯ã·ã°ã¢ã€ã掻æ§åæ©èœãæã€å±€ïŒCaffeã®çšèªã§ã¯ãã·ã°ã¢ã€ãåã®å±€ïŒãç³ã¿èŸŒã¿å±€ïŒç³ã¿èŸŒã¿åïŒã空éçµåå±€ããŸãã¯ãµããµã³ãã«å±€ïŒããŒã«åïŒããããæ£èŠåå±€ïŒ BatchNormã¿ã€ãïŒãå®å šã«æ¥ç¶ãããã¬ã€ã€ãŒïŒInnerProductã¿ã€ãïŒã ãããã¯ãŒã¯ã®åºåã«ã¯ãã¢ã¯ãã£ããŒã·ã§ã³é¢æ°SoftmaxïŒã¿ã€ãSoftmaxWithLossïŒãæã€ã¬ã€ã€ãŒããããŸãã ãã®ãããã¯ãŒã¯ãšãã®ã¬ã€ã€ãŒã«ã€ããŠã¯ã以äžã§è©³ãã説æããŸãã ããã§ã¯ãCaffeã®ãªãªãžãã«ããŒãžã§ã³ã®åæã«åãããããŸãããã
åææ§èœåæ
BVLC CaffeãšIntel Caffeã®ããã©ãŒãã³ã¹ãè©äŸ¡ããæ¹æ³ã®1ã€ã¯ã timeã³ãã³ãã䜿çšããããšã§ããããã¯ãä¿¡å·ãã¬ã€ã€ãŒãé æ¹åããã³éæ¹åã«ç§»åããã®ã«ãããæéãèšç®ããŸãã ãã®ã³ãã³ãã¯ãåã¬ãã«ã§èšç®ã«è²»ããããæéã枬å®ããç°ãªãã¢ãã«ã®æ¯èŒå®è¡æéãååŸããã®ã«éåžžã«äŸ¿å©ã§ãã
./build/tools/caffe time \ --model=examples/cifar10/cifar10_full_sigmoid_train_test_bn.prototxt \ -iterations 1000
ãã®å Žåããå埩ãïŒ å埩ãã©ã¡ãŒã¿ãŒãèšå®ããïŒã¯ãç»åãã±ããã1ååæ¹ããã³åŸæ¹ã«ééããããšã§ãã äžèšã®ã³ãã³ãã¯ãåã ã®ã¬ã€ã€ãŒãšãããã¯ãŒã¯å šäœã®äž¡æ¹ã«ã€ããŠã1000åã®å埩ã®å¹³åå®è¡æéã衚瀺ããŸãã BVLC Caffeã®ãã®ããŒã ã®çµæã¯æ¬¡ã®ãšããã§ãã
BVLC Caffeã®æéã³ãã³ãåºå
ãã¹ãã§ã¯ã2ã€ã®ãœã±ãããæã€ã·ã¹ãã ã䜿çšããŸããã ããããã«ã18åã®ç©çã³ã¢ãæã€Intel Xeonããã»ããµãŒE5-2699 v3ïŒ2.3 GHzïŒãã€ã³ã¹ããŒã«ãããŸããã åæã«ãIntel Hyper-Threading Technologyã¯ç¡å¹ã«ãªããŸããã ãããã£ãŠãã·ã¹ãã ã«ã¯36åã®ç©çããã»ããµã³ã¢ãšã OMP_NUM_THREADSç°å¢å€æ°ã䜿çšããŠèšå®ãããåæ°ã®OpenMPã¹ã¬ãããããããŸããã§ããã ç¹ã«æèšããªãéãããã®ãããªæ§æã®ã¿ãå®éšã§äœ¿çšãããŸããã Intel CaffeãOpenMPç°å¢å€æ°ãèªåã§èšå®ããã®ã§ã¯ãªããèªåçã«èšå®ã§ããããã«ããããšããå§ãããŸãã ã·ã¹ãã ã«ã¯ã64 GBã®DDR4ã¡ã¢ãªãæèŒãããŠããã2.133 MHzã®åšæ³¢æ°ã§åäœããŸãã
ããã§ã¯ãIntelãšã³ãžãã¢ã«ããã³ãŒãã®æé©åã®ãããã§éæãããããã©ãŒãã³ã¹ãã¹ãã®çµæãèŠãããšãã§ããŸãã ããã©ãŒãã³ã¹ã枬å®ããããã«ã次ã®ããŒã«ã䜿çšããŸããã
- ValgrindããŒã«ãããããã®Callgrindã
- Intel VTune Amplifier XE 2017ããŒã¿çã
Intel VTune Amplifier XEã®ããŒã«ã¯ã次ã®æ å ±ãæäŸããŸãã
- ã·ã¹ãã ã«æ倧ã®è² è·ããããæ©èœïŒãããã¹ãããïŒã
- ã·ã¹ãã ã³ãŒã«ïŒã¿ã¹ã¯ã¹ã€ããã³ã°ãå«ãïŒã
- CPUããã³ãã£ãã·ã¥ã®äœ¿çšã
- OpenMPã¹ããªãŒã éã®è² è·åæ£ã
- ã¹ã¬ããããã¯ã
- ã¡ã¢ãªäœ¿çšéã
ããã©ãŒãã³ã¹åæã䜿çšããŠãã·ã¹ãã ã«å€§ããªè² è·ããããé¢æ°ããå®äºãããŸã§ã«æ¯èŒçé·ãæéãèŠããé¢æ°ã®åŒã³åºããªã©ãæé©åã®é©åãªåè£ãèŠã€ããããšãã§ããŸãã
次ã®å³ã¯ã100åã®å埩åŸã«åŸãããIntel VTuneã®BVLC Caffeã®ããã©ãŒãã³ã¹åæã®æŠèŠã瀺ããŠããŸãã å³ã®äžéšã«ããçµéæéã€ã³ãžã±ãŒã¿ã¯37ç§ã§ãã ããã¯ããã¹ãã·ã¹ãã ã§ã³ãŒããå®è¡ããã®ã«ããã£ãæéã§ãã CPUæéã€ã³ãžã±ãŒã¿ãããã»ããµæéã¯1306ç§ã§ãã ããã¯ã37ç§ã«36ã³ã¢ïŒ1332ç§ïŒãæããå€ãããããã«å°ããã§ãã ãã®ã€ã³ãžã±ãŒã¿ãŒã¯ãèšç®ã§äœ¿çšããããã¹ãŠã®ã¹ã¬ããïŒãŸãã¯ãã€ã³ãã«HTãã¯ãããžãŒãç¡å¹ã«ãªã£ãŠããããããã¹ãŠã®ã³ã¢ïŒã§ã®ã³ãŒãå®è¡ã®åèšæéãè¡šããŸãã
Intel VTune Amplifier XE 2017ããŒã¿çã®CIFAR-10ããŒã¿ã»ããã§ã®BVLC Caffeããã©ãŒãã³ã¹åæã®äžè¬çãªçµæ
å³ã®äžéšã«ããããã»ããµäœ¿çšçã®ãã¹ãã°ã©ã ã¯ããã¹ãäžã«ç¹å®ã®æ°ã®ã¹ã¬ãããåæã«ã¢ã¯ãã£ãåãããé »åºŠã瀺ããŠããŸãã ãã®å Žåã37ç§ã®ãã¡ã14ã1ã€ã®ã¹ã¬ããïŒã€ãŸãã1ã€ã®ã³ã¢ïŒã«èœã¡ãŸãã æ®ãã®æéã§ã¯ãéåžžã«éå¹ççãªãã«ãã¹ã¬ããåŠçãèŠãããŸãããåºæ¬çã«ã¯20ã¹ã¬ããæªæºãäœæ¥ã«åå ããŠããŸãã
å³ã®äžå€®ã«ãã[äžäœã®ãããã¹ããã]ã»ã¯ã·ã§ã³ã«ã¯ãæãå€ãã®æ©èœãå ããæ©èœã瀺ãããŸãã é¢æ°åŒã³åºããšãããããã®åèšããã»ããµãŒæéã«å¯Ÿãããããã®å¯äžãããã«ãªã¹ããããŠããŸãã kmp_fork_barrieré¢æ°ã¯ãã³ãŒããå®è¡ããã®ã«1130ç§ã®ããã»ããµãŒæéãèŠããå€éšOpenMPé¢æ°ã§ãã ããã¯ãããã»ããµã®äœæ¥æéã®çŽ87ïŒ ãããã®ããªã¢æ©èœã§äœã圹ã«ç«ããã«ã¢ã€ããªã³ã°ããã¹ã¬ããã«è²»ããããããšãæå³ããŸãã
BVLC Caffeã®ãœãŒã¹ã³ãŒãã«ã¯ã #pragma omp parallelãšããè¡ããããŸã ã ãã ããã³ãŒãèªäœã¯OpenMPã©ã€ãã©ãªãæ瀺çã«äœ¿çšããŠãã«ãã¹ã¬ããããŒã¿åŠçãç·šæããŸããã åæã«ãã€ã³ãã«Â®MKLã®å éšã§ã¯ãOpenMPã¹ããªãŒã ã䜿çšããŠãããã€ãã®åºæ¬çãªæ°åŠçèšç®ã®å®è¡ã䞊ååããŸãã ãã®äžŠååã確èªããããã«ãIntel VTune XEã®[ããã ã¢ãã]ã¿ãã䜿çšã§ããŸããCIFAR-10ããŒã¿ã»ããã§BVLC Caffeããã¹ãããåŸããã®å 容ãäžå³ã«ç€ºããŸãã ããã§ã¯ãé¢æ°åŒã³åºãã®ãªã¹ããšãããã«é¢ããè¿œå æ å ±ãèŠã€ããããšãã§ããŸãã ç¹ã«ã䜿çšçå¥ã®æå¹æéã€ã³ãžã±ãŒã¿ãŒïŒã¿ãã®äžéšïŒãšããããŒã«ããæ©èœã«ãã£ãŠäœæãããè² è·ã®ååžã®ã€ã³ãžã±ãŒã¿ãŒïŒäžéšïŒã«é¢å¿ããããŸãã
CIFAR-10ããŒã¿ã»ããã§BVLC Caffeãå®è¡ãããšãã«ãé¢æ°ã®å®è¡ã®æéãã©ã¡ãŒã¿ãŒãšã·ã¹ãã ã«æãè² è·ããããé¢æ°ã®ãªã¹ãã®å¯èŠå
gemm_omp_driver_v2é¢æ°ã¯ãIntel MKLã®è¡åä¹ç®ïŒGEMMïŒã®æ±çšå®è£ ã§ããlibmkl_intel_thread.soã©ã€ãã©ãªã®äžéšã§ãã ãã®é¢æ°ã®å éšã¡ã«ããºã ã«ã¯ãOpenMPãã«ãã¹ã¬ãããé¢ä¿ããŠããŸãã ã€ã³ãã«Â®MKLã®è¡åä¹ç®é¢æ°ã¯ãé æ¹åããã³éæ¹åã®äŒææé ãã€ãŸããããã¯ãŒã¯å¿çãåä¿¡ããŠââãã¬ãŒãã³ã°ããæäœã§äœ¿çšãããäž»ãªé¢æ°ã§ãã ã€ã³ãã«Â®MKLã¯ãã«ãã¹ã¬ããå®è¡ã䜿çšããŸããããã«ãããéåžžãGEMMèšç®ã®å®è¡æéãççž®ãããŸãã ãã ãããã®ç¹å®ã®ã±ãŒã¹ã§ã¯ã32x32ã€ã¡ãŒãžã®ããã¿èŸŒã¿æäœã«ãã£ãŠã·ã¹ãã ã«é床ã®è² è·ããããããšã¯ãªãã1ã€ã®GEMMæäœã§36ã³ã¢ã®36 OpenMPãããŒãã¹ãŠãå¹ççã«äœ¿çšããããšã¯ã§ããŸããã ãããã£ãŠã以äžã«ç€ºãããã«ããã«ãã¹ã¬ãããšã³ãŒãå®è¡ã®äžŠååã®ããŸããŸãªã¹ããŒã ã䜿çšããå¿ èŠããããŸãã
å€ãã®OpenMPã¹ããªãŒã ã§äœæ¥ããå¿ èŠãããã·ã¹ãã ãžã®è¿œå ã®è² è·ã瀺ãããã«ãç°å¢å€æ°OMP_NUM_THREADS = 1ã§åãã³ãŒããå®è¡ããå®è¡æéãåã®çµæãšæ¯èŒããŸããã ç§ãã¡ãæã£ãŠãããã®ãäžã®å³ã«ç€ºããŸãã ããã§ã¯ãååã®ãã¹ããã37ç§ã§ã¯ãªãã31.1ç§ã®çµéæéã®å€ã衚瀺ãããŸãã ç°å¢å€æ°ã«ãŠããããæžã蟌ãã ã®ã§ãOpenMPã«1ã€ã®ã¹ããªãŒã ã®ã¿ãäœæãããããã䜿çšããŠã³ãŒããå®è¡ããŸããã çµæãšããŠçããã»ãŒ6ç§ã®å·®ã¯ãOpenMPã¹ããªãŒã ã®åæåãšåæåã«ãã£ãŠåŒãèµ·ãããããã·ã¹ãã ãžã®è¿œå ã®è² è·ã瀺ããŠããŸãã
ã·ã³ã°ã«ã¹ããªãŒã ã䜿çšããIntel VTune Amplifier XE 2017ããŒã¿ã®CIFAR-10ããŒã¿ã»ããã§ã®BVLC Caffeããã©ãŒãã³ã¹ã®åæã®äžè¬çãªçµæ
äžã®å³ã®äžå€®éšåã«ã¯ãã·ã¹ãã ã«æãéãè² è·ããããæ©èœã®ãªã¹ãããããŸãã ãã®äžã§ãæé©åã®3ã€ã®äž»èŠãªåè£ãèŠã€ãããŸããã ã€ãŸãããããã¯é¢æ°im2col_cpu ã col2im_cpu ãããã³PoolingLayer :: Forward_cpuã§ãã
ã³ãŒãã®æé©å
Intelã¢ãŒããã¯ãã£åãã«æé©åãããCaffeç°å¢ã§CIFAR-10 cããŒã¿ã»ããã䜿çšãããšãCaffe BVLCã䜿çšããå ŽåãããçŽ13.5åé«éã«ãªããŸãã 次ã®å³ã¯ã1000åã®å埩åŸã®å¹³åçµæã瀺ããŠããŸãã å·ŠåŽã¯BVLC CaffeããŒã¿ãå³åŽã¯Intel Caffeã§ãã æåã®ã±ãŒã¹ã§ã¯ãåèšå®è¡æéã¯270ããªç§ã§ããããŸãã2çªç®ã®ã±ãŒã¹ã§ã¯20ããªç§ã§ããã
BVLC CaffeãšIntel Caffeã®ããã©ãŒãã³ã¹æ¯èŒ
ã¬ã€ã€ãŒã®èšç®ãã©ã¡ãŒã¿ãŒã®èšå®æ¹æ³ã®è©³çŽ°ã¯ã ãã¡ããã芧ãã ãã ã
次ã®ã»ã¯ã·ã§ã³ã§ã¯ãããŸããŸãªã¬ã€ã€ãŒã§äœ¿çšãããèšç®ã®ããã©ãŒãã³ã¹ãæ¹åããããã«äœ¿çšãããæé©åã«ã€ããŠèª¬æããŸãã Intel Modern Codeããã°ã©ã ã®ãã¥ãŒããªã¢ã«ã«åŸããŸããã æé©åã®äžéšã¯ãIntel MKL 2017ã®åºæ¬çãªæ°åŠé¢æ°ã«åºã¥ããŠããŸãã
ã¹ã«ã©ãŒããã³ã·ãŒã±ã³ã·ã£ã«æé©å
âãã¯ãã«åã³ãŒã
BVLC Caffeã³ãŒãããããã¡ã€ãªã³ã°ããæãCPUæéãæ¶è²»ããæãããŒããããé¢æ°ãç¹å®ããåŸãã³ãŒãã®ãã¯ãã«åã®äœæ¥ãéå§ããŸããã å€æŽã«ã¯æ¬¡ã®ãã®ããããŸãã
- åºæ¬çãªç·åœ¢ä»£æ°ãµãããã°ã©ã ïŒBLASïŒã©ã€ãã©ãªãã€ãŸããèªå調æŽç·åœ¢ä»£æ°ã·ã¹ãã ïŒATLASïŒããã€ã³ãã«MKLãžã®ç§»è¡ãæ¹åããŸãã
- ã³ãŒãæ§ç¯ããã»ã¹ã®æé©åïŒXbyak JITã¢ã»ã³ãã©ãŒã䜿çšïŒã
- GNU Compiler CollectionïŒGCCïŒããã³OpenMPã䜿çšãããã¯ã¿ãŒã³ãŒãã
BVLC Caffeã§ã¯ãIntel MKL BLASé¢æ°åŒã³åºããŸãã¯åãã¡ã«ããºã ã®ä»ã®å®è£ ã䜿çšã§ããŸãã ããšãã°ãGEMMé¢æ°ã¯ããã¯ãã«åããã«ãã¹ã¬ããå®è¡ãããã³ãã£ãã·ã¥ã¡ã¢ãªã®å¹ççãªäœ¿çšã®ããã«æé©åãããŠããŸãã ãã¯ãã«åãæ¹åããããã«ãx86ïŒIA-32ïŒããã³x64ïŒAMD64ãŸãã¯x86-64ïŒã¢ãŒããã¯ãã£çšã®JITã¢ã»ã³ãã©ãŒã§ããXbyakã䜿çšããŸããã Xbyakã¯ãMMXãIntel Streaming SIMD ExtensionsïŒIntel SSEïŒãIntel SSE2ãIntel SSE3ãIntel SSE4ãæµ®åå°æ°ç¹ã¢ãžã¥ãŒã«ãIntel AVXãIntel AVX2ãIntel AVX-512ã®ãã¯ã¿ãŒåœä»€ã»ããããµããŒãããŠããŸãã
Xbyakã¯ãã³ãŒãå®è¡å¹çãæ¹åããããã«ç¹å¥ã«èšèšãããã©ã€ãã©ãªãC ++çšã®x86 / x64ã¢ã»ã³ãã©ãŒã§ãã Xbyakã¯ããããŒãã¡ã€ã«ãšããŠæäŸãããŸãã x86ããã³x64ã¢ãŒããã¯ãã£çšã®ããŒã¢ããã¯åœä»€ãåçã«ã³ã³ãã€ã«ã§ããŸãã å®è¡äžã®ãã€ããªã³ãŒãã®JITçæã¯ãè¿œå ã®æé©åã®æ©äŒãæäŸããŸãã ããšãã°ãéååã®æé©åãããé åãå¥ã®é åã§èŠçŽ ããšã«é€ç®ããæäœããŸãã¯ããã°ã©ã å®è¡äžã«å¿ èŠãªé¢æ°ãèªåçã«äœæããããã®å€é åŒèšç®ã®æé©åã§ãã Intel AVXããã³Intel AVX2ãã¯ãã«åœä»€ã»ããã®ãµããŒãã«ãããXbyakã䜿çšãããšãIntelã¢ãŒããã¯ãã£åãã«æé©åãããCaffeã§æé«ã¬ãã«ã®ã³ãŒããã¯ãã«åãå®çŸã§ããŸãã Xbyakã®ææ°ããŒãžã§ã³ã¯ãIntel AVX-512ãã¯ãã«åœä»€ã»ããããµããŒãããŠããŸãã ããã«ãããIntel Xeon Phi x200ãã¡ããªããã»ããµã®ã³ã³ãã¥ãŒãã£ã³ã°ããã©ãŒãã³ã¹ãåäžããŸãã
ãã¯ãã«åã®ããã©ãŒãã³ã¹ãåäžããããšãSIMDåœä»€ã®å©ããåããŠXbyakãåæã«ããå€ãã®ããŒã¿ãåŠçã§ããããã«ãªãã䞊åããŒã¿åŠçãããå¹ççã«äœ¿çšã§ããããã«ãªããŸãã Xbyakã䜿çšããŠã³ãŒããæé©åãã空éçµåã®ã¬ã€ã€ãŒã§ã®èšç®ã®ããã©ãŒãã³ã¹ãå€§å¹ ã«æ¹åããŸããã 空éçµåã®ãã©ã¡ãŒã¿ãŒãããã£ãŠããå Žåãç¹å®ã®ããŒã¿åŠçãŠã£ã³ããŠãŸãã¯ã¢ã«ãŽãªãºã ã䜿çšããç¹å®ã®çµåã¢ãã«ã®ã¢ã»ã³ãã©ãŒã³ãŒããçæã§ããŸãã çµæã¯å®å šã«æ£åžžãªã¢ã»ã³ããªã§ããã蚌æãããŠããããã«ãXbyakã䜿çšããã«ã³ã³ãã€ã«ãããC ++ã³ãŒããããå¹ççã«åäœããŸãã
âäžè¬çãªã³ãŒãã®æé©å
ãã®ä»ã®é£ç¶ããæé©åã«ã¯ã次ã®ãã®ãå«ãŸããŸãã
- ã¢ã«ãŽãªãºã ã®è€éãã軜æžããŸãã
- èšç®éã®æžå°ã
- å±éãµã€ã¯ã«ã
çµæãå€ãããªãã³ãŒãã®ç¹°ãè¿ãå®è¡ããªããããšã¯ãé©çšããã¹ã«ã©ãŒæé©åææ³ã®1ã€ã§ãã ããã¯ãæ倧ã®ãã¹ãã®æ·±ãã§ã«ãŒãå ã§èšç®ããããã®ãäºåã«èšç®ããããã«è¡ãããŸããã
ããšãã°ã次ã®ã³ãŒããã©ã°ã¡ã³ããèããŸãã
for (int h_col = 0; h_col < height_col; ++h_col) { for (int w_col = 0; w_col < width_col; ++w_col) { int h_im = h_col * stride_h - pad_h + h_offset; int w_im = w_col * stride_w - pad_w + w_offset;
ãã®ãã©ã°ã¡ã³ãã®3è¡ç®ã§ã¯ãå€æ°h_imã®èšç®ã«å éšã«ãŒãw_colã®ã€ã³ããã¯ã¹ã䜿çšããŠããŸããã ãã ããããã«ããããããããã®å€æ°ã®èšç®ã¯ããã¹ããããã«ãŒãã®åå埩ã§å®è¡ãããŸãã ãŸãã¯ããã®è¡ãå åŽã®ã«ãŒãã®å€åŽã«ç§»åããŠãã³ãŒãã次ã®åœ¢åŒã«ã§ããŸãã
for (int h_col = 0; h_col < height_col; ++h_col) { int h_im = h_col * stride_h - pad_h + h_offset; for (int w_col = 0; w_col < width_col; ++w_col) { int w_im = w_col * stride_w - pad_w + w_offset;
ããã»ããµãã·ã¹ãã åºæã®æé©åãããã³ãã®ä»ã®äžè¬çãªã³ãŒãæ¹åã¢ãããŒã
é©çšãããããã€ãã®è¿œå ã®äžè¬çãªã³ãŒãæé©åã次ã«ç€ºããŸãã
- im2col_cpuããã³col2im_cpué¢æ°ã®æ¹åã
- ãããæ£èŠåæäœã®è€éãã軜æžããŸãã
- ããã»ããµãšã·ã¹ãã ã«åºæã®æé©åã
- èšç®ã¹ããªãŒã ããšã«1ã€ã®ã³ã¢ã䜿çšããŸãã
- ã³ã³ãã¥ãŒãã£ã³ã°ã³ã¢éã®ãããŒã®ç§»åã®æé€ââã
Intel VTune Amplifier XEã¯ã im2col_cpuãæãè² è·ã®é«ãã·ã¹ãã ã®1ã€ã§ããããšãçºèŠããŸããã ããã¯ã圌女ãããã©ãŒãã³ã¹ã®æé©åã«é©ããŠããããšãæå³ããŸãã im2col_cpué¢æ°ã¯ãçŽæ¥ç³ã¿èŸŒã¿æŒç®ã®æšæºã¹ãããã®å®è£ ã§ãã åããŒã«ã«ãã©ã°ã¡ã³ãã¯åå¥ã®ãã¯ãã«ã«å±éãããç»åå šäœããã倧ããªãããªãã¯ã¹ã«å€æããïŒã¡ã¢ãªã®åŠçã®åŒ·åºŠãå¢ããŸãïŒããã®è¡ã¯ãã£ã«ã¿ãŒãé©çšãããå€ãã®å Žæã«å¯Ÿå¿ããŸãã
im2col_cpuã®æé©åææ³ã®1ã€ã¯ãããŒã¿ã«ã¢ã¯ã»ã¹ããããã«å¿ èŠãªæäœã®æ°ãæžããããšã§ãã BVLC Caffeã³ãŒãã«ã¯ãç»åã®ãã¯ã»ã«ãå埩åŠçãã3ã€ã®ãã¹ããããã«ãŒãããããŸãã
for (int c_col = 0; c_col < channels_col; ++c_col) for (int h_col = 0; h_col < height_col; ++h_col) for (int w_col = 0; w_col < width_col; ++w_col) data_col[(c_col*height_col+h_col)*width_col+w_col] = // ...
ãã®ã³ãŒãã¹ããããã§ã¯ãBVLC Caffeã¯æåã«data_colèŠçŽ ã®é åã®å¯Ÿå¿ããã€ã³ããã¯ã¹ãèšç®ããŸãããããã®é åã®ã€ã³ããã¯ã¹ã¯åçŽã«é çªã«åŠçãããŸãã ãããã£ãŠã4ã€ã®ç®è¡æŒç®ïŒ2ã€ã®å ç®ãš2ã€ã®ä¹ç®ïŒã1ã€ã®ã€ã³ããã¯ã¹ã€ã³ã¯ãªã¡ã³ãæŒç®ã«çœ®ãæããããšãã§ããŸãã ããã«ã次ã®æ¡ä»¶ã«åºã¥ããŠãæ¡ä»¶ã®ç¢ºèªã®è€éãã軜æžã§ããŸãã
/* int unsigned , a , , b. b â unsigned, , , , 0x800âŠ, , , 0x800⊠. */ inline bool is_a_ge_zero_and_a_lt_b(int a, int b) { return static_cast<unsigned>(a) < static_cast<unsigned>(b); }
BVLC Caffeã³ãŒãã§ã¯ã ifïŒx> = 0 && x <NïŒã®åœ¢åŒã®æ¡ä»¶ããã§ãã¯ããŸãããxãšNã¯ç¬Šå·ä»ãæŽæ°ã§ã Nã¯åžžã«æ£ã®æ°ã§ãã ãããã®æŽæ°ã笊å·ãªãæŽæ°ã«å€æãããšãæ¯èŒééãå€æŽã§ããŸãã åå€æåŸã«ãè«çANDã®æ¯èŒãšèšç®ã®2ã€ã®æäœãå®è¡ãã代ããã«ã1ã€ã®æ¯èŒã§ååã§ãã
if (((unsigned) x) < ((unsigned) N))
ãªãã¬ãŒãã£ã³ã°ã·ã¹ãã ã«ããã³ã³ãã¥ãŒãã£ã³ã°ã³ã¢éã®ã¹ã¬ããã®ç§»åãåé¿ããããã«ãOpenMPç°å¢å€æ°ã䜿çšããŸããïŒ KMP_AFFINITY = compactãgranularity = fine ã é£æ¥ããã¹ã¬ãããã³ã³ãã¯ãã«é 眮ãããšãåãæçµã¬ãã«ãã£ãã·ã¥ïŒLLCïŒãšé£æºããŠåäœããã¹ã¬ããã以åã«ãã£ãã·ã¥ã©ã€ã³ã«æžã蟌ãŸããããŒã¿ãåå©çšã§ãããããGEMMæäœã®ããã©ãŒãã³ã¹ãåäžããŸãã
ãã£ãã·ã¥ã®ãããã¯ã«é¢é£ããæé©åãæé©ãªããŒã¿æ§æãšãã¯ãã«åã®æ©èœã«é¢ãã詳现ãèŠã€ããããšãã§ããè³æã以äžã«ç€ºããŸãã
OpenMPã䜿çšããã³ãŒã䞊åå
âãã¥ãŒã©ã«ãããã¯ãŒã¯å±€
OpenMP䞊ååã®é©çšäžã«ã次ã®ãã¥ãŒã©ã«ãããã¯ãŒã¯ã¡ã«ããºã ãæé©åãããŸããã
- ç³ã¿èŸŒã¿å±€
- éç³ã¿èŸŒã¿ïŒãã³ã³ããªã¥ãŒã·ã§ã³ïŒã®ã¬ã€ã€ãŒã
- ããŒã«ã«æ£èŠåã®å±€ïŒããŒã«ã«å¿çæ£èŠåãLRNïŒã
- åç·åœ¢æŽ»æ§åæ©èœãåããã¬ã€ã€ãŒïŒæŽæµç·åœ¢ãŠããããReLUïŒ
- Softmaxã¢ã¯ãã£ããŒã·ã§ã³æ©èœãåããã¬ã€ã€ãŒã
- é£çµã¬ã€ã€ãŒ
- vPowxæäœ-y [i] = x [i]βãæäœcaffe_set ã caffe_copy ã caffe_rng_bernoulliãªã©ã®OpenBLASæé©åã®ããã®ãŠãŒãã£ãªãã£ã
- 空éçµåããŸãã¯ãµããµã³ããªã³ã°ïŒããŒã«ïŒã®ã¬ã€ã€ãŒã
- åãã¬ãŒãã³ã°ïŒããããã¢ãŠãïŒã®åœ±é¿ãé²ãããã«ããããã¯ãŒã¯ããèå±€åãããŸãã
- ãããæ£èŠåã¬ã€ã€ãŒã
- ããŒã¿å±€
- èŠçŽ åäœã®æäœãå®è¡ããã¬ã€ã€ãŒïŒEltwiseïŒã
âã¬ã€ã€ãŒç³ã¿èŸŒã¿
ãã®ååãšéåžžã«äžèŽããŠããç³ã¿èŸŒã¿å±€ã¯ããã¬ãŒãã³ã°ãããã¯ãŒã¯ãŸãã¯ãã£ã«ã¿ãŒã«ãã£ãŠå€æŽãããéã¿ã®ã»ããã䜿çšããŠå ¥åããŒã¿ãç³ã¿èŸŒã¿ãŸããããããã䜿çšãããšãåºåç»åã«1ã€ã®ç¹åŸŽããããååŸã§ããŸãã ãã®æé©åã«ãããåäžã»ããã®å ¥åãã£ãŒãã£ã«ãŒãã®ããŒããŠã§ã¢ãªãœãŒã¹ãååã«æŽ»çšãããªããªããŸãã
template <typename Dtype> void ConvolutionLayer<Dtype>::Forward_cpu(const vector<Blob<Dtype>*>& \ bottom, const vector<Blob<Dtype>*>& top) { const Dtype* weight = this->blobs_[0]->cpu_data(); // , , // ( , 36 // ). // MKL. for (int i = 0; i < bottom.size(); ++i) { const Dtype* bottom_data = bottom[i]->cpu_data(); Dtype* top_data = top[i]->mutable_cpu_data(); #ifdef _OPENMP #pragma omp parallel for num_threads(this->num_of_threads_) #endif for (int n = 0; n < this->num_; ++n) { this->forward_cpu_gemm(bottom_data + n*this->bottom_dim_, weight, top_data + n*this->top_dim_); if (this->bias_term_) { const Dtype* bias = this->blobs_[1]->cpu_data(); this->forward_cpu_bias(top_data + n * this->top_dim_, bias); } } } }
input_featureãããã»ããã® k = minïŒnum_threadsãbatch_sizeïŒã»ãããåŠçããŸã ã ããšãã°ã kåã® im2colæäœã䞊è¡ããŠçºçããã€ã³ãã«MKLãžã®kåã®åŒã³åºããå®è¡ãããŸãã ã€ã³ãã«Â®MKLã¯èªåçã«ã·ã³ã°ã«ã¹ã¬ããå®è¡ã¢ãŒãã«åãæ¿ãããã€ã³ãã«Â®MKLã1ã€ã®ãã±ãããåŠçãããšããããå šäœçãªããã©ãŒãã³ã¹ãåäžããŸãã ãã®åäœã¯ãsrc / caffe / layers / base_conv_layer.cppãœãŒã¹ãã¡ã€ã«ã§æå®ãããŠããŸãã ããã¯ãsrc / caffe / layers / conv_layer.cppãœãŒã¹ãã¡ã€ã«ããOpenMPã䜿çšããŠæé©åããããã«ãã¹ã¬ããåŠçã®å®è£ ã§ãã
âã¬ã€ã€ãŒã®ãµããµã³ããªã³ã°
æ倧ããŒãªã³ã°ãå¹³åããŒãªã³ã°ãããã³ç¢ºççããŒãªã³ã°ïŒãŸã å®è£ ãããŠããªãïŒã¯ç°ãªãããŠã³ãµã³ããªã³ã°ææ³ã§ãããæ倧ããŒãªã³ã°ãæãäžè¬çãªææ³ã§ãã ãµããµã³ããªã³ã°ã¬ã€ã€ãŒã¯ãåã®ã¬ã€ã€ãŒããååŸããçµæããéåžžã¯éãªãåããªãé·æ¹åœ¢ã®æçã®ã»ããã«åå²ããŸãã ãã®ãããªãã©ã°ã¡ã³ãããšã«ãã¬ã€ã€ãŒã¯ãåãã©ã°ã¡ã³ãã®æŽ»æ§åé¢æ°ãã圢æãããå€é ååžããåŸãããæ倧å€ïŒæ倧ããŒãªã³ã°ïŒãç®è¡å¹³åïŒå¹³åããŒãªã³ã°ïŒããŸãã¯ïŒå°æ¥ïŒç¢ºçå€ïŒç¢ºçããŒãªã³ã°ïŒã衚瀺ããŸãã
ãµããµã³ããªã³ã°ã¬ã€ã€ãŒã¯ãäž»ã«æ¬¡ã®3ã€ã®çç±ã§ç³ã¿èŸŒã¿ãããã¯ãŒã¯ã§åœ¹ç«ã¡ãŸãã
- ãµããµã³ããªã³ã°ã«ãããã¿ã¹ã¯ã®æ¬¡å ãšäžå±€ã®èšç®è² è·ã軜æžãããŸãã
- åºç€ãšãªãå±€ã®ãµããµã³ãã«ã«ãããäžã®å±€ã®ç³ã¿èŸŒã¿ã«ãŒãã«ãå ¥åããŒã¿ã®åºãé åãã«ããŒã§ãããããããè€éãªå±æ§ãåŠç¿ã§ããŸãã ããšãã°ãéåžžãäžã®ã¬ã€ã€ãŒã®ã³ã¢ã¯ç»åã®å°ããªèŠçŽ ãèªèããããšãåŠç¿ã§ããŸãããäžã®ã¬ã€ã€ãŒã®ã³ã¢ã¯æ£®æãããŒãã®ç»åãªã©ãããè€éãªæ§é ãèªèããããšãåŠç¿ã§ããŸãã
- æ倧ããŒãªã³ã°æ¹åŒã¯ãã€ã¡ãŒãžã·ããã«å¯Ÿãããããã¯ãŒã¯ã®åŸ©å åãé«ããŸãã 2x2 ( ) , . 3x3 .
, Xbyak , , . , OpenMP.
, OpenMP-. , :
#ifdef _OPENMP #pragma omp parallel for collapse(2) #endif for (int image = 0; image < num_batches; ++image) for (int channel = 0; channel < num_channels; ++channel) generator_func(bottom_data, top_data, top_count, image, image+1, mask, channel, channel+1, this, use_top_mask); }
collapse(2), OpenMP #pragma omp parallel for , , , .
â Softmax
â . , . , , , , , . softmax ( â SoftmaxWithLoss).
, , , , . , ( ), â K , j - x :
. , , . , .
, :
// #ifdef _OPENMP #pragma omp parallel for #endif for (int j = 0; j < channels; j++) { caffe_div(inner_num_, top_data + j*inner_num_, scale_data, top_data + j*inner_num_); }
âReLU
ReLU â , . â , (blob Caffe), , . ( â , Caffe. , Caffe ).
ReLU x x , , negative_slope :
negative_slope , ReLU, : max(x, 0) . - , :
template <typename Dtype> void ReLULayer<Dtype>::Forward_cpu(const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top) { const Dtype* bottom_data = bottom[0]->cpu_data(); Dtype* top_data = top[0]->mutable_cpu_data(); const int count = bottom[0]->count(); Dtype negative_slope=this->layer_param_.relu_param().negative_slope(); #ifdef _OPENMP #pragma omp parallel for #endif for (int i = 0; i < count; ++i) { top_data[i] = std::max(bottom_data[i], Dtype(0)) + negative_slope * std::min(bottom_data[i], Dtype(0)); } }
:
template <typename Dtype> void ReLULayer<Dtype>::Backward_cpu(const vector<Blob<Dtype>*>& top, const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom) { if (propagate_down[0]) { const Dtype* bottom_data = bottom[0]->cpu_data(); const Dtype* top_diff = top[0]->cpu_diff(); Dtype* bottom_diff = bottom[0]->mutable_cpu_diff(); const int count = bottom[0]->count(); Dtype negative_slope=this->layer_param_.relu_param().negative_slope(); #ifdef _OPENMP #pragma omp parallel for #endif for (int i = 0; i < count; ++i) { bottom_diff[i] = top_diff[i] * ((bottom_data[i] > 0) + negative_slope * (bottom_data[i] <= 0)); } } }
S(x) = 1 / (1 + exp(-x)):
#ifdef _OPENMP #pragma omp parallel for #endif for (int i = 0; i < count; ++i) { top_data[i] = sigmoid(bottom_data[i]); }
MKL ReLU-, , , ReLU- ( Xbyak). , , Intel Xeon. - . C++ .
çµè«
, , , , OpenMP Intel MKL. , , , .
Caffe, Intel, CIFAR-10 Intel VTune Amplifier XE 2017 beta
Caffe, Intel, . 37 BVLC Caffe, 3.6 . 10 .
Elapsed Time, , Spin Time, , , . ( ). , , , OpenMP. OpenMP OpenMP, . , , , .
, , Caffe Intel.
Intel Modern Code
Intel VTune Amplifier XE 2017 beta , , . , , . , . , , GCC. JIT- Xbyak SIMD-.
, OpenMP, , . Intel Modern Code , , , . , , . , , -, . . Intel Xeon Phi x200 MCDRAM NUMA.
Caffe Intel , . Caffe, Intel, .
, , , , , , .
Intel OpenMP- Caffe, Intel.
Intel Modern Code .