ããã¯ç¿»èš³ã§ã¯ãªãããããã¯ã«é¢ããç§ã®å°è±¡ãšã³ã¡ã³ãã§ãã
Sparkã®äººæ°ã®çç±ã¯äœã§ããïŒ
åŒçšïŒ
Apache Sparkãéåžžã«äººæ°ãããçç±ã¯ç°¡åã«ããããŸãã ã€ã³ã¡ã¢ãªãåæ£ãããã³å埩èšç®ãè¡ããŸããããã¯ãæ©æ¢°åŠç¿ã¢ã«ãŽãªãºã ã䜿çšããå Žåã«ç¹ã«äŸ¿å©ã§ãã ä»ã®ããŒã«ã§ã¯ãäžéçµæããã£ã¹ã¯ã«æžã蟌ãã§ã¡ã¢ãªã«èªã¿èŸŒãå¿ èŠãããå Žåããããå埩ã¢ã«ãŽãªãºã ã®äœ¿çšãéåžžã«é ããªãå¯èœæ§ããããŸãããããããããã¯ã»ãšãã©ã®å Žåå®å šã«çå®ã§ã¯ãããŸããã ã¡ã¢ãªã«ïŒ ãŸããã¯ããSparkã¯è©Šè¡ããŸãããããã§ä»ã®ããŒã«ã«ã€ããŠæžãããŠããããšãè¡ãããŸãã æçµçã«ãã¡ã¢ãªãããã»ããµã³ã¢ãããã³ãããã¯ãŒã¯ã¯ãªãœãŒã¹ãéãããŠãããããé ããæ©ããããŒã«ã¯å¶éã«äŸåããŸãã
ããæå³ã§ã¯ãSparkã¯åŸæ¥ã®map-reduceãããã¡ã¢ãªå ã«ããããšã¯ãããŸããã äœããã®æ¹æ³ã§ãããŒã¿ã¯æçµçã«ãã£ã¹ã¯ã«ä¿åããããïŒãšãããããšã©ãŒããã確å®ã«ä¹ãåãããšãã§ããæåããèšç®ãéå§ããªãïŒããããã¯ãŒã¯ãä»ããŠè»¢éãããŸãïŒã·ã£ããã«ãªã©ã®ããã»ã¹ïŒã ããã°ã©ããŒãšããŠã¯ãçªç¶å¿ èŠã«ãªã£ãå Žåã«äžéçµæããã£ã¹ã¯ã«ä¿åããŠä¿åããããšã劚ããããšã¯ã»ãšãã©ãããŸããã ãã©ãã€ãã®ããŒã¿ãèšãå Žåãããããã¡ã¢ãªã«ä¿åããŸããïŒ ç§ã¯ãããçããŸãã
ä»ã®ããŒã«ïŒéåžžã¯åŸæ¥ã®map-reduceãšããŠç解ãããïŒãšã¯ç°ãªããSparkã§ã¯ããªãœãŒã¹ã®æé©ãªäœ¿çšã«ã€ããŠå°ãèããããšãèš±å¯ãããã®äœ¿çšèªäœãããæé©åããŸãã ãããŠãæçµçãªé床ã¯ãæçµçã«ã¯ããããããã°ã©ã ãæžã人ã®æã®çã£çŽããã«äŸåããŸãã
ããã«ãèè ã¯ã圌ã«ãšã£ãŠæãè¯ããšæãããSparkã®å質ããªã¹ãããŠããŸãã
é åçãªAPIãšé 延å®è¡
äžè¬çã«ãç§ã¯ããã«åæããŸãã éçºããŒã«ãšããŠã®Sparkã¯ãåŸæ¥ã®map-reduceãããã¯ããã«äŸ¿å©ã§ãããApache Crunchãªã©ã®æ¡ä»¶ä»ãã第2ãäžä»£ã®ããŒã«ããããããã䟿å©ã§ãã ãŸããããšãã°Hiveãããå€å°æè»æ§ããããSQLèªäœã«éå®ãããŸããã
é 延ããã©ãŒãã³ã¹ã¯åžžã«è¯ããšã¯éããŸããã HiveãšDataSetã®åè·¯ã®éãã¯ããã¹ãŠã®ããŒã¿ããã§ã«åŠçããããšãã§ã¯ãªããå°ãæ©ã蚺æããããã¹ãŠãæ°æé/æ¥ã§ã¯ãªãèµ·åæã«ãã¹ãŠèšºæããããšèšãæ¹ãè¯ãå ŽåããããŸãã
ç°¡åãªå€æ
ããã§ãèè ã¯äž»ã«SparkãšPython / Pandasæ§é ã®éã®å€æã念é ã«çœ®ããŠããŸããã ç§ã¯ããã«ã¯çšé ãã®ã§ãçºèšããŸããã ããããã以äžã§pySparkã«ã€ããŠèª¬æããŸãã
ç°¡åãªå€æ
Sparkã®ãã1ã€ã®å©ç¹ã¯ãããããåŽåå ããããŒããã£ã¹ãæ¹åŒã§ãã ãã®æ¹æ³ã¯ãããŒãã«ã®1ã€ãä»ã®ããŒãã«ãããå°ãããåã ã®ãã·ã³ã«å®å šã«åãŸãå Žåã«ãçµåãå€§å¹ ã«é«éåããŸãã å°ããæ¹ã¯ãã¹ãŠã®ããŒãã«éä¿¡ãããããã倧ããæ¹ã®ããŒãã«ã®ããŒã¿ã移åããå¿ èŠã¯ãããŸããã ããã¯ãã¹ãã¥ãŒã®åé¡ã軜æžããã®ã«ã圹ç«ã¡ãŸãã 倧ããªããŒãã«ã®çµåããŒã«å€§ããªã¹ãã¥ãŒãããå Žåã倧ããªããŒãã«ããå°æ°ã®ããŒãã«å€§éã®ããŒã¿ãéä¿¡ããŠãçµåãå®è¡ãããããã®ããŒããå§åããããšããŸããpythonã®æ©èœã¯ããããŸããããç§ãã¡ã®ãšãªã¢ã§ã¯ããããåŽã®çµåã¯çŽ æã§ãCrunshãªã©ã®ããŒã«ã§ãç°¡åã«è¡ããŸãã ããã«ã¯ç¹å¥ãªå©ç¹ã¯ãããŸããããå€ãã®äººã¯ãããšãã°Hiveãç¥ã£ãŠããŸãã HadoopãããåŽã®çµåãšã³ã·ã¹ãã ã«ã¯ã€ã³ããã¯ã¹ãäºå®äžååšããªããããããããäžè¬çãªçµåæé©åããŒã«ã®1ã€ã§ãã
å€æçšã®APIã¯éåžžã«äŸ¿å©ã§ãããç°çš®ã®ãã®ã§ãã ãå€ããRDD APIã¯ãããããããå°ãæè»æ§ããããŸãããåæã«ãç¹ã«åºå®æ§é ã¯ã©ã¹ïŒJava BeansïŒã®ã¬ãã«ã§ã¯ãªããRowãšæè»ãªããŒã¿æ§é ã§äœæ¥ããŠããå Žåããã¹ãç¯ãç¯å²ãåºãããŸãã ãã®å Žåãå®éã®Sparkã¹ããŒã ãšäºæ³ãããSparkã¹ããŒã ã®äžäžèŽã¯éåžžã«äžè¬çã§ãã
DataSet APIã«é¢ããŠã¯ãéåžžã«åªããŠãããšèšããŸãã ããçšåºŠç·Žç¿ããã°ãSQLãšåããããç°¡åã«ãã¹ãŠãèšè¿°ããUDFã§è£å®ããŠæè»æ§ãé«ããããšãã§ããŸãã åæã«ãUDFèªäœã¯Hiveãããç°¡åã«èšè¿°ã§ããè€éãªããŒã¿æ§é ïŒé åãããããæ§é ïŒããJavaã«æ»ãå Žåã«ã®ã¿ããããŠããããScalaã«æ§é ãæåŸ ããããããããã€ãã®å°é£ãçããŸãã
JavaããŒãpymorphy2ã®ãããªãã®ãUDFã®åœ¢ã§éåžžã«ç°¡åã«äœ¿çšã§ãããšããŸãããã ãŸãã¯ãžãªã³ãŒããŒã æ¬è³ªçã«ãå¿ èŠãªã®ã¯ãSparkã·ãªã¢ã«åã®æ©èœãèŠããŠãUDFãé©åã«åæåããããšã ãã§ãã
ããããäžæ¹ã§Spark ML APIã¯ããŸã£ããç°ãªã人ã ã«ãã£ãŠèšèšãããŠããããã«èŠããŸãã ããã¯åœŒãæªããšããæå³ã§ã¯ãããŸãã-圌ã¯ãã éãã ãã§ãã
ãªãŒãã³ãœãŒã¹ã³ãã¥ããã£
Sparkã®èåŸã«ã¯ã倧èŠæš¡ãªãªãŒãã³ãœãŒã¹ã³ãã¥ããã£ããããŸãã ã³ãã¥ããã£ã¯ã³ã¢ãœãããŠã§ã¢ãæ¹åããå®çšçãªã¢ããªã³ããã±ãŒãžãæäŸããŠããŸãã ããšãã°ãããŒã ãSparkçšã®èªç¶èšèªåŠçã©ã€ãã©ãªãéçºããŸããã 以åã¯ããŠãŒã¶ãŒã¯ä»ã®ãœãããŠã§ã¢ã䜿çšããããNatural Language Toolkitãªã©ã®Pythonããã±ãŒãžã掻çšããããã«é ããŠãŒã¶ãŒå®çŸ©é¢æ°ã«äŸåããå¿ èŠããããŸãããããã§äžè¬çã«è¿œå ãããã®ã¯ãããŸããã ã³ãã¥ããã£ã¯æ¬åœã«å€§ãããã¹ãã«ãããããã¬ã³ããªãŒã§ãã Sparkçšã«èšå€§ãªæ°ã®æ¡åŒµæ©èœãäœæãããŠããŸãã
é ãUDFã«é¢ãã次ã®æç« ã¯ãPythonistã®è¯å¿ã«ä»»ããŸããScala/ Java UDFã¯ããã»ã©é ãã¯ãªããåæã«éåžžã«äŸ¿å©ã§ãã
èªåããè¿œå ãããã®ïŒ
ç°ãªãèšèªã§ã®éçº
ãããããã®äººæ°ã®çç±ã®1ã€ã¯ãããã€ãã®éçºèšèªïŒScalaãJavaãPythonãããã³RïŒã®ãµããŒãã§ãã æŠããŠãããŸããŸãªèšèªã®APIã¯ã»ãŒåçã«äŸ¿å©ã§ããããã®ãµããŒããçæ³ãšã¯åŒã³ãŸããã Sparkã¢ããªã±ãŒã·ã§ã³ãèµ·åãããšãããã«Java / ScalaãšPythonã®ãããããéžæã§ããäžåºŠã«èšèªãçµã¿åãããããšã¯ã§ããŸããã ãããã£ãŠãpySparkäžã®ã¢ããªã±ãŒã·ã§ã³ã®éšåïŒMLãŸãã¯NLPã®éšåãé »ç¹ã«èšè¿°ãããïŒãšJava / Scalaã®çµ±åã¯ãå®éã«ã¯ãã¡ã€ã«/ããŒã¿ããŒã¹ãä»ããŠã®ã¿å¯èœã§ãã ãŸãããŸãã¯ã«ãã«ãRESTãªã©ã®ãªãã·ã§ã³ã®ãããªãã®ã
ã¹ããªãŒãã³ã°
Spark StreamingïŒå®å šã«ç°ãªãHadoop Streamingãšæ··åããªãã§ãã ããïŒãããã¯Sparkã®æ©èœã®ãã1ã€ã®é åçãªéšåã§ãã 1ã€ã®æã§èª¬æãããšãããã¯ãããšãã°ãKafkaãZeroMQãªã©ããã®ã¹ããªãŒãã³ã°ããŒã¿ã®åŠçã§ãã ããŒã¿ããŒã¹ããååŸããããŒã¿ãšåãæ段ã§ã
ãã¹ãŠã®é åã¯ãå¹³åããŸã£ããåãã§ãããšããäºå®ã«æ£ç¢ºã«ãããŸãã å®éã«ã¯ãKafkaããã®ããŒã¿ã®åŠçãéå§ããããã«ããã°ã©ã å ã§äœãå€æŽããå¿ èŠã¯ãããŸããã map reduceãCrunchãCascadingã®ããããããã®ãããªããªãã¯ãè¡ãããšãèš±å¯ããŸããã
çæ
ããããã«ç¬èªã®æ¬ ç¹ããããŸãïŒcïŒã Sparkã䜿çšãããšãã«çŽé¢ããåé¡ã¯äœã§ããïŒ
ã¯ã©ã¹ã¿ãŒç®¡ç
Sparkã¯ã調æŽãšä¿å®ãé£ããããšã§æåã§ãã ã€ãŸããæé«ã®ããã©ãŒãã³ã¹ã確ä¿ããŠãããŒã¿ãµã€ãšã³ã¹ã®è² è·ã倧ãããªã£ãŠãè² è·ãããããªãããã«ããããšã¯å°é£ã§ãã ã¯ã©ã¹ã¿ãŒãé©åã«ç®¡çãããŠããªãå Žåãäžèšã§èª¬æããããã«ãããã¯ãè¯ãããç¡å¹ã«ããå¯èœæ§ããããŸãã ã¡ã¢ãªäžè¶³ãšã©ãŒã§å€±æãããžã§ãã¯éåžžã«äžè¬çã§ãããå€ãã®åæãŠãŒã¶ãŒããããšãªãœãŒã¹ç®¡çãããã«é£ãããªããŸãã誰ããçŽæããŸãããïŒ å®éãç§ã¯ãã§ã«ããã¹ãŠãçŽ æŽããããã®ã§ãããéåžžã«å€§ããªã¿ã¹ã¯ãæããªãå ŽåããŸãã¯å¿ èŠãªã ãã®ãªãœãŒã¹ãæã£ãŠããå Žåãªã©ã1ã€ã®ã±ãŒã¹ã§æ£ç¢ºã«ãªãããããšãæ¢ã«æžããŸãããã€ãŸããã¿ã¹ã¯ã¯ããã»ã©è€éã§ã¯ãããŸããã
æãæçœãªä»ã®ã±ãŒã¹ã§ã¯ãSparkã¢ããªã±ãŒã·ã§ã³ã調æŽãæ§æãããã³ä¿å®ããå¿ èŠããããŸãã
åºå®ãŸãã¯åçã¡ã¢ãªå²ãåœãŠã䜿çšããŸããïŒ Sparkã§äœ¿çšã§ããã¯ã©ã¹ã¿ãŒã®ã³ã¢ã¯ããã€ãããŸããïŒ åãšã°ãŒãã¥ãŒã¿ãŒã¯ã©ã®ãããã®ã¡ã¢ãªãååŸããŸããïŒ SparkãããŒã¿ãã·ã£ããã«ãããšãã«äœ¿çšããããŒãã£ã·ã§ã³ã¯ããã€ã§ããïŒ ããããã¹ãŠã®èšå®ãããŒã¿ãµã€ãšã³ã¹ã¯ãŒã¯ããŒãã«é©åã«å¯Ÿå¿ãããããšã¯å°é£ã§ããããšãã°ããšã°ãŒãã¥ãŒã¿ã®æ°ãéžæããã®ã¯æ¯èŒçç°¡åãªäœæ¥ã®ããã«æããŸãã ååãšããŠãããŒã¿ã«ã€ããŠäœããç¥ã£ãŠããã°ããã®æ°ãå®å šã«èšç®ã§ããŸãã ãããããªãœãŒã¹ã䜿çšããã ãã§ãªãããã¹ãŠããã楜ãããªããŸãã ããã»ã¹ã«ä»ã®ã¢ããªã±ãŒã·ã§ã³ãžã®ã¢ã¯ã»ã¹ãå«ãŸããå Žåã...
ããšãã°ãéãžãªã³ãŒãã£ã³ã°æ©èœãåããã¢ããªã±ãŒã·ã§ã³ããããŸãã ãŸãã圌ã¯å¥ã®ArcGISãµãŒããŒã«åŸäºããŠããŸãã åæã«ãArcGISã«ã¯4ã€ã®ã³ã¢ãããªããSparkãå®è¡ãããŠããHadoopã¯ã©ã¹ã¿ãŒã«ã¯å€æ°ã®ããŒãããããŸãããã®ããã8ã€ã®ãšã°ãŒãã¥ãŒã¿ãŒã®ã¿ã§Sparkãéžæããå ŽåãArcGISããã»ããµãŒã®è² è·æ²ç·ã¯100ïŒ ã«ãžã£ã³ããããã®ãŸãŸæ®ããŸãæ°æéã®ã¢ããªã±ãŒã·ã§ã³æäœã ãã®ã¿ã¹ã¯ãSparkã«è»¢éãããšïŒã¢ããªã±ãŒã·ã§ã³ã³ãŒãã以åã«æžãæããåŸïŒããã®ã¿ã¹ã¯ã«ãã¯ã©ã¹ã¿ãŒãªãœãŒã¹ã䜿çšã§ãããããåäœæéãæ°æ¡ççž®ãããŸãã
ã€ãŸããäžå®éã®ãªãœãŒã¹ãå²ãåœãŠããããããããã®ãªãœãŒã¹ãå¥ã®æ¹æ³ã§ç®¡çããããšããããã«ããã¯ããããããŸãïŒSparkã圱é¿ãäžããããšã¯ã§ããŸããïŒã ãããã£ãŠãSparkããããã®ãªãœãŒã¹ã®äœ¿çšãæé©åããããšãæåŸ ããã®ã¯åçŽã§ãã
ãããã°
ãããçå®ã§ãã ãã ããæåŸ ãããŠããŸãã åæ£äžŠåã·ã¹ãã ãããããã®ãããã°ãšç£èŠã¯éèŠãªã¿ã¹ã¯ã§ãã SparkUIã¯ç£èŠã®åé¡ãããçšåºŠè§£æ±ºããSpark Metricsã¯ããã©ãŒãã³ã¹æž¬å®ã解決ããŸãããããšãã°ããããã¬ãŒã䜿çšããŠå®è¡å¯èœã¢ããªã±ãŒã·ã§ã³ã«æ¥ç¶ããŠã¿ãŸããåäœãããã¹ããæ¥ç¶ããããŒããããããŸããã éåžžã®ã¢ããªã±ãŒã·ã§ã³ã®å Žåãšåãã¡ããªãã¯ã¹ããããšãã°JMXããç°¡åã«ååŸã§ããŸããåæ£ã¢ããªã±ãŒã·ã§ã³ã®å Žåã¯ããããã¯ãŒã¯ãä»ããŠéä¿¡ããå¿ èŠãããããã®åŸã§ã®ã¿åéã§ããŸãã ã¯ããããã¯ãã¹ãŠæ¯èŒçæªãã§ãã
PySparkã§ã®UDFããã©ãŒãã³ã¹ã®äœäžïŒPySpark UDFã®é床äœäžïŒ
ããŠãããã§ç§ã¯äœãèšãããšãã§ããŸããïŒ åœŒããæŠã£ããã®ã®ããã«ã圌ãã¯äœãã«åºããããŸããã ç§ã®ç¥ãéããPythonã®UDFã¯ãã¢ããªã±ãŒã·ã§ã³ãšUDFã®éã§ããŒã¿ã®äºéå€æãè¡ããããšããäºå®ã«ã€ãªãããŸãã PythonãSparkãå®è¡ãããJVMãšã³ã·ã¹ãã ã®ç°è³ªãªèšèªã§ãããUDFãå€éšã§å®è¡ãããããã§ãã
ããã§ã¢ããã€ã¹ã§ããã®ã¯1ã€ã ãã§ããPythonã§èšè¿°ãããScala / Javaã§èšè¿°ããŠãã ããã ãã®ã¢ããã€ã¹ãåžžã«æãŸããŠããããã§ã¯ãªããåŸãããšãã§ããããšã¯æããã§ãããPythonã®ããŒãžã§ã³ãç£æ¥ã¬ãã«ã«ãªã£ããšãã«Graalã ãããã®åé¡ãã°ããŒãã«ã«è§£æ±ºã§ããã®ã§ã¯ãªãããšæããŸãã
䞊ååŠçã®æ倧ã¬ãã«ãä¿èšŒããããšã¯å°é£ã§ãïŒããŒãããŒã®ã£ã©ã³ãã£æ倧䞊ååŠçïŒ
Sparkã®éèŠãªäŸ¡å€ææ¡ã®1ã€ã¯åæ£èšç®ã§ãããSparkãå¯èœãªéãèšç®ã䞊ååããããšãä¿èšŒããããšã¯å°é£ã§ãã Sparkã¯ããžã§ãã®ããŒãºã«åºã¥ããŠããžã§ãã䜿çšãããšã°ãŒãã¥ãŒã¿ãŒã®æ°ã匟æ§çã«ã¹ã±ãŒãªã³ã°ããããšããŸãããå€ãã®å Žåãããèªäœã§ã¯ã¹ã±ãŒã«ã¢ããã§ããŸããã ãã®ããããšã°ãŒãã¥ãŒã¿ãŒã®æå°æ°ãäœãèšå®ãããããšããžã§ãã¯å¿ èŠãªãšãã«ãã以äžãšã°ãŒãã¥ãŒã¿ãŒãå©çšã§ããªããªãå¯èœæ§ããããŸãã ãŸããSparkã¯RDDïŒResilient Distributed DatasetïŒ/ DataFramesãããŒãã£ã·ã§ã³ã«åå²ããŸããããŒãã£ã·ã§ã³ã¯ããšã°ãŒãã¥ãŒã¿ãŒãå®è¡ããæå°ã®äœæ¥åäœã§ãã èšå®ããããŒãã£ã·ã§ã³ãå°ãªãããå Žåããã¹ãŠã®ãšã°ãŒãã¥ãŒã¿ãŒãäœæ¥ããã®ã«ååãªäœæ¥ãã£ã³ã¯ããªãå¯èœæ§ããããŸãã ãŸããããŒãã£ã·ã§ã³ãå°ãªããšããŒãã£ã·ã§ã³ã倧ãããªãããšã°ãŒãã¥ãŒã¿ã®ã¡ã¢ãªãäžè¶³ããå¯èœæ§ããããŸããããã ããç°¡åã ã£ããã ç°¡åãªãã®ããå§ããŸããã-éå§ã®ããã®ãã©ã¡ãŒã¿ãŒã¯ãç¹å®ã®ã¯ã©ã¹ã¿ãŒããšã«èª¿æŽããå¿ èŠããããŸãã prodã¯ã©ã¹ã¿ãŒã«ã¯ã1æ¡ä»¥äžã®ããŒãããããåããŒãã§äœåãã®ã¡ã¢ãªã䜿çšã§ããŸãã Devã¯ã©ã¹ã¿ãŒã®èšå®ã¯ãProdã§èµ·åãããšãã«ããããéå°è©äŸ¡ãããŸãã çŸåšã®ã¯ã©ã¹ã¿ãŒã®èªã¿èŸŒã¿ã¿ã¹ã¯ãèæ ®ãå§ãããšãããã¯ãã¹ãŠããã«è€éã«ãªããŸãã äžè¬çã«ãã¯ã©ã¹ã¿ãŒãªãœãŒã¹ãå²ãåœãŠããã®ã¿ã¹ã¯ã¯æé©åã¿ã¹ã¯ã§ãããéåžžã«éèŠã§ãããåäžã®æ£ãã解決çã¯ãããŸããã
ããŒãã£ã·ã§ã³ãå°ãªãå Žåã䞊åæ§ã¯äžååã§ãã ãŸããããããå€ãããå Žåãããããã®ãµã€ãºã¯ãHDFSãããã¯ã®ãµã€ãºãªã©ãæ¡ä»¶ä»ãã®äžéãããäœããªãå¯èœæ§ããããŸãã åã¿ã¹ã¯ã¯èµ·åã«è²»ãããããªãœãŒã¹ã§ãããããæããã«ãªãŒããŒãããã®ã³ã¹ãã¯çç£æ§ãããéãå¢å ãããããã¿ã¹ã¯ã®ãµã€ãºã«ã¯äžéãããããã以äžã«äžããå¿ èŠã¯ãããŸããã
ç°¡åãªäŸã¯ãããªãã®éã®ãã£ã¬ã¯ããªãå¿ èŠãšããã¢ããªã±ãŒã·ã§ã³ã§ãã Hadoopã§ã®ãéåžžã®ãmap-reduceã¿ã¹ã¯ã®å ŽåãéåžžãããŒã¿ã«ã³ãŒããé ä¿¡ããŸãã ã¢ããªã±ãŒã·ã§ã³ïŒSparkããŒãïŒããã¡ã€ã«ïŒãã¡ã€ã«ïŒãé 眮ãããŠããã¯ã©ã¹ã¿ãŒã®ããŒãã«ã³ããŒãããšããã£ã¬ã¯ããªã¯æ¢ã«ãããåŽã®çµåã«äŒŒãŠãããããã³ãŒããšäžç·ã«é ä¿¡ããå¿ èŠããããŸãã ãããŠçªç¶ãåããŒãã«é ä¿¡ãããããŒã¿ã®ãµã€ãºãæ°æ¡å€§ãããªããŸãããããšãã°ã10ã¡ã¬ãã€ãïŒSparkèªäœã®ãªãå°ããªSparkã¢ããªã±ãŒã·ã§ã³ïŒãããšãã°20ã®ã¬ãã€ãïŒéåžžã«çŸå®çãªå Žåãã¢ãã¬ã¹ãé»è©±ãªã©ã®ããŒã¿ãæ£èŠåããããã«å¿ èŠãªãã£ã¬ã¯ããªïŒãã®ãããªããªã¥ãŒã ã«ãã£ãŠããªãåŒã£åŒµãããïŒã ããŠãããã«ãããŸã-é床ã®äžŠååŠçã®ä»£åã¯æããã§ãã
ãããããç¹å®ã®èªç¶æ°ã®ããŒãã£ã·ã§ã³ããããŸããããã¯ãã¬ããªã±ãŒã·ã§ã³ä¿æ°ãèæ ®ããŠãå ¥åãã¡ã€ã«ãåå²ãããããã¯ã®æ°ã«ãã£ãŠæ±ºãŸããŸãã ãã®æ°å€ã¯ãããŒã¿ã®èªã¿åãã«é¢ããŠæé©ã«è¿ãå¯èœæ§ããããŸãã ã€ãŸãããã¡ã€ã«ã«3ã€ã®ãããã¯ããããåãããã¯ã«ã¯ã©ã¹ã¿ãŒã®2ã€ã®ããŒãã«ã³ããŒãããå Žåã6ã€ã®ã¹ã¬ãããèªç¶ã«äžŠååŠçãããã®ããŒãã§åã¬ããªã«ãåŠçã§ããŸãã ãã¡ãããSparkã¯ãªãœãŒã¹ãåçã«å²ãåœãŠããšãã«ãããã®ãã©ã¡ãŒã¿ãŒãèæ ®ããŸãã
æ®å¿µãªããããŸãã¯å¹žããªããšã«ãSparkã¯ã¯ã©ã¹ã¿ãŒãªãœãŒã¹ãã©ã³ããŒã§ã¯ãããŸããã ããšãã°ã糞ã§ãã ãã®ãããSparkã«ã¯ããã¹ãŠã®ãªãœãŒã¹ã®äœ¿çšãæé©ã«èšç»ããã®ã«ååãªæ å ±ããªãå ŽåããããŸãã
Hiveãšã®çµ±åãããŸãè¯ããªã
äžæ¹ã§ã¯ãSparkã¯HiveããŒã¿ãšã¡ã¿ããŒã¿ã§ããŸãæ©èœããŸãã ç§ãåºäŒã£ãã»ãšãã©ã®ã¢ããªã±ãŒã·ã§ã³ã¯ãŸãã«ãããããŠããããšã ãšæããŸãã ããããè¿·æãªåé¡ããªãããã§ã¯ãããŸããã Sparkã§partitionByããã³bucketByããŒã«ã䜿çšããããšãããšãHiveãäœæ¥çµæã衚瀺ããªãå¯èœæ§ãéåžžã«é«ããšèšããŸãã ããã«ããã°ã®ã©ããã«èŠåã衚瀺ãããã ãã§ãã
äºææ§
æ®å¿µãªããããã®ãããã¯ã«é¢ããç§ã®çµéšã¯ããªãæªãã§ãã Sparkã®ããŒãžã§ã³ãäºæ³ãšç°ãªãã¯ã©ã¹ã¿ãŒã§ã¢ããªã±ãŒã·ã§ã³ãå®è¡ããããšãããšãè€æ°ã®åé¡ã«ééããŸããã Spark 2.2.0ã§éçºããå Žåã2.1ããã³2.3ã§éå§ãããšãã«åé¡ããããŸããã
ç§ãã¡ã®å Žåãäœããã®çç±ã§Sparkãã³ãŒããã¯ã®1ã€ïŒã€ãŸããsnappyïŒãããŒãžã§ã³2.3ã§å®è¡ããŠãããšãã«èŠã€ããããšãã§ããªãã£ããšããŸãã ããŒã¿ãæžã蟌ãå¿ èŠãããå ŽåïŒããã¯èšé²æã«ã³ãŒããã¯ãæå®ããããã¯ãããŠããªãããŒã¿ãå«ãä»»æã®ãã®ãéžæã§ããŸãïŒãããã¯ããã»ã©æ·±å»ãªåé¡ã§ã¯ãããŸããããæ¥ã«ããã¯ããããã®ãèªãå¿ èŠãããå Žåãæããã«éãæªãã§ãã
ããããåé¡ã®äžéšã¯ã€ã³ã¹ããŒã©ãŒã®ãšã©ãŒãåå ã§ããããããã¯ããã»ã©ç°¡åã§ã¯ãããŸããã ããã«ããããããããã€ããŒããŒãžã§ã³éã®ç§»è¡ã¯ããã¹ã ãŒãºã«ãªã£ãŠããã¯ãã§ãã
æ²ããããªãSparkã¯ãåãã©ã€ã³ïŒåã2.2ãš2.3ïŒã®2ã€ã®ç°ãªãããŒãžã§ã³ã®1ã€ã®ã¯ã©ã¹ã¿ãŒãžã®ãã«ã¿ã€ã ã®äžŠåã€ã³ã¹ããŒã«ãæå³ããŸããã
æãããããŒãã£ãŒ
APIã®åä»ã
Spark APIã®å€ãã¯éåžžã«ãšã¬ã¬ã³ããªã®ã§ãæŽç·ŽãããŠããªãéšåãéç«ã£ãŠããŸãã ããšãã°ãé åèŠçŽ ãžã®ã¢ã¯ã»ã¹ã¯ãSparkã©ã€ãã®ofãéšåã§ãããšèããŠããŸããé åã®æäœãããã»ã©ã²ã©ããšã¯èšããŸããã Spark APIã¯ããšããšScalaã§äœæãããŠãããããäžäŸ¿ãªç¹ãããã€ããããããã«ã¯ç¬èªã®ã³ã¬ã¯ã·ã§ã³æ§é ããããJavaããæ©èœãããããSkalovã«æžããå¿ èŠããããŸãã ãããã£ãŠãUDFãèšè¿°ã§ããã°ãé åã䜿ã£ãŠäœã§ãã§ããŸãã ãããã¯ã-Pythonã§ã¯ãUDFã®ãã¹ãŠãæªãã§ãããã€ãå¿ããŠããŸãã
ããŸã䟿å©ã§ã¯ãªããããŸãå¹æçã§ããªã-ã¯ããå€åã ããã¯ãè€éãªæ§é ãæ±ãããã®æ°ããé«éé¢æ°ãå°å ¥ããSpark 2.4ã®æ°ããããŒãžã§ã³ã解決ããããšããŠããŸãïŒããã«ãããexplode / collectã®äœ¿çšãåé¿ãããŸãïŒã
ç§ã®æèŠã§ã¯ãAPIã®ã¯ããã«äžäŸ¿ãªåŽé¢ã¯ãã³ãŒããèŠããšãã©ã®éšåããã©ã€ããŒã§å®è¡ãããã©ã®éšåãä»ã®ããŒãã§å®è¡ãããããåžžã«æããã§ã¯ãªããšããããšã§ãã åæã«ãããŒãéã§ã³ãŒããé åžããã¡ã«ããºã ã«ã¯ïŒäœããã®æ¹æ³ã§ã®ïŒã·ãªã¢ã«åãå«ãŸãããšã°ãŒãã¥ãŒã¿ãŒã§å®è¡ãããã³ãŒãã¯ã·ãªã¢ã«åå¯èœã§ãªããã°ãªããŸããã ã·ãªã¢ã«åãšã©ãŒãç解ãããšãã³ãŒãã«é¢ããå€ãã®æ°ããèå³æ·±ãæ å ±ãåŠã¶ããšãã§ããŸã:)ã
ã¯ã©ã¹ããŒããŒ
æ®å¿µãªãããSparkã³ãŒãããã¢ããªã±ãŒã·ã§ã³ã³ãŒããåé¢ããåé¡ã¯ååã«è§£æ±ºãããŠããŸããã ãã ããåŸæ¥ã®map-reduce Hadoopã¢ããªã±ãŒã·ã§ã³ã«ãåãããšãåœãŠã¯ãŸããŸãã åæã«ãHadoopã³ãŒãã¯Google Guavaãªã©ã®ã©ã€ãã©ãªã®å€ãããŒãžã§ã³ã䜿çšããŸãããä»ã®ã©ã€ãã©ãªã¯ãŸã£ããæ°ãããã®ã§ã¯ãããŸããã Guavaã®äœæè ãéæšå¥šã®ã¡ãœãããåé€ããŠAPIã«åŸæ¹äºææ§ãå°å ¥ããããšã奜ãããšãæãåºããšãå®å šã«éŠ¬é¹¿ããç»åãåŸãããŸã-Guavaã§ã³ãŒããæ°ããããŒãžã§ã³ã§èšè¿°ããå®è¡ãããšã¯ã©ãã·ã¥ããŸã-æ¬åœã«GuavaããŒãžã§ã³ã§äœæ¥ããŠããããã§ãHadoopããïŒã¯ããã«å€ãïŒãã³ãŒããæ°ããããŒãžã§ã³ã®ã¡ãœãããèŠã€ããããªãããæ°ããããŒãžã§ã³ãšäºææ§ããªãããã«Hadoopãã¯ã©ãã·ã¥ããŸãã ããã¯éåžžã«å žåçãªãæ®å¿µãªããšã«ãéçºè ã2人ããã«ééããå¯èœæ§ãé«ãåé¡ã§ãã Apache Http Componentsã©ã€ãã©ãªãåæ§ã®åé¡ã®å¥ã®äŸã§ãã
ãã€ã³ãå€æ°ãªãã®SQL
æ®å¿µãªããããã¢ã§ã¯ãšãªãå®è¡ããããã®å žåçãªã³ãŒãã¯æ¬¡ã®ããã«ãªããŸãã
val sqlDF = spark.sqlïŒ "SELECT * FROM people WHERE id = 1"ïŒ
APIã¯ãid =ïŒãªã¯ãšã¹ããå®è¡ãããªãã·ã§ã³ãæäŸããŸãã å®è¡ããšã®ãã©ã¡ãŒã¿çœ®æã ããŠãSQLã€ã³ãžã§ã¯ã·ã§ã³ã®åé¡ã¯äœæè ãæ©ãŸããããšã¯ãããŸããããéçºè ã¯ã¯ãšãªã®ãã©ã¡ãŒã¿ãŒã眮ãæããå¿ èŠããããŸãããããã£ãŠãç¹æ®æåã®çœ®ãæãã¯å®å šã«ããªã次第ã§ãã 客芳æ§ã®ããã«ãHiveãåæ§ã®åé¡ãæ±ããŠããŸãããã©ã¡ãŒã¿ãŒã䜿çšããŠã¯ãšãªãå®çŸ©ããããšãã§ããŸããã
ãã ããããããªããšã«ãJDBCãœãŒã¹ã®å Žåãã¯ãšãªãèšè¿°ããããšããæ£åŒã«ã¯äžå¯èœã§ããåã®ã¿ã§ã¯ãªãããŒãã«ã®ã¿ãæå®ã§ããŸãã éå ¬åŒã«ã¯ãããŒãã«ã®ä»£ããã«ïŒdããaãbãcãéžæïŒtã®ãããªãã®ãæžãããšãã§ããŸãããããããã¹ãŠã®å Žåã«æ©èœããå Žåã誰ãããªãã«ç¢ºå®ã«äŒããŸããã
æç床ãšæ©èœã®å®å šæ§ã®æ¬ åŠ
ããŒã ä»äººã®é -éã
ãã1ã€ã®æ©èœã®ã®ã£ããã®äŸã¯ãSparkã§é£ç¶ããäžæã®ã¬ã³ãŒãèå¥åãäœæããã®ãé£ããããšã§ãã é£ç¶ããäžæã®ã€ã³ããã¯ã¹åã¯ãäžéšã®ã¿ã€ãã®åæã«åœ¹ç«ã¡ãŸãã ããã¥ã¡ã³ãã«ãããšããmonotonically_increasing_idïŒïŒãã¯åè¡ã«äžæã®IDãçæããŸãããIDãé£ç¶ããŠããããšãä¿èšŒãããã®ã§ã¯ãããŸããã é£ç¶ããIDãéèŠãªå ŽåãSparkã®å€ãRDD圢åŒã䜿çšããå¿ èŠããããŸããç§ã¯ãã®ãããªäž»åŒµãç解ããŠããŸããã ãœãŒã¹ãå©çšå¯èœã§ãããèŠããŠãå°ãªããšãã³ã¡ã³ããèªãããšã¯ããªãå¯èœã§ãïŒ
å調ã«å¢å ãã64ãããæŽæ°ãè¿ããŸããã€ãŸãããã®é¢æ°ã¯ããŒãã£ã·ã§ã³çªå·ãååŸããããã«ã«ãŠã³ã¿ãŒãè¿œå ããã ãã§ãã åœç¶ã2åã®é£ç¶ããåŒã³åºãã®éã«èª°ããããåŒã³åºããªããšããä¿èšŒã¯ãããŸããã 1ã€ã®Sparkã¢ããªã±ãŒã·ã§ã³ã¯ãã¯ã©ã¹ã¿ãŒã®ç°ãªãããŒãã§å®è¡ãããŠããå€ãã®JVMã§ããå¯èœæ§ãããããããã1ã€ã®JVMå ã§å®è¡ãããå€ãã®ã¹ã¬ããã§ãã
- çæãããIDã¯ãå調ã«å¢å ããäžæã§ããããšãä¿èšŒãããŸãããé£ç¶çã§ã¯ãããŸããã
- çŸåšã®å®è£ ã§ã¯ãããŒãã£ã·ã§ã³IDã¯äžäœ31ãããã«ãäžäœ33ãããã«é 眮ãããŸã
- åããŒãã£ã·ã§ã³å ã®ã¬ã³ãŒãçªå·ãè¡šããŸãã ä»®å®ã¯ãããŒã¿ãã¬ãŒã ã
- 10åæªæºã®ããŒãã£ã·ã§ã³ãåããŒãã£ã·ã§ã³ã®ã¬ã³ãŒãã¯80åæªæºã§ãã
äœæè ã«ããå°ãèããŠãåäžã®çæãã€ã³ããäœæããã«ïŒæå³çã«ããã«ããã¯ã«ãªããŸãïŒããããã¯ããã«ïŒããã¯åãã«ãªããŸãïŒããã®äžŠååæ£ã·ã¹ãã ã§å¿ èŠãªIDãæ£ç¢ºã«çæããæ¹æ³ãèããŠã¿ãŸãããããèªäœã§ïŒã
Spark 2.4ã«æåŸ ããããš
ãã§ã«è¿°ã¹ãé«éé¢æ°
ããã¯æ¬åœã«ããã§ãã äž»ãªããšã¯åãããšã§ãã
å®éãããã¯é åãŸãã¯ããããæäœããããã®çµã¿èŸŒã¿é¢æ°ã®ã»ããã§ãããç¬èªã®é¢æ°ïŒã©ã ãïŒã䜿çšããŠãããã«å¯ŸããŠå€æãå®è¡ããæ©èœã§ãã
ããã§ãããã€ãã®äœ¿çšäŸãèŠãããšãã§ããŸãã
æ°ããå®è¡ã¢ãŒã
ããã¯ãããããbarierã¹ã±ãžã¥ãŒã©ããã³ã©ã³ã¿ã€ã ã§ãã èè ã¯ãããæ©æ¢°åŠç¿ã¿ã¹ã¯çšã«æå³ããŠãããããã®ãããªã¿ã¹ã¯ã®ã»ããã¯ãã¡ããããåºãã å®éããããã¯Spark map-reduceã«ã¯äžè¬çã§ã¯ãªãã¿ã¹ã¯ã§ãã ç§ãç解ããŠããããã«ããããã¯ã»ãšãã©ãäžåºŠèµ·åããããã¯ã©ãã·ã¥ããå Žåã®ã¡ãã»ãŒãžã³ã°ã³ã³ããŒãã³ãã§ãã
ãã®ãããªã¿ã¹ã¯ããµããŒãããAPIã䟿å©ã§ããã°ãééããªããã®å¿ èŠæ§ããããŸãã ç§ãã¡ã®äŒç€Ÿã§ã¯ããã®ãããªã³ã³ããŒãã³ãã¯Yarnã¢ããªã±ãŒã·ã§ã³ãšããŠèšèšãããŠãããSparkããã¯å€å°å¥ã ã«åäœããŸãã Sparkå ã®ããç·å¯ã§äŸ¿å©ãªçµ±åã¯äŸ¡å€ããããŸãã
AvroãµããŒãã®æ¹å
Avroã®ãµããŒãã¯äžè¬çã«è¯å¥œã§ããã ããã€ãã®è¿œå ã®ããŒã¿åãã€ãŸã10é²æ°ãæ¥ä»ãæå»ãæéãªã©ãå«ãããããããè«çåãïŒå®éã«ã¯ããã€ãã®æŽŸçåïŒããµããŒããããŠããŸãã
ççŽã«èšã£ãŠãHiveã®äœè ïŒãããŠåæã«SparkãïŒãå¯æšçŽ°å·¥ããããããµããŒããããã®ã¬ã€ã¢ãŠãã«åºã¥ããŠããŒãã«ãäœæããæ¹æ³ãåŠã¶ãšããç§ã¯ãã£ãšåŸ ã¡ãŸãã ãããå¯èœã«ãªããŸããããAvroã§ã¯èŠãç®ãåäœããã䟿å©ã«ãªããŸããã
ããã§è©³çŽ°ãèªãããšãã§ããŸã ã
Scala 2.12ã®ãµããŒãïŒå®éšçïŒ
Javaããã°ã©ããŒãšããŠã¯éèŠã§ã¯ãªãããã«æããŸããããã®ãããžã§ã¯ãã®ãã¬ãŒã ã¯ãŒã¯å ã§ã¯ãJava 8ãšã®çžäºäœçšãããšãã°ã©ã ãã®ã·ãªã¢ã«åãæ¹åããããšãçŽæããŸããã