ãšã³ããªãŒ
æè¿ãApache Sparkãããžã§ã¯ãã¯å€§ããªæ³šç®ãéããŠãããå€æ°ã®å°ããªå®çšçãªèšäºãæžãããHadoop 2.0ã®äžéšã«ãªããŸããã ããã«ã圌ã¯Spark StreamingãSparkMLãSpark SQLãGraphXãªã©ã®è¿œå ãã¬ãŒã ã¯ãŒã¯ã§ããã«å€§ãããªãããããã®ãå ¬åŒããã¬ãŒã ã¯ãŒã¯ã«å ããŠãããŸããŸãªã³ãã¯ã¿ãã¢ã«ãŽãªãºã ãã©ã€ãã©ãªãªã©ãå€ãã®ãããžã§ã¯ããç»å ŽããŸããã Sparkã«ã¯ä»ã®Berkeleyãããžã§ã¯ãïŒBlinkDBãªã©ïŒã®ããããçš®é¡ã®åºæ¬çãªèŠçŽ ãå«ãŸããŠãããšããäºå®ãèæ ®ãããšãæ·±å»ãªããã¥ã¡ã³ãããªãããããã®åç©åããã°ããèªä¿¡ãæã£ãŠç解ã§ããŸããããã¯ç°¡åãªäœæ¥ã§ã¯ãããŸããã ãããã£ãŠãç§ã¯å¿ãã人ã ã®ç掻ã楜ã«ããããã«ãã®èšäºãæžãããšã«ããŸããã
å°ãã®èæ¯ïŒ
Sparkã¯ã2009幎é ã«å§ãŸã£ãUC Berkeleyã®ã©ããããžã§ã¯ãã§ãã Sparkã®åµèšè ã¯ããŒã¿ããŒã¹åéã®æåãªç§åŠè ã§ããã圌ãã®å²åŠã«ããã°ãSparkã¯äœããã®åœ¢ã§MapReduceã«å¯Ÿããçãã§ãã Sparkã¯çŸåšãApacheã®ãå±æ ¹ãã®äžã«ãããŸãããã€ããªãã®ãŒå®¶ãšã³ã¢éçºè ã¯åã人ãã¡ã§ãã
ãã¿ãã¬ïŒ2ã¯ãŒãã®ã¹ããŒã¯
Sparkã¯ã次ã®ãããª1ã€ã®æã§èª¬æã§ããŸããããã¯ã倧èŠæš¡ãªäžŠåDBMSã®ãšã³ãžã³ã®å éšã§ãã ã€ãŸããSparkã¯ã¹ãã¬ãŒãžãææ ŒãããŸããããä»ã®ãã®ãããåªå ãããŸãïŒHDFS-åæ£ãã¡ã€ã«ã·ã¹ãã Hadoopãã¡ã€ã«ã·ã¹ãã ãHBaseãJDBCãCassandraãªã©ïŒã çå®ã¯ãIndexedRDDãããžã§ã¯ãã¯ããã«èšåãã䟡å€ããããšããããšã§ã-Sparkã®ããŒ/å€ã¹ãã¬ãŒãžã¯ãããããããã«ãããžã§ã¯ãã«çµ±åãããã§ãããããŸããSparkã¯ãã©ã³ã¶ã¯ã·ã§ã³ãæ°ã«ããŸããããããã§ãªããã°MPP DBMSãšã³ãžã³ã§ãã
RDD-Sparkã®ã³ã¢ã³ã³ã»ãã
Sparkãç解ããéµã¯RDDïŒResilient Distributed Datasetã§ãã å®éãããã¯ä¿¡é Œã§ããåæ£ããŒãã«ã§ãïŒå®éãRDDã«ã¯ä»»æã®ã³ã¬ã¯ã·ã§ã³ãå«ãŸããŠããŸããããªã¬ãŒã·ã§ãã«ããŒãã«ã®ããã«ã¿ãã«ãæäœããã®ãæã䟿å©ã§ãïŒã RDDã¯å®å šã«ä»®æ³ã§ããããã®çºçæ¹æ³ãææ¡ããã ãã§ãããšãã°ããŒãã«é害ãçºçããå Žåã«åŸ©æ§ã§ããŸãã ãããŠããããå ·äœåããããšãã§ããŸã-åæ£ãã¡ã¢ãªãŸãã¯ãã£ã¹ã¯ïŒãŸãã¯ãã£ã¹ã¯ãžã®æŒãåºãã䌎ãã¡ã¢ãªïŒã ãŸããå éšçã«ãRDDã¯ããŒãã£ã·ã§ã³åãããŠããŸããããã¯ãåäœæ¥ããŒãã§åŠçãããRDDã®æå°éã§ãã
Sparkã§çºçããèå³æ·±ãããšã¯ãã¹ãŠãRDDã®æäœãéããŠçºçããŸãã ã€ãŸããéåžžãSparkã®ã¢ããªã±ãŒã·ã§ã³ã¯æ¬¡ã®ããã«ãªããŸã-RDDãäœæãïŒããšãã°ãHDFSããããŒã¿ãååŸããŸãïŒããããå°ç¡ãã«ãïŒmapãreduceãjoinãgroupByãaggregateãreduceã...ïŒãçµæã§äœããè¡ããŸã-ããšãã°ã HDFSã
ããŠããã§ã«ãã®ç解ã«åºã¥ããŠãSparkã¯ãã¿ã¹ã¯ã調æŽãããã¹ã¿ãŒãšãå®è¡ã«åå ããå€æ°ã®äœæ¥ããŒããååšãããè€éãªåæã¿ã¹ã¯ã®äžŠåç°å¢ãšèŠãªãã¹ãã§ãã
ãã®ãããªåçŽãªã¢ããªã±ãŒã·ã§ã³ã詳现ã«èŠãŠã¿ãŸãããïŒScalaã§äœæããŸã-ããã¯ãã®ãã¡ãã·ã§ããã«ãªèšèªãåŠã¶æ©äŒã§ãïŒã
Sparkã¢ããªã±ãŒã·ã§ã³ã®äŸïŒãã¹ãŠãå«ãããã§ã¯ãããŸãããã€ã³ã¯ã«ãŒããªã©ïŒ
åã¹ãããã§äœãèµ·ããããåå¥ã«åæããŸãã
def main(args: Array[String]){ // , val conf = new SparkConf().setAppName(appName).setMaster(master) val sc = new SparkContext(conf) // HDFS, RDD val myRDD = sc.textFile("hdfs://mydata.txt") // . . // , ( // ) - "" val afterSplitRDD = myRDD.map( x => ( x.split(" ")( 0 ), x ) ) // : - val groupByRDD = afterSplitRDD.groupByKey( x=>x._1 ) // - val resultRDD = groupByRDD.map( x => ( x._1, x._2.length )) // HDFS resultRDD.saveAsTextFile("hdfs://myoutput.txt") }
ããã§äœãèµ·ãã£ãŠããŸããïŒ
ã§ã¯ããã®ããã°ã©ã ã調ã¹ãŠãäœãèµ·ãããèŠãŠã¿ãŸãããã
ãŸããããã°ã©ã ã¯ã¯ã©ã¹ã¿ãŒã®ãã¹ã¿ãŒã§å®è¡ãããããŒã¿ã®äžŠååŠçãè¡ãããåã«ã1ã€ã®ã¹ã¬ããã§éãã«äœããããæ©äŒããããŸãã ããã«-æ¢ã«ç®ç«ã£ãŠããããã«-RDDã§ã®åæäœã¯ç°ãªãRDDãäœæããŸãïŒsaveAsTextFileãé€ãïŒã åæã«ããã¡ã€ã«ã«æžã蟌ãããããšãã°ãã¹ã¿ãŒã«ã¢ããããŒãããããã«èŠæ±ããå Žåã«ã®ã¿ãRDDã¯ãã¹ãŠé 延ããŠäœæãããŸããå®è¡ãéå§ãããŸãã ã€ãŸããã³ã³ãã¢ã«ãã£ãŠã¯ãšãªãã©ã³ã®ããã«å®è¡ãããŸããã³ã³ãã¢èŠçŽ ã¯ããŒãã£ã·ã§ã³ã§ãã
HDFSãã¡ã€ã«ããäœæããæåã®RDDã¯ã©ããªããŸããïŒ Sparkã¯HadoopãšããŸãçµ±åãããŠãããããåäœæ¥ããŒãã§ç¬èªã®ããŒã¿ãµãã»ãããã¢ããããŒããããããŒãã£ã·ã§ã³ïŒHDFSã®å Žåã¯ãããã¯ãšäžèŽïŒã«ãã£ãŠã¢ããããŒããããŸãã ã€ãŸãããã¹ãŠã®ããŒããæåã®ãããã¯ãããŠã³ããŒãããèšç»ã«åŸã£ãŠå®è¡ãããã«é²ã¿ãŸããã
ãã£ã¹ã¯ããèªã¿åã£ãåŸããããããããŸã-åäœæ¥ããŒãã§ç°¡åã«å®è¡ãããŸãã
次ã¯groupByã§ãã ããã¯ãã¯ãåçŽãªãã€ãã©ã€ã³æäœã§ã¯ãªããå®éã®åæ£ã°ã«ãŒãåã§ãã 幞ããªããšã«ããã®æŒç®åã¯ããŸãè³¢ãå®è£ ãããŠããŸããããããŒã¿ã®å±ææ§ã®è¿œè·¡ãäžååã§ãããåæ£ãœãŒãã«å¹æµããããã©ãŒãã³ã¹ã«ãªãããããã®æŒç®åãé¿ããæ¹ãè¯ãã§ãããã ããŠãããã¯èæ ®ãã¹ãæ å ±ã§ãã
groupByã®å®è¡æã®ç¶æ³ã«ã€ããŠèããŠã¿ãŸãããã ãã¹ãŠã®RDDã¯ä»¥åã«ãã€ãã©ã€ã³åãããŠããŸãããã€ãŸããã©ãã«ãäœãä¿åãããŸããã§ããã é害ãçºçããå Žåã圌ãã¯åã³äžè¶³ããŠããããŒã¿ãHDFSããååŸãããã€ãã©ã€ã³ã«æž¡ããŸãã ããããgroupByã¯ãã€ãã©ã€ã³ãå£ãããã®çµæããã£ãã·ã¥ãããRDDãååŸããŸãã æ倱ãçºçããå Žåããã¹ãŠã®RDDãgroupByã«å®å šã«ããçŽãå¿ èŠããããŸãã
Sparkã®è€éãªã¢ããªã±ãŒã·ã§ã³ã®é害ã«ãããã€ãã©ã€ã³å šäœãåèšç®ããå¿ èŠãããç¶æ³ãåé¿ããããã«ãSparkã§ã¯ãŠãŒã¶ãŒãpersistã¹ããŒãã¡ã³ãã§ãã£ãã·ã¥ãå¶åŸ¡ã§ããããã«ããŸãã ã¡ã¢ãªïŒãã®å Žåãã¡ã¢ãªã§ããŒã¿ã倱ããããšåã«ãŠã³ããçºçããŸã-ãã£ãã·ã¥ããªãŒããŒãããŒãããšçºçããå¯èœæ§ããããŸãïŒããã£ã¹ã¯ïŒåžžã«ååã«é«éã§ã¯ãªãïŒããŸãã¯ãã£ãã·ã¥ãªãŒããŒãããŒã®å Žåã¯ãã£ã¹ã¯ãžã®æåºã䌎ãã¡ã¢ãªã«ãã£ãã·ã¥ã§ããŸãã
ãã®åŸãåã³ããããšHDFSã®ãšã³ããªããããŸãã
ããŠãSparkã®å éšã§äœãèµ·ãã£ãŠãããã¯ãåçŽãªã¬ãã«ã§å€ããå°ãªããæããã§ãã
ãããã詳现ã¯ã©ãã§ããïŒ
ããšãã°ãgroupByæäœã®ä»çµã¿ãç¥ãããã§ãã ãŸãã¯ãreduceByKeyæäœãããã³ãããgroupByãããã¯ããã«å¹ççã§ããçç±ã ãŸãã¯ãjoinãšleftOuterJoinã®ä»çµã¿ã æ®å¿µãªããããããŸã§ã®è©³çŽ°ã®ã»ãšãã©ã¯ãSparkã®ãœãŒã¹ããã®ã¿ããŸãã¯ã¡ãŒãªã³ã°ãªã¹ãã§è³ªåããããšã§æãç°¡åã«åŠã¶ããšãã§ããŸãïŒã¡ãªã¿ã«ãSparkã§æ·±å»ãªãŸãã¯éæšæºã®æäœãè¡ãå Žåã¯ã賌èªããããšããå§ãããŸãïŒã
ããã«æªãããšã«ãããŸããŸãªSparkã³ãã¯ã¿ã§äœãèµ·ãã£ãŠããã®ããç解ããŠããŸãã ãããŠãããããã©ãã ã䜿çšã§ãããã ããšãã°ãSparkã³ãã¯ã¿ã®ãµããŒããç解ã§ããªããããCassandraãšã®çµ±åãšããèããäžæçã«æŸæ£ããªããã°ãªããŸããã§ããã ããããè¿ãå°æ¥ãé«å質ã®ããã¥ã¡ã³ããç»å ŽããããšãæåŸ ããŠããŸãã
Sparkã®äžã«ã©ããªé¢çœããã®ããããŸããïŒ
- SparkSQLïŒSparkäžã®SQLãšã³ãžã³ã ãã§ã«èŠãããã«ãSparkeã«ã¯ãã¹ãã¬ãŒãžãã€ã³ããã¯ã¹ãããã³ç¬èªã®çµ±èšãé€ããŠãããã«é¢ããã»ãšãã©ãã¹ãŠã®æ©èœãæ¢ã«åãã£ãŠããŸãã ããã¯æé©åãéåžžã«è€éã«ããŸãããSparkSQLããŒã ã¯æ°ããæé©åãã¬ãŒã ã¯ãŒã¯ãèŠãŠãããšäž»åŒµããŠãããAMP LABïŒSparkãè²ã£ãç 究æïŒã¯Sharkãããžã§ã¯ããæåŠããŸãã-Apache HIVEã®å®å šãªä»£æ¿å
- Spark MLibïŒããã¯æ¬è³ªçã«Apache Mahaoutã®ä»£æ¿ã§ãããã¯ããã«æ·±å»ã§ãã å¹ççãªäžŠåæ©æ¢°åŠç¿ïŒRDDã ãã§ãªããè¿œå ã®ããªããã£ãã䜿çšïŒã«å ããŠãSparkMLã¯ãBreezeãã€ãã£ãç·åœ¢ä»£æ°ããã±ãŒãžã䜿çšããŠããŒã«ã«ããŒã¿ãåŠçããFortranã³ãŒããã¯ã©ã¹ã¿ãŒã«åŒãä»ããŸãã ãŸããéåžžã«ããèšèšãããAPIã ç°¡åãªäŸïŒçžäºæ€èšŒã¯ã©ã¹ã¿ãŒã§åæã«ãã¬ãŒãã³ã°ããŸãã
- BlinkDBïŒââéåžžã«èå³æ·±ããããžã§ã¯ã-倧éã®ããŒã¿ã«å ããŠäžæ£ç¢ºãªSQLã¯ãšãªã äžéšã®ãã£ãŒã«ãã®å¹³åãèšç®ãããã®ã§ããã5ç§ä»¥å
ã«ïŒç²ŸåºŠã倱ã£ãŠïŒãããå®è¡ãããã§ã-ãé¡ãããŸãã äžãããããã®ä»¥äžã®ãšã©ãŒãæã€çµæãå¿
èŠã§ã-ãããé©åã§ãã ãšããã§ããã®BlinkDBã®äžéšã¯Sparkå
ã§èŠã€ããããšãã§ããŸãïŒããã¯å¥ã®ã¯ãšã¹ããšèŠãªãããšãã§ããŸãïŒã
- ããŠãå€ãã®å€ãã®ããšãSparkã®äžã«æžãããŠããã®ã§ãç§ã¯èªåã®èŠ³ç¹ããæãèå³æ·±ããããžã§ã¯ãã®ã¿ããªã¹ãããŸãã