Dataflowã䜿çšããŠãPub / SubããïŒbyïŒæéééã«ã€ãã³ãããšã¯ã¹ããŒãããŸã
SpotifyãçŸåšè¡ãã¿ã¹ã¯ã®ã»ãšãã©ã¯ããããžã§ãã§ãã ã€ãã³ããæ°žç¶ã¹ãã¬ãŒãžã«ç¢ºå®ã«ãšã¯ã¹ããŒãããå¿ èŠããããŸãã ãã®ãããªæ°žç¶ã¹ãã¬ãŒãžãšããŠãåŸæ¥ã¯Hadoopåæ£ãã¡ã€ã«ã·ã¹ãã ïŒHDFSïŒãšHiveã䜿çšããŠããŸãã ä¿åãããŠããããŒã¿ã®ãµã€ãºãšãšã³ãžãã¢ã®æ°ã®äž¡æ¹ã§æž¬å®ã§ããSpotifyã®æé·ã«åãããŠãHDFSããã¯ã©ãŠãã¹ãã¬ãŒãž ãHiveããBigQueryã«åŸã ã«åãæ¿ããŠããŸãã
æœåºãå€æãããã³ããŒãïŒETLïŒãžã§ãã¯ãHDFSããã³ã¯ã©ãŠãã¹ãã¬ãŒãžããããŒã¿ããšã¯ã¹ããŒãããããã«äœ¿çšããã³ã³ããŒãã³ãã§ãã Hiveããã³BigQueryã®ãšã¯ã¹ããŒãã¯ã1æéããšã®ã¢ã»ã³ããªããHDFSããã³ã¯ã©ãŠãã¹ãã¬ãŒãžã«ããŒã¿ãå€æããããããžã§ãã«ãã£ãŠåŠçãããŸãã
ãšã¯ã¹ããŒãããããã¹ãŠã®ããŒã¿ã¯ãã¿ã€ã ã¹ã¿ã³ãã«åŸã£ãŠ1æéããšã®ãã±ããã«åå²ãããŸãã ããã¯ãæåã®ã€ãã³ãé ä¿¡ã·ã¹ãã ã§å°å ¥ããããªãŒãã³ã€ã³ã¿ãŒãã§ã€ã¹ã§ãã ã·ã¹ãã ã¯scpã³ãã³ãã«åºã¥ããŠããã1æéããšã®syslogãã¡ã€ã«ãHDFSã®ãã¹ãŠã®ãµãŒããŒããã³ããŒããŸããã
ETLãžã§ãã¯ãç£èŠã¢ã»ã³ããªã®ãã¹ãŠã®ããŒã¿ãæ°žç¶ã¹ãã¬ãŒãžã«æžã蟌ãŸããŠããããšãé«ã確床ã§å€æããå¿ èŠããããŸãã ãŠã©ããã¢ã»ã³ããªã®ããŒã¿ããã以äžãªãå Žåãå®äºãšããŠããŒã¯ãããŸãã
ãã§ã«å®äºããã¢ã»ã³ããªã«é ããŠãã人ã¯ãéåžžãå®è¡ãããã¿ã¹ã¯ãã¢ã»ã³ããªããããŒã¿ã1åèªã¿åããããã¢ã»ã³ããªã«è¿œå ã§ããŸããã ãã®åé¡ã解決ããã«ã¯ãETLãžã§ãã§é 延ããŒã¿ãåå¥ã«åŠçããå¿ èŠããããŸãã ãã¹ãŠã®é 延ããŒã¿ã¯çŸåšã®å¶æ¥æéã®ã¢ã»ã³ããªã«èšé²ãããã€ãã³ãã®ã¿ã€ã ã¹ã¿ã³ããæªæ¥ã«ã·ããããŸãã
ETLãžã§ããäœæããããã«ã Dataflowãè©Šãããšã«ããŸããã ãã®éžæã¯ãéçšäžã®è²¬ä»»ãã§ããã ãå°ãªãããããšããäºå®ãšãä»ã®äººã倧ããªåé¡ã解決ãããšããäºå®ã«ãããã®ã§ããã ããŒã¿ãããŒã¯ãããŒã¿èšé²ããã€ãã©ã€ã³åããããã®ãã¬ãŒã ã¯ãŒã¯ã§ããããã®ãããªãã€ãã©ã€ã³ãå®è¡ããããã®Google Cloudã®å®å šã«ç®¡çããããµãŒãã¹ã§ãã ç®±ããåºããŠããã«ãCloud Pub / SubãCloud StorageãBigQueryã§åäœããŸãã
Dataflowã§ãã€ãã©ã€ã³ãèšè¿°ããã®ã¯ãApache Crunchã§ãã€ãã©ã€ã³ãèšè¿°ããã®ãšãã䌌ãŠããŸãã ã©ã¡ãã®ãããžã§ã¯ããFlumeJavaã«è§ŠçºãããŠãããããããã¯é©ãããšã§ã¯ãããŸããã éãã¯ãDataflowã«ã¯ã¹ããªãŒãã³ã°ããã³ãããäœæ¥çšã®çµ±åã¢ãã«ãçšæãããŠããã®ã«å¯ŸããŠãCrunchã«ã¯ãããã¢ãã«ãããªãããšã§ãã
ãšã³ãããŒãšã³ãã®é©åãªé 延ãå®çŸããããã«ãETLãã¹ããªãŒãã³ã°ãžã§ããšããŠäœæããŸããã åžžã«å®è¡ãããŠãããšããäºå®ã«ãããããŒã¿ãå°çãããšåã ã®ç£èŠã¢ã»ã³ããªã段éçã«æºããããšãã§ããŸãã ããã«ããã1æéããšã«ããŒã¿ã1åãšã¯ã¹ããŒããããããäœæ¥ãããé 延ãå°ãªããªããŸãã
ELTã¿ã¹ã¯ã¯ã ãŠã£ã³ããŠã€ã³ã°ããŒã¿ãããŒã®æŠå¿µã䜿çšããŠãæéã«åºã¥ããŠããŒã¿ãæéã¢ã»ã³ããªã«åå²ããŸãã Dataflowã§ã¯ãã€ãã³ãæéãšåŠçæéã®äž¡æ¹ã§ãŠã£ã³ããŠãå²ãåœãŠãããšãã§ããŸãã ã¿ã€ã ã¹ã¿ã³ãã«åºã¥ããŠãŠã£ã³ããŠãäœæã§ãããšããäºå®ã¯ãDataflowãä»ã®ã¹ããªãŒã ãã¬ãŒã ã¯ãŒã¯ããåªããŠããããšã瀺ããŠããŸãã ãããŸã§ã®ãšããã Apache Flinkã®ã¿ãæéãšåŠçã®äž¡æ¹ã§ãŠã£ã³ããŠã®äœæ¥ããµããŒãããŠããŸãã
åãŠã£ã³ããŠã¯1ã€ä»¥äžã®ãããã¯ïŒãã€ã³ïŒã§æ§æãããåãããã¯ã«ã¯èŠçŽ ã®ã»ãããå«ãŸããŸãã åãŠã£ã³ããŠã«å²ãåœãŠãããããªã¬ãŒã«ããããããã¯ã®äœææ¹æ³ã決ãŸããŸãã ãããã®ãããã¯ã¯ãããŒã¿ãGroupByKeyãééããåŸã«ã®ã¿å²ãåœãŠãããŸãã GroupByKeyã¯ããŒãšãŠã£ã³ããŠã§ã°ã«ãŒãåããããããåããããã¯å ã®ãã¹ãŠã®éçŽèŠçŽ ã¯åãããŒãæã¡ãåããŠã£ã³ããŠã«å±ããŸãã
ããŒã¿ãããŒã¯ãããŠã©ãŒã¿ãŒããŒã¯ããšåŒã°ããã¡ã«ããºã ãæäŸããŸãïŒãŠã©ãŒã¿ãŒããŒã¯ã¯ãããã§ã¯ç»åãã¡ã¢ãšã¯ç°ãªããå¶éãå¢çç·ãæããŸãïŒããŠã£ã³ããŠãéããã¿ã€ãã³ã°ã決å®ããããã«äœ¿çšã§ããŸãã çä¿¡ããŒã¿ã¹ããªãŒã ã®ã€ãã³ãæéã䜿çšããŠãç¹å®ã®ãŠã£ã³ããŠã®ãã¹ãŠã®ã€ãã³ãããã§ã«å°çããŠããå¯èœæ§ãé«ãæç¹ãèšç®ããŸãã
ETLå®è£ ã®è©³çŽ°
ãã®ã»ã¯ã·ã§ã³ã§ã¯ãã€ãã³ãé ä¿¡çšã®Dataflow ETLã¿ã¹ã¯ã®äœæã§çºçããåé¡ã®ããã€ããèŠãŠãããŸãã DataflowãŸãã¯åæ§ã®ã·ã¹ãã ã®çµéšããªãå Žåã¯ãç解ããã®ãå°ãé£ããå ŽåããããŸãã ïŒæŠå¿µãšçšèªãåããŠã®å ŽåïŒç解ã«åœ¹ç«ã€ã®ã¯ãGoogle DataFlowã®åºçç©ã§ãã
ã€ãã³ãé ä¿¡ã·ã¹ãã ã§ã¯ãã€ãã³ãã¿ã€ããšCloud Pub / Subãããã¯ã®éã«1察1ã®ãããã³ã°ããããŸãã 1ã€ã®ETLã¿ã¹ã¯ã¯ãã€ãã³ãã¿ã€ãã®åäžã¹ããªãŒã ã§æ©èœããŸãã ç¬ç«ããETLã¿ã¹ã¯ã䜿çšããŠããã¹ãŠã®ã¿ã€ãã®ã€ãã³ãããã®ããŒã¿ãåŠçããŸãã
䜿çšå¯èœãªãã¹ãŠã®ã¯ãŒã«ãŒã«è² è·ãåçã«åæ£ããããã«ãããŒã¿ã¹ããªãŒã ã¯ãåã€ãã³ãã«ãŠã£ã³ããŠãå²ãåœãŠãå€æãåããåã«åå²ãããŸãã 䜿çšããã·ã£ãŒãã®æ°ã¯ãã¿ã¹ã¯ã«å²ãåœãŠãããã¯ãŒã«ãŒã®æ°ã®é¢æ°ã§ãã
ããŠã£ã³ããŠãã¯è€åå€æã§ãã ãã®å€æã®æåã®æ®µéã§ã¯ãçä¿¡ã¹ããªãŒã å ã®ãã¹ãŠã®ã€ãã³ãã«åºå®ã®1æéããšã®ãŠã£ã³ããŠãå²ãåœãŠãŸãã ãŠã©ãŒã¿ãŒããŒã¯ã1æéãè¶ ãããšããŠã£ã³ããŠã¯éãããããšèŠãªãããŸãã
@Override public PCollection<KV<String, Iterable<Gabo.EventMessage>>> apply( final PCollection<KV<String, Gabo.EventMessage>> shardedEvents) { return shardedEvents .apply("Assign Hourly Windows", Window.<KV<String, Gabo.EventMessage>>into( FixedWindows.of(ONE_HOUR)) .withAllowedLateness(ONE_DAY) .triggering( AfterWatermark.pastEndOfWindow() .withEarlyFirings(AfterPane.elementCountAtLeast(maxEventsInFile)) .withLateFirings(AfterFirst.of(AfterPane.elementCountAtLeast(maxEventsInFile), AfterProcessingTime.pastFirstElementInPane() .plusDelayOf(TEN_SECONDS)))) .discardingFiredPanes()) .apply("Aggregate Events", GroupByKey.create()); }
ãŠã£ã³ããŠãå²ãåœãŠããšãããŠã£ã³ããŠãéãããŸã§ãããã¯ããšã«Nåã®èŠçŽ ãå²ãåœãŠãããã«èšå®ãããåæããªã¬ãŒããããŸãã ããªã¬ãŒã®ãããã§ãããŒã¿ãå°çãããšã¿ã€ã ããã¯ãåžžã«æºããããŸãã ãã®ã»ããã¢ããããªã¬ãŒã¯ãããäœããšã¯ã¹ããŒãé 延ãå®çŸããã ãã§ãªããGroupByKeyã®å¶éãåé¿ããã®ã«ã圹ç«ã¡ãŸãã GroupByKeyã¯ã¡ã¢ãªã®å€æã§ãããããããã«ã§åéãããããŒã¿ã®éã¯ãã¯ãŒã«ãŒã®ãã·ã³ã®ã¡ã¢ãªã«åãŸãå¿ èŠããããŸãã
ãŠã£ã³ããŠãéãããšããããã¯ã®å²ãåœãŠã¯åŸã®ããªã¬ãŒã«ãã£ãŠå¶åŸ¡ãããŸãã ãã®ããªã¬ãŒã¯ãNåã®èŠçŽ ã®åŸããŸãã¯æäœæéã®10ç§åŸã«ããŒã¿ã®ãããã¯ãäœæããŸãã ã€ãã³ãã1æ¥ä»¥äžé ãããšãã€ãã³ãã¯ç Žæ£ãããŸãã
ãããã¯ã®å®äœåïŒããšãã°ãäžæã¹ãã¬ãŒãžãŸãã¯ããŒãã«ã®äœæïŒã¯ããã€ãã³ãéçŽãå€æã§å®è¡ãããŸããããã¯GroupByKeyå€æã«ãããŸããã
1ç§ãããã®çä¿¡ã€ãã³ãã®æ°
ETLã¿ã¹ã¯ãééãã1ç§ãããã®çä¿¡ã€ãã³ãã®æ°ã远跡ããããã«ãæéåäœã®å²ãåœãŠãŠã£ã³ããŠã®åºåã§ãã¿ã€ã ãªãŒããã³ã¬ã€ãã€ãã³ãã®å¹³åRPSã®ç£èŠãã䜿çšããŸãã ãã¹ãŠã®å€æã¡ããªãã¯ã¯ãã«ã¹ã¿ã ã¡ããªãã¯ãšããŠCloud Monitoringã«éä¿¡ãããŸãã ã€ã³ãžã±ãŒã¿ãŒã¯ãæ¯åéä¿¡ããã5åéã®ã¹ã©ã€ããŠã£ã³ããŠã§èšç®ãããŸãã
ã€ãã³ãã®é©ææ§ã«é¢ããæ å ±ã¯ãã€ãã³ãããŠã£ã³ããŠã«å²ãåœãŠãããåŸã«ã®ã¿ååŸã§ããŸãã ãŠã£ã³ããŠèŠçŽ ã®æ倧ã¿ã€ã ã¹ã¿ã³ããšçŸåšã®éãããæ¯èŒãããšããã®ãããªæ å ±ãåŸãããŸãã éããããŒã¿ã¯å€æéã§åæãããªãããããã®æ¹æ³ã§ã®é©ææ§ã®æ€åºã¯äžæ£ç¢ºã«ãªãå¯èœæ§ããããŸãã çŸåšã誀ã£ãŠæ€åºãããŠããé 延ã€ãã³ãã®æ°ã¯éåžžã«å°ãªãã1æ¥1åæªæºã§ãã
ç£èŠå€æïŒãŸãã¯é©æããã³é 延ã€ãã³ãã®å¹³åRPSã®ç£èŠïŒãéçŽã€ãã³ãã®åºåã«é©çšãããå Žåãã€ãã³ãã®é©ææ§ãæ£ç¢ºã«æ€åºã§ããŸãã ãã®ã¢ãããŒãã®äžå©ãªç¹ã¯ãèŠçŽ ã®æ°ãšã€ãã³ãã®æéã«åºã¥ããŠãŠã£ã³ããŠãååŸããããããã¡ããªãã¯ãäºæž¬ã§ããªãããšã§ãã
Write to HDFS / GCSå€æã§ã¯ãHDFSãŸãã¯Cloud Storageã«ããŒã¿ãæžã蟌ã¿ãŸãã HDFSãšCloud Storageã§ã®èšé²ã®ä»çµã¿ã¯åãã§ãã å¯äžã®éãã¯ã䜿çšããããã¡ã€ã«ã·ã¹ãã APIã§ãã å®è£ ã§ã¯ãäž¡æ¹ã®APIã¯IOChannelFactoryã€ã³ã¿ãŒãã§ã€ã¹ã®èåŸã«é ãããŠããŸãã
1ã€ã®ãã¡ã€ã«ã®ã¿ããããã¯ã«æžã蟌ãŸããããã«ãé害ã®å¯èœæ§ãç¡èŠããŠããåãããã¯ã¯äžæã®IDãåãåããŸãã ãããã¯èå¥åã¯ãèšé²ããããã¹ãŠã®ãã¡ã€ã«ã®äžæã®IDãšããŠäœ¿çšãããŸãã ãã¡ã€ã«ã¯ãã€ãã³ãIDã¹ããŒã ãšäžèŽããã¹ããŒã ã䜿çšããŠAvro圢åŒã§æžã蟌ãŸããŸãã
ã¿ã€ã ãªãŒãªãããã¯ã¯ãã€ãã³ãã®æéã«åºã¥ããŠãã±ããïŒãã±ããïŒã«æžã蟌ãŸããŸãã éããã¢ã»ã³ããªã®è¿œå ã¯ãSpotifyã§ã®ããŒã¿ã®æäœã«ã¯æãŸãããªããããææ°ã®ãã®ã¯çŸåšã®æéã®ããã±ãŒãžã«æžã蟌ãŸããŸãã ãããã¯ãã¿ã€ã ãªãŒãã©ãããç解ããã«ã¯ã PaneInfoãªããžã§ã¯ãã䜿çšããŸãã ãããã¯ãäœæããããšãã«äœæãããŸãã
1æéããšã®ã¢ã»ã³ããªã®å®å šæ§ããŒã«ãŒã¯1åã ãæžã蟌ãŸããŸãã ãããè¡ãã«ã¯ãWrite Paneã¢ã¯ã·ã§ã³ã®ã¡ã€ã³åºåã¹ããªãŒã ãã¯ããã¯ãŠã£ã³ããŠã«åãŠã£ã³ããŠåãããAggregated Write Successesã«éçŽãããŸãã
1ç§ãããã®èšé²ããããã¡ã€ã«ã®æ°
ããªç§ã®éããé 延
ã¡ããªãã¯ã®ååŸã¯ããã€ã³ã®æžã蟌ã¿ã¢ã¯ã·ã§ã³ã®å¯æ¬¡çãªåºåã§ãã 1ç§ãããã«æžã蟌ãŸãããã¡ã€ã«ã®æ°ãã€ãã³ãã®å¹³åé 延ãããã³çŸåšæå»ãšæ¯èŒãããéãããã®é ãã瀺ãããŒã¿ãååŸããŸãã ãããã®ã¡ããªãã¯ã¯ãã¹ãŠã5åéã®ãŠã£ã³ããŠã§èšç®ãããæ¯åéä¿¡ãããŸãã
HDFS / Cloud Storageã«èšé²ããåŸããŠã©ãŒã¿ãŒããŒã¯ã©ã°ã枬å®ãããããã·ã¹ãã ã®ã¬ã€ãã³ã·å šäœã«çŽæ¥é¢é£ããŠããŸãã é 延ã®ããã°ã©ãã§ã¯ãçŸåšã®ãã£ã©ã¯ã¿ãŒã®é 延ã¯åºæ¬çã«200ç§æªæºïŒçŽ3.5åïŒã§ããããšãããããŸãã åãå³ã§ãæ倧1500ç§ïŒçŽ25åïŒã®ã©ã³ãã ãªããŒã¹ãã確èªã§ããŸãã ãã®ãããªããŒã¯ã¯ãVPNçµç±ã§Hadoopã¯ã©ã¹ã¿ãŒã«èšé²ããéã®æçåãåå ã§ãã æ¯èŒã®ããã«ãå€ãã·ã¹ãã ã®ã¬ã€ãã³ã·ã¯ãæé«ã®æ¥ãã§2æéãå¹³åã§3æéã§ãã
ETLãžã§ãã®æ¬¡ã®ã¹ããã
ETLãžã§ãã®å®è£ ã¯ããŸã è©Šäœæ®µéã§ãã ãããŸã§ã«ã4ã€ã®ETLãžã§ããé²è¡äžã§ãïŒ1ç§ãããã®ã€ãã³ãã®ã°ã©ããåç §ïŒã æå°ã®ã¿ã¹ã¯ã¯1ç§ãããçŽ30ã€ãã³ããæ¶è²»ããæ倧ã®ã¿ã¹ã¯ã¯1ç§ããã10äžã€ãã³ãã®ããŒã¯å€ã«éããŸãã
ETLã¿ã¹ã¯ã®æé©ãªã¯ãŒã«ãŒæ°ãèšç®ããè¯ãæ¹æ³ã¯ãŸã èŠã€ãããŸããã ãããŸã§ã®ãšããããããã®æ°ã¯è©Šè¡é¯èª€ã®åŸã«æåã§æ±ºå®ãããŸãã æå°ã®ã¿ã¹ã¯ã«ã¯2人ã®ã¯ãŒã«ãŒããæ倧ã®ã¿ã¹ã¯ã«ã¯42人ã®ã¯ãŒã«ãŒã䜿çšããŸãã ã¿ã¹ã¯ã®å®è¡ãã¡ã¢ãªã«äŸåããŠããããšã«æ³šæããŠãã ããã 1ç§ãããçŽ20Kã€ãã³ããåŠçãã1ã€ã®ãã€ãã©ã€ã³ã§ã¯24人ã®ã¯ãŒã«ãŒã䜿çšãã2ã€ç®ã§ã¯åãé床ã§ã€ãã³ããåŠçããŸãããå¹³åã¡ãã»ãŒãžãµã€ãºã¯4åå°ããã4ã€ã ãã䜿çšããŸãã èªåã¹ã±ãŒãªã³ã°ã®æ©èœãå®è£ ããŸã ã
ã¿ã¹ã¯ãåéãããšãã«ãããŒã¿ã倱ãããªãããã«ããå¿ èŠããããŸãã ãžã§ãã®æŽæ°ãæ©èœããªãå Žåãããã¯åœãŠã¯ãŸããŸããã ãã®åé¡ã®è§£æ±ºçãèŠã€ããããã«ãDataflowã®ãšã³ãžãã¢ãšç©æ¥µçã«ååããŠããŸãã
ãŠã©ãŒã¿ãŒããŒã¯ã®åäœã¯ãŸã è¬ã§ãã é害ãçºçããå Žåãšéåžžã®åäœã®å Žåã®äž¡æ¹ã§ããã®èšç®ãäºæž¬å¯èœã§ããããšã確èªããå¿ èŠããããŸãã
æåŸã«ãé«éã§ä¿¡é Œæ§ã®é«ãETLãžã§ãæŽæ°ã®ããã«ãé©åãªCI / CDã¢ãã«ãå®çŸ©ããå¿ èŠããããŸãã ããã¯éèŠãªã¿ã¹ã¯ã§ããã€ãã³ãã®ã¿ã€ãããšã«1ã€ã®ETLã¿ã¹ã¯ã管çããå¿ èŠãããã1000ãè¶ ããã¿ã¹ã¯ããããŸãã
ã¯ã©ãŠãã€ãã³ãé ä¿¡ã·ã¹ãã
ç§ãã¡ã¯ãæ¬çªç°å¢ã§ã®æ°ããã·ã¹ãã ã®ç«ã¡äžãã«ç©æ¥µçã«åãçµãã§ããŸãã å®éšæ®µéã§ã®æ°ããã·ã¹ãã ã®ç«ã¡äžãããåŸãäºåçãªæ°å€ã¯éåžžã«å¿åŒ·ããã®ã§ãã æ°ããã·ã¹ãã ã§ã®ææªã®å šäœé 延æéã¯ãå€ããã©ãããã©ãŒã ã®åèšé 延æéã®4åã®1ã§ãã
ããããæ°ããã·ã¹ãã ããåŸããã®ã¯çç£æ§ã®åäžã ãã§ã¯ãããŸããã ã¯ã©ãŠã補åã§ã¯ããã©ã³ã¶ã¯ã·ã§ã³ã³ã¹ããå€§å¹ ã«åæžããããšèããŠããŸãã ããã¯ãSpotify補åãæ¹åããããã«ããå€ãã®æéãããããšãæå³ããŸãã