è¯ããã¹ãã®æªãäŸã
æè¿ãå«ç 宀ã§ã¯ãApache Hadoopã®ããŸããŸãªã¹ãã¬ãŒãžåœ¢åŒïŒCSVãJSONãApache AvroãApache Parquetãªã©ïŒã®ããã©ãŒãã³ã¹ãæ¯èŒããè°è«ãé »ç¹ã«è¡ãããŠããŸãã ã»ãšãã©ã®åå è ã¯ããã«ããã¹ã圢åŒãæçœãªéšå€è ãšããŠåŽäžããAvroãšParquetã®éã®ã³ã³ãã¹ãã®äž»ãªé°è¬ãæ®ããŸããã
äžè¬çãªæèŠã¯ãããŒã¿ã»ããå šäœãæäœãããšãã«1ã€ã®åœ¢åŒããããè¯ããããã«èŠãã2çªç®ã®ãããè¯ãããåã®ãµãã»ããã®ã¯ãšãªãåŠçãããšããæªç¢ºèªã®åã§ããã
ä»ã®èªå°å¿ã®ãããšã³ãžãã¢ãšåæ§ã«ãæ¬æ Œçãªããã©ãŒãã³ã¹ãã¹ããå®æœããŠãã©ã¡ããæ£ããããæçµçã«ç¢ºèªããããšããå§ãããŸãã æ¯èŒã®çµæã¯åãæšãŠãããŠããŸãã
翻蚳è 泚ïŒ
åœåããã®èšäºã¯ã Apache Sparkã䜿çšããŠApache AvroãšApache Parquetãæ¯èŒããçµéšã«é¢ããCloudera Engineering Blogã®Don Drakeã®ããã¹ãïŒ @dondrake ïŒã®ç¡æ翻蚳ãšããŠæ§æ³ãããŸããã ãããã翻蚳ããã»ã¹äžã«è©³çŽ°ã調ã¹ãŠããã¹ãã§å€ãã®è«äºç¹ãèŠã€ããŸããã èšäºã«ãµãã¿ã€ãã«ãè¿œå ããŸãããããã¹ãã«ã¯ãäžæ£ç¢ºãã瀺ãæªæã®ããã³ã¡ã³ããå«ãŸããŠããŸãã
ãã¹ãããŒã¿ã»ãã
ãã¹ãã«å®éã®ããŒã¿ãšå®éã®ã¯ãšãªã䜿çšããã®ãæ£ãããšæããŸããã ãã®å Žåãå®çšŒåç°å¢ã§ã®ããã©ãŒãã³ã¹ã¯ãã¹ãç°å¢ãšåæ§ã«åäœããããšãæåŸ ã§ããŸãã ã€ãŸãããã¹ãã§ã¯ããµãã²ãŒãããŒã¿ã®è¡ãã«ãŠã³ãããããšã¯ã§ããŸããã
ãã¹ãã®ããã®ãå®éã®ããŒã¿ããšãå®éã®ã¯ãšãªãã®éžæã¯ãéåžžã«ç©è°ãããããŠããããã§ãã 誰ããç°ãªãå®éã®ããŒã¿ãšèŠæ±ãæã£ãŠããŸãã ãã®åé¡ã解決ããããã«ã TPCãã³ãããŒã¯ãªã©ã®å žåçãªã¹ãã¬ãŒãžããã©ãŒãã³ã¹ãã¹ããåæãããŸãã
ç§ãæè¿äœæ¥ããããŒã¿ã»ããã調ã¹ããšããããã¹ãã«æé©ãª2ã€ã®ããŒã¿ãèŠã€ãããŸããã 1ã€ç®ã¯ããããŒããšåŒã³ã3åã®ã¿ã§æ§æããã8230äžè¡ãå«ãŸããŠããŸããCSVã§ã¯3.9 GBãå ããŸãã
以äžã«ç€ºãããã«ãããã«ãã750ã1000 MBã®ã·ãªã¢ã«åãããããŒã¿ãçæããã50人ã®ã¯ãŒã«ãŒã§åŠçãããŸãã åã¯ãŒã«ãŒã¯15ã20 MBã®ããŒã¿ãååŸããŸãã ã»ãšãã©ã®å Žåãã¯ãŒã«ãŒã®åæåã¯ããŒã¿ã®èªã¿åããšåŠçãããæéãããããŸãã
2ã€ç®ã¯ããã¯ã€ãããšåŒã³ãŸãã103åãš6å9,400äžè¡ã§æ§æããããµã€ãºã194 GBã®CSVãã¡ã€ã«ã«ãªããŸãã ãã®ã¢ãããŒãã«ããã倧å°ã®ãã¡ã€ã«ã§ã©ã®åœ¢åŒãããé©åã«æ©èœããããè©äŸ¡ã§ãããšæããŸãã
ãã¯ã€ããããŒã¿ã»ããã¯30åã ãã§ãªãã8åãé·ããªã£ãŠããŸãã å ã®ãµã€ãºã®49åã ããŒã¿ã»ããããå°ãããã³ã倧ããšåŒã¶æ¹ãããæ£ç¢ºã§ãã
ããã«ããµã€ãºæ¯ããå€æãããšãããŒã¿ã»ããã§ã¯ããŸããŸãªã¿ã€ãã®åãè¡šãããŠããããã§ãã ãã®äœæ¥ã§ã¯ãããŒã¿åã®éãã¯éåžžç¡èŠãããŸãã äžæ¹ãããã¯ã¹ãã¬ãŒãžåœ¢åŒã®éèŠãªåŽé¢ã§ãã
è©Šéšæ¹æ³
ãã¹ãã®äž»åãšããŠãApache Spark 1.6ãéžæããŸããã Sparkã¯ãã®ãŸãŸParquetããµããŒãããAvroããã³CSVã®ãµããŒãã¯åå¥ã«æ¥ç¶ãããŸãã ãã¹ãŠã®æäœã¯ã100å°ä»¥äžã®ãã·ã³ã®CDH 5.5.xã¯ã©ã¹ã¿ãŒã§å®è¡ãããŸããã
ããŸããŸãªçš®é¡ã®åŠçïŒèªã¿èŸŒã¿ãåçŽãªã¯ãšãªãéèŠãªã¯ãšãªãããŒã¿ã»ããå šäœã®åŠçãããã³äœ¿çšãããŠãããã£ã¹ã¯å®¹éïŒã®ãã©ãŒãããã®ããã©ãŒãã³ã¹ã枬å®ããããšã«èå³ããããŸããã
äž¡æ¹ã®ããŒã¿ã»ããã«å¯ŸããŠåãæ§æã§spark-shell
ãä»ããŠãã¹ããå®è¡ããŸããïŒéãã¯ãšã°ãŒãã¥ãŒã¿ãŒã®æ°ã®ã¿ã§ããïŒã ã·ã§ã«ã¢ãŒã:paste
ã¯ãã€ã³ã¿ãŒããªã¿ãŒãæ··ä¹±ãããå¯èœæ§ã®ããè€æ°è¡ã®ã³ãã³ããå¿é
ããããšãªããScalaã³ãŒããREPLã«çŽæ¥ã³ããŒã§ããããã«ããŠãåœãæããŸããã
#!/bin/bash -x # Drake export HADOOP_CONF_DIR=/etc/hive/conf export SPARK_HOME=/home/drake/coolstuff/spark/spark-1.6.0-bin-hadoop2.6 export PATH=$SPARK_HOME/bin:$PATH # use Java8 export JAVA_HOME=/usr/java/latest export PATH=$JAVA_HOME/bin:$PATH # NARROW NUM_EXECUTORS=50 # WIDE NUM_EXECUTORS=500 spark-shell âmaster yarn-client \ âconf spark.eventLog.enabled=true \ âconf spark.eventLog.dir=hdfs://nameservice1/user/spark/applicationHistory \ âconf spark.yarn.historyServer.address=http://yarnhistserver.allstate.com:18088 \ âpackages com.databricks:spark-csv_2.10:1.3.0,com.databricks:spark-avro_2.10:2.0.1 \ âdriver-memory 4G \ âexecutor-memory 2G \ ânum-executors $NUM_EXECUTORS \ ...
Spark Web UIã®[ãžã§ã]ã¿ãããã¯ãšãªã®å®è¡æéãååŸããŸããã åãã¹ãã3åç¹°ãè¿ããå¹³åæéãèšç®ããŸããã çãããŒã¿ã»ãããžã®èŠæ±ã¯æ¯èŒçè² è·ã®é«ãã¯ã©ã¹ã¿ãŒã§å®è¡ãããåºãããŒã¿ã»ãããžã®èŠæ±ã¯å®å šãªã¢ã€ãã«ã¯ã©ã¹ã¿ãŒã®æç¹ã§å®è¡ãããŸããã ããã¯æå³çã«èµ·ãã£ãã®ã§ã¯ãªããå¶ç¶ã§ãã
ãã¹ãéã§ç°ãªãç°å¢ïŒç°ãªãæ°ã®ã¯ãŒã«ãŒãšç°ãªãã¯ã©ã¹ã¿ãŒè² è·ãå«ãïŒã䜿çšãããšã絶察å€ãæ¯èŒããããšãã§ããªããªããŸãã
ããŒããããã¯ã©ã¹ã¿ãŒèªäœã®äœ¿çšã¯ãåèµ·åæã®æž¬å®çµæã®åçŸæ§ã«æªåœ±é¿ãåãŒããŸã-ç°¡åãªããšã§ã¯ãããŸããã
å®éšã®3åã®ç¹°ãè¿ãã¯çµ±èšçã«è»œåŸ®ã«èŠããŸã-è©äŸ¡ã®ä¿¡é Œåºéã¯éåžžã«å€§ãããªããŸãã ãã ããèè ã¯ä¿¡é Œåºéã«ã€ããŠãèšåããŠããŸããã
ããŒã¿ã®ååŠç
CSVããçãããŒã¿ã»ãããèªã¿åããšãã«ãåè·¯ãåºåããŸããã§ãããã String
åã®String
ãTimestamp
ã«å€æããå¿
èŠããããŸããã çµæã«ãã®å€æã®æéãå«ããŸããã§ããããªããªãã ã¹ãã¬ãŒãžåœ¢åŒã«ã¯é©çšãããŸããã åºãããŒã¿ã»ããã䜿çšããå Žåãåè·¯ã®åºåã䜿çšããŸãããããã®æéãèæ
®ããŸããã§ããã
ã¹ããŒãã®åºåïŒå ã®- æšè«ã¹ããŒã ïŒDataFrame
ãReflectionã䜿çšããRDD
ããDataFrame
ãžã®æé»çãªå€æãæå³ããŸã ã
ãã¹ãããã»ã¹äžã«ãTimestampã¿ã€ãã®åãæã€Avroãã¡ã€ã«ãä¿åããããšãäžå¯èœã§ããããšãç¥ã£ãŠé©ããã å®éãAvroããŒãžã§ã³1.7.xã¯åºæ¬çã«Date
ãŸãã¯Timestamp
ãµããŒãããŠããŸããã
Avro 1.8 ã¯ãè«çåDate
ãTimestamp
ããã³ãããã®æŽŸçåããµããŒãããŠããŸãã å®éããããã¯int
ãŸãã¯long
åãªãã©ãããŒã§ãã
çãããŒã¿ã»ããã®ãã¹ã
æåã«ãAvroãŸãã¯Parquet圢åŒã§çãããŒã¿ã»ããããã£ã¹ã¯ã«æžã蟌ãããšãã§ããæéãèŠç©ãããŸããã ããŒã¿ãããŒã¿ãã¬ãŒã ã«èªã¿èŸŒãã åŸãèšé²ã«æå¹ãªæéã®ã¿ãèæ ®ããŸããã çµ±èšèª€å·®ã®ç¯å²å ã§å·®ãåŸãããŸããã ãããã£ãŠãçãããŒã¿ã»ããã®æžã蟌ã¿ããã©ãŒãã³ã¹ã¯ãäž¡æ¹ã®åœ¢åŒã§ã»ãŒåãã§ãã
ãããã¯ãŒã¯ã®ãªãŒããŒããããªã©ãèæ ®ããŠããã·ãªã¢ã«åã®æéã¯éåžžã«é·ããªããŸãããçµå±ã®ãšããã1人ã®ã¯ãŒã«ãŒã®åºåã¯20 MBæªæºã§ãã
äœæè ãèªã¿åããšåŠçã®æéãšæžã蟌ã¿ã®æéã誀ã£ãŠåé¢ããããã«èŠããŸãã ãã®å Žåããã®æéã®ã»ãšãã©ã4ã®ã¬ãã€ãã®CSVãã¡ã€ã«ãèªã¿èŸŒãã§ããå¯èœæ§ãããããããã1ã€ã®ã¹ããªãŒã ã§ããã§ãã ãããŠãä»ã®ãã¹ãŠã¯5-10ç§ããããŸãã
çãããŒã¿ã»ããããã£ã¹ã¯ã«æžã蟌ãæéïŒç§åäœïŒïŒå°ãªãã»ã©è¯ãïŒïŒ
ãã®åŸãçãããŒã¿ã»ããã®è¡æ°ãåçŽã«ã«ãŠã³ãããã®ã«ãããæéã調ã¹ãŸããã AvroãšParquetã®å®è¡é床ã¯çãã[^ fast-row-count]ã§ãã æ¯èŒã®ããããŸãèªè ãè ãããã«ãéå§çž®CSVã®ã«ãŠã³ãæéãèšç®ããŸããã
å¯æšçŽ°å·¥ã®ãã¡ã€ã«ã«ã¯ãã¡ã¿ããŒã¿ã«ãããã¯å ã®ãªããžã§ã¯ãã®æ°ãå«ãŸããŠããŸãã ã¯ãŒã«ãŒããšã®ããŒã¿éã®ãã®æ¯çã§ã¯ãããããã1ã€ã®Parquetãããã¯ã®ã¿ãååŸããŸãã ãããã£ãŠãã«ãŠã³ãããã«ã¯ããã¹ãŠã®äººã1ã€ã®æ°å€ãèªã¿åã£ãŠãããäžè¬çãªçž®å°ãè¡ã£ãŠåèšãååŸããã ãã§ååã§ãã
Avroã®å Žåãã¿ã¹ã¯ã¯ã¯ããã«è€éã§ã-Avroãããã¯ã«ã¯ãããã¯å ã®ãªããžã§ã¯ãã®æ°ãå«ãŸããŸããããããã¯èªäœã¯ã¯ããã«å°ããïŒ ããã©ã«ãã§ã¯64 KB ïŒããã¡ã€ã«ã«ã¯å€ãã®ãããã¯ãå«ãŸããŸãã çè«çã«ã¯ãavroãã¡ã€ã«å ã®ãã¹ãŠã®ãªããžã§ã¯ãã®ã«ãŠã³ãæéã¯é·ãããå¿ èŠããããŸãã å®éã«ã¯ããã®ãããªå°ããªãã¡ã€ã«ã®å Žåãéãã«æ°ä»ããªãå ŽåããããŸãã
CSVãã¡ã€ã«ã®è¡æ°ãã«ãŠã³ãããã«ã¯ãAvroã®å Žåãšåæ§ã«ããã®ãã¡ã€ã«ãå®å šã«èªã¿åãå¿ èŠããããŸãã 4 GBãã¡ã€ã«ãæ£ããåå²ãããšãåã¯ãŒã«ãŒã¯80 MBã®ããŒã¿ãæã¡ãæ°ç§ã§èªã¿åãããšãã§ããŸãã ãã ããäœæè ã®èªã¿åãããã»ã¹ã«ã¯45ç§ããããŸããããã¯ããã¡ã€ã«ãååã«äžŠååãããŠããªãããšã瀺ããŸãã
çãããŒã¿ã»ããã®è¡æ°ãç§åäœã§æ°ããïŒå°ãªãæ¹ãè¯ãïŒïŒ
GROUP BY
ã°ã«ãŒãã³ã°ã䜿çšããããè€éãªã¯ãšãªãè©ŠããåŸã ãã®ããŒã¿ã»ããã®åã®1ã€ã¯ã¿ã€ã ã¹ã¿ã³ãã§ãããæ¯æ¥å¥ã®åã®éãèšç®ããŸããã ãªããªã Avroã¯Date
ãšTimestamp
ãµããŒãããŠããŸãããåæ§ã®çµæãåŸãããã«ã¯ãšãªã埮調æŽããå¿
èŠããããŸããã
å¯æšçŽ°å·¥ã®ãåãåããïŒ
val sums = sqlContext.sql("""select to_date(precise_ts) as day, sum(replacement_cost) from narrow_parq group by to_date(precise_ts) """)
Avroã¯ãšãªã®ã¯ãšãªïŒ
val a_sums = sqlContext.sql("""select to_date(from_unixtime(precise_ts/1000)) as day, sum(replacement_cost) from narrow_avro group by to_date(from_unixtime(precise_ts/1000)) """)
ã°ã«ãŒãåãããã¯ãšãªã®å ŽåãParquetã¯Avroãã2.6åé«éã§ããã
次ã«ã DataFrame
eã§.map()
DataFrame
ãå®è¡ããŠãããŒã¿ã»ããå
šäœã®åŠçãã·ãã¥ã¬ãŒãããããšã«ããŸããã è¡ã®åæ°ãã«ãŠã³ããããã¹ãŠã®äžæã®å€ãåéããå€æãéžæããŸããã
def numCols(x: Row): Int = { x.length } val numColumns = narrow_parq.rdd.map(numCols).distinct.collect
.distinct()
æäœã¯ã¿ã¹ã¯ãéåžžã«è€éã«ããŸãã ç°¡åã«ããããã«ãããã»ã¹ã«ãªãã¥ãŒã¹ãã§ãŒãºãè¿œå ãããšæ³å®ã§ããŸããããã¯ãããŒã¿ã»ããå šäœã®.map()
ã枬å®ãããã ãã§ãªããã¯ãŒã«ãŒéã®ããŒã¿äº€æã®ãªãŒããŒãããã枬å®ãããããšãæå³ããŸãã
ããã¯å®éã®ããŒã¿åŠçäžã«å®è¡ãããã¿ã¹ã¯ãã®ãã®ã§ã¯ãããŸããããããã§ãããŒã¿ã»ããå
šäœã®åŠçã匷å¶ããŸãã ãŸããParquetã¯Avroã®ã»ãŒ2åé«éã§ãã
æåŸã«è¡ãå¿ èŠãããã®ã¯ããã£ã¹ã¯äžã®ããŒã¿ã»ããã®ãµã€ãºãæ¯èŒããããšã§ãã ã°ã©ãã¯ãµã€ãºããã€ãåäœã§ç€ºããŸãã Avroã¯Snappyå§çž®ã³ãŒããã¯ã䜿çšããããã«æ§æãããParquetã«ã¯ããã©ã«ãèšå®ã䜿çšãããŸããã
Parquetã®ããŒã¿ã»ããã¯Avroããã25ïŒ
å°ãªãããšãå€æããŸããã
ããã©ã«ãã®å§çž®èšå®ã䜿çšãããããã調ã¹ãªãããšã¯éåžžã«æªãç¿æ £ã§ãã
ãã ããParquet ã¯ããã©ã«ãã§gzipã䜿çšããŸã ã Gzipã¯ãSnappyãããèãã匷åã«å§çž®ãããŸãã çªç¶ããµã€ãºã®éãã¯åã«ã³ãŒããã¯ã®éãã«ãããã®ã§ããïŒ æ£ããæ¯èŒã®ããã«ãåãå§çž®ã䜿çšããå ŽåããŸãã¯ãŸã£ãã䜿çšããªãå Žåã®ããŒã¿ã»ããã®ãµã€ãºãèšç®ããå¿ èŠããããŸãã
ãŸããæ£çŽãªãšãããéåžžã¯ããã¹ããã¡ã€ã«ãæã å§çž®ã§ããããšã«æ³šæããŠãã ããã ç©æ¥µçã«gzipããŒã¹ã®ãœãŒã¹CSVãã¡ã€ã«ã1.5 GBãè¶ ããªãããšãèªããŸãã ãããã£ãŠããã€ããªåœ¢åŒã®å©ç¹ã¯ããã»ã©åçã§ã¯ãããŸããã
ã¯ã€ãããŒã¿ã»ãããã¹ã
倧èŠæš¡ãªãã¯ã€ããããŒã¿ã»ããã«å¯ŸããŠåæ§ã®æäœãå®è¡ããŸããã ãã®ããŒã¿ã»ããã«ã¯10ââ3åãš6å9,400äžè¡ãå«ãŸããŠããã194 GBã®éå§çž®CSVãã¡ã€ã«ã«å€æãããããšã«æ³šæããŠãã ããã
ãããŠãä»åŸã5 GBã®Parquetãš17 GBã®AvroãåŸãããããšããç¥ããããŸãã 500人ã®åŸæ¥å¡ããããšãParquetã®å Žåã¯100 MBãAvroã®å Žåã¯340 MBã®è² è·ãããããŸãã ãã¡ãããå¯æšçŽ°å·¥ã¯ã³ã³ãã¯ããªåçŽã§åã¡ãŸããã ããããAvroãã¡ã€ã«ã¯ããå€ãã®ãããã¯ãæã€ããšãå€æããŸãããã€ãŸããã¯ãŒã«ãŒã®æ°ãå¢ããããšã§åŠçé床ãäžããããšãã§ããŸãã ãã®ãããã¯ã©ã¹ã¿ãŒã倩äºããããŒãããã¯ãŒã«ãŒã®æ°ãåçã«èšç®ãããšããããã®ãã¹ããããåªããAvroããã©ãŒãã³ã¹ãå®çŸã§ããŸãã
æåã«ãäž¡æ¹ã®åœ¢åŒã§å¹
åºãããŒã¿ã»ãããä¿åããæéã枬å®ããŸããã ããŒã±ããã¯æ¯åAvroãããé«éã§ããã
è¡æ°ã®èšç®ã§ãParquetã¯Avroãå®å
šã«ç Žå£ãã3ç§ãããéãçµæãåºããŸããã
Parquetã¯ããã©ã«ãã§ã 128 MBã®ãããã¯ãµã€ãºã䜿çšããŸããããã¯ãã¯ãŒã«ãŒããšã®å¹³åããŒã¿éããã倧ãããªããŸãã ãããã£ãŠãParquetã䜿çšããå ŽåããçããããŒã¿ã»ããããã®ããªãã¯ã¯åŒãç¶ãæ©èœããŸããããŒã¿ã»ããå ã®è¡æ°ãèšç®ããã«ã¯ãã¡ã¿ããŒã¿ãã1ã€ã®æ°å€ãèªã¿åãã ãã§ååã§ãã
Avroãã¡ã€ã«ã®å Žåã¯ãããŒã¿ã»ãããå®å šã«èªã¿åããåãããã¯ã®ã¡ã¿ããŒã¿ã®ã¿ã解éããããŒã¿èªäœãïŒãã·ãªã¢ã©ã€ãºããã«ïŒã¹ãããããå¿ èŠããããŸãã ããã¯ããã£ã¹ã¯ã®ãå®éã®ãäœæ¥ã«å€æãããŸãã CSVã®å Žåãç¶æ³ã¯ããã«æªåããŸããããã§ã¯ããã¹ãŠã®ãã€ãã解æããå¿ èŠããããŸãã
ããè€éãªGROUP BY
ã¯ãšãªã®å ŽåãParquetãåã³ãªãŒãããŸãã
ããã§ã¯ãAvroã§3.4åã®ã¯ãŒã«ãŒãèµ·åã§ããããšãæãåºããŠãã ããã å¯æšçŽ°å·¥ã¯ãã®åŸãªãŒããŒã·ãããç¶æããŸãã
ãããŠãããŒã¿ã»ããå
šäœã®.map()
å€æã®å Žåã§ããããŒã±ããã¯èª¬åŸåã®ããããŒãžã³ã§åã³åå©ããŸãã
ãŸããããã§ã¯ãAvroã§3.4å以äžã®ã¯ãŒã«ãŒãèµ·åã§ããããšãèŠããŠããå¿ èŠããããŸãã ãããŠãæäœæéã®ã©ã®å²åã.distinct()
ãå¿ èŠãšããã©ã®éšåãå®éã«ãã£ã¹ã¯ããèªã¿åãããŸããïŒ
ææ°ã®ãã¹ãã§ãããã£ã¹ã¯äœ¿çšå¹çã®ãã¹ãã§ã¯ãäž¡æ¹ã®åå è
ã«å°è±¡çãªçµæã瀺ãããŸããã Parquetã¯å
ã®194 GBã4.7 GBã«å§çž®ããããšãã§ãã97ïŒ
ãè¶
ãã倧ããªå§çž®ãå®çŸããŸããã ãŸããAvroã¯ãããŒã¿ã16.9 GBã«å§çž®ããïŒ91ïŒ
ã®å§çž®ïŒãšããå°è±¡çãªçµæã瀺ããŸããã äž¡æ¹ã®åå è
ã«ç§°è³ïŒ
ãããã«
ãã®çµæãParquetã¯åãã¹ãã§å°ãªããšãææªã®ããã©ãŒãã³ã¹ã瀺ããŸããã§ããã ããŒã¿éãå¢å ããã«ã€ããŠããã®å©ç¹ãæããã«ãªããŸããã Avroã¯Parquetã®3.5åã®èªã¿åããè¡ãå¿ èŠããã£ããããParquetã®è¯ãçµæã¯å§çž®å¹çã®åäžã«äžéšèµ·å ããŠããŸãã ãŸããAvroã¯ã圌ã«èµ·å ãããšåãããŠããããŒã¿ã»ããå šäœãèªã¿åã£ããšãã«ããã®é«ãããã©ãŒãã³ã¹ã瀺ããŸããã§ããã
Hadoopã§ã¹ãã¬ãŒãžåœ¢åŒãéžæããå¿ èŠãããå ŽåããµãŒãããŒãã£ã¢ããªã±ãŒã·ã§ã³ãšã®çµ±åãã¹ããŒã ã®é²åãç¹å®ã®ããŒã¿ã¿ã€ãã®ãµããŒããªã©ãå€ãã®èŠå ãèæ ®ããå¿ èŠããããŸã...ããããããã©ãŒãã³ã¹ãæåç·ã«çœ®ããå Žåãäžèšã®ãã¹ãã¯Parquetãæè¯ã®éžæã§ããããšã確信çã«ç€ºããŸãã
ãããŠãèªåããè¿œå ããŸãã ããã¯ããã©ãŒãããã®ããã©ãŒãã³ã¹ã®å®å šã«é©åãªæž¬å®ã§ãã ããã¯ãç§ãã¡ã®ããŒã ã®æ¥ççµéšããã®å€æ°ã®æ£èŠçãªèŠ³å¯ã«ãã£ãŠç¢ºèªãããŠããŸãã ããã§ãããã¹ãæ¹æ³è«ã¯ãå€éšã¢ã¯ã·ã§ã³ïŒCSVãGROUP BY', '.distinct()
ãªã©GROUP BY', '.distinct()
枬å®ãããããããšããããéèŠãªåé¡ïŒå§çž®ãããŒã¿åœ¢åŒãªã©ïŒãå®å šã«ç¡èŠããããšããããŸãã ãèžçãã¡ããªãã¯ã䜿çšããŠæšæºãã¹ããå®è¡ããã®ã¯ç°¡åã§ã¯ãªãããšãç解ããŠããŸãã ããããClouderaã®ããã°ããã¯ããŸãã«ãããæåŸ ããŠããŸããã