ã¯ããã«
ç§ãã¡ã®ã¯ã©ã€ã¢ã³ãã®1ã€ã¯ãã»ãšãã©ã®äŒæ¥ã¢ããªã±ãŒã·ã§ã³ãšãã®ããŒã¿ããŒã¹ãããã°ããã©ãããã«åãåºãã¿ã¹ã¯ãæ±ããŠããŸãããäœå¹Žãã®éãäœç³»çãªæ¹æ³ã§åæãããã§ãã ãã¡ããããã°ã¢ãŠãã¯äž»èŠãªç®æšã§ã¯ãªããäžé£ã®èŠä»¶ã«åºã¥ããŠãClouderaïŒCDH 5ïŒã®ããŒãžã§ã³ã§ããHadoopãéžæããŸããã
èŠä»¶ã¯ãç¹ã«ãœãªã¥ãŒã·ã§ã³ããæå®ãããåºæºã«åŸã£ãŠïŒã§ããã°é«éã§ïŒã€ãã³ãã®ãªã¹ããïŒãã°ããïŒæ€çŽ¢ããã³è¡šç€ºããæ©èœãæäŸããå¿ èŠãããããšã瀺ããŠããŸããã ããã«ããã°ãã¥ãŒãã©ãŒã ãããŒã¿ããŒã¹ã§ã¯ãªãHadoopã䜿çšããããã«ãäžéšã®ã¢ããªã±ãŒã·ã§ã³ãããçŽãå¿ èŠããããŸãã
ãœãªã¥ãŒã·ã§ã³ã®1ã€ãšããŠãClouderaã®Hadoopããã±ãŒãžã«å«ãŸããŠããSolrCloudæ€çŽ¢ã¢ãžã¥ãŒã«ã䜿çšããŸãã ããã«äœ¿çšå¯èœãªClouderaã«ã¯ãã¢ããªã±ãŒã·ã§ã³ããŒã¿ããŒã¹ããããŒã¿ãããŠã³ããŒããããããã§ïŒè¡ããšã§ã¯ãªãïŒã€ã³ããã¯ã¹ãäœæããããã®ããŒã«ãå«ãŸããŠããŸãã ãã ãããã®æ¹æ³ã¯æ©èœããŸãããImpalaã䜿çšããŠããŒã¿ããã§ããããå Žåãããããã¥ãŒãã³ã°ã«æéãããããäºæž¬äžå¯èœã§ããããšãå€æããŸããã ãã®ãããåæ§ã®ã¿ã¹ã¯ã«çŽé¢ãã人ã ã®æéãç¯çŽããããšãæåŸ ããŠãç§ãã¡ããããã©ã®ããã«è¡ã£ãããå ±æããããšã«ããŸããã
ãã®èšäºã§ã¯ãæ§æã®è©³çŽ°ãšãæäœäžã«ééããæ©èœã«ã€ããŠèª¬æããŸãã
ã¹ã¯ãªãã
- OracleããHDFSäžã®ãã¡ã€ã«ã«ããŒã¿ãã¢ããããŒãããŸãã ãã¡ã€ã«åœ¢åŒã¯avroã§ãã ããŒã«ïŒ sqoop ïŒ http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.htm ïŒã
avro圢åŒã«ã¯å€ãã®å©ç¹ããããŸãïŒãã€ããªã§ãããããŒã¿ãååã«å§çž®ãããŠãããããCSVã®ããã«ããã£ãªããžå€æãããã¹ããã£ãŒã«ãã«ã³ã³ããå ¥ããªãã§ãã ããããŸãããã¡ã€ã«èªäœã«ã¹ããŒãããããã¹ããŒãé²åããµããŒãããŠããŸãã äžè¬ã«ãHadoop avroã§ã¯ãç°ãªãã³ã³ããŒãã³ãéã§ããŒã¿ãä¿åããã³è»¢éããããã®çµ±äžããã圢åŒãšããŠå®£äŒãããŠãããå€ãã®ããŒã«ãšã³ã³ããŒãã³ãã§ãµããŒããããŠããŸãã ãããŠãç§ãã¡ã®ä»äºã«ã¯ãã1ã€ãã©ã¹ããããŸãã詳现ã«ã€ããŠã¯ä»¥äžãã芧ãã ããã
- SolrCloudã§ãã³ã¬ã¯ã·ã§ã³ããäœæããŸãã ããŒã«ïŒ solrctl ïŒ http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/Search/Cloudera-Search-User-Guide/csug_solrctl_ref.html ïŒ
ã³ã¬ã¯ã·ã§ã³ã¯ãSolrCloudã®è«çã€ã³ããã¯ã¹ã§ãã æ§æãã¡ã€ã«ã®ã»ããã«é¢é£ä»ãããã1ã€ä»¥äžã®ã·ã£ãŒãã§æ§æãããã€ã³ããã¯ã¹ãã¡ã€ã«ã®ãããã©ã«ããŒãã«ãŠã³ãããŸãã ã·ã£ãŒãã®æ°ãè€æ°ã®å Žåãããã¯åæ£ã€ã³ããã¯ã¹ã§ãã
- MapReduceãã©ã€ããŒïŒ https://developer.yahoo.com/hadoop/tutorial/module4.html#driver ïŒãéå§ããŸã ã
- avroãã¡ã€ã«ãããã¹ãŠã®ãšã³ããªãèªã¿åããŸã
- ã¢ãŒãã©ã€ã³ã¹ã¯ãªããã®åœ¢åŒã§èšè¿°ãããETLããã»ã¹ãä»ããŠããããæž¡ããŸãã ãã®ããã»ã¹ã®çµæã¯ãæ°ããããŒã¿ïŒæå®ãããHDFSãã£ã¬ã¯ããªã«é 眮ãããSolr圢åŒã®ã€ã³ããã¯ã¹ãã¡ã€ã«ïŒãæã€æçã§ãã
- ã¬ã€ã¢ãŠããããã·ã£ãŒããã¢ã¯ãã£ããªSolrCloudã®ã³ã¬ã¯ã·ã§ã³ã«ããŒãžããŸãããªãã©ã€ã³ãã©ã€ãïŒãŽãŒã©ã€ãïŒã«å€æããã«:)
ããŒã«ïŒ org.apache.solr.hadoop.MapReduceIndexerToolãã©ã€ããŒãèµ·åããhadoopã³ãã³ãïŒ http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/Search/Cloudera-Search-User-Guide /csug_mapreduceindexertool.html ïŒãã®ã·ãŒã±ã³ã¹ãå®è¡ããŸãã
ã¡ã€ã³ã®NameNodeãããã¹ãŠãéå§ããŸãããããã¯éèŠã§ã¯ãããŸããã
ã ãããã¹ããããã€ã¹ããã...
Oracleããavroãã¡ã€ã«ã«ããŒã¿ãããŠã³ããŒããã
sqoop import --connect jdbc:oracle:thin:@oraclehost:1521/SERVICENAME \ --username ausername --password apassword --table ASCHEMA.LOG_TABLE \ --as-avrodatafile --compression-codec snappy \ -m 16 --split-by NUM_BEG \ --map-column-java NUM_BEG=Integer,DTM_BEG=String,KEY_TYPE=String,OLD_VALUE=String,NEW_VALUE=String,NUM_PARENT=Integer,\ NUM_END=Integer,EVENT=String,TRACELEVEL=String,KEY_USER=String,COMPUTER_NAME=String,PRM=String,OPERATION=Integer,\ KEY_ENTITY=String,MODULE_NAME=String \ --target-dir /user/$USER/solrindir/tmlogavro
ãã©ã¡ãŒã¿ãŒã«ã€ããŠå°ãïŒ
- connect -Oracleäžã®ã¢ããªã±ãŒã·ã§ã³ã®ããããã®ããŒã¿ããŒã¹ãžã®æ¥ç¶æååã
- as-avrodatafileããã³compression-codecã¯ãæå®ãããå§çž®ã§ããŒã¿ãavroãã¡ã€ã«ã«ã¢ããããŒããããããšã瀺ããŸããããã«ãããæ§é äœã®ããŒã¿ãå¹³å10åå§çž®ãããŸãã
- -mã¯ãããŒãã«ããããŒã¿ãã¢ã³ããŒããããããã¿ã¹ã¯ã®æ°ã決å®ããŸãã è€æ°ã®ã¿ã¹ã¯ã䞊è¡ããŠå®è¡ãããŸãã åã¿ã¹ã¯ã¯ãããŒãã«ããã¬ã³ãŒãã®ãµãã»ãããååŸããåå¥ã®ãã¡ã€ã«ã«ä¿åããŸãã ãµãã»ããå šäœã決å®ããããã«ãsqoopã¯select minïŒ<split-by>ïŒãmaxïŒ<split-by>ïŒãåããŸãã
- fromã¯çµæã®æ°å€ç¯å²ã16ã®éšåã«åå²ãïŒãã®äŸã§ã¯ïŒãåã¿ã¹ã¯ã¯çµæã®æ°å€ã®ãµãç¯å²ãSQLã¯ãšãªã®ãã£ã«ã¿ãŒãšããŠäœ¿çšããŠãããŒãã«ãšã³ããªã®å¿ èŠãªãµãã»ãããéžæããŸãã ããã©ã«ãã§ã¯ãåå²ã¯PkããŒãã«ã®æåã®åãšããŠäœ¿çšãããŸãã
- map-column-java -Sqoopçšèªã§åã¿ã€ããæå®ããŸãã ååãšããŠãSqoopã¯ã»ãšãã©ã®Oracleåã¿ã€ãããã€ãžã§ã¹ãã§ããŸããããã®ãã©ã¡ãŒã¿ãŒã§ããã³ããã衚瀺ããããã«åŒ·å¶ãããå ŽåããããŸãã
- target-dirã¯ããã¡ã€ã«ãä¿åããHDFSã®ãã£ã¬ã¯ããªã§ãã
ã³ã¬ã¯ã·ã§ã³ãäœæãã
ããã§ã¯ãsolrctlãŠãŒãã£ãªãã£ã䜿çšããŠããããã€ãããSolrCloudã管çããŸãã
ãŸããããŒã«ã«ãã£ã¹ã¯äžã«ãå°æ¥ã®ã³ã¬ã¯ã·ã§ã³ã®ãã¡ã€ã«æ§é ãããããã³ã¬ã¯ã·ã§ã³ã€ã³ã¹ã¿ã³ã¹ãã£ã¬ã¯ããªãçæããŸãã ãã®äžã§ãããŒã«ã«ãã£ã¹ã¯äžã®ã³ã¬ã¯ã·ã§ã³èšå®ãäœæ/å€æŽããããããzookeeperæ§æãµãŒãã¹ã«è€è£œããããããSolrCloudãäœæ¥ã«å¿ èŠãªèšå®ãèªã¿åããŸãã
solrctl instancedir --generate $HOME/solr_configs_for_tm_log
ããã§ããã©ã¡ãŒã¿ã¯äœæãããããŒã«ã«ãã£ã¬ã¯ããªãžã®ãã¹ã§ãã
ããã©ã«ãã§ã¯ããã£ã¬ã¯ããªã«äœæããããã¡ã€ã«ã«ã¯ãããŒã¿ã¹ããŒããšæ€çŽ¢æé ã®ãã¢èšå®ãæ¢ã«å ¥åãããŠãããããäœåãªãã®ãåé€ããå¿ èŠããããŸãã
äœæããããã£ã¬ã¯ããªã§conf / schema.xmlãã¡ã€ã«ãéããŸãã ããã¯ãã€ã³ããã¯ã¹ä»ãããŒã¿ã®æ§é ãèšè¿°ããã¡ã€ã³ã³ã¬ã¯ã·ã§ã³ãã¡ã€ã«ã§ãã ã¿ã°ãšãã®ã³ã³ãã³ããã¿ã°ãåé€ããŸãã 代ããã«ã次ãæ¿å ¥ããŸãã
<fields> <field name="num_beg" type="int" indexed="true" stored="true" multiValued="false" /> <field name="dtm_beg" type="date" indexed="true" stored="true" multiValued="false" /> <field name="key_type" type="string" indexed="true" stored="true" multiValued="false" /> <field name="old_value" type="string" indexed="true" stored="true" multiValued="false" /> <field name="new_value" type="string" indexed="true" stored="true" multiValued="false" /> <field name="num_parent" type="string" indexed="true" stored="true" multiValued="false" /> <field name="num_end" type="string" indexed="true" stored="true" multiValued="false" /> <field name="event" type="text_general" indexed="true" stored="true" multiValued="false" /> <field name="tracelevel" type="string" indexed="true" stored="true" multiValued="false" /> <field name="key_user" type="string" indexed="true" stored="true" multiValued="false" /> <field name="computer_name" type="string" indexed="true" stored="true" multiValued="false" /> <field name="prm" type="string" indexed="true" stored="true" multiValued="false" /> <field name="operation" type="string" indexed="true" stored="true" multiValued="false" /> <field name="key_entity" type="string" indexed="true" stored="true" multiValued="false" /> <field name="module_name" type="string" indexed="true" stored="true" multiValued="false" /> <field name="_version_" type="long" indexed="true" stored="true" required="true" /> <!-- catchall field, containing all other searchable text fields (implemented via copyField further on in this schema --> <field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/> </fields> <!-- Field to use to determine and enforce document uniqueness. Unless this field is marked with required="false", it will be a required field --> <uniqueKey>num_beg</uniqueKey> <copyField source="event" dest="text"/>
_version_ãã£ãŒã«ãã¯ããŒã¿ãœãŒã¹ã«ååšããªãããšã«æ³šæããŠãã ãããSolrã®å éšç®çãããšãã°æ¥œèŠ³çããã¯ãéšåæŽæ°ã¡ã«ããºã ã«å¿ èŠã§ãã schema.xmlã§ãã®ãããªãã£ãŒã«ããæå®ããã ãã§ååã§ããSolrã¯ãã®ã³ã³ãã³ãã管çããŸãã
ãŸããããã¹ããã£ãŒã«ãã¯ãããŸããã HUEïŒClouderããHadoopãžã®ãŠãŒã¶ãŒã€ã³ã¿ãŒãã§ã€ã¹ïŒãä»ããå šææ€çŽ¢ã®ããã«ãcopyFieldåœä»€ãšãšãã«æå®ããŸããã äœæãããã³ã¬ã¯ã·ã§ã³ãïŒæ§æUIãã©ãŒã ãä»ããŠïŒHUEã«æ¥ç¶ãããšããã®ã³ã¬ã¯ã·ã§ã³ã®æ€çŽ¢ã€ã³ã¿ãŒãã§ã€ã¹ã§ãããã¹ããã£ãŒã«ãã«æ€çŽ¢æååã®å€ã衚瀺ãããŸãã
ä»ã1ã¹ã¯ã¯ããã å®éãçæããããµã³ãã«ãã¡ã€ã«ã«ã¯ãæ€çŽ¢ãšã³ãžã³ã®1ã€ã®ã¡ã«ããºã ã§ãããšã¬ããŒã¿ãŒãå«ãŸããŠããŸãã Yandexã®æ€çŽ¢çµæã®äžéšã«ããåºåãªã©ãç¹å®ã®åºæºã«åŸã£ãŠçµæãæ瀺ã§ããŸãã ãã®ããããã®äŸã§ã¯ãã¹ããŒã ã®ããŒãã£ãŒã«ãã®ã¿ã€ããæååã«ãªãããã«èšå®ãããŠããŸãïŒåºåãã¬ãŒãºã®äŸã¯ãconf \ elevate.xmlã«ãããŸãïŒã intããããŸãã ãã®ãããã€ã³ããã¯ã¹äœæããã»ã¹å šäœããåã®äžäžèŽã«é¢ãããšã©ãŒã§åŽ©å£ããŸããã ãã®ã¡ã«ããºã ãã¿ã¹ã¯ã«ãšã£ãŠé¢çœããªããããäœæãããã£ã¬ã¯ããªã§
conf/solrconfig.xml
éããã¿ã°ãšãã®ã³ã³ãã³ããåé€ïŒã³ã¡ã³ãïŒããŸã
<searchComponent name="elevator" ...">, <requestHandler name="/elevate" ...>
ã
<searchComponent name="elevator" ...">, <requestHandler name="/elevate" ...>
ãäœæããããã£ã¬ã¯ããªãã
conf\elevate.xml
ãåé€ããŠã足å ã«ãã³ã°ã¢ããããªãããã«ããŸãã
次ã«ãå°æ¥ã®ã³ã¬ã¯ã·ã§ã³ã®æ§æå šäœãSolrCloudã«ç»é²ïŒã¯ããŒã³ïŒããããããŒãã³ã°ãµãŒãã¹ZooKeeperã«ç»é²ããŸãããã®ãµãŒãã¹ãããå±éããããã¹ãŠã®SolrCloudãµãŒããŒãæ§æãèªã¿åãïŒæŽæ°ãåãåããŸãïŒïŒ
solrctl instancedir --create tm_log_avro $HOME/solr_configs_for_tm_log
ããã§ããã©ã¡ãŒã¿ãŒã¯ãå°æ¥ã®ã³ã¬ã¯ã·ã§ã³ã®ååãããã³æ§æãã¡ã€ã«ãå«ãããŒã«ã«ãã£ã¹ã¯äžã®ãã£ã¬ã¯ããªãžã®ãã¹ã§ãã äžèšã§äœæããŸããã
ããŠããã®æ®µéã®æåŸã®ã¹ãããã¯ãæå®ãããæ°ã®ã·ã£ãŒããæã€ã³ã¬ã¯ã·ã§ã³ãäœæããããšã§ãã
solrctl collection --create tm_log_avro -s 1
ãã®ã³ãã³ãã¯ãZooKeeperã«ç»é²ãããŠããæ§æã«åºã¥ããŠã³ã¬ã¯ã·ã§ã³ãäœæããŸãã æåã®ãã©ã¡ãŒã¿ãŒã¯ã³ã¬ã¯ã·ã§ã³ã®ååã2çªç®ã¯ã·ã£ãŒãã®æ°ã§ãïŒç°¡åã«ããããã«1ã䜿çšããŸãïŒã
ã³ã¬ã¯ã·ã§ã³ã®ã€ã³ããã¯ã¹äœæããã»ã¹ã®éå§
æåã«ãETLã€ã³ããã¯ã¹äœæããã»ã¹ãèšå®ããŸãã Clouderaã¯ãKite SDKãç¹ã«Morphlineã®äžéšãå°éããŠããŸãã å®éãMorphlineã³ã³ããŒãã³ãã¯ãå ¥åããŒã¿ã¹ããªãŒã ïŒãã¬ã³ãŒãããªããžã§ã¯ãã®é åãšããŠïŒã§äœãããå¿ èŠãããã®ãââïŒã³ãã³ãã·ãŒã±ã³ã¹ã®éå±€ã®åœ¢ã§ïŒãå€ææ¹æ³ãããã³è»¢éå ãèšè¿°ããã¹ã¯ãªããèšèªã®ã€ã³ã¿ãŒããªã¿ãŒã§ãã ããšãã°ãavroãã¡ã€ã«ãèªã¿åãã³ãã³ãããããŸãã ãã¡ããã圌ãã®ããŒã ã¯ã€ãªãã£ãŠããŸãããããããªãã¯ã§ãã Clouderã¯ãçä¿¡ã¹ããªãŒã ã®ãã¹ãŠã®ãšã³ããªã«å¯ŸããŠSolrã€ã³ããã¯ã¹ãäœæããã³ãã³ããäœæããŸãããããã¯ã¹ã¯ãªããã®æåŸã«ãªããŸãã
ããã»ã¹ã®æ¬è³ªïŒ
- ãã¡ã€ã«æ å ±ãæã€ãã¬ã³ãŒãããªããžã§ã¯ããå ¥åã«æ¥ãŸã
- ãã®ãã¡ã€ã«ãèªã¿åãããã®ãã¡ã€ã«ã®è¡ããã¬ã³ãŒãããªããžã§ã¯ãã®é åãšããŠè¿ãã³ãã³ããèµ·åãããŸã
- åè¡ã®ããŒã¿ã¯ãå¿ èŠã«å¿ããŠå€æãããŸãïŒããšãã°ãæ¥æãæã€ãã£ãŒã«ãã®å€ã¯ãUTCããå°åæéã«å€æãããŸãïŒ
- åè¡ã¯Solrããã¥ã¡ã³ãã«å€æãããé åå šäœãMapReduce Mapperããè¿ãããŸã
ãã®ããã»ã¹ãæ§æããã«ã¯ã次ã®å 容ã®ãã¡ã€ã«
$HOME/solr_configs_for_tm_log_morphlines/morphlines.conf
ãäœæããŸãã
# Specify server locations in a SOLR_LOCATOR variable; used later in # variable substitutions: SOLR_LOCATOR : { # Name of solr collection collection : tm_log_avro # ZooKeeper ensemble zkHost : "hadoop-n1.custis.ru:2181,hadoop-n2.custis.ru:2181,hadoop-n3.custis.ru:2181/solr" } # Specify an array of one or more morphlines, each of which defines an ETL # transformation chain. A morphline consists of one or more potentially # nested commands. A morphline is a way to consume records such as Flume events, # HDFS files or blocks, turn them into a stream of records, and pipe the stream # of records through a set of easily configurable transformations on its way to # Solr. morphlines : [ { # Name used to identify a morphline. For example, used if there are multiple # morphlines in a morphline config file. id : morphline1 # Import all morphline commands in these java packages and their subpackages. # Other commands that may be present on the classpath are not visible to this # morphline. importCommands : ["org.kitesdk.**", "org.apache.solr.**"] commands : [ { # Parse Avro container file and emit a record for each Avro object readAvroContainer { # Optionally, require the input to match one of these MIME types: # supportedMimeTypes : [avro/binary] # Optionally, use a custom Avro schema in JSON format inline: # readerSchemaString : """<json can go here>""" # Optionally, use a custom Avro schema file in JSON format: # readerSchemaFile : /path/to/syslog.avsc } } { # Consume the output record of the previous command and pipe another # record downstream. # # extractAvroPaths is a command that uses zero or more Avro path # excodeblockssions to extract values from an Avro object. Each excodeblockssion # consists of a record output field name, which appears to the left of the # colon ':' and zero or more path steps, which appear to the right. # Each path step is separated by a '/' slash. Avro arrays are # traversed with the '[]' notation. # # The result of a path excodeblockssion is a list of objects, each of which # is added to the given record output field. # # The path language supports all Avro concepts, including nested # structures, records, arrays, maps, unions, and others, as well as a flatten # option that collects the primitives in a subtree into a flat list. In the # paths specification, entries on the left of the colon are the target Solr # field and entries on the right specify the Avro source paths. Paths are read # from the source that is named to the right of the colon and written to the # field that is named on the left. extractAvroPaths { flatten : true paths : { computer_name :/COMPUTER_NAME dtm_beg :/DTM_BEG event :/EVENT key_entity :/KEY_ENTITY key_type :/KEY_TYPE key_user :/KEY_USER module_name :/MODULE_NAME new_value :/NEW_VALUE num_beg :/NUM_BEG num_end :/NUM_END num_parent :/NUM_PARENT old_value :/OLD_VALUE operation :/OPERATION prm :/PRM tracelevel :/TRACELEVEL } } } # Consume the output record of the previous command and pipe another # record downstream. # # convert timestamp field to native Solr timestamp format # such as 2012-09-06 07:14:34 to 2012-09-06T07:14:34.000Z in UTC { convertTimestamp { field : dtm_beg inputFormats : ["yyyy-MM-dd HH:mm:ss", "yyyy-MM-dd"] inputTimezone : Europe/Moscow outputFormat : "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'" outputTimezone : UTC } } # Consume the output record of the previous command and pipe another # record downstream. # # This command deletes record fields that are unknown to Solr # schema.xml. # # Recall that Solr throws an exception on any attempt to load a document # that contains a field that is not specified in schema.xml. { sanitizeUnknownSolrFields { # Location from which to fetch Solr schema solrLocator : ${SOLR_LOCATOR} } } # log the record at DEBUG level to SLF4J { logDebug { format : "output record: {}", args : ["@{}"] } } # load the record into a Solr server or MapReduce Reducer { loadSolr { solrLocator : ${SOLR_LOCATOR} } } ] } ]
䜿çšãããã³ãã³ãã«ã€ããŠå°ãïŒ
- readAvroContainer-ããã¯avro圢åŒã䟿å©ãªå Žæã§ãããã¡ã€ã«èªäœã«ã¯ãããŒã¿æ§é ã«é¢ãããã¹ãŠã®ã¡ã¿æ å ±ãå«ãŸããŸããããã¯ãã¬ã³ãŒããªããžã§ã¯ãã®ã¹ããªãŒã ã圢æããã³ãã³ãã§ããã«é²ããããã«å¿ èŠã§ãã ããšãã°ãCSVã䜿çšããå Žåãåãã£ãŒã«ãã®ååããã®ã¿ã€ããé·ãããã¡ã€ã«å ã®äœçœ®ãããã§å床説æããå¿ èŠããããŸãããã®æ å ±ã¯ãSqoopãä»ããŠOracleããã¢ã³ããŒãããæåã®ã¹ãããã§èªåçã«çæãããŸãã
- extractAvroPaths-ååä¿¡ã¬ã³ãŒãããååŸãããã£ãŒã«ããšãéä¿¡ã¬ã³ãŒãã®ã©ã®ãã£ãŒã«ãã«é 眮ãããã瀺ããŸãã ããã§ã¯ãã³ã¬ã¯ã·ã§ã³ãSolrCloudã§ãç¥ã£ãŠããããã£ãŒã«ãåã瀺ããŸãã ãããã¯ãæåŸã®ããŒã ã«ãã£ãŠã€ã³ããã¯ã¹äœæã«è»¢éãããŸãã
- convertTimestamp-çä¿¡ã¬ã³ãŒãããšã«åŒã³åºãããæååãã£ãŒã«ããUTC圢åŒã®æ¥æã«å€æããŸãã
- loadSolr-ã¬ã³ãŒããªããžã§ã¯ããSolrããã¥ã¡ã³ãã«å€æããŸãã ãã®åŸããããã®ããã¥ã¡ã³ãã®é åãMapReduce Reducerã«æž¡ãããMapReduce Reducerã¯ã€ã³ããã¯ã¹äœæãçŽæ¥åŠçããŸãã
æã¡äžã
ããã§ããã¹ãŠãå®è¡ããæºåãæŽããŸããã 2ã€ã®ããŒã ãäžç·ã«ç«ã¡äžããŸãã
- org.apache.solr.hadoop.HdfsFindToolã¯ãå®éã«ã¯ãLinuxã®findã³ãã³ãã®äžéšã®å®è£ ã§ãïŒäœããã®çç±ã§ããã®ãããªã³ãã³ãã¯hdfsã«ãŸã å®è£ ãããŠããŸãããããã°ã¯é·ãéååšããŠããŸããïŒã ãã®ã³ãã³ãã®çµæïŒãªã¹ãïŒã¯2çªç®ã«æž¡ãããŸã
- MapReduceãã©ã€ããŒorg.apache.solr.hadoop.MapReduceIndexerToolãšäžé£ã®ãã©ã¡ãŒã¿ãŒ
sudo -u hdfs hadoop jar /usr/lib/solr/contrib/mr/search-mr-*-job.jar org.apache.solr.hadoop.HdfsFindTool -find \ hdfs://$NNHOST:8020/user/$USER/solrindir/tmlogavro -type f \ -name 'part-m-000*.avro' |\ sudo -u hdfs hadoop --config /etc/hadoop/conf.cloudera.yarn \ jar /usr/lib/solr/contrib/mr/search-mr-*-job.jar org.apache.solr.hadoop.MapReduceIndexerTool \ --libjars /usr/lib/solr/contrib/mr/search-mr-1.0.0-cdh5.0.0.jar \ --log4j $HOME/solr_configs_for_tm_log_morphlines/log4j.properties \ --morphline-file $USER/solr_configs_for_tm_log_morphlines/morphlines.conf \ --output-dir hdfs://$NNHOST:8020/user/$USER/solroutdir \ --verbose --go-live --zk-host $ZKHOST \ --collection tm_log_avro \ --input-list -;
2çªç®ã®ã³ãã³ãã®ãã©ã¡ãŒã¿ãŒã«ã€ããŠå°ãïŒ
- jarã¯ããã©ã€ããŒjarãžã®ãã¹ã§ãã æ¹æ³-Clouderaããã®æšæºé ä¿¡
- org.apache.solr.hadoop.MapReduceIndexerTool -jar'nikã®ãã©ã€ããŒã¯ã©ã¹ã®åå
- libjars-ã¢ããªã³ã©ã€ãã©ãª
- log4j -log4jæ§æãã¡ã€ã«ãžã®ãã¹ã/usr/lib/hadoop-yarn/etc/hadoopã«ããæšæºã®ãã®ã䜿çšã§ããŸãã
- morphline- file-äžèšã§äœæãããã¢ãŒãã©ã€ã³ã¹ã¯ãªãããã¡ã€ã«ãžã®ãã¹
- output-dir -SolrCloudãµãŒããŒã«ããŒãžããåã«ãã¹ãŠã®ã€ã³ããã¯ã¹ãä¿åãããhdfsã®ãã£ã¬ã¯ããªã®åå
- input- list-ã€ã³ããã¯ã¹äœæçšã®ãã¡ã€ã«ã®ãªã¹ãã ãã©ã¡ãŒã¿ãŒã®åŸã®ããã·ã¥ã¯ãæšæºå ¥åãããªã¹ããååŸããããšãæå³ããŸã
- $ ZKHOSTå€æ°ã¯hadoop-n1.custis.ru:2181,hadoop-n2.custis.ru:2181,hadoop-n3.custis.ru:2181/solrã§æ§æãããŸã
ãã®ã³ãã³ãã¯ãMapReduceã¿ã¹ã¯ãäœæããŠå®è¡ããŸãã
- Mapã¿ã¹ã¯ã¯ãã¡ã€ã«ãååŸããMorphline ETLãä»ããŠãããæž¡ããåä¿¡ãããã°ãšã³ããªãSolrããã¥ã¡ã³ããªããžã§ã¯ãã«å€æãã次ã®ã¿ã¹ã¯ã«æž¡ããŸãã ãã¡ã€ã«ãšåãæ°ã®ã¿ã¹ã¯ã€ã³ã¹ã¿ã³ã¹ããããŸãã
- Reduceã¿ã¹ã¯ã¯å ¥åããã¥ã¡ã³ããååŸãããããããã£ã¹ã¯äžã®å¥ã®ãã£ã¬ã¯ããªïŒ<output-dir>ãµããã£ã¬ã¯ããªïŒã«ã€ã³ããã¯ã¹ä»ãããŸãã åãæ°ã®ã€ã³ã¹ã¿ã³ã¹ããããŸã
- ããããReduce-Onlyã¿ã¹ã¯ã¯ããã©ã«ããŒãããã¹ãŠã®ã€ã³ããã¯ã¹ãååŸããããããSolrCloudã«ããŒãžããŸãã ã³ã¬ã¯ã·ã§ã³å ã®ã·ã£ãŒããšåãæ°ã®ã¿ã¹ã¯ã€ã³ã¹ã¿ã³ã¹ããããŸãã ç§ãã¡ã®å Žå-1
ããã€ãã®çµæ
MapReduceIndexerToolãšSolrèªäœã¯ã䜿çšå¯èœãªRAMã«ã€ããŠéåžžã«äžæ©å«ã§ããããšãå€æããŸããã ç§ãã¡ã®æ§é ã§ã¯ããªã¹ããããã¡ã€ã«ã«ã€ã³ããã¯ã¹ãä»ããåReduceã¿ã¹ã¯ã¯ãéå§çž®ãã¡ã€ã«ã®ãµã€ãºã®çŽ1/2ã®éïŒãã以å€ã®å Žåã¯OutOfMemoryErrorïŒã§RAMïŒJavaããŒããµã€ãºïŒã§å©çšã§ããå¿ èŠããããŸããã ãããã£ãŠãsqoopã䜿çšããŠãã¡ã€ã«ã«ã¢ã³ããŒããããšãã¯ãããšãã°mãã©ã¡ãŒã¿ãŒïŒãã¡ã€ã«ãäœæããããããŒã®æ°ïŒã䜿çšããŠãµã€ãºãå¶åŸ¡ããŸãã
ãŸããMapããã³Reduceã¿ã¹ã¯ã§äœ¿çšå¯èœãªã¡ã¢ãªã®éã«ãããããããæåŸã®ã¹ãããã®æåã¯ãSolr Serverã§äœ¿çšå¯èœãªã¡ã¢ãªã®éãšã³ã¬ã¯ã·ã§ã³ã§ãã§ã«ã€ã³ããã¯ã¹ä»ããããŠããããŒã¿ã®ãµã€ãºã«çŽæ¥äŸåããŸãã ããšãã°ãæ§é ã«ããã°ã30 GBã®ããŒãžã§ã¯ã1ã€ã®Solrã€ã³ã¹ã¿ã³ã¹ã«å²ãåœãŠããã6 GBã®JavaããŒããµã€ãºã§1ã€ã®ã·ã£ãŒãã«ååã§ããã
å¥ã®æ©èœããããŸã-ã€ã³ããã¯ã¹ããŒãžã®äœ¿çšã¡ã«ããºã ã¯ãéè€ã¬ã³ãŒããèå¥ããŸããã ã€ã³ããã¯ã¹åããããã¡ã€ã«ã«æ¢ã«ã³ã¬ã¯ã·ã§ã³ã«ããã¬ã³ãŒããããå Žåããããã¯è€è£œãããŸãã ãããã£ãŠãã€ã³ããã¯ã¹ãåäœæãããšãã¯ãæ¯åãã¡ã€ã«å ã®äžæã®ã¬ã³ãŒãã»ãããååŸããããã«æ³šæããŠãã ããã ããã¯ãïŒsqoopãžã§ããä»ããŠïŒå¢åããŒã¿ã¢ããããŒãçšã®sqoopæ©èœã䜿çšããŠéåžžã«ç°¡åã«é 眮ã§ããŸãã ã¢ããããŒããéå§ããåã«ããã©ã«ãããå€ããã¡ã€ã«ãåé€ããããšãå¿ããªãã§ãã ãããåé€ããªããšãåã³ã€ã³ããã¯ã¹ãäœæãããŸãã