ã€ã³ã¿ãŒãããäžã®æ å ±ãæ€çŽ¢ããæ©èœã¯äžå¯æ¬ ã§ãã ãæ°ã«å ¥ãã®æ€çŽ¢ãšã³ãžã³ã§ãæ€çŽ¢ããã¿ã³ãã¯ãªãã¯ãããšã1ç§åŸã«çããåŸãããŸãã
ã»ãšãã©ã®äººã¯ãè£åŽãã§äœãèµ·ãã£ãŠããã®ããŸã£ããèããŠããŸããããæ€çŽ¢ãšã³ãžã³ã¯æçšãªããŒã«ã§ããã ãã§ãªããè€éãªæè¡è£œåã§ããããŸãã ææ°ã®æ€çŽ¢ã·ã¹ãã ã¯ãããã°ããŒã¿ãã°ã©ãããã³ãããã¯ãŒã¯çè«ãèªç¶èšèªããã¹ãåæãæ©æ¢°åŠç¿ãããŒãœãã©ã€ãŒãŒã·ã§ã³ãã©ã³ãã³ã°ãªã©ãã³ã³ãã¥ãŒã¿ãŒæ¥çã®é«åºŠãªææã®ã»ãšãã©ãã¹ãŠã䜿çšããŠããŸãã æ€çŽ¢ãšã³ãžã³ãã©ã®ããã«æ©èœããããç解ããããšã§ãæè¡éçºã®ã¬ãã«ãç¥ãããšãã§ãããããã©ã®ãšã³ãžãã¢ã§ããããç解ããã®ã«åœ¹ç«ã¡ãŸãã
ããã€ãã®èšäºã§ã¯ãæ€çŽ¢ãšã³ãžã³ãã©ã®ããã«æ©èœãããã段éçã«èª¬æããããã«ã説æã®ããã«ãæ ¹æ ã®ãªãããã«ç¬èªã®å°ããªæ€çŽ¢ãšã³ãžã³ãæ§ç¯ããŸãã ãã¡ããããã®æ€çŽ¢ãšã³ãžã³ã¯ããã¬ãŒãã³ã°ããããGoogleãŸãã¯Yandexã®å éšã§èµ·ãã£ãŠããããšãéåžžã«åŒ·åã«åçŽåããŸãããäžæ¹ã§ãããŸãåçŽåããŸããã
æåã®ã¹ãããã¯ããŒã¿åéïŒãŸãã¯ãã¯ããŒã«ãšãåŒã°ããŸãïŒã§ãã
ãŠã§ãã¯ã°ã©ãã§ã
ç§ãã¡ãèå³ãæã£ãŠããã€ã³ã¿ãŒãããã®éšåã¯ãŠã§ãããŒãžã§æ§æãããŠããŸãã æ€çŽ¢ã·ã¹ãã ããŠãŒã¶ãŒã®èŠæ±ã§ç¹å®ã®WebããŒãžãèŠã€ããããšãã§ããããã«ããã«ã¯ããã®ãããªããŒãžãååšããèŠæ±ã«é¢é£ããæ å ±ãå«ãŸããŠããããšãäºåã«ç¥ãå¿ èŠããããŸãã ãŠãŒã¶ãŒã¯éåžžãæ€çŽ¢ãšã³ãžã³ããWebããŒãžã®ååšã«ã€ããŠåŠç¿ããŸãã æ€çŽ¢ãšã³ãžã³ã¯WebããŒãžã®ååšãã©ã®ããã«ç¥ãã®ã§ããïŒ çµå±ã®ãšããã誰ã圌女ã«ãããæ瀺çã«å ±åãã矩åã¯ãããŸããã
幞ããªããšã«ãWebããŒãžã¯ããèªäœã§ã¯ååšãããäºãã«ãªã³ã¯ããŠããŸãã æ€çŽ¢ããããã¯ãããã®ãªã³ã¯ããã©ãããã¹ãŠã®æ°ããWebããŒãžãçºèŠã§ããŸãã
å®éãããŒãžã®æ§é ãšããŒãžéã®ãªã³ã¯ã¯ããã°ã©ãããšåŒã°ããããŒã¿æ§é ãèšè¿°ããŠããŸãã å®çŸ©ã«ããã°ã©ãã¯ãé ç¹ïŒãã®å Žåã¯WebããŒãžïŒãšãšããžïŒãã®å Žåã¯é ç¹éã®ãªã³ã¯ããã€ããŒãªã³ã¯ïŒã§æ§æãããŸãã
ã°ã©ãã®ä»ã®äŸã¯ããœãŒã·ã£ã«ãããã¯ãŒã¯ïŒäººã -ããŒã¯ããšããž-åæ é¢ä¿ïŒãããŒããããïŒéœåž-ããŒã¯ããšããž-éœåžéã®éè·¯ïŒãããã«ã¯ãã§ã¹ã®ãã¹ãŠã®å¯èœãªçµã¿åããïŒãã§ã¹ã®çµã¿åãããé ç¹ã§ãããããŒã¯éã®ãšããžãååšããå ŽåïŒ 1ã€ã®åãã§1ã€ã®äœçœ®ããå¥ã®äœçœ®ã«ç§»åã§ããŸãïŒã
ã°ã©ãã¯ãæ¹åããšããžã«ç€ºãããŠãããã©ããã«å¿ããŠãæ¹åä»ããããŠããŸãã ãã€ããŒãªã³ã¯ã¯äžæ¹åã«ãã移åã§ããªããããã€ã³ã¿ãŒãããã¯æåã°ã©ãã§ãã
ããã«èª¬æããããã«ãã€ã³ã¿ãŒãããã¯åŒ·åã«æ¥ç¶ãããã°ã©ãã§ãããšæ³å®ããŸããã€ãŸããã€ã³ã¿ãŒãããäžã®ã©ãããã§ãéå§ããããšã§ãä»ã®ãã€ã³ãã«å°éã§ãããšä»®å®ããŸãã ãã®ä»®å®ã¯æããã«ééã£ãŠããŸãïŒã©ãããã§ããªã³ã¯ãããªãæ°ããWebããŒãžãç°¡åã«äœæã§ãããããããã«ã¢ã¯ã»ã¹ããããšã¯ã§ããŸããïŒããªã³ã¯ã¯æ€çŽ¢ã«ã¯ããŸãé¢å¿ããããŸããã
Webã°ã©ãã®ããäžéšïŒ
ã°ã©ããã©ããŒãµã«ã¢ã«ãŽãªãºã ïŒå¹ ãšæ·±ãã®æ€çŽ¢
æ·±ãæ€çŽ¢
2ã€ã®å€å žçãªã°ã©ããã©ããŒãµã«ã¢ã«ãŽãªãºã ããããŸãã æåã®-ã·ã³ãã«ã§åŒ·åãª-ã¢ã«ãŽãªãºã ã¯ãæ·±ãåªå æ€çŽ¢ïŒDFSïŒãšåŒã°ããŸãã ååž°ã«åºã¥ããŠããã次ã®äžé£ã®ã¢ã¯ã·ã§ã³ãè¡šããŸãã
- åŠçãããé ç¹ã®çŸåšã®é ç¹ãããŒã¯ããŸãã
- çŸåšã®é ç¹ãåŠçããŸãïŒæ€çŽ¢ããããã®å ŽåãåŠçã¯åã«ã³ããŒãä¿åããŸãïŒã
- çŸåšã®é ç¹ãã移åã§ãããã¹ãŠã®é ç¹ã«ã€ããŠãé ç¹ããŸã åŠçãããŠããªãå Žåãååž°çã«åŠçããŸãã
ãã®ã¢ãããŒããæåéãå®è£ ããPythonã³ãŒãã¯æ¬¡ã®ãšããã§ãã
seen_links = set() def dfs(url): seen_links.add(url) print('processing url ' + url) html = get(url) save_html(url, html) for link in get_filtered_links(url, html): if link not in seen_links: dfs(link) dfs(START_URL)
ã»ãŒåãæ¹æ³ã§ãããšãã°ãæšæºã®Linux wgetãŠãŒãã£ãªãã£ã¯-rãã©ã¡ãŒã¿ãŒã䜿çšããŠæ©èœããŸããããã¯ããµã€ããååž°çã«ãã³ãã¢ãŠãããå¿ èŠãããããšã瀺ããŸãã
wget -r habrahabr.ru
æ·±ãæ€çŽ¢æ¹æ³ã¯ãå°ããªãµã€ãã®WebããŒãžãã¯ããŒã«ããããã«é©åã«é©çšãããŸãããã€ã³ã¿ãŒãããå šäœãã¯ããŒã«ããã«ã¯ããŸã䟿å©ã§ã¯ãããŸããã
- ãã®äžã«å«ãŸããååž°åŒã³åºãã¯ãããŸãããŸã䞊è¡ããŠããŸããã
- ãã®å®è£ ã䜿çšãããšãã¢ã«ãŽãªãºã ã¯ãªã³ã¯ãã©ãã©ãæ·±ããªããæçµçã«ã¯ååž°åŒã³åºãã¹ã¿ãã¯ã«ååãªã¹ããŒã¹ããªããªãå¯èœæ§ãé«ããªããã¹ã¿ãã¯ãªãŒããŒãããŒãšã©ãŒãçºçããŸãã
äžè¬ã«ããããã®åé¡ã¯äž¡æ¹ãšã解決ã§ããŸããã代ããã«å¥ã®å€å žçãªã¢ã«ãŽãªãºã ã§ããå¹ åªå æ¢çŽ¢ã䜿çšããŸãã
åºãæ€çŽ¢
å¹ åªå æ€çŽ¢ïŒBFSïŒã¯æ·±ãæ€çŽ¢ãšåæ§ã®æ¹æ³ã§æ©èœããŸãããéå§ããŒãžããã®è·é¢ã®é ã«ã°ã©ãã®äžéšãåããŸãã ãã®ãããã¢ã«ãŽãªãºã ã¯ããã¥ãŒãããŒã¿æ§é ã䜿çšããŸãããã¥ãŒã§ã¯ãèŠçŽ ãæåŸã«è¿œå ããæåããããããååŸã§ããŸãã
- ã¢ã«ãŽãªãºã ã¯æ¬¡ã®ããã«èª¬æã§ããŸãã
- æåã®ããŒã¯ããã¥ãŒãšå€ãã®ã衚瀺ããããããŒã¯ã«è¿œå ããŸãã
- ãã¥ãŒã空ã§ãªãå ŽåãåŠçã®ããã«ãã¥ãŒãã次ã®é ç¹ãååŸããŸãã
- ããããå å·¥ããŸãã
- ã衚瀺ããããé ç¹ã«å«ãŸããŠããªããåŠçãããé ç¹ããã®ãã¹ãŠã®ãšããžïŒ
- ãseenãã«è¿œå ããŸãã
- ãã¥ãŒã«è¿œå ããŸãã
- ã¹ããã2ã«é²ã¿ãŸãã
Pythonã³ãŒãïŒ
def bfs(start_url): queue = Queue() queue.put(start_url) seen_links = {start_url} while not (queue.empty()): url = queue.get() print('processing url ' + url) html = get(url) save_html(url, html) for link in get_filtered_links(url, html): if link not in seen_links: queue.put(link) seen_links.add(link) bfs(START_URL)
ãã¥ãŒã«ã¯ãæåã«æåã®ãªã³ã¯ãã1ã€ã®ãªã³ã¯ã次ã«2ã€ã®ãªã³ã¯ã次ã«3ã€ã®ãªã³ã¯ãªã©ã®è·é¢ã«ããé ç¹ããããŸããã€ãŸããå¹ åªå æ¢çŽ¢ã¢ã«ãŽãªãºã ã¯åžžã«æçãã¹ã§é ç¹ã«å°éããŸãã
ãã1ã€ã®éèŠãªãã€ã³ãïŒãã®å Žåããã¥ãŒãšãèŠããããããŒã¯ã®å€ãã¯åçŽãªã€ã³ã¿ãŒãã§ã€ã¹ïŒè¿œå ãååŸããšã³ããªã®ãã§ãã¯ïŒã®ã¿ã䜿çšãããããã®ã€ã³ã¿ãŒãã§ã€ã¹ãä»ããŠã¯ã©ã€ã¢ã³ããšéä¿¡ããå¥ã®ãµãŒããŒã«ç°¡åã«ç§»åã§ããŸãã ãã®æ©èœã«ããã ãã«ãã¹ã¬ããã°ã©ããã©ããŒãµã«ãå®è£ ã§ããŸããåããã¥ãŒã䜿çšããè€æ°ã®ããã»ããµãåæã«å®è¡ã§ããŸãã
Robots.txt
å®éã®å®è£ ã説æããåã«ãé©åã«åäœããã¯ããŒã©ãŒã¯ãrobots.txtãã¡ã€ã«ã§Webãµã€ãã®ææè ã«ãã£ãŠèšå®ãããçŠæ¢äºé ãèæ ®ããŠããããšã«æ³šæããããšæããŸãã ããšãã°ãlenta.ruã®robots.txtã®å 容ã¯æ¬¡ã®ãšããã§ãã
User-agent: YandexBot Allow: /rss/yandexfull/turbo User-agent: Yandex Disallow: /search Disallow: /check_ed Disallow: /auth Disallow: /my Host: https://lenta.ru User-agent: GoogleBot Disallow: /search Disallow: /check_ed Disallow: /auth Disallow: /my User-agent: * Disallow: /search Disallow: /check_ed Disallow: /auth Disallow: /my Sitemap: https://lenta.ru/sitemap.xml.gz
ããã§ã¯ãYandexãããããGoogleããã®ä»ãã¹ãŠã®äººã蚪åããããšãçŠããããŠãããµã€ãã®ããã€ãã®ã»ã¯ã·ã§ã³ãå®çŸ©ãããŠããããšãããããŸãã Pythonã§robots.txtã®å 容ãèæ ®ããããã«ãæšæºã©ã€ãã©ãªã«å«ãŸãããã£ã«ã¿ãŒã®å®è£ ã䜿çšã§ããŸãã
In [1]: from urllib.robotparser import RobotFileParser ...: rp = RobotFileParser() ...: rp.set_url('https://lenta.ru/robots.txt') ...: rp.read() ...: In [3]: rp.can_fetch('*', 'https://lenta.ru/news/2017/12/17/vivalarevolucion/') Out[3]: True In [4]: rp.can_fetch('*', 'https://lenta.ru/search?query=big%20data#size=10|sort=2|domain=1 ...: |modified,format=yyyy-MM-dd') Out[4]: False
å®è£
ãã®ãããã€ã³ã¿ãŒããããäžåšããããã«åŠçããããã«ä¿åããå¿ èŠããããŸãã
ãã¡ããããã¢ã³ã¹ãã¬ãŒã·ã§ã³ã®ããã«ãã€ã³ã¿ãŒãããå šäœãåã£ãŠä¿åããããšã¯ã§ããŸãã-éåžžã«é«äŸ¡ã§ãããæœåšçã«ã€ã³ã¿ãŒãããå šäœã®ãµã€ãºã«æ¡å€§çž®å°ã§ãããšããäºå®ãèæ ®ããŠã³ãŒããéçºããŸãã
ããã¯ãåæã«å€æ°ã®ãµãŒããŒã§äœæ¥ããçµæãç°¡åã«åŠçã§ããäœããã®çš®é¡ã®ã¹ãã¬ãŒãžã«ä¿åããå¿
èŠãããããšãæå³ããŸãã
ãœãªã¥ãŒã·ã§ã³ã®åºç€ãšããŠAmazon Web ServicesãéžæããŸãããç¹å®ã®æ°ã®ãã·ã³ãç°¡åã«äœæããçµæãåŠçããŠã Amazon S3åæ£ã¹ãã¬ãŒãžã«ä¿åã§ããããã§ãã åæ§ã®ãœãªã¥ãŒã·ã§ã³ã¯ãããšãã°google ã microsoftããã³Yandexã§ãã
éçºãããœãªã¥ãŒã·ã§ã³ã®ã¢ãŒããã¯ãã£
ç§ã®ããŒã¿åéã¹ããŒã ã®äžå¿çãªèŠçŽ ã¯ãã¥ãŒãµãŒããŒã§ãããã®ãµãŒããŒã«ã¯ãããŠã³ããŒãããŠåŠçããURLã®ãã¥ãŒãšãããã»ããµãæ¢ã«ãèŠããå€ãã®URLãæ ŒçŽãããŠããŸãã ç§ã®å®è£ ã§ã¯ããããã¯æãåçŽãªãã¥ãŒã«åºã¥ããŠãããPythonããŒã¿æ§é ãèšå®ããŠããŸãã
å®éã®æ¬çªã·ã¹ãã ã§ã¯ããããããããã®ä»£ããã«ããã¥ãŒïŒããšãã°ã kafka ïŒããã³ã»ããã®åæ£ã¹ãã¬ãŒãžïŒããšãã°ã erospikeãªã©ã®ããŒã¿ããŒã¹ã®ã¡ã¢ãªå ããŒå€ã¯ã©ã¹ã®ãœãªã¥ãŒã·ã§ã³ãé©åã§ã ïŒã«æ¢åã®ãœãªã¥ãŒã·ã§ã³ã䜿çšãã䟡å€ããããŸãã ããã«ãããå®å šãªæ°Žå¹³ã¹ã±ãŒã©ããªãã£ãå®çŸã§ããŸãããäžè¬çã«ãã¥ãŒãµãŒããŒã®è² è·ã¯ããã»ã©å€§ãããªããããç§ã®å°ããªãã¢ãããžã§ã¯ãã§ã¯ããã®ãããªå°èŠæš¡ã§ã¯æå³ããããŸããã
皌åäžã®ãµãŒããŒã¯ãããŠã³ããŒãçšã®æ°ããURLã°ã«ãŒããå®æçã«éžæãïŒãã¥ãŒã«äžå¿ èŠãªè² æ ããããªãããã«ãããã«å€ããåãé€ããŸãïŒãWebããŒãžãããŠã³ããŒãããs3ã«ä¿åããæ°ãããã¥ãŒãããŠã³ããŒããã¥ãŒã«è¿œå ããŸãã
URLãè¿œå ããè² æ ã軜æžããããã«ãè¿œå ãã°ã«ãŒãã§è¡ãããŸãïŒWebããŒãžã§èŠã€ãã£ããã¹ãŠã®æ°ããURLãäžåºŠã«è¿œå ããŸãïŒã ãŸããäœæ¥ããŒãã®åŽã§æ¢ã«è¿œå ãããããŒãžãäºåã«ãã£ã«ã¿ãªã³ã°ããããã«ãå€ãã®ã衚瀺ããããURLãéçšãµãŒããŒãšå®æçã«åæããŸãã
ããŠã³ããŒãããWebããŒãžãåæ£ã¯ã©ãŠãã¹ãã¬ãŒãžïŒS3ïŒã«ä¿åããŸã-ããã¯åŸã§åæ£åŠçã«äŸ¿å©ã§ãã
ãã¥ãŒã¯ãè¿œå ããã³åŠçããããªã¯ãšã¹ãã®æ°ã«é¢ããçµ±èšãçµ±èšãµãŒããŒã«å®æçã«éä¿¡ããŸãã äœæ¥ããŒãããšã«åèšã§åå¥ã«çµ±èšãéä¿¡ããŸããããã¯ãããŠã³ããŒããæ£åžžã«è¡ãããŠããããšãæ確ã«ããããã«å¿ èŠã§ãã åã ã®äœæ¥ãã·ã³ã®ãã°ãèªã¿åãããšã¯äžå¯èœãªã®ã§ããã£ãŒãã§åäœãç£èŠããŸãã ããŠã³ããŒããç£èŠããããã®ãœãªã¥ãŒã·ã§ã³ãšããŠã ã°ã©ãã¡ã€ããéžæããŸããã
ã¯ããŒã©ãŒã®æã¡äžã
ãã§ã«æžããããã«ãã€ã³ã¿ãŒãããå šäœãããŠã³ããŒãããã«ã¯èšå€§ãªãªãœãŒã¹ãå¿ èŠãªã®ã§ããã®ã»ãã®äžéšãã€ãŸãhabrahabr.ruãšgeektimes.ruãšãããµã€ãã«éå®ããŸããã ãã ããå¶éã¯ããªãæ¡ä»¶ä»ãã§ãããä»ã®ãµã€ãã«æ¡åŒµããããšã¯ãå©çšå¯èœãªéã®éã®åé¡ã§ãã å®è¡ããããã«ãAmazonã¯ã©ãŠãã«æ°ããã¯ã©ã¹ã¿ãŒãäœæããã·ã³ãã«ãªã¹ã¯ãªãããå®è£ ããããã«ãœãŒã¹ã³ãŒããã³ããŒããŠã察å¿ãããµãŒãã¹ãéå§ããŸãã
#deploy_queue.py from deploy import * def main(): master_node = run_master_node() deploy_code(master_node) configure_python(master_node) setup_graphite(master_node) start_urlqueue(master_node) if __name__ == main(): main()
#deploy_workers.py #run as: http://<queue_ip>:88889 from deploy import * def main(): master_node = run_master_node() deploy_code(master_node) configure_python(master_node) setup_graphite(master_node) start_urlqueue(master_node) if __name__ == main(): main()
åŒã³åºããããã¹ãŠã®é¢æ°ãå«ãdeploy.py ã¹ã¯ãªããã®ã³ãŒã
çµ±èšããŒã«ãšããŠã°ã©ãã¡ã€ãã䜿çšãããšãçŸããã°ã©ããæãããšãã§ããŸãã
èµ€ãã°ã©ãã¯æ€åºãããURLã瀺ããç·ã®ã°ã©ãã¯ããŠã³ããŒããããURLã瀺ããéãã°ã©ãã¯ãã¥ãŒå ã®URLã瀺ããŸãã æéå šäœã§ã550äžããŒââãžãããŠã³ããŒããããŸããã
äœæ¥ããŒãããšã«åé¡ãããã1åãããã®ã¯ããŒã«ãããããŒãžã®æ°ã ã°ã©ãã¯äžæããããã¯ããŒã«ã¯éåžžã¢ãŒãã«ãªããŸãã
çµæ
habrahabrãšgeektimesã®ããŠã³ããŒãã«ã¯3æ¥ããããŸããã
åäœæ¥ãã·ã³ã®ã¯ãŒã«ãŒã®ã€ã³ã¹ã¿ã³ã¹æ°ãå¢ãããã¯ãŒã«ãŒã®æ°ãå¢ããããšã§ãã¯ããã«é«éã«ããŠã³ããŒãããããšãã§ããŸãããããèªäœã®è² è·ã¯éåžžã«å€§ãããªããŸãããæ°ã«å
¥ãã®ãµã€ãã§åé¡ãçºçããã®ã¯ãªãã§ããïŒ
ãã®éçšã§ãã¯ããŒã©ãŒã«ããã€ãã®ãã£ã«ã¿ãŒãè¿œå ããæ€çŽ¢ãšã³ãžã³ã®éçºã«é¢ä¿ã®ãªãæããã«äžèŠãªããŒãžãé€å€ãå§ããŸããã
éçºãããã¯ããŒã©ãŒã¯ãã¢ã§ãããäžè¬ã«ã¹ã±ãŒã©ãã«ã§ãããåæã«å€æ°ã®ãµã€ããã倧éã®ããŒã¿ãåéããããã«äœ¿çšã§ããŸãïŒãã ããå®çšŒåç°å¢ã§ã¯ ã heritrixãªã©ã®æ¢åã®ã¯ããŒã«ãœãªã¥ãŒã·ã§ã³ã«çŠç¹ãåœãŠãããšãçã«ããªã£ãŠããŸããäžåºŠã ãã§ã¯ãªããå®æçã«èµ·åããå€ãã®è¿œå æ©èœãå®è£ ããå¿ èŠããããŸããããããŸã§ã¯ç¡èŠããŠããŸããã
ã¯ããŒã©ãŒã®æ代ãç§ã¯Amazonã¯ã©ãŠãã«çŽ60ãã«ãè²»ãããŸããã åèš550äžããŒââãžãããŠã³ããŒãããåèšããªã¥ãŒã ã¯668ã®ã¬ãã€ãã§ãã
ã·ãªãŒãºã®æ¬¡åã®èšäºã§ã¯ãããã°ããŒã¿æè¡ã䜿çšããŠããŠã³ããŒãããWebããŒãžã«ã€ã³ããã¯ã¹ãäœæããããŠã³ããŒãããããŒãžã§å®éã«æ€çŽ¢ããããã®æãç°¡åãªãšã³ãžã³ãèšèšããŸãã
ãããžã§ã¯ãã³ãŒãã¯githubã§å ¥æã§ããŸã