å€æ°ã®ãªãœãŒã¹ããWebããŒãžããããŒã¿ãååŸããå¯äžã®çã®æ¹æ³ã«ã€ããŠèªã£ãŠããŸãã ããããçŸå®ã«ã¯ããã®ããã«ããã€ãã®ãœãªã¥ãŒã·ã§ã³ãšããŒã«ã䜿çšã§ããŸãã
- WebããŒãžããããã°ã©ã ã§ããŒã¿ãååŸããããã®ãªãã·ã§ã³ã¯äœã§ããïŒ
- åã¢ãããŒãã®é·æãšçæã¯ïŒ
- ã¯ã©ãŠããªãœãŒã¹ã䜿çšããŠèªååã®åºŠåããé«ããæ¹æ³
ãã®èšäºã¯ããããã®è³ªåã«å¯ŸããåçãåŸãã®ã«åœ¹ç«ã¡ãŸãã
HTTPãªã¯ãšã¹ãã DOM ïŒããã¥ã¡ã³ããªããžã§ã¯ãã¢ãã«ïŒã HTML ã CSSã»ã¬ã¯ã¿ãŒ ã éåæJavaScriptãäœã§ããããæ¢ã«ç¥ã£ãŠãããšä»®å®ããŸãã
ããã§ãªãå Žåã¯ãçè«ã詳ãã調ã¹ãŠããèšäºã«æ»ãããšããå§ãããŸãã
éçã³ã³ãã³ã
HTMLãœãŒã¹
æãç°¡åãªã¢ãããŒãããå§ããŸãããã
WebããŒãžã®ã¹ã¯ã¬ã€ãã³ã°ãèšç»ããŠããå Žåã¯ããããæåã®éå§ç¹ã§ãã ã³ã³ãã¥ãŒã¿ãŒã®é»åã¯ã»ãšãã©å¿ èŠãªããæå°éã®æéã§æžã¿ãŸãã
ãã ããããã¯ãHTMLãœãŒã¹ã³ãŒãã«å¯Ÿè±¡ã®ããŒã¿ãå«ãŸããŠããå Žåã«ã®ã¿æ©èœããŸãã Chromeã§ããããã¹ãããã«ã¯ãããŒãžãå³ã¯ãªãã¯ããŠ[ããŒãžã³ãŒãã®è¡šç€º]ãéžæããŸãã HTMLãœãŒã¹ã³ãŒãã衚瀺ãããŸãã
ããŒã¿ãèŠã€ããããåŸã§ãªã³ã¯ã§ããããã«ãã©ããã³ã°èŠçŽ ã«å±ããCSSã»ã¬ã¯ã¿ãŒãèšè¿°ããŸãã
å®è£ ã®ããã«ãããŒãžURLã«HTTP GETãªã¯ãšã¹ããéä¿¡ããHTMLãœãŒã¹ã³ãŒããååŸã§ããŸãã
Nodeã§ã¯ ã CheerioJSããŒã«ã䜿çšããŠçã®HTMLã解æããã»ã¬ã¯ã¿ãŒã䜿çšããŠããŒã¿ãååŸã§ããŸãã ã³ãŒãã¯æ¬¡ã®ããã«ãªããŸãã
const fetch = require('node-fetch'); const cheerio = require('cheerio'); const url = 'https://example.com/'; const selector = '.example'; fetch(url) .then(res => res.text()) .then(html => { const $ = cheerio.load(html); const data = $(selector); console.log(data.text()); });
åçã³ã³ãã³ã
å€ãã®å ŽåãDOMã¯ããã¯ã°ã©ãŠã³ãã§å®è¡ãããJavaScriptã«ãã£ãŠå¶åŸ¡ãããŠãããããæªå å·¥ã®HTMLããæ å ±ã«ã¢ã¯ã»ã¹ããããšã¯ã§ããŸããã ãã®å žåçãªäŸã¯ãSPAïŒã·ã³ã°ã«ããŒãžã¢ããªã±ãŒã·ã§ã³ïŒã§ããHTMLããã¥ã¡ã³ãã«ã¯æå°éã®æ å ±ãå«ãŸããŠãããå®è¡æã«JavaScriptãæ å ±ãå ¥åããŸãã
ãã®ç¶æ³ã§ã®è§£æ±ºçã¯ããã©ãŠã¶ãŒãšåæ§ã«ãDOMãäœæããHTMLãœãŒã¹ã³ãŒãã«ããã¹ã¯ãªãããå®è¡ããããšã§ãã ãã®åŸãã»ã¬ã¯ã¿ã䜿çšããŠãã®ãªããžã§ã¯ãããããŒã¿ãæœåºã§ããŸãã
ãããã¬ã¹ãã©ãŠã¶
ãããã¬ã¹ãã©ãŠã¶ãŒã¯éåžžã®ãã©ãŠã¶ãŒãšåãã§ããããŠãŒã¶ãŒã€ã³ã¿ãŒãã§ã€ã¹ã¯ãããŸããã ããã¯ã°ã©ãŠã³ãã§å®è¡ãããããŠã¹ãã¯ãªãã¯ããŠããŒããŒãããå ¥åãã代ããã«ãããã°ã©ã ã§å¶åŸ¡ã§ããŸãã
Puppeteer㯠ãæã人æ°ã®ãããããã¬ã¹ãã©ãŠã¶ãŒã®1ã€ã§ãã ããã¯äœ¿ããããNodeã©ã€ãã©ãªã§ãChromeããªãã©ã€ã³ã§ç®¡çããããã®é«ã¬ãã«APIãæäŸããŸãã ããããŒãªãã§å®è¡ããããã«æ§æã§ããéçºäžã«éåžžã«äŸ¿å©ã§ãã 次ã®ã³ãŒãã¯ä»¥åãšåãããšãè¡ããŸãããåçããŒãžã§ãæ©èœããŸãã
const puppeteer = require('puppeteer'); async function getData(url, selector){ const browser = await puppeteer.launch(); const page = await browser.newPage(); await page.goto(url); const data = await page.evaluate(selector => { return document.querySelector(selector).innerText; }, selector); await browser.close(); return data; } const url = 'https://example.com'; const selector = '.example'; getData(url,selector) .then(result => console.log(result));
ãã¡ãããPuppeteerã䜿çšãããšããã«èå³æ·±ãããšãã§ããã®ã§ã ããã¥ã¡ã³ãã確èªããŠãã ããã URLãããã²ãŒãããã¹ã¯ãªãŒã³ã·ã§ãããååŸããŠä¿åããã³ãŒãã®ã¹ããããã次ã«ç€ºããŸãã
const puppeteer = require('puppeteer'); async function takeScreenshot(url,path){ const browser = await puppeteer.launch(); const page = await browser.newPage(); await page.goto(url); await page.screenshot({path: path}); await browser.close(); } const url = 'https://example.com'; const path = 'example.png'; takeScreenshot(url, path);
ãã©ãŠã¶ãŒã¯ãåçŽãªGETèŠæ±ãéä¿¡ããŠå¿çãåæãããããã¯ããã«å€ãã®åŠçèœåãå¿ èŠãšããŸãã ãããã£ãŠãå®è¡ã¯æ¯èŒçé ããªããŸãã ããã ãã§ãªããäŸåé¢ä¿ãšããŠãã©ãŠã¶ãè¿œå ãããšãããã±ãŒãžã巚倧ã«ãªããŸãã
äžæ¹ããã®æ¹æ³ã¯éåžžã«æè»ã§ãã ããã䜿çšããŠãããŒãžã®ããã²ãŒããã¯ãªãã¯ã®ã·ãã¥ã¬ãŒã·ã§ã³ãããŠã¹ã®åããšããŒããŒãã®äœ¿çšããã©ãŒã ãžã®å ¥åãã¹ã¯ãªãŒã³ã·ã§ããã®äœæãŸãã¯PDFããŒãžã®äœæãã³ã³ãœãŒã«ã§ã®ã³ãã³ãã®å®è¡ãããã¹ãã³ã³ãã³ãã®æœåºé ç®ã®éžæãè¡ãããšãã§ããŸãã åºæ¬çã«ããã©ãŠã¶ã§æåã§å®è¡ã§ãããã¹ãŠã®ããšã
DOMãæ§ç¯ãã
DOMãäœæããããã ãã«ãã©ãŠã¶ãŒå šäœãã·ãã¥ã¬ãŒãããå¿ èŠã¯ãªããšæãã§ãããã å®éãå°ãªããšãç¹å®ã®ç¶æ³ã§ã¯ãããã¯äºå®ã§ãã
Jsdomã¯ããã©ãŠã¶ãè¡ãããã«ãéä¿¡ãããHTMLã解æããããŒãã©ã€ãã©ãªã§ãã ãã ããããã¯ãã©ãŠã¶ã§ã¯ãªãã ç¹å®ã®HTMLãœãŒã¹ã³ãŒãããDOMãæ§ç¯ããããã®ããŒã«ã§ãã ããã®HTMLã§JavaScriptã³ãŒããå®è¡ããããã®ããŒã«ã§ã ã
ãã®æœè±¡åã®ãããã§ãJsdomã¯ãããã¬ã¹ãã©ãŠã¶ãŒãããé«éã«å®è¡ã§ããŸãã éãå Žåã¯ãåžžã«ãããã¬ã¹ãã©ãŠã¶ãŒã®ä»£ããã«äœ¿çšããªãã§ãã ããã
ããã¥ã¡ã³ãããåŒçš ïŒ
å€ãã®å Žåãjsdomã䜿çšãããšãéåæã§ã¹ã¯ãªãããèªã¿èŸŒããšãã«åé¡ãçºçããŸãã å€ãã®ããŒãžã¯ã¹ã¯ãªãããéåæçã«èªã¿èŸŒã¿ãŸãããããããã€çºçãããããããã£ãŠãã€ã³ãŒããå®è¡ããŠçµæã®DOMæ§é ãæ€èšŒããããå€æããããšã¯äžå¯èœã§ãã ããã¯åºæ¬çãªå¶éã§ãã
ãã®ãœãªã¥ãŒã·ã§ã³ãäŸã«ç€ºããŸãã 100ããªç§ããšã«ãèŠçŽ ãåºçŸããããã¿ã€ã ã¢ãŠããçºçãããïŒ2ç§åŸïŒããã§ãã¯ãããŸãã
ãŸããJsdomãããŒãžã«äžéšã®ãã©ãŠã¶ãŒæ©èœãå®è£ ããŠããªãå Žåãã ãšã©ãŒïŒå®è£ ãããŠããªãïŒwindow.alert ...ããŸãã¯ããšã©ãŒïŒå®è£ ãããŠããªãïŒwindow.scrollTo ... ããªã©ã®ãšã©ãŒã¡ãã»ãŒãžã衚瀺ãããŸãã ãã®åé¡ã¯ãããã€ãã®åé¿çïŒ ä»®æ³ã³ã³ãœãŒã« ïŒã§ã解決ã§ããŸã ã
ããã¯éåžžãPuppeteerãããäœã¬ãã«ã®APIã§ãããããããã€ãã®ããšãèªåã§å®è£ ããå¿ èŠããããŸãã
äŸãããããããã«ãããã«ãã䜿çšãå°ãè€éã«ãªããŸãã Jsdomã¯åãä»äºã«è¿ éãªãœãªã¥ãŒã·ã§ã³ãæäŸããŸãã
åãäŸãèŠãŠã¿ãŸããããã Jsdomã䜿çšããŸã ã
const jsdom = require("jsdom"); const { JSDOM } = jsdom; async function getData(url,selector,timeout) { const virtualConsole = new jsdom.VirtualConsole(); virtualConsole.sendTo(console, { omitJSDOMErrors: true }); const dom = await JSDOM.fromURL(url, { runScripts: "dangerously", resources: "usable", virtualConsole }); const data = await new Promise((res,rej)=>{ const started = Date.now(); const timer = setInterval(() => { const element = dom.window.document.querySelector(selector) if (element) { res(element.textContent); clearInterval(timer); } else if(Date.now()-started > timeout){ rej("Timed out"); clearInterval(timer); } }, 100); }); dom.window.close(); return data; } const url = "https://example.com/"; const selector = ".example"; getData(url,selector,2000).then(result => console.log(result));
ãªããŒã¹ãšã³ãžãã¢ãªã³ã°
Jsdomã¯è¿ éã§ç°¡åãªãœãªã¥ãŒã·ã§ã³ã§ãããããã«ã·ã³ãã«ã«ããããšãã§ããŸãã
DOMãã¢ãã«åããå¿ èŠããããŸããïŒ
ã¹ã¯ã©ããããWebããŒãžã¯ãåãHTMLãšJavaScriptããã§ã«ç¥ã£ãŠããåããã¯ãããžãŒã§æ§æãããŠããŸãã ãããã£ãŠã ã¿ãŒã²ããããŒã¿ã®ååŸå ã®ã³ãŒããã©ã°ã¡ã³ããèŠã€ãã£ãå Žåãåãæäœãç¹°ãè¿ããŠåãçµæãåŸãããšãã§ããŸã ã
ç©äºãç°¡çŽ åããããã«ãæ¢ããŠããããŒã¿ã¯æ¬¡ã®ãšããã§ãã
- ãœãŒã¹HTMLã³ãŒãã®äžéšïŒèšäºã®æåã®éšåãããããããã«ïŒã
- HTMLããã¥ã¡ã³ãã§åç §ãããéçãã¡ã€ã«ã®äžéšïŒããšãã°ãjavascriptãã¡ã€ã«ã®è¡ïŒã
- ãããã¯ãŒã¯èŠæ±ãžã®å¿çïŒããšãã°ãäžéšã®JavaScriptã³ãŒãã¯ãJSONæååã§å¿çãããµãŒããŒã«AJAXèŠæ±ãéä¿¡ããŸããïŒã
ãããã®ããŒã¿ãœãŒã¹ã«ã¯ããããã¯ãŒã¯ã¯ãšãªã䜿çšããŠã¢ã¯ã»ã¹ã§ããŸã ã WebããŒãžãHTTPãWebSocketããŸãã¯ãã®ä»ã®éä¿¡ãããã³ã«ã䜿çšãããã©ããã¯åé¡ã§ã¯ãããŸããããããã¯ãã¹ãŠçè«çã«åçŸå¯èœã§ããããã§ãã
ããŒã¿ãå«ããªãœãŒã¹ãèŠã€ããããå ã®ããŒãžãšåããµãŒããŒã«åæ§ã®ãããã¯ãŒã¯èŠæ±ãéä¿¡ã§ããŸãã ãã®çµæãæ£èŠè¡šçŸãæååã¡ãœãããJSON.parseãªã©ã䜿çšããŠç°¡åã«æœåºã§ããã¿ãŒã²ããããŒã¿ãå«ãåçãåŸãããŸãã
ç°¡åã«èšãã°ããã¹ãŠã®ææãåŠçããŠããŒããã代ããã«ãããŒã¿ãé 眮ãããŠãããªãœãŒã¹ã䜿çšã§ããŸãã ãããã£ãŠãåã®äŸã§ç€ºããåé¡ã¯ããã©ãŠã¶ãŒãŸãã¯è€éãªJavaScriptãªããžã§ã¯ããå¶åŸ¡ãã代ããã«ãåäžã®HTTPèŠæ±ã§è§£æ±ºã§ããŸãã
ãã®ãœãªã¥ãŒã·ã§ã³ã¯çè«çã«ã¯åçŽã«èŠããŸãããã»ãšãã©ã®å ŽåãæéãããããWebããŒãžãšãµãŒããŒã®çµéšãå¿ èŠã§ãã
ãããã¯ãŒã¯ãã©ãã£ãã¯ãç£èŠããããšããå§ããŸãã ããã«æé©ãªããŒã«ã¯ã Chrome DevToolsã® [ ãããã¯ãŒã¯ ]ã¿ãã§ã ã ãã¹ãŠã®çºä¿¡èŠæ±ãšãã®å¿çïŒéçãã¡ã€ã«ãAJAXèŠæ±ãªã©ãå«ãïŒã衚瀺ãããããããç¹°ãè¿ãåŠçããŠããŒã¿ãæ€çŽ¢ããŸãã
ç»é¢ã«è¡šç€ºãããåã«ã³ãŒãã«ãã£ãŠåçãå€æŽãããå Žåãããã»ã¹ã¯é ããªããŸãã ãã®å Žåãã³ãŒãã®ãã®éšåãèŠã€ããŠãäœãèµ·ãã£ãŠããã®ããç解ããå¿ èŠããããŸãã
ã芧ã®ãšããããã®ãããªæ¹æ³ã§ã¯ãäžèšã®æ¹æ³ãããã¯ããã«å€ãã®äœæ¥ãå¿ èŠã«ãªãå ŽåããããŸãã äžæ¹ãæé«ã®ããã©ãŒãã³ã¹ãæäŸããŸãã
ãã®å³ã¯ãJsdomãšPuppeteerãšæ¯èŒãããå¿ èŠãªã©ã³ã¿ã€ã ãšãã±ãããµã€ãºã瀺ããŠããŸãã
çµæã¯æ£ç¢ºãªæž¬å®å€ã«åºã¥ããã®ã§ã¯ãªããç°ãªãå ŽåããããŸããããããã®æ¹æ³ã®éã«ã¯ããããã®å·®ããããŸãã
ã¯ã©ãŠããµãŒãã¹ã®çµ±å
ãããã®ãœãªã¥ãŒã·ã§ã³ã®ãããããå®è£ ãããšããŸãã ã¹ã¯ãªãããå®è¡ãã1ã€ã®æ¹æ³ã¯ãã³ã³ãã¥ãŒã¿ãŒã®é»æºãå ¥ããã¿ãŒããã«ãéããŠæåã§èµ·åããããšã§ãã
ããããé¢åã§éå¹çã«ãªãããããµãŒããŒã«ã¹ã¯ãªãããã¢ããããŒãããã ãã§ãèšå®ã«å¿ããŠå®æçã«ã³ãŒããå®è¡ããããšãã§ããã°ããè¯ãã§ãããã
ããã¯ãå®éã®ãµãŒããŒãèµ·åããã¹ã¯ãªãããå®è¡ããã¿ã€ãã³ã°ãèšå®ããããšã§å®è¡ã§ããŸãã ãã以å€ã®å Žåãã¯ã©ãŠãæ©èœãç°¡åãªæ¹æ³ã§ãã
ã¯ã©ãŠãæ©èœã¯ãã€ãã³ããçºçãããšãã«ããŒããããã³ãŒããå®è¡ããããã«èšèšãããã¹ãã¬ãŒãžã§ãã ã€ãŸãããµãŒããŒã管çããå¿ èŠã¯ãããŸãããããã¯ã¯ã©ãŠããããã€ããŒã«ãã£ãŠèªåçã«è¡ãããŸãã
ããªã¬ãŒã¯ãã¹ã±ãžã¥ãŒã«ããããã¯ãŒã¯èŠæ±ãããã³ä»ã®å€ãã®ã€ãã³ãã§ãã åéããããŒã¿ãããŒã¿ããŒã¹ã«ä¿åãããã Googleã·ãŒãã«æžã蟌ãã ãã é»åã¡ãŒã«ã§éä¿¡ãããã§ããŸãã ããã¯ãã¹ãŠããªãã®æ³ååã«ããã£ãŠããŸãã
人æ°ã®ããã¯ã©ãŠããããã€ããŒ- ã¢ããŸã³ãŠã§ããµãŒãã¹ ïŒAWSïŒã Googleã¯ã©ãŠããã©ãããã©ãŒã ïŒGCPïŒãããã³Microsoft Azure ïŒ
ãããã®ãµãŒãã¹ã¯ç¡æã§äœ¿çšã§ããŸãããé·ãã¯äœ¿çšã§ããŸããã
Puppeteerã䜿çšããå Žåã Google Cloud æ©èœãæãç°¡åãªãœãªã¥ãŒã·ã§ã³ã§ãã Headless Chrome圢åŒã®ããã±ãŒãžãµã€ãºïŒçŽ130 MBïŒã¯ãAWS Lambdaã®æ倧蚱容ã¢ãŒã«ã€ããµã€ãºïŒ50 MBïŒãè¶ ããŠããŸãã Lambdaã§åäœãããã«ã¯ããã€ãã®æ¹æ³ããããŸãããGCPæ©èœã¯ããã©ã«ãã§ããããŒãªãã®ChromeããµããŒãããŠããã®ã§ã package.jsonã«äŸåé¢ä¿ãšããŠPuppeteerãå«ããã ãã§ãã
äžè¬çãªã¯ã©ãŠãæ©èœã®è©³çŽ°ã«ã€ããŠã¯ããµãŒããŒã¬ã¹ã¢ãŒããã¯ãã£ã®æ å ±ãã芧ãã ããã å€ãã®åªããã¬ã€ãããã§ã«ãã®äž»é¡ã«ã€ããŠæžãããŠãããã»ãšãã©ã®ãããã€ããŒã¯ããããããããã¥ã¡ã³ããæã£ãŠããŸãã