Webã¹ã¯ã¬ã€ãã³ã°ããŒã«ã¯ ãWebãµã€ãããããŒã¿ãæœåºããããã«èšèšãããŠããŸãã ãããã®ããŒã«ã¯ãã€ã³ã¿ãŒãããããããŒã¿ãååŸããããšããŠãã人ã«åœ¹ç«ã¡ãŸãã Webã¹ã¯ã¬ã€ãã³ã°ã¯ãå€ãã®ããŒãžãéããŠã³ããŒããŒã¹ããè¡ãããšãªãããŒã¿ãåä¿¡ã§ããããã«ããæè¡ã§ãã ãããã®ããŒã«ã䜿çšãããšãæ°ããããŒã¿ãŸãã¯æŽæ°ãããããŒã¿ãæåãŸãã¯èªåã§ååŸããåŸã§äœ¿çšããããã«ä¿åã§ããŸãã ããšãã°ãWebã¹ã¯ã¬ã€ãã³ã°ããŒã«ã䜿çšãããšããªã³ã©ã€ã³ã¹ãã¢ãã補åãšäŸ¡æ Œã«é¢ããæ å ±ãæœåºã§ããŸãã
Webã¹ã¯ã¬ã€ãã³ã°ããŒã«ã䜿çšããããã®å¯èœãªã·ããªãªïŒ
- ããŒã±ãã£ã³ã°èª¿æ»ã®ããã®ããŒã¿åé
- ããŸããŸãªãµã€ãããé£çµ¡å æ å ±ïŒé»åã¡ãŒã«ã¢ãã¬ã¹ãé»è©±çªå·ãªã©ïŒãæœåºããŠããµãã©ã€ã€ãã¡ãŒã«ãŒããŸãã¯ãã®ä»ã®é¢ä¿è ã®ç¬èªã®ãªã¹ããäœæããŸãã
- StackOverflowïŒãŸãã¯è³ªåãšåçãããä»ã®åæ§ã®ãµã€ãïŒãããœãªã¥ãŒã·ã§ã³ãããŠã³ããŒãããŠãããŸããŸãªãµã€ãããããŒã¿ããªãã©ã€ã³ã§èªã¿åã£ããä¿åãããã§ããããã«ããŠãã€ã³ã¿ãŒãããã¢ã¯ã»ã¹ãžã®äŸåãæžãããŸãã
- ãžã§ããŸãã¯ãžã§ãæ€çŽ¢ã
- ããŸããŸãªåºèã§ã®è£œåäŸ¡æ Œã®è¿œè·¡ã
ã³ãŒãã1è¡ãèšè¿°ããã«Webãµã€ãããããŒã¿ãæœåºããããŒã«ã¯å€æ°ãããŸããã 10ãªã³ã©ã€ã³ããŒã¿ãæœåºããWebã¹ã¯ã¬ã€ãã³ã°ããŒã« ãã ããŒã«ã¯ãã¹ã¿ã³ãã¢ãã³ã¢ããªã±ãŒã·ã§ã³ãWebãµã€ãããŸãã¯ãã©ãŠã¶ãã©ã°ã€ã³ã«ããããšãã§ããŸãã ç¬èªã®Webã¹ã¯ã¬ãŒããŒãäœæããåã«ãæ¢åã®ããŒã«ãåŠç¿ããå¿ èŠããããŸãã å°ãªããšããããã¯å€ãã®äººããã¹ãŠã®ä»çµã¿ã説æããéåžžã«åªãããããªãã¥ãŒããªã¢ã«ãæã£ãŠãããšãã芳ç¹ããã¯æçšã§ãã
Webã¹ã¯ã¬ã€ããŒã¯ãPythonïŒPythonã«ããWebã¹ã¯ã¬ã€ãã³ã° ïŒãŸãã¯RïŒ å®çšçãªããžãã¹äžã®åé¡ã解決ããããã®Rã®äœ¿çšäŸ ïŒã§äœæã§ããŸãã
CïŒã§èšè¿°ããŸãïŒãã ãã䜿çšããã¢ãããŒãã¯éçºèšèªã«äŸåããªããšèããŠããŸãïŒã ãã¹ãŠãç°¡åãã€ç°¡åã«æ©èœãããšä¿¡ããŠãç§ãç¯ããåä»ãªééãã«ç¹å¥ãªæ³šæãæãããšããŸãã
ç§ãããããã£ãçç±ãšãããã©ã®ããã«äœ¿çšãããã¯ããã§èªãããšãã§ããŸãïŒ
- ãªãŒãã³ããŒã¿ãµã€ãdata.gov.ruããããŒã¿ãããŠã³ããŒããã
- ãªãŒãã³ããŒã¿ããŒã¿ã«data.gov.ruã®ããŒã¿ã»ããã®åæ
- ãã·ã¢ã®ãªãŒãã³ããŒã¿ããŒã¿ã«data.gov.ruã®ããŒã¿ã»ããã䜿çšãã
ãããŠç°¡åã«
data.gov.ruã«æçš¿ããããªãŒãã³ããŒã¿ãã©ãã»ã©åœ¹ç«ã€ããããŒã¿ã«èªäœã§å©çšå¯èœãªAPIãšãªã³ã¯ã§ã¯é¢é£ããŒã¿ã®ããŠã³ããŒããèš±å¯ãããªãããšã«èå³ããããŸããã
ãã®ããããã·ã¢ã®ãªãŒãã³ããŒã¿ããŒã¿ã«data.gov.ruããããŒã¿ã»ããã«é¢ããæ å ±ãæœåºãããããcsv圢åŒã®åçŽãªããã¹ããã¡ã€ã«ãšããŠããã«åŠçããããã«ä¿åããŸãã ããŒã¿ã»ããã¯ãããŒãžããšã«ãªã¹ãã®åœ¢åŒã§è¡šç€ºãããåèŠçŽ ã«ã¯ããŒã¿ã»ããã«é¢ããç°¡åãªæ å ±ãå«ãŸããŠããŸãã
詳现æ å ±ãå ¥æããã«ã¯ããªã³ã¯ããã©ã£ãŠãã ããã
ãããã£ãŠãããŒã¿ã»ããã«é¢ããæ å ±ãååŸããã«ã¯ã次ã®ãã®ãå¿ èŠã§ãã
- ããŒã¿ã»ãããå«ããã¹ãŠã®ããŒãžãé²èŠ§ããŸãã
- ããŒã¿ã»ããã«é¢ããç°¡åãªæ å ±ãšãå®å šãªæ å ±ãå«ãããŒãžãžã®ãªã³ã¯ãæœåºããŸãã
- ãã¹ãŠã®æ å ±ãå«ãåããŒãžãéããŸãã
- ããŒãžããå®å šãªæ å ±ãæœåºããŸãã
ç§ããããªãã®ã¯ãHttpClientãŸãã¯WebRequestã䜿çšããŠããŒãžãåå¥ã«èªã¿èŸŒã¿ãèªåã§ããŒãžã解æããããšã§ãã
ScrapySharpãã¬ãŒã ã¯ãŒã¯ã䜿çšããŸãã ScrapySharpã«ã¯ãå®éã®Webãã©ãŠã¶ãŒããšãã¥ã¬ãŒãã§ããçµã¿èŸŒã¿ã®Webã¯ã©ã€ã¢ã³ãããããŸãã ãŸããScrapySharpã䜿çšãããšãCSSã»ã¬ã¯ã¿ãŒãšLinqã§HTMLãç°¡åã«è§£æã§ããŸãã ãã®ãã¬ãŒã ã¯ãŒã¯ã¯HtmlAgilityPackã®ã¢ããªã³ã§ãã ãããã¯ãããšãã°AngleSharpãæ€èšããŠãã ãã ã
ScrapySharpã®äœ¿çšãéå§ããã«ã¯ãé©åãªnugetããã±ãŒãžããã©ã°ã€ã³ããã ãã§ãã
ããã§ãçµã¿èŸŒã¿Webãã©ãŠã¶ãŒã䜿çšããŠããŒãžãããŒãã§ããŸãã
// - ScrapingBrowser browser = new ScrapingBrowser(); // - WebPage page = browser.NavigateToPage(new Uri("http://data.gov.ru/opendata/"));
WebPageåã®ãªããžã§ã¯ããšããŠè¿ãããããŒãžã ãã®ããŒãžã¯ãHtmlNodeåã®ããŒãã®ã»ãããšããŠè¡šãããŸãã InnerHtmlããããã£ã䜿çšãããšãèŠçŽ ã®Htmlã³ãŒãã衚瀺ã§ããInnerTextã䜿çšãããšãèŠçŽ å ã®ããã¹ããååŸã§ããŸãã
å®éãå¿ èŠãªæ å ±ãæœåºããã«ã¯ãç®çã®ããŒãžèŠçŽ ãèŠã€ããŠãããããããã¹ããæœåºããå¿ èŠããããŸãã
質åïŒããŒãžã³ãŒããèŠãŠãå¿ èŠãªèŠçŽ ãèŠã€ããæ¹æ³
ãã©ãŠã¶ã§ããŒãžã³ãŒãã衚瀺ããã ãã§ãã äžéšã®èšäºã§æšå¥šãããŠããããã«ã Fiddlerãªã©ã®ããŒã«ã䜿çšã§ããŸãã

ããããGoogle Chromeã§éçºè ããŒã«ã䜿çšããæ¹ã䟿å©ã ãšæããŸããã

ã³ãŒãåââæã容æã«ããããã«ãChromeçšã®XPath Helperæ¡åŒµæ©èœãã€ã³ã¹ããŒã«ããŸããã ã»ãšãã©ããã«ããªã¹ãå ã®ãã¹ãŠã®èŠçŽ ã«åãCSSã¯ã©ã¹.node-datasetãå«ãŸããŸãã ããã確èªããã«ã¯ãã³ã³ãœãŒã«é¢æ°ã®1ã€ã䜿çšããŠCSSã¹ã¿ã€ã«ãæ€çŽ¢ã§ããŸãã
æå®ãããã¹ã¿ã€ã«ã¯ããŒãžäžã§30åæ€åºãããããŒã¿ã»ããã«é¢ããç°¡åãªæ å ±ãå«ããªã¹ãã¢ã€ãã ã«æ£ç¢ºã«å¯Ÿå¿ããŸãã
.node-datasetãå«ããã¹ãŠã®ãªã¹ãé ç®ãScrapySharpã§ååŸããŸãã
var Table1 = page.Html.CssSelect(".node-dataset")
ããã¹ããå«ãŸãããã¹ãŠã®divèŠçŽ ãæœåºããŸãã
var divs = item.SelectNodes("div")
å®éãå¿ èŠãªããŒã¿ãååŸããããã®å€ãã®ãªãã·ã§ã³ããããŸãã ãããŠããäœãããŸããã£ããã觊ããªãã§ããšããååã«åºã¥ããŠè¡åããŸããã
ããšãã°ãæ¡åŒµæ å ±ãžã®ãªã³ã¯ã¯ãaboutå±æ§ããååŸã§ããŸãã
<div rel="dc:hasPart" about="/opendata/1435111685-maininfo" typeof="sioc:Item foaf:Document dcat:Dataset" class="ds-1col node node-dataset node-teaser gosudarstvo view-mode-teaser clearfix" property="dc:title" content=" () ()">
ScrapySharpã§ã¯ãããã¯æ¬¡ã®ããã«å®è¡ã§ããŸãã
String link = item.Attributes["about"]?.Value
å®éããªã¹ãããããŒã¿ã»ããã«é¢ããæ å ±ãæœåºããããã«å¿ èŠãªã®ã¯ããã ãã§ãã
ãã¹ãŠãããŸãããããã«èŠããŸãããããã§ã¯ãããŸããã
ãšã©ãŒçªå·1ã ããŒã¿ã¯åžžã«åãã§ããããšã©ãŒã¯çºçããªããšèããŠããŸãïŒããŠã³ããŒãããããŒã¿ã®å質ã«ã€ããŠã¯ããã§ã«ããã«æžããŸããã ãªãŒãã³ããŒã¿ããŒã¿ã«data.gov.ruããã®ããŒã¿ã»ããã®åæ ïŒã
ããšãã°ãäžéšã®ããŒã¿ã»ããã«ã¯ãæšå¥šããšããããã¹ãããããŸãã ãã®æ å ±ã¯å¿ èŠãããŸããã ãã§ãã¯ãè¿œå ããå¿ èŠããããŸããã
if (innerText != "") { items.Add(innerText); }
åä¿¡ããæ å ±ãéåžžã®ãªã¹ãã«ä¿åããŸãã
List<string> items = new List<string>()
ããŒã¿ã«ãšã©ãŒãããããã§ãã åä»ãæ§é ãäœæããå Žåããããã®ãšã©ãŒãããã«åŠçããå¿ èŠããããŸãã ç§ã¯ç°¡åã«è¡åããŸãã-ããŒã¿ãcsvãã¡ã€ã«ã«ä¿åããŸããã 次ã«åœŒãã«äœãèµ·ãããã¯ä»ã®ãšããæ¬åœã«æ°ã«ãªããŸããã
ãã¹ãŠã®ããŒãžãèŠãããã«ãç§ã¯èªè»¢è»ãçºæããŸããã§ããã ãªã³ã¯æ§é ãèŠãŠãã ããïŒ
<div class="item-list"><ul class="pager"><li class="pager-first first"><a title=" " href="/opendata?query=">« </a></li> <li class="pager-previous"><a title=" " href="/opendata?query=&page=32">â¹ </a></li> <li class="pager-item"><a title=" 30" href="/opendata?query=&page=29">30</a></li> <li class="pager-item"><a title=" 31" href="/opendata?query=&page=30">31</a></li> ⊠<li class="pager-item"><a title=" 37" href="/opendata?query=&page=36">37</a></li> <li class="pager-item"><a title=" 38" href="/opendata?query=&page=37">38</a></li> <li class="pager-next"><a title=" " href="/opendata?query=&page=34"> âº</a></li> <li class="pager-last last"><a title=" " href="/opendata?query=&page=423"> »</a></li> </ul></div>
çŽæ¥ãªã³ã¯ã䜿çšããŠãç®çã®ããŒãžã«ç§»åã§ããŸãã
http://data.gov.ru/opendata?query=&page={0}
ãã¡ããã次ã®ããŒãžãžã®ãªã³ã¯ãæ¢ãããšãã§ããŸãããããã§ã¯ãã©ã®ããã«ããŒãžã䞊è¡ããŠãªã¯ãšã¹ãããŸããïŒ å¿ èŠãªã®ã¯ãåèšã§äœããŒãžããããå€æããããšã ãã§ãã
WebPage page = _Browser.NavigateToPage(new Uri("http://data.gov.ru/opendata")); var lastPageLink = page.Html.SelectSingleNode("//li[@class='pager-last last']/a"); if (lastPageLink != null) { string href = lastPageLink.Attributes["href"].Value; âŠ
åçŽãªXPathã¯ãšãªã䜿çšããŠãå¿ èŠãªã¢ã€ãã ãååŸããŸãã
以åãChrome Developer Toolsã³ã³ãœãŒã«ã§ãã¹ãããŸããã
ãã¹ãŠã®ããŒãžã調ã¹ãŠãããŒã¿ã»ããã«é¢ããç°¡åãªæ å ±ãæœåºããå®å šãªæ å ±ïŒããŒã¿ã»ãããã¹ããŒãïŒãå«ãããŒãžãžã®ãªã³ã¯ãååŸã§ããŸãã
ééãïŒ2ã ãµãŒããŒã¯åžžã«ç®çã®ããŒãžãè¿ããŸãã
ãã¹ãŠãæå³ãããšããã«æ©èœãããšä¿¡ããã®ã¯åçŽã§ãã ã€ã³ã¿ãŒãããæ¥ç¶ã倱ãããæåŸã®ããŒãžãåé€ãããå¯èœæ§ããããŸãïŒç§ãæã£ãŠããïŒããµãŒããŒã¯DDOSæ»æã§ãããšå€æããå¯èœæ§ããããŸãã ã¯ããããæç¹ã§ãµãŒããŒãç§ã«å¿çããªããªããŸãã-èŠæ±ãå€ãããŸãã
ãšã©ãŒãç¡å¹ã«ããããã«ã次ã®æŠç¥ã䜿çšããŸããã
- ãµãŒããŒãããŒãžãè¿ããªãå Žåã¯ãnåç¹°ãè¿ããŸãïŒç¡éã§ã¯ãªããããŒãžãååšããªããªãå¯èœæ§ããããŸãïŒã
- ãµãŒããŒãããŒãžãè¿ããªãã£ãå Žåã¯ãããã«èŠæ±ããã«ãkããªç§ã®ã¿ã€ã ã¢ãŠããäœæããŸãã ãããŠãåãããŒãžã®æ¬¡ã®ãšã©ãŒã§ããããå¢ãããŸãã
- ãã¹ãŠã®ããŒãžãäžåºŠã«ãªã¯ãšã¹ãããã®ã§ã¯ãªããå°ãé ããŠãªã¯ãšã¹ãããŸãã
ãããŠãç§ãæ¬åœã«ãã¹ãŠã®ããŒãžãååŸããããšãã§ããå¯äžã®æ¹æ³ã
ããŒã¿ã»ããã®ãã¹ããŒããååŸããããšã¯ç°¡åãªäœæ¥ã§ããããšã蚌æãããŠããŸãã ãã¹ãŠã®æ å ±ã¯è¡šã«ãããŸããã ãããŠãå³ã®åããããã¹ããæœåºããå¿ èŠããããŸããã
List<string> passport = new List<string>(); var table = page.Html.CssSelect(".sticky-enabled").FirstOrDefault(); if (table != null) { foreach (var row in table.SelectNodes("tbody/tr")) { foreach (var cell in row.SelectNodes("td[2]")) { passport.Add(cell.InnerText); } } }
åããŒã¿ã»ããã«ã¯è©äŸ¡ããããããã¯ããŒã¿ã«ãŠãŒã¶ãŒã®æ祚ã«ãã£ãŠæ±ºå®ãããŸãã ã¹ã³ã¢ã¯ããŒãã«å ã§ã¯ãªããåå¥ã®pã¿ã°å ã«ãããŸãã
ã¹ã³ã¢ãååŸããã«ã¯ã.vote-current-scoreã¯ã©ã¹ã§pã¿ã°ãèŠã€ããŸãã
var score = PageResult.Html.SelectSingleNode("//p[@class='vote-current-score']");
åé¡ã¯è§£æ±ºããŸããã ããŒã¿ãååŸãããŸãã ããã¹ããã¡ã€ã«ã«ä¿åã§ããŸãã
çµæã®Webã¹ã¯ã¬ãŒããŒãå®å šã«ãã¹ãããããã«ãåçŽãªRESTãµãŒãã¹ã§ã©ãããããã®å éšã§ããã¯ã°ã©ãŠã³ãããŒãããã»ã¹ãèµ·åããŸã ã
ãããŠããããAzureã«æçš¿ããŸããã
ããã»ã¹ãå¶åŸ¡ããããããããã«ãã·ã³ãã«ãªã€ã³ã¿ãŒãã§ã€ã¹ãè¿œå ãããŸããã
ãµãŒãã¹ã¯ããŒã¿ãååŸãããã¡ã€ã«ãšããŠä¿åããŸãã ããã«ããµãŒãã¹ã¯æœåºãããããŒã¿ã以åã®ããŒãžã§ã³ãšæ¯èŒããè¿œå ãåé€ãå€æŽãããããŒã¿ã»ããã®æ°ã«é¢ããæ å ±ãä¿åããŸãã
çµè«
Webã¹ã¯ã¬ã€ããŒã®äœæã¯ãå°é£ãªäœæ¥ã§ã¯ãããŸããã
Webã¹ã¯ã¬ã€ããŒãäœæããã«ã¯ãHtmlãšã¯äœããCSSãšXPathãã©ã®ããã«äœ¿çšãããŠããããç解ããã ãã§ååã§ãã
ã¿ã¹ã¯ãéåžžã«å®¹æã«ããæ¢è£œã®ãã¬ãŒã ã¯ãŒã¯ããããããŒã¿æœåºã«çŽæ¥éäžã§ããŸãã
Google Chromeéçºè ããŒã«ã¯ãäœãã©ã®ããã«æœåºããããææ¡ããã®ã«ååã§ãã
ããŒã¿ãæœåºããæ¹æ³ã«ã¯å€ãã®ãªãã·ã§ã³ããããçµæãéæãããå Žåããããã¯ãã¹ãŠæ£ããã§ãã