ã°ã©ããšã¯äœã§ããïŒ
ããã¯ããµã€ãã解æããããã®ã©ã€ãã©ãªã§ãã ãã®äž»ãªæ©èœïŒ
- ãããã¯ãŒã¯ãªã¯ãšã¹ãã®æºåïŒCookieãHTTPããããŒãPOST / GETããŒã¿ïŒ
- ãµãŒããŒèŠæ±ïŒããããHTTP / SOCKSãããã·çµç±ïŒ
- ãµãŒããŒå¿çãšãã®åæåŠçã®åä¿¡ïŒããããŒã®è§£æãCookieã®è§£æãããã¥ã¡ã³ãã®ãšã³ã³ãŒãã®æ±ºå®ããªãã€ã¬ã¯ãã®åŠçïŒã¡ã¿ãªãã¬ãã·ã¥ã¿ã°ã®ãªãã€ã¬ã¯ãããµããŒããããŸãïŒïŒ
- DOMå¿çããªãŒã䜿çšããïŒHTMLããã¥ã¡ã³ãã®å ŽåïŒ
- ãã©ãŒã ã®æäœïŒå ¥åãèªåå ¥åïŒ
- ãããã°ïŒããã»ã¹ãã³ã³ãœãŒã«ã«èšé²ãããããã¯ãŒã¯èŠæ±ãšãã¡ã€ã«ãžã®å¿ç
次ã«ãåé ç®ã«ã€ããŠè©³ãã説æããŸãã æåã«ãäœæ¥ãªããžã§ã¯ãã®åæåãšãããã¯ãŒã¯èŠæ±ã®æºåã«ã€ããŠèª¬æããŸãããã Yandexã«ããŒãžãèŠæ±ãããã¡ã€ã«ã«ä¿åããã³ãŒãã®äŸã瀺ããŸãã
>>> g = Grab(log_file='out.html') >>> g.go('http://yandex.ru')
å®éã `log_file`ãã©ã¡ãŒã¿ãŒã¯ãããã°ãç®çãšããŠããŸã-ãããªã調æ»ã®ããã«å¿çæ¬æãä¿åããå Žæã瀺ããŸãã ãã ããããã䜿çšããŠãã¡ã€ã«ãããŠã³ããŒãã§ããŸãã
Grabãªããžã§ã¯ãã®æ§ææ¹æ³ã¯ãã³ã³ã¹ãã©ã¯ã¿ãŒå ã§ç¢ºèªããŸããã ãããŠãåãã³ãŒãã®ããã€ãã®ããªãšãŒã·ã§ã³ããããŸãïŒ
>>> g = grab() >>> g.setup(url='http://yandex.ru', log_file='out.html') >>> g.request()
ãŸãã¯
>>> g = Grab() >>> g.go('http://yandex.ru', log_file='out.html')
æçïŒ
>>> Grab(log_file='out.html').go('http://yandex.ru')
èŠçŽãããšãGrabã®æ§æã¯ãã³ã³ã¹ãã©ã¯ã¿ãŒã `setup`ã¡ãœããããŸãã¯` go`ããã³ `request`ã¡ãœãããä»ããŠæå®ã§ããŸãã `go`ã¡ãœããã®å ŽåãèŠæ±ãããURLã¯äœçœ®åŒæ°ãšããŠæž¡ãããšãã§ããŸã;ä»ã®å Žåã§ã¯ãååä»ãåŒæ°ãšããŠæž¡ãå¿ èŠããããŸãã ãgoãã¡ãœãããšãrequestãã¡ãœããã®éãã¯ããgoãã§ã¯æåã®ãã©ã¡ãŒã¿ãŒãå¿ èŠãªã®ã«å¯Ÿããrequestã§ã¯äœãå¿ èŠãšããã以åã«èšå®ããURLã䜿çšããããšã§ãã
log_fileãªãã·ã§ã³ã«å ããŠãlog_dirãªãã·ã§ã³ããããŸããããã«ããããã«ãã¹ãããããŒãµãŒã®ãããã°ãéåžžã«ç°¡åã«ãªããŸãã
>>> import logging >>> from grab import Grab >>> logging.basicConfig(level=logging.DEBUG) >>> g = Grab() >>> g.setup(log_dir='log/grab') >>> g.go('http://yandex.ru') DEBUG:grab:[02] GET http://yandex.ru >>> g.setup(post={'hi': u', !'}) >>> g.request() DEBUG:grab:[03] POST http://yandex.ru
ã»ã åãªã¯ãšã¹ãã¯çªå·ãåãåããŸããã åèŠæ±ã«å¯Ÿããå¿çã¯ããã¡ã€ã«/ tmp / [number 022.htmlã«èšé²ãããŸããããŸãã/ tmp / [number 022.logãã¡ã€ã«ãäœæãããŸããããã®ãã¡ã€ã«ã«ã¯ãå¿çã®httpããããŒãèšé²ãããŸããã ãããŠãäžèšã®ã³ãŒãã¯äœãããŸããïŒ åœŒã¯Yandexã®ã¡ã€ã³ããŒãžã«ç§»åããŸãã ãããŠãåãããŒãžã«å¯ŸããŠç¡æå³ãªPOSTãªã¯ãšã¹ããè¡ããŸãã 2çªç®ã®ãªã¯ãšã¹ãã§ã¯URLãæå®ããªãããšã«æ³šæããŠãã ãã-以åã®ãªã¯ãšã¹ãã®URLãããã©ã«ãã§äœ¿çšãããŸãã
å¥ã®Grabãããã°èšå®ãèŠãŠã¿ãŸãããã
>>> g = Grab() >>> g.setup(debug=True) >>> g.go('http://youporn.com') >>> g.request_headers {'Accept-Language': 'en-us;q=0.9,en,ru;q=0.3', 'Accept-Encoding': 'gzip', 'Keep-Alive': '300', 'Accept': 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.3', 'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en; rv:1.9.0.2) Gecko/2008091620 Firefox/3.0.2', 'Accept-Charset': 'utf-8,windows-1251;q=0.7,*;q=0.7', 'Host': 'www.youporn.com'}
youporn.comã«ãªã¯ãšã¹ããè¡ããŸããã ãããã°ãªãã·ã§ã³ã䜿çšãããšãçºä¿¡èŠæ±ããããŒãä¿åã§ããŸãã äžæãªç¹ãããå Žåã¯ããµãŒããŒã«éä¿¡ããå 容ã確èªã§ããŸãã request_headerså±æ§ã«ã¯ããªã¯ãšã¹ãã®httpããããŒã®ããŒãšå€ãå«ãèŸæžãå«ãŸããŠããŸãã
ã¯ãšãªãã³ã³ãã€ã«ããããã®åºæ¬çãªæ©èœãæ€èšããŠãã ããã
HTTPãªã¯ãšã¹ãã¡ãœãã
POSTãªã¯ãšã¹ãã ãšãŠãç°¡åã§ãã `post`ãªãã·ã§ã³ã§ããŒãšå€ãæã€èŸæžãæå®ããŸãã ã°ã©ãã¯ãèŠæ±ã¿ã€ããèªåçã«POSTã«å€æŽããŸãã
>>> g = Grab() >>> g.setup(post={'act': 'login', 'redirec_url': '', 'captcha': '', 'login': 'root', 'password': '123'}) >>> g.go('http://habrahabr.ru/ajax/auth/') >>> print g.xpath_text('//error')
GETãªã¯ãšã¹ãã POSTããŒã¿ãŸãã¯ãªã¯ãšã¹ãã¡ãœãããæ瀺çã«èšå®ãããŠããªãå ŽåãGrabã¯GETãªã¯ãšã¹ããçæããŸãã
PUTãDELETEãHEADã¡ãœããã çè«çã«ã¯ããªãã·ã§ã³method = 'delete'ãmethod = 'put'ãŸãã¯method = 'head'ãèšå®ãããšãã¹ãŠãæ©èœããŸãã å®éã«ã¯ãç§ã¯ãããã®ã¡ãœãããã»ãšãã©äœ¿çšããŠãããããããã®ããã©ãŒãã³ã¹ã«ã€ããŠç¢ºä¿¡ãæãŠãŸããã
POSTãªã¯ãšã¹ãã«é¢ããéèŠãªæ³šæã Grabã¯ãæå®ããããã¹ãŠã®ãªãã·ã§ã³ãä¿åãã次ã®ã¯ãšãªã§äœ¿çšããããã«èšèšãããŠããŸãã ä¿åããªãå¯äžã®ãªãã·ã§ã³ã¯ã `post`ãªãã·ã§ã³ã§ãã 圌ããããä¿åããå Žåã次ã®äŸã§ã¯ã2çªç®ã®URLã«POSTãªã¯ãšã¹ããéä¿¡ããŸãããããã¯ã»ãšãã©æãã§ããŸããã
>>> g.setup(post={'login': 'root', 'password': '123'}) >>> g.go('http://example.com/login') >>> g.go('http://example.com/news/recent')
HTTPããããŒãæ§æãã
次ã«ãéä¿¡ãããhttpããããŒãæ§æããæ¹æ³ãèŠãŠã¿ãŸãããã `headers`ãªãã·ã§ã³ã§ããããŒèŸæžãèšå®ããã ãã§ãã ããã©ã«ãã§ã¯ãGrabã¯ãAcceptãAccept-LanguageãAccept-CharsetãKeep-Aliveãªã©ã®ãã©ãŠã¶ãŒã«äŒŒãããããŒãçæããŸãã `headers`ãªãã·ã§ã³ã§å€æŽããããšãã§ããŸãïŒ
>>> g = Grab() >>> g.setup(headers={'Accept-Encoding': ''}) >>> g.go('http://digg.com') >>> print g.response.headers.get('Content-Encoding') None >>> g.setup(headers={'Accept-Encoding': 'gzip'}) >>> g.go('http://digg.com') >>> print g.response.headers['Content-Encoding'] gzip
Cookieã䜿çšãã
ããã©ã«ãã§ã¯ãGrabã¯åä¿¡ããCookieãä¿åãã次ã®ãªã¯ãšã¹ãã§ããããéä¿¡ããŸãã ãŠãŒã¶ãŒã»ãã·ã§ã³ãšãã¥ã¬ãŒã·ã§ã³ã¯ãã®ãŸãŸäœ¿çšã§ããŸãã ãããå¿ èŠãªãå Žåã¯ã `reuse_cookies`ãªãã·ã§ã³ãç¡å¹ã«ããŠãã ããã `cookies`ãªãã·ã§ã³ã§æåã§cookieãèšå®ã§ããŸããããã«ã¯èŸæžãå«ãŸããŠããå¿ èŠãããããã®åŠçã¯` post`ãªãã·ã§ã³ã§éä¿¡ãããããŒã¿ã®åŠçã«äŒŒãŠããŸãã
>>> g.setup(cookies={'secureid': '234287a68s7df8asd6f'})
`cookiefile`ãªãã·ã§ã³ã§Cookieã¹ãã¬ãŒãžãšããŠäœ¿çšããããã¡ã€ã«ãæå®ã§ããŸãã ããã«ãããããã°ã©ã ãèµ·åãããã³ã«Cookieãä¿åã§ããŸãã
dump_cookiesã¡ãœããã䜿çšããŠãã€ã§ãGrabãªããžã§ã¯ãã®Cookieããã¡ã€ã«ã«æžã蟌ããããã¡ã€ã«ããload_cookiesã¡ãœãããããŒãã§ããŸãã Grabãªããžã§ã¯ãã®Cookieãã¯ãªã¢ããã«ã¯ã `clear_cookies`ã¡ãœããã䜿çšããŸãã
ãŠãŒã¶ãŒãšãŒãžã§ã³ã
ããã©ã«ãã§ã¯ãGrabã¯å®éã®ãã©ãŠã¶ãè£ ããŸãã ããŸããŸãªUser-Agentæååã®ãªã¹ãããããGrabãªããžã§ã¯ãã®äœææã«ãã®ãã¡ã®1ã€ãã©ã³ãã ã«éžæãããŸãã ãã¡ããã `user_agent`ãªãã·ã§ã³ã§User-Agentãèšå®ã§ããŸãã
>>> from grab import Grab >>> g = Grab() >>> g.go('http://whatsmyuseragent.com/') >>> g.xpath('//td[contains(./h3/text(), "Your User Agent")]').text_content() 'The Elements of Your User Agent String Are:\nMozilla/5.0\r\nWindows\r\nU\r\nWindows\r\nNT\r\n5.1\r\nen\r\nrv\r\n1.9.0.1\r\nGecko/2008070208\r\nFirefox/3.0.1' >>> g.setup(user_agent='Porn-Parser') >>> g.go('http://whatsmyuseragent.com/') >>> g.xpath('//td[contains(./h3/text(), "Your User Agent")]').text_content() 'The Elements of Your User Agent String Are:\nPorn-Parser'
ãããã·ãµãŒããŒã䜿çšãã
ãã¹ãŠãäºçŽ°ã§ãã ãããã·ãªãã·ã§ã³ã§ã¯ããserverïŒportãã®åœ¢åŒã§ãããã·ã¢ãã¬ã¹ãæž¡ãå¿ èŠããããŸããproxy_typeãªãã·ã§ã³ã§ã¯ããã®ã¿ã€ããæž¡ããŸãïŒhttpãsocks4ãŸãã¯socks5ãããã·ãèªèšŒãå¿ èŠãšããå Žåãproxy_userpwdãªãã·ã§ã³ãå€ã䜿çšããŸããuserïŒpasswordããšãã圢åŒã§ãã
Googleæ€çŽ¢ã«åºã¥ãæãåçŽãªãããã·æ€çŽ¢ãšã³ãžã³ïŒ
>>> from grab import Grab, GrabError >>> from urllib import quote >>> import re >>> g = Grab() >>> g.go('http://www.google.ru/search?num=100&q=' + quote('free proxy +":8080"')) >>> rex = re.compile(r'(?:(?:[-a-z0-9]+\.)+)[a-z0-9]+:\d{2,4}') >>> for proxy in rex.findall(g.drop_space(g.css_text('body'))): ... g.setup(proxy=proxy, proxy_type='http', connect_timeout=5, timeout=5) ... try: ... g.go('http://google.com') ... except GrabError: ... print proxy, 'FAIL' ... else: ... print proxy, 'OK' ... 210.158.6.201:8080 FAIL ... proxy2.com:80 OK âŠ. 210.107.100.251:8080 OK âŠ.
åçäœæ¥
Grabã䜿çšããŠãããã¯ãŒã¯ãªã¯ãšã¹ããè¡ã£ããšããŸãã 次ã¯ïŒ ãgoãã¡ãœãããšãrequestãã¡ãœããã¯Responseãªããžã§ã¯ããè¿ããŸããããã¯ãGrabãªããžã§ã¯ãã®ãresponseãå±æ§ãããå©çšã§ããŸãã Responseãªããžã§ã¯ãã®æ¬¡ã®å±æ§ãšã¡ãœããã«èå³ããããããããŸããïŒcodeãbodyãheadersãurlãcookiesãcharsetã
- code-HTTPå¿çã³ãŒãã çãã200çªç®ãšç°ãªãå ŽåãäŸå€ã¯çæãããŸãããããã念é ã«çœ®ããŠãã ããã
- bodyã¯ãhttpããããŒãé€ãå®éã®å¿çæ¬æã§ã
- ããããŒ-ãããã¯èŸæžã®ããããŒã§ã
- url-ãªãã€ã¬ã¯ãããã£ãå Žåãå ã®ãã®ãšç°ãªãå ŽåããããŸã
- cookies-èŸæžã®ã¯ãããŒ
- charset-ããã¥ã¡ã³ãã®ãšã³ã³ãŒããããã¥ã¡ã³ãã®METAã¿ã°ã§æ€çŽ¢ãããŸããContent-Typehttp-responseããããŒããã³XMLããã¥ã¡ã³ãã®xml宣èšã§ãæ€çŽ¢ãããŸãã
ã°ã©ããªããžã§ã¯ãã«ã¯ããresponse_unicode_bodyããšããã¡ãœããããããŸãããã®ã¡ãœããã¯ãUnicodeã«å€æãããå¿çæ¬æãè¿ããŸããã¿ã€ãïŒã®HTMLãšã³ãã£ãã£ã¯ã察å¿ããUnicodeã«å€æãããªãããšã«æ³šæããŠãã ããã
æåŸã®ãªã¯ãšã¹ãã®ã¬ã¹ãã³ã¹ãªããžã§ã¯ãã¯ãåžžã«å±æ§ `response` Grabãªããžã§ã¯ãã«ä¿åãããŸãã
>>> g = Grab() >>> g.go('http://aport.ru') >>> g.response.code 200 >>> g.response.cookies {'aportuid': 'AAAAGU5gdfAAABRJAwMFAg=='} >>> g.response.headers['Set-Cookie'] 'aportuid=AAAAGU5gdfAAABRJAwMFAg==; path=/; domain=.aport.ru; expires=Wed, 01-Sep-21 18:21:36 GMT' >>> g.response.charset 'windows-1251'
å¿çããã¹ãïŒgrab.ext.textæ¡åŒµïŒã®äœ¿çš
æ€çŽ¢ã¡ãœããã§ã¯ãæå®ããæååãå¿çæ¬æã«ååšãããã©ãããèšå®ã§ããŸã; search_rexã¡ãœããã¯ããã©ã¡ãŒã¿ãŒãšããŠæ£èŠè¡šçŸãªããžã§ã¯ããåãå ¥ããŸãã åŒæ°ãèŠã€ãããªãã£ãå Žåãassert_substringããã³assert_rexã¡ãœããã¯DataNotFoundäŸå€ãã¹ããŒããŸãã ãŸãããã®æ¡åŒµæ©èœã«ã¯ããfind_number-æåã®æ°å€ã®åºçŸãæ€çŽ¢ãããdrop_spaceã-空çœæåãåé€ãããnormalize_spaceã-ã¹ããŒã¹ã®ã·ãŒã±ã³ã¹ã1ã€ã®ã¹ããŒã¹ã«çœ®ãæããŸãã
>>> g = Grab() >>> g.go('http://habrahabr.ru') >>> g.search(u'Google') True >>> g.search(u'') False >>> g.search(u'') False >>> g.search(u'') False >>> g.search(u'') True >>> g.search('') Traceback (most recent call last): File "", line 1, in File "grab/ext/text.py", line 37, in search raise GrabMisuseError('The anchor should be byte string in non-byte mode') grab.grab.GrabMisuseError: The anchor should be byte string in non-byte mode >>> g.search('', byte=True) True >>> import re >>> g.search_rex(re.compile('Google')) <_sre.SRE_Match object at 0xb6b0a6b0> >>> g.search_rex(re.compile('Google\s+\w+', re.U)) <_sre.SRE_Match object at 0xb6b0a6e8> >>> g.search_rex(re.compile('Google\s+\w+', re.U)).group(0â) u'Google Chrome' >>> g.assert_substring(' ') Traceback (most recent call last): File "", line 1, in File "grab/ext/text.py", line 62, in assert_substring if not self.search(anchor, byte=byte): File "grab/ext/text.py", line 37, in search raise GrabMisuseError('The anchor should be byte string in non-byte mode') grab.grab.GrabMisuseError: The anchor should be byte string in non-byte mode >>> g.assert_substring(u' ') Traceback (most recent call last): File "", line 1, in File "grab/ext/text.py", line 63, in assert_substring raise DataNotFound('Substring not found: %s' % anchor) grab.grab.DataNotFound >>> g.drop_spaces('foo bar') Traceback (most recent call last): File "", line 1, in AttributeError: 'Grab' object has no attribute 'drop_spaces' >>> g.drop_space('foo bar') 'foobar' >>> g.normalize_space(' foo \n \t bar') 'foo bar' >>> g.find_number('12 ') '12'
DOMããªãŒïŒgrab.ext.lxmlæ¡åŒµæ©èœïŒãæäœãã
æãèå³æ·±ããã®ã«ã¢ãããŒãããŸãã ãã°ãããlxmlã©ã€ãã©ãªã®ãããã§ãGrabã¯xpathåŒãæäœããŠããŒã¿ãæ€çŽ¢ããæ©èœãæäŸããŸãã éåžžã«çãå ŽåïŒ `tree`å±æ§ãä»ããŠãElementTreeã€ã³ã¿ãŒãã§ã€ã¹ã§DOMããªãŒã«ã¢ã¯ã»ã¹ã§ããŸãã ããªãŒã¯ãlxmlã©ã€ãã©ãªã®ããŒãµãŒã䜿çšããŠæ§ç¯ãããŸãã xpathãšcssã®2ã€ã®ã¯ãšãªèšèªã䜿çšããŠDOMããªãŒãæäœã§ããŸãã
xpathã䜿çšããæ¹æ³ïŒ
- xpath-èŠæ±ãæºããæåã®èŠçŽ ãè¿ããŸã
- xpath_list-ãã¹ãŠã®èŠçŽ ãè¿ãxpath_text-èŠçŽ ïŒããã³ãã¹ãŠã®ãã¹ããããèŠçŽ ïŒã®ããã¹ãã³ã³ãã³ããè¿ã
- xpath_number-èŠçŽ ïŒããã³ãã¹ãŠã®ãã¹ããããèŠçŽ ïŒã®ããã¹ãããæåã«åºçŸããæ°å€ãè¿ããŸã
èŠçŽ ãèŠã€ãããªãã£ãå Žåãé¢æ° `xpath`ã` xpath_text`ããã³ `xpath_number`ã¯DataNotFoundäŸå€ãã¹ããŒããŸãã
é¢æ° `css`ã` css_list`ã `css_text`ããã³` css_number`ã¯åæ§ã«åäœããŸããã1ã€ã®äŸå€ãé€ããåŒæ°ã¯xpathãã¹ã§ã¯ãªãcssã»ã¬ã¯ã¿ãŒã§ãªããã°ãªããŸããã
>>> g = Grab() >>> g.go('http://habrahabr.ru') >>> g.xpath('//h2/a[@class="topic"]').get('href') 'http://habrahabr.ru/blogs/qt_software/127555/' >>> print g.xpath_text('//h2/a[@class="topic"]') Qt Creator 2.3.0â >>> print g.css_text('h2 a.topic') Qt Creator 2.3.0â >>> print 'Comments:', g.css_number('.comments .all') Comments: 5 >>> from urlparse import urlsplit >>> print ', '.join(urlsplit(x.get('href')).netloc for x in g.css_list('.hentry a') if not 'habrahabr.ru' in x.get('href') and x.get('href').startswith('http:')) labs.qt.nokia.com, labs.qt.nokia.com, thisismynext.com, www.htc.com, www.htc.com, droider.ru, radikal.ru, www.gosuslugi.ru, bit.ly
ãã©ãŒã ïŒgrab.ext.lxml_formæ¡åŒµïŒ
èªåãã©ãŒã å ¥åæ©èœãå®è£ ãããšããç§ã¯ãšãŠã幞ãã§ããã ããªããåãã§ãã ããïŒ ãããã£ãŠã `set_input`ã¡ãœããããããŸã-æå®ãããååã§ãã£ãŒã«ããåããã` set_input_by_id`-idå±æ§ã®å€ãããã³ `set_input_by_number`-åã«çªå·ã§ã ãããã®ã¡ãœããã¯ãæåã§èšå®ã§ãããã©ãŒã ã§æ©èœããŸãããéåžžãGrabèªèº«ãã©ã®ãã©ãŒã ã䜿çšããããæ£ããæšæž¬ããŸãã ãã©ãŒã ã1ã€ã®å Žå-ãã¹ãŠãæ確ã§ãããè€æ°ã®å Žåã¯ïŒ ã°ã©ãã¯ãã»ãšãã©ã®ãã£ãŒã«ããå«ãŸãã圢åŒã«ãªããŸãã ãã©ãŒã ãæåã§æå®ããã«ã¯ã `choose_form`ã¡ãœããã䜿çšããŸãã submitã¡ãœããã䜿çšããŠãå®æãããã©ãŒã ãéä¿¡ã§ããŸãã ã°ã©ãèªäœã¯ãæ瀺çã«å ¥åããªãã£ããã£ãŒã«ãïŒé衚瀺ãã£ãŒã«ããªã©ïŒã«å¯ŸããŠPOST / GETãªã¯ãšã¹ããäœæãããã©ãŒã ã®ã¢ã¯ã·ã§ã³ãšãªã¯ãšã¹ãã¡ãœãããèšç®ããŸãã ãŸããèŸæžã®ãã¹ãŠã®ãã£ãŒã«ããšãã©ãŒã ã®å€ãè¿ã `form_fields`ã¡ãœããããããŸãã
>>> g.go('http://ya.ru/') >>> g.set_input('text', u' ') >>> g.submit() >>> print ', '.join(x.get('href') for x in g.css_list('.b-serp-url__link')) http://gigporno.ru/, http://drochinehochu.ru/, http://porno.bllogs.ru/, http://www.pornoflv.net/, http://www.plombir.ru/, http://vuku.ru/, http://www.carol.ru/, http://www.Porno-Mama.ru/, http://kashtanka.com/, http://www.xvidon.ru/
茞é
ããã©ã«ãã§ã¯ãGrabã¯ãã¹ãŠã®ãããã¯ãŒã¯æäœã«pycurlã䜿çšããŸãã ãã®æ©èœã¯æ¡åŒµã®åœ¢ã§ãå®è£ ãããŠãããurllib2ã©ã€ãã©ãªãä»ãããªã¯ãšã¹ããªã©ãå¥ã®ãã©ã³ã¹ããŒãæ¡åŒµæ©èœãæ¥ç¶ã§ããŸãã 1ã€ã ãåé¡ããããŸãããã®æ¡åŒµæ©èœã¯äºåã«äœæããå¿ èŠããããŸã:) urllib2æ¡åŒµæ©èœã®äœæ¥ã¯é²è¡äžã§ãããéåžžã«ãã£ããã§ã-ç§ã¯pycurlã«100ïŒ æºè¶³ããŠããŸãã pycurlæ¡åŒµæ©èœãšurllib2æ¡åŒµæ©èœã®æ©èœã¯äŒŒãŠãããšæããŸãããurllib2ã¯SOCKSãããã·ã§åäœã§ããªãããšãé€ããŸãã ãã®èšäºã®ãã¹ãŠã®äŸã§ã¯ãpycurlãã©ã³ã¹ããŒãã䜿çšããŠããŸãããããã¯ããã©ã«ãã§æå¹ã«ãªã£ãŠããŸãã
>>> g = Grab() >>> g.curl <pycurl.Curl object at 0x9d4ba04> >>> g.extensions [<grab.ext.pycurl.Extension object at 0xb749056c>, <grab.ext.lxml.Extension object at 0xb749046c>, <grab.ext.lxml_form.Extension object at 0xb6de136c>, <grab.ext.django.Extension object at 0xb6a7e0ac>]
ãã³ããŒã¢ãŒã
ãã®ã¢ãŒãã¯ããã©ã«ãã§æå¹ã«ãªã£ãŠããŸãã ã°ã©ãã«ã¯ããªã¯ãšã¹ãããšã«ã¿ã€ã ã¢ãŠãããããŸãã ãã³ããŒã¢ãŒãã§ã¯ãã¿ã€ã ã¢ãŠããçºçããå ŽåãGrabã¯ããã«äŸå€ãã¹ããŒããŸããããã¿ã€ã ã¢ãŠããå¢ãããªããèŠæ±ãæ°åè©Šè¡ããŸãã ãã®ã¢ãŒãã§ã¯ãããã°ã©ã ã®å®å®æ§ãå€§å¹ ã«åäžã§ããŸãã ãµã€ãã®äœæ¥äžã®ãã€ã¯ãããŒãºããã£ãã«ã®ã®ã£ãããéåžžã«é »ç¹ã«çºçããŸãã ã¢ãŒããæå¹ã«ããã«ã¯ã `hammer_mode`ãªãã·ã§ã³ã䜿çšããŠãã¿ã€ã ã¢ãŠãã®æ°ãšé·ããèšå®ãã` hammer_timeouts`ãªãã·ã§ã³ã䜿çšããŸããããã«æ°å€ãã¢ã®ãªã¹ããæž¡ããŸããå¿çãåãåããŸãã
>>> import logging >>> logging.basicConfig(level=logging.DEBUG) >>> g = Grab() >>> g.setup(hammer_mode=True, hammer_timeouts=((1, 1), (2, 2), (30, 30))) >>> URL = 'http://download.wikimedia.org/enwiki/20110803/enwiki-20110803-stub-articles5.xml.gz' >>> g.go(URL, method='head') DEBUG:grab:[01] HEAD http://download.wikimedia.org/enwiki/20110803/enwiki-20110803-stub-articles5.xml.gz >>> print 'File size: %d Mb' % (int(g.response.headers['Content-Length']) / (1024 * 1024)) File size: 3 Mb >>> g.go(URL, method='get') DEBUG:grab:[02] GET http://download.wikimedia.org/enwiki/20110803/enwiki-20110803-stub-articles5.xml.gz DEBUG:grab:Trying another timeouts. Connect: 2 sec., total: 2 sec. DEBUG:grab:[03] GET http://download.wikimedia.org/enwiki/20110803/enwiki-20110803-stub-articles5.xml.gz DEBUG:grab:Trying another timeouts. Connect: 30 sec., total: 30 sec. DEBUG:grab:[04] GET http://download.wikimedia.org/enwiki/20110803/enwiki-20110803-stub-articles5.xml.gz >>> print 'Downloaded: %d Mb' % (len(g.response.body) / (1024 * 1024)) Downloaded: 3 Mb
Djangoæ¡åŒµæ©èœïŒgrab.ext.djangoïŒ
ã¯ããã¯ãã ãã®ãããªãã®ã1ã€ãããŸã:-) ImageFieldãã£ãŒã«ã `picture`ãæã€Movieã¢ãã«ããããšããŸãã ç»åãããŠã³ããŒãããŠã ãŒããŒãªããžã§ã¯ãã«ä¿åããæ¹æ³ã¯æ¬¡ã®ãšããã§ãã
>>> obj = Movie.objects.get(pk=797) >>> g = Grab() >>> g.go('http://img.yandex.net/i/www/logo.png') >>> obj.picture = g.django_file() >>> obj.save()
ã°ã©ãã«ã¯ä»ã«äœããããŸããïŒ
ä»ã®ãããããããŸãããèšäºã倧ããããã®ã§ã¯ãªãããšå¿é ããŠããŸãã Grabã©ã€ãã©ãªã®ãŠãŒã¶ãŒã®äž»ãªã«ãŒã«ã¯ãäœããæ確ã§ãªãå Žåã¯ãã³ãŒãã調ã¹ãå¿ èŠããããšããããšã§ãã ããã¥ã¡ã³ãã¯ãŸã 匱ãã§ã
éçºèšç»
ç§ã¯é·å¹ŽãGrabã䜿çšããŠããŸãããããã«ã¯ãã¢ã¹ã¯ã¯ãä»ã®éœåžã§å²åŒã¯ãŒãã³ãè³Œå ¥ã§ããã¢ã°ãªã²ãŒã¿ãŒãªã©ã®çç£çŸå Žãå«ãŸããŸãã 2011幎ã«ããã¹ããšããã¥ã¡ã³ãã®äœæãéå§ããŸããã å€åãç§ã¯multicurlã«åºã¥ããéåæãªã¯ãšã¹ãã®æ©èœãæžããŸãã urllibãã©ã³ã¹ããŒããçµäºããã®ãããã§ãããã
ãããžã§ã¯ããæ¯æŽããã«ã¯ã©ãããã°ããã§ããïŒ äœ¿çšããŠããã°ã¬ããŒããšããããéã£ãŠãã ããã ããŒãµãŒãã°ã©ã㌠ãæ å ±åŠçã¹ã¯ãªããã®äœæã泚æããããšãã§ããŸãã ç§ã¯ã°ã©ãã䜿ã£ãŠãã®ãããªããšãå®æçã«æžããŠããŸãã
å ¬åŒãããžã§ã¯ããªããžããªïŒ bitbucket.org/lorien/grabã©ã€ãã©ãªã¯pypi.python.orgãããé ä¿¡ã§ããŸãããéåžžããªããžããªå ã®ã³ãŒãã¯ææ°ã§ãã
UPDïŒã³ã¡ã³ãã§ã¯ãã·ãã«ä»£ããããããçš®é¡ã®éžæè¢ãè¡šæããŸããã ç§ã¯ãããããªã¹ã+ç§ã®é ããã®äœãã§èŠçŽããããšã«ããŸããã å®éããããã®ã¯ãŽã³ãšå°åå°è»ã«ä»£ãããã®ããããŸãã N人ç®ã®ããã°ã©ããŒã¯ããã€ã®æ¥ãããããã¯ãŒã¯ãªã¯ãšã¹ãçšã®ãŠãŒãã£ãªãã£ãçšæããããšã«æ±ºãããšæããŸãã
- github.com/lispython/human_curl
- docs.python-requests.org/en/latest/index.html
- wwwsearch.sourceforge.net/mechanize/
- github.com/mattseh/python-web/
- code.google.com/p/urllib3/
- scrapy.org/
UPD2ïŒã©ã€ãã©ãªã«é¢ãã質åãgoogleã°ã«ãŒãã«æžããŠãã ããïŒ groups.google.com/group/python-grab/ä»ã®ã°ã©ããŠãŒã¶ãŒã¯ã質åãšåçãç解ããŠãããšåœ¹ç«ã¡ãŸãã
UPD3ïŒdocs.grablib.org/ã§ææ°ã®ããã¥ã¡ã³ããå ¥æã§ããŸã
UPD4ïŒçŸåšã®ãããžã§ã¯ããµã€ãïŒ grablib.org
UPD5ïŒèšäºã®ãœãŒã¹ã³ãŒãã®äŸãä¿®æ£ããŸããã 次ã®ã¢ããã°ã¬ãŒãåŸãHabrahabrã¯ãç§ãç解ããŠããªãã£ãçç±ã§å€ãèšäºã®ã³ãŒãã®ãã©ãŒããããä¿®æ£ãå§ããŸããã§ããã èšäºãä¿®æ£ããŠãããAlexei Mazanovã«æè¬ããŸãã ãŸãã圌ã¯Habrã«è¡ããããšæã£ãŠããŸããæåŸ ãããã°ã圌ã®ã¡ãŒã«ïŒegocentrist@me.com