Skip to content

出了点问题 #6

@101142TS

Description

@101142TS

得到的ip无法爬取网站,我想要爬取wandoujia,但得到的ip访问时timeout

/Users/icst/Desktop/test_proxy/wandoujia/proxyPool/ProxyPoolWorker.py:81: SyntaxWarning: "is not" with a literal. Did you mean "!="?
if proxy is not '':
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pymysql/cursors.py:170: Warning: (1681, b'Integer display width is deprecated and will be removed in a future release.')
result = self._query(query)
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pymysql/cursors.py:170: Warning: (3719, b"'utf8' is currently an alias for the character set UTF8MB3, but will be an alias for UTF8MB4 in a future release. Please consider using UTF8MB4 in order to be unambiguous.")
result = self._query(query)
正在爬取快代理……
115.216.56.92 | 9999 | 高匿名 | HTTP | 浙江省杭州市 电信 | 3秒
123.149.136.127 | 9999 | 高匿名 | HTTP | 河南省洛阳市 电信 | 1秒
111.72.25.153 | 9999 | 高匿名 | HTTP | 江西省抚州市 电信 | 0.5秒
183.166.111.11 | 9999 | 高匿名 | HTTP | 安徽省淮南市 电信 | 2秒
171.35.211.234 | 9999 | 高匿名 | HTTP | 江西省新余市 联通 | 3秒
114.239.110.93 | 9999 | 高匿名 | HTTP | 江苏省宿迁市 电信 | 2秒
110.243.2.58 | 9999 | 高匿名 | HTTP | 河北省唐山市 联通 | 2秒
114.99.22.104 | 9999 | 高匿名 | HTTP | 安徽省铜陵市 电信 | 2秒
124.113.250.171 | 9999 | 高匿名 | HTTP | 安徽省宿州市 电信 | 3秒
123.149.141.209 | 9999 | 高匿名 | HTTP | 河南省洛阳市 电信 | 1秒
183.146.156.254 | 9999 | 高匿名 | HTTP | 浙江省金华市 电信 | 0.7秒
123.149.136.121 | 9999 | 高匿名 | HTTP | 河南省洛阳市 电信 | 3秒
163.204.247.139 | 9999 | 高匿名 | HTTP | 广东省汕尾市 联通 | 1秒
123.163.27.220 | 9999 | 高匿名 | HTTP | 河南省洛阳市 电信 | 0.8秒
1.196.177.218 | 9999 | 高匿名 | HTTP | 河南省洛阳市 电信 | 0.7秒
2020-02-09 23:15:11 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: wandoujia)
2020-02-09 23:15:11 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.8.1 (v3.8.1:1b293b6006, Dec 18 2019, 14:08:53) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d 10 Sep 2019), cryptography 2.8, Platform macOS-10.14.1-x86_64-i386-64bit
2020-02-09 23:15:11 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'wandoujia', 'COOKIES_ENABLED': False, 'NEWSPIDER_MODULE': 'wandoujia.spiders', 'SPIDER_MODULES': ['wandoujia.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'}
2020-02-09 23:15:11 [scrapy.extensions.telnet] INFO: Telnet Password: 79f3a3cb43e725d1
2020-02-09 23:15:11 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2020-02-09 23:15:11 [scrapy.middleware] INFO: Enabled downloader middlewares:
['proxyPool.scrapy.middlewares.RetryMiddleware',
'proxyPool.scrapy.middlewares.ProxyMiddleware',
'proxyPool.scrapy.middlewares.CatchExceptionMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'proxyPool.scrapy.RandomUserAgentMiddleware.RandomUserAgentMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'wandoujia.middlewares.WandoujiaDownloaderMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-02-09 23:15:11 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-02-09 23:15:11 [scrapy.middleware] INFO: Enabled item pipelines:
['wandoujia.pipelines.MyFilesPipeline']
2020-02-09 23:15:11 [scrapy.core.engine] INFO: Spider opened
2020-02-09 23:15:11 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-02-09 23:15:11 [main] INFO: Spider opened: main
2020-02-09 23:15:11 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-02-09 23:15:11 [root] DEBUG: ===== ProxyMiddleware get a random_proxy:【 http://123.149.136.121:9999 】 =====
2020-02-09 23:16:11 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-02-09 23:17:11 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-02-09 23:18:11 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-02-09 23:18:11 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.wandoujia.com/apps/665777> (failed 1 times): User timeout caused connection failure: Getting https://www.wandoujia.com/apps/665777 took longer than 180.0 seconds..
2020-02-09 23:18:11 [root] DEBUG: ===== ProxyMiddleware get a random_proxy:【 http://110.243.2.58:9999 】 =====
2020-02-09 23:19:11 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-02-09 23:19:27 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.wandoujia.com/apps/665777> (failed 2 times): TCP connection timed out: 60: Operation timed out.
2020-02-09 23:19:27 [root] DEBUG: ===== ProxyMiddleware get a random_proxy:【 http://1.196.177.218:9999 】 =====
2020-02-09 23:19:27 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://www.wandoujia.com/apps/665777> (failed 3 times): Connection was refused by other side: 61: Connection refused.
2020-02-09 23:19:27 [root] DEBUG: === success to update 1.196.177.218 proxy ===
2020-02-09 23:19:27 [root] DEBUG: === success to update 1.196.177.218 proxy ===
2020-02-09 23:19:27 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.wandoujia.com/apps/665777>
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
defer.returnValue((yield download_func(request=request, spider=spider)))
twisted.internet.error.ConnectionRefusedError: Connection was refused by other side: 61: Connection refused.
2020-02-09 23:19:27 [scrapy.core.engine] INFO: Closing spider (finished)
2020-02-09 23:19:27 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
'downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError': 1,
'downloader/exception_type_count/twisted.internet.error.TCPTimedOutError': 1,
'downloader/exception_type_count/twisted.internet.error.TimeoutError': 1,
'downloader/request_bytes': 918,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'elapsed_time_seconds': 256.041098,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 2, 9, 15, 19, 27, 373921),
'log_count/DEBUG': 8,
'log_count/ERROR': 1,
'log_count/INFO': 15,
'memusage/max': 67170304,
'memusage/startup': 66805760,
'retry/count': 2,
'retry/max_reached': 1,
'retry/reason_count/twisted.internet.error.TCPTimedOutError': 1,
'retry/reason_count/twisted.internet.error.TimeoutError': 1,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2020, 2, 9, 15, 15, 11, 332823)}
2020-02-09 23:19:27 [scrapy.core.engine] INFO: Spider closed (finished)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions