In Python3 crawler, use proxy is a common method to prevent IP from being banned or improve crawling speed, and is mainly used to simulate multiple IP addresses for access. Proxy can be divided into two types: free proxy and paid proxy, among which free proxyNoStablize, and paid proxy is relatively more reliable than Stablize。
The following are the use scenarios and applications of Python3 crawler proxy:
Prevent IP from being blocked: Some websites will set an IP access frequency limit. If the number of visits exceeds the limit, the IP will be banned from accessing the website. use proxy can prevent this from happening。
Improve crawling speed: use proxy can establish multiple connections at the same time, so as to achieve the purpose of quickly crawling target data。
Bypass geographical restrictions: Some websites provide ServeNo in different areas. If you need to visit some websites that are only open to specific areas, use proxy can bypass this restriction so that you can get the required data。
In short, Proxy IP plays a very important role in Python3 crawlers. Because use proxy will also bring some security problems, so when using proxy, you need to pay attention to choose the appropriate proxy Serve provider, and strictly abide by the network security regulations。
Preparation
We need to Obtain an available proxy first, the proxy is the combination of IP address and port, that is ip:port such a format. If the proxy needs to access authentication, it also needs two additional information of user name and password。
Here I installed an proxy software on my local machine, which will create HTTP proxy Serve on the local port 7890, that is, the proxy is 127.0.0.1:7890. In addition, the software will also create a SOCKS proxy Serve on port 7891, that is, the proxy is 127.0.0.1:7891, so as long as this proxy is set, the local IP can be successfully switched to the IP of the Server connected to the proxy software.。
In the examples below in this chapter, I use the above proxy to demonstrate its setting method, and you can also replace it with your own available proxy.
After setting the proxy, the test URL is http://httpbin.org/get By visiting this link, we can get the relevant information of the request. The origin field of the returned result is the client's IP. We can use it to judge whether the proxy is set up successfully, that is, whether the IP is successfully disguised.。
Ok, let's take a look at the proxy setting methods of each request library.
ObtainPython3crawler proxy
Some websites detect frequent access to their data and take steps to block these accesses. The use proxy Server can disperse access sources, reduce the possibility of being detected, and thus increase the success rate of crawling.
optimalU.S.staticProxy IP
IPRoyal is an proxy Serve provider that is extremely friendly to Chinaarea, and its Residential proxy solution is very attractive
View IPRoyal
cheapest static proxy
Proxy-seller is a data center proxy provider popular with many small internet marketers.
View Proxy-seller
The most affordable static proxy
Shifter.io is a well-known proxy Serve provider, aiming to provide users with privacy protection and a better Internet experience.
View Shifter.io
2. urllib
First, let's take the most basic urllib as an example to see how to set up the proxy. The code is as follows:
from urllib.error import URLError
from urllib.request import ProxyHandler, build_opener
proxy = '127.0.0.1:7890'
proxy_handler = ProxyHandler({
'http': 'http://' + proxy,
'https': 'http://' + proxy
})
opener = build_opener(proxy_handler)
try:
response = opener.open('https://httpbin.org/get')
print(response.read().decode('utf-8'))
except URLError as e:
print(e.reason)
The result of the operation is as follows:
{
"args": {},
"headers": {
"Accept-Encoding": "identity",
"Host": "httpbin.org",
"User-Agent": "Python-urllib/3.7",
"X-Amzn-Trace-Id": "Root=1-60e9a1b6-0a20b8a678844a0b2ab4e889"
},
"origin": "210.173.1.204",
"url": "https://httpbin.org/get"
}
Here we need to set the proxy with the help of ProxyHandler, the parameter is the dictionary type, the key name is protocoltype, and the key value is proxy. Note that the protocol needs to be added in front of the proxy here, that is, http://
or https://
. When the requested link is HTTP protocol, the http key will be used The proxy corresponding to the key name, when the requested link is HTTPS protocol, the proxy corresponding to the https key name will be used. No Here we set the proxy itself to HTTP protocol, that is, the prefix is uniformly set to
http://
,Therefore, no matter whether you access the link of HTTP or HTTPS protocol, the proxy of HTTP protocol we configured will be used to make the request。
After creating the ProxyHandler object, we need to use the build_opener method to pass in the object to create an Opener, which means that the Opener has already set up an proxy. Next, directly call the open method of the Opener object to access the link we want。
The output of the operation is a JSON, which has a field origin, indicating the IP of the client. Verify that the IP here is indeed the IP of the proxy, and No is the real IP. In this way, we have successfully set up the proxy and can hide the real IP。
If we encounter an proxy that needs to be authenticated, we can set it in the following way:
from urllib.error import URLError
from urllib.request import ProxyHandler, build_opener
proxy = 'username:password@127.0.0.1:7890'
proxy_handler = ProxyHandler({
'http': 'http://' + proxy,
'https': 'http://' + proxy
})
opener = build_opener(proxy_handler)
try:
response = opener.open('https://httpbin.org/get')
print(response.read().decode('utf-8'))
except URLError as e:
print(e.reason)
What is changed here is only the proxy variable. You only need to add the username and password of the proxy authentication in front of the proxy, where username is the username and password is the password. For example, if the username is foo and the password is bar, then the proxy is foo:bar@127.0.0.1:7890
。
If the proxy is SOCKS5 type, then the proxy can be set as follows:
import socks
import socket
from urllib import request
from urllib.error import URLError
socks.set_default_proxy(socks.SOCKS5, '127.0.0.1', 7891)
socket.socket = socks.socksocket
try:
response = request.urlopen('https://httpbin.org/get')
print(response.read().decode('utf-8'))
except URLError as e:
print(e.reason)
A socks module is required here, which can be installed by the following command:
pip3 install PySocks
Here you need to run a SOCKS5 proxy locally, running on port 7891. After the operation is successful, the output result of the above HTTP proxy is the same:
{
"args": {},
"headers": {
"Accept-Encoding": "identity",
"Host": "httpbin.org",
"User-Agent": "Python-urllib/3.7",
"X-Amzn-Trace-Id": "Root=1-60e9a1b6-0a20b8a678844a0b2ab4e889"
},
"origin": "210.173.1.204",
"url": "https://httpbin.org/get"
}
The origin field of the result is also the IP of the proxy, and the proxy is set successfully。
3.requests proxy settings
For requests, the proxy setting is very simple, we only need to pass in proxies
parameter is enough.
Here I take my local proxy as an example to see the HTTP proxy settings of requests, the code is as follows:
import requests
proxy = '127.0.0.1:7890'
proxies = {
'http': 'http://' + proxy,
'https': 'http://' + proxy,
}
try:
response = requests.get('https://httpbin.org/get', proxies=proxies)
print(response.text)
except requests.exceptions.ConnectionError as e:
print('Error', e.args)
The result of the operation is as follows:
{
"args": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.22.0",
"X-Amzn-Trace-Id": "Root=1-5e8f358d-87913f68a192fb9f87aa0323"
},
"origin": "210.173.1.204",
"url": "https://httpbin.org/get"
}
and urllib Similarly, when the requested link is HTTP protocol, the proxy corresponding to the http key name will be used. When the requested link is HTTPS protocol, the proxy corresponding to the https key name will be used. No, the proxy of HTTP protocol is uniformly used here。
If the origin
in the running result is the IP of the proxy Server, it proves that the proxy has been set up successfully。
If the proxy requires authentication, just add username and password in front of the proxy, and the writing method of the proxy becomes as follows:
proxy = 'username:password@127.0.0.1:7890'
Here you only need to replace username
and password
。
If you need to use SOCKS proxy, you can use the following way to set:
import requests
proxy = '127.0.0.1:7891'
proxies = {
'http': 'socks5://' + proxy,
'https': 'socks5://' + proxy
}
try:
response = requests.get('https://httpbin.org/get', proxies=proxies)
print(response.text)
except requests.exceptions.ConnectionError as e:
print('Error', e.args)
Here we need to install an additional package requests[socks]
, the relevant commands are as follows:
pip3 install "requests[socks]"
The results are exactly the same:
{
"args": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.22.0",
"X-Amzn-Trace-Id": "Root=1-5e8f364a-589d3cf2500fafd47b5560f2"
},
"origin": "210.173.1.204",
"url": "https://httpbin.org/get"
}
In addition, there is another setting method, that is, the use socks module, which also needs to install the socks library as above. This setup method is as follows:
import requests
import socks
import socket
socks.set_default_proxy(socks.SOCKS5, '127.0.0.1', 7891)
socket.socket = socks.socksocket
try:
response = requests.get('https://httpbin.org/get')
print(response.text)
except requests.exceptions.ConnectionError as e:
print('Error', e.args)
Use this method can also set SOCKS proxy, the operation results are exactly the same. Compared with the first method, this method is set globally. We can choose No different methods in No different situations。
4. httpx proxy settings
The use of httpx itself is very similar to the use of requests, so it also sets the proxy through the proxies parameter, but No is the same as requests No, the key name of the proxies parameter No can be http
or https
, but need to be changed to http://
or https://
, other settings are the same。
For HTTP proxy, the setting method is as follows:
import httpx
proxy = '127.0.0.1:7890'
proxies = {
'http://': 'http://' + proxy,
'https://': 'http://' + proxy,
}
with httpx.Client(proxies=proxies) as client:
response = client.get('https://httpbin.org/get')
print(response.text)
For proxy that require authentication, just change the value of proxy:
proxy = 'username:password@127.0.0.1:7890'
Here you only need to replace username
and password
。
The running results anduse requests are similar, the results are as follows:
{
"args": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "httpbin.org",
"User-Agent": "python-httpx/0.18.1",
"X-Amzn-Trace-Id": "Root=1-60e9a3ef-5527ff6320484f8e46d39834"
},
"origin": "210.173.1.204",
"url": "https://httpbin.org/get"
}
For SOCKS proxy, we need to install httpx-socks library, the installation method is as follows:
pip3 install "httpx-socks[asyncio]"
This will install both synchronous and asynchronous modes of support。
For synchronous mode, the setting method is as follows:
import httpx
from httpx_socks import SyncProxyTransport
transport = SyncProxyTransport.from_url(
'socks5://127.0.0.1:7891')
with httpx.Client(transport=transport) as client:
response = client.get('https://httpbin.org/get')
print(response.text)
Here we need to set a transport object and configure the address of the SOCKS proxy, and at the same time pass in the transport parameter when declaring the client object of httpx, the running result is the same as before。
For asynchronous mode, the setting method is as follows:
import httpx
import asyncio
from httpx_socks import AsyncProxyTransport
transport = AsyncProxyTransport.from_url(
'socks5://127.0.0.1:7891')
async def main():
async with httpx.AsyncClient(transport=transport) as client:
response = await client.get('https://httpbin.org/get')
print(response.text)
if __name__ == '__main__':
asyncio.get_event_loop().run_until_complete(main())
The same as the synchronization mode No, the transport object we use is AsyncProxyTransport and No is SyncProxyTransport, and the Client object needs to be changed to the AsyncClient object at the same time, the other No changes, the operation result is the same。
5. Selenium proxy settings
Selenium can also set the proxy, here we take Chrome as an example to introduce its setting method。
For proxy without authentication, the setting method is as follows:
from selenium import webdriver
proxy = '127.0.0.1:7890'
options = webdriver.ChromeOptions()
options.add_argument('--proxy-server=http://' + proxy)
browser = webdriver.Chrome(options=options)
browser.get('https://httpbin.org/get')
print(browser.page_source)
browser.close()
The result of the operation is as follows:
{
"args": {},
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "zh-CN,zh;q=0.9",
"Host": "httpbin.org",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-5e8f39cd-60930018205fd154a9af39cc"
},
"origin": "210.173.1.204",
"url": "http://httpbin.org/get"
}
The proxy is set successfully, origin
is also the address of the proxy IP。
If the proxy is an authentication proxy, the setting method is relatively cumbersome, as follows:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import zipfile
ip = '127.0.0.1'
port = 7890
username = 'foo'
password = 'bar'
manifest_json = """{"version":"1.0.0","manifest_version": 2,"name":"Chrome Proxy","permissions": ["proxy","tabs","unlimitedStorage","storage","<all_urls>","webRequest","webRequestBlocking"],"background": {"scripts": ["background.js"]
}
}
"""
background_js = """
var config = {
mode: "fixed_servers",
rules: {
singleProxy: {
scheme: "http",
host: "%(ip) s",
port: %(port) s
}
}
}
chrome.proxy.settings.set({value: config, scope: "regular"}, function() {});
function callbackFn(details) {
return {
authCredentials: {username: "%(username) s",
password: "%(password) s"
}
}
}
chrome.webRequest.onAuthRequired.addListener(
callbackFn,
{urls: ["<all_urls>"]},
['blocking']
)
""" % {'ip': ip, 'port': port, 'username': username, 'password': password}
plugin_file = 'proxy_auth_plugin.zip'
with zipfile.ZipFile(plugin_file, 'w') as zp:
zp.writestr("manifest.json", manifest_json)
zp.writestr("background.js", background_js)
options = Options()
options.add_argument("--start-maximized")
options.add_extension(plugin_file)
browser = webdriver.Chrome(options=options)
browser.get('https://httpbin.org/get')
print(browser.page_source)
browser.close()
Here you need to create a manifest.json configuration file and background.js script locally to set up the authentication proxy. After running the code, a proxy_auth_plugin.zip file will be generated locally to save the current configuration.
The running result is the same as the above example, origin
is also the proxy IP。
The setting of SOCKS proxy is also relatively simple, just change the corresponding protocol to socks5
. For example, the proxy setting method without password authentication is as follows:
from selenium import webdriver
proxy = '127.0.0.1:7891'
options = webdriver.ChromeOptions()
options.add_argument('--proxy-server=socks5://' + proxy)
browser = webdriver.Chrome(options=options)
browser.get('https://httpbin.org/get')
print(browser.page_source)
browser.close()
The results are the same.
6.aiohttp proxy settings
For aiohttp, we can set it directly through proxy
parameter. The HTTP proxy settings are as follows:
import asyncio
import aiohttp
proxy = 'http://127.0.0.1:7890'
async def main():
async with aiohttp.ClientSession() as session:
async with session.get('https://httpbin.org/get', proxy=proxy) as response:
print(await response.text())
if __name__ == '__main__':
asyncio.get_event_loop().run_until_complete(main())
If the proxy has a username and password, like requests, modify proxy
to the following:
proxy = 'http://username:password@127.0.0.1:7890'
Here you only need to replace username
and password
。
For SOCKS proxy, we need to install a support library aiohttp-socks, the installation command is as follows:
pip3 install aiohttp-socks
We can set the SOCKS proxy with the help of ProxyConnector of this library, the code is as follows:
import asyncio
import aiohttp
from aiohttp_socks import ProxyConnector
connector = ProxyConnector.from_url('socks5://127.0.0.1:7891')
async def main():
async with aiohttp.ClientSession(connector=connector) as session:
async with session.get('https://httpbin.org/get') as response:
print(await response.text())
if __name__ == '__main__':
asyncio.get_event_loop().run_until_complete(main())
The results are the same.
In addition, this library also supports setting SOCKS4, HTTP proxy and corresponding proxy authentication, you can refer to its official introduce。
7. Pyppeteer proxy settings
For Pyppeteer, since its default use is a Chrome-like Chromium browser, its setting method is the same as that of Selenium's Chrome, such as the HTTP non-authentication proxy setting method is set through args
, The implementation is as follows:
import asyncio
from pyppeteer import launch
proxy = '127.0.0.1:7890'
async def main():
browser = await launch({'args': ['--proxy-server=http://' + proxy], 'headless': False})
page = await browser.newPage()
await page.goto('https://httpbin.org/get')
print(await page.content())
await browser.close()
if __name__ == '__main__':
asyncio.get_event_loop().run_until_complete(main())
The result of the operation is as follows:
{
"args": {},
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "zh-CN,zh;q=0.9",
"Host": "httpbin.org",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3494.0 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-5e8f442c-12b1ed7865b049007267a66c"
},
"origin": "210.173.1.204",
"url": "https://httpbin.org/get"
}
You can also see that the setting is successful.
SOCKS The proxy is the same, only need to modify the protocol to socks5
That's it, the code is implemented as follows:
import asyncio
from pyppeteer import launch
proxy = '127.0.0.1:7891'
async def main():
browser = await launch({'args': ['--proxy-server=socks5://' + proxy], 'headless': False})
page = await browser.newPage()
await page.goto('https://httpbin.org/get')
print(await page.content())
await browser.close()
if __name__ == '__main__':
asyncio.get_event_loop().run_until_complete(main())
The result is also the same.
8. Playwright proxy settings
Compared with Selenium and Pyppeteer, Playwright's proxy setting is more convenient. It reserves a proxy parameter, which can be set when starting Playwright。
For HTTP proxy, it can be set like this:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(proxy={
'server': 'http://127.0.0.1:7890'
})
page = browser.new_page()
page.goto('https://httpbin.org/get')
print(page.content())
browser.close()
When calling the launch method, we can pass a proxy parameter, which is a dictionary. The dictionary has a required field called server, here we can directly fill in the address of the HTTP proxy。
The result of the operation is as follows:
{
"args": {},
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "zh-CN,zh;q=0.9",
"Host": "httpbin.org",
"Sec-Ch-Ua": "\" Not A;Brand\";v=\"99\", \"Chromium\";v=\"92\"",
"Sec-Ch-Ua-Mobile": "?0",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4498.0 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-60e99eef-4fa746a01a38abd469ecb467"
},
"origin": "210.173.1.204",
"url": "https://httpbin.org/get"
}
For the SOCKS proxy, the setting method is exactly the same, we only need to replace the value of the server field with the address of the SOCKS proxy:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(proxy={
'server': 'socks5://127.0.0.1:7891'
})
page = browser.new_page()
page.goto('https://httpbin.org/get')
print(page.content())
browser.close()
The running result is exactly the same as just now.
For an proxy with a username and password, the setting of Playwright is also very simple. We only need to set the username and password fields in the proxy parameter. If the username and password are foo and bar respectively, the setting method is as follows:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(proxy={
'server': 'http://127.0.0.1:7890',
'username': 'foo',
'password': 'bar'
})
page = browser.new_page()
page.goto('https://httpbin.org/get')
print(page.content())
browser.close()
In this way, we can easily implement the authentication proxy settings for Playwright。
9. Summarize
Above we have summarized the proxyuse methods of each request library. The setting methods of various libraries are similar. After learning these methods, if we encounter the problem of IP blocking in the future, we can easily solve it by adding an proxy.。
By using the proxy Server located in No different geographical location, you can simulate crawling on No different geographical location and obtain data in a specific area. This is useful for tasks that require Obtain geographically related data. Through the use proxy Server, the real IP address of the crawler can be hidden, so as to avoid being banned or frequency restricted by the website. This also helps protect your privacy and security. Some websites prevent crawlers by identifying and blocking too many requests from the same IP address. use proxy Server can bypass such anti-crawling mechanism, allowing you to crawl data with faster speed and higher success rate。