In Python3 crawler, use proxy is a common method to prevent IP from being banned or improve crawling speed, and is mainly used to simulate multiple IP addresses for access. Proxy can be divided into two types: free proxy and paid proxy, among which free proxyNoStablize, and paid proxy is relatively more reliable than Stablize。

The following are the use scenarios and applications of Python3 crawler proxy:

Prevent IP from being blocked: Some websites will set an IP access frequency limit. If the number of visits exceeds the limit, the IP will be banned from accessing the website. use proxy can prevent this from happening。

Improve crawling speed: use proxy can establish multiple connections at the same time, so as to achieve the purpose of quickly crawling target data。

Bypass geographical restrictions: Some websites provide ServeNo in different areas. If you need to visit some websites that are only open to specific areas, use proxy can bypass this restriction so that you can get the required data。

In short, Proxy IP plays a very important role in Python3 crawlers. Because use proxy will also bring some security problems, so when using proxy, you need to pay attention to choose the appropriate proxy Serve provider, and strictly abide by the network security regulations。

Preparation

We need to Obtain an available proxy first, the proxy is the combination of IP address and port, that is ip:port such a format. If the proxy needs to access authentication, it also needs two additional information of user name and password。

Here I installed an proxy software on my local machine, which will create HTTP proxy Serve on the local port 7890, that is, the proxy is 127.0.0.1:7890. In addition, the software will also create a SOCKS proxy Serve on port 7891, that is, the proxy is 127.0.0.1:7891, so as long as this proxy is set, the local IP can be successfully switched to the IP of the Server connected to the proxy software.。

In the examples below in this chapter, I use the above proxy to demonstrate its setting method, and you can also replace it with your own available proxy.

After setting the proxy, the test URL is http://httpbin.org/get By visiting this link, we can get the relevant information of the request. The origin field of the returned result is the client's IP. We can use it to judge whether the proxy is set up successfully, that is, whether the IP is successfully disguised.。

Ok, let's take a look at the proxy setting methods of each request library.

ObtainPython3crawler proxy

Some websites detect frequent access to their data and take steps to block these accesses. The use proxy Server can disperse access sources, reduce the possibility of being detected, and thus increase the success rate of crawling.

optimalU.S.staticProxy IP

IPRoyal is an proxy Serve provider that is extremely friendly to Chinaarea, and its Residential proxy solution is very attractive

View IPRoyal
cheapest static proxy

Proxy-seller is a data center proxy provider popular with many small internet marketers.

View Proxy-seller
The most affordable static proxy

Shifter.io is a well-known proxy Serve provider, aiming to provide users with privacy protection and a better Internet experience.

View Shifter.io

2. urllib

First, let's take the most basic urllib as an example to see how to set up the proxy. The code is as follows:

 from urllib.error import URLError
 from urllib.request import ProxyHandler, build_opener

 proxy = '127.0.0.1:7890'
 proxy_handler = ProxyHandler({
    'http': 'http://' + proxy,
    'https': 'http://' + proxy
 })
 opener = build_opener(proxy_handler)
 try:
    response = opener.open('https://httpbin.org/get')
    print(response.read().decode('utf-8'))
 except URLError as e:
    print(e.reason)
 

The result of the operation is as follows:

 {
  "args": {},
  "headers": {
    "Accept-Encoding": "identity",
    "Host": "httpbin.org",
    "User-Agent": "Python-urllib/3.7",
    "X-Amzn-Trace-Id": "Root=1-60e9a1b6-0a20b8a678844a0b2ab4e889"
  },
  "origin": "210.173.1.204",
  "url": "https://httpbin.org/get"
 }
 

Here we need to set the proxy with the help of ProxyHandler, the parameter is the dictionary type, the key name is protocoltype, and the key value is proxy. Note that the protocol needs to be added in front of the proxy here, that is, http:// or https://. When the requested link is HTTP protocol, the http key will be used The proxy corresponding to the key name, when the requested link is HTTPS protocol, the proxy corresponding to the https key name will be used. No Here we set the proxy itself to HTTP protocol, that is, the prefix is ​​uniformly set to http://,Therefore, no matter whether you access the link of HTTP or HTTPS protocol, the proxy of HTTP protocol we configured will be used to make the request。

After creating the ProxyHandler object, we need to use the build_opener method to pass in the object to create an Opener, which means that the Opener has already set up an proxy. Next, directly call the open method of the Opener object to access the link we want。

The output of the operation is a JSON, which has a field origin, indicating the IP of the client. Verify that the IP here is indeed the IP of the proxy, and No is the real IP. In this way, we have successfully set up the proxy and can hide the real IP。

If we encounter an proxy that needs to be authenticated, we can set it in the following way:

 from urllib.error import URLError
 from urllib.request import ProxyHandler, build_opener

 proxy = 'username:password@127.0.0.1:7890'
 proxy_handler = ProxyHandler({
    'http': 'http://' + proxy,
    'https': 'http://' + proxy
 })
 opener = build_opener(proxy_handler)
 try:
    response = opener.open('https://httpbin.org/get')
    print(response.read().decode('utf-8'))
 except URLError as e:
    print(e.reason)
 

What is changed here is only the proxy variable. You only need to add the username and password of the proxy authentication in front of the proxy, where username is the username and password is the password. For example, if the username is foo and the password is bar, then the proxy is foo:bar@127.0.0.1:7890

If the proxy is SOCKS5 type, then the proxy can be set as follows:

 import socks
 import socket
 from urllib import request
 from urllib.error import URLError

 socks.set_default_proxy(socks.SOCKS5, '127.0.0.1', 7891)
 socket.socket = socks.socksocket
 try:
    response = request.urlopen('https://httpbin.org/get')
    print(response.read().decode('utf-8'))
 except URLError as e:
    print(e.reason)
 

A socks module is required here, which can be installed by the following command:

 pip3 install PySocks