what is crawler proxy？

crawler Proxy refers to the use of an intermediate server to obtain the content of the target website during the process of web data crawling. This intermediate server is the so-called proxy server. When a user initiates a request, the request will be received and processed by the proxy server, and then the proxy server will send the request to the target website, obtain the content of the website, and return the content to the user.

Why do you need to use a proxy?

Using a proxy can help us evade some anti-crawler measures, such as IP blocking, access frequency restrictions, etc. In addition, using a proxy can also hide the user's real IP address and protect the user's private information.

How to test if the proxy is available?

You can test whether a proxy is available by entering the address of the proxy server in your browser and then visiting a website. Alternatively, you can use the requests library in Python to test whether the proxy is available, httpbin.org is a test proxy website that returns the IP address used for the current request. If a proxy is available, the output should contain the IP address of the proxy server.

How to get available proxies from proxy pool?

You can obtain available proxies by crawling free proxy websites or building a proxy pool yourself. get_proxies The function fetches 50 free proxies by crawling http://free-proxy-list.net/ and returns a list. The get_proxies_from_pool function is to obtain an available proxy from the proxy pool built by itself.

How to ensure that the proxy server used is of good quality?

In order to ensure that the quality of the proxy server used is good, you can start from the following aspects: Regularly test the availability of the proxy server. The quality of the proxy is rated and the score is recorded, and the proxy with higher quality is selected according to the score. Limit proxy usage time to avoid using the same proxy for too long.

How to use Python3 crawler proxy IP

This article introduces in detail that using proxy IP can make your Python3 crawler more stealthy when crawling website data, avoid frequent requests to be blocked IP, and introduce request libraries urllib, requests, Selenium, and Playwright and other usages, but there is no unified method of setting up the proxy. In this section, we will sort out the setting method of the proxy for these libraries.

In Python3 crawler, use proxy is a common method to prevent IP from being banned or improve crawling speed, and is mainly used to simulate multiple IP addresses for access. Proxy can be divided into two types: free proxy and paid proxy, among which free proxyNoStablize, and paid proxy is relatively more reliable than Stablize。

The following are the use scenarios and applications of Python3 crawler proxy：

Prevent IP from being blocked: Some websites will set an IP access frequency limit. If the number of visits exceeds the limit, the IP will be banned from accessing the website. use proxy can prevent this from happening。

Improve crawling speed: use proxy can establish multiple connections at the same time, so as to achieve the purpose of quickly crawling target data。

Bypass geographical restrictions: Some websites provide ServeNo in different areas. If you need to visit some websites that are only open to specific areas, use proxy can bypass this restriction so that you can get the required data。

In short, Proxy IP plays a very important role in Python3 crawlers. Because use proxy will also bring some security problems, so when using proxy, you need to pay attention to choose the appropriate proxy Serve provider, and strictly abide by the network security regulations。

Preparation

We need to Obtain an available proxy first, the proxy is the combination of IP address and port, that is ip:port such a format. If the proxy needs to access authentication, it also needs two additional information of user name and password。

Here I installed an proxy software on my local machine, which will create HTTP proxy Serve on the local port 7890, that is, the proxy is 127.0.0.1:7890. In addition, the software will also create a SOCKS proxy Serve on port 7891, that is, the proxy is 127.0.0.1:7891, so as long as this proxy is set, the local IP can be successfully switched to the IP of the Server connected to the proxy software.。

In the examples below in this chapter, I use the above proxy to demonstrate its setting method, and you can also replace it with your own available proxy.

After setting the proxy, the test URL is http://httpbin.org/get By visiting this link, we can get the relevant information of the request. The origin field of the returned result is the client's IP. We can use it to judge whether the proxy is set up successfully, that is, whether the IP is successfully disguised.。

Ok, let's take a look at the proxy setting methods of each request library.

ObtainPython3crawler proxy

Some websites detect frequent access to their data and take steps to block these accesses. The use proxy Server can disperse access sources, reduce the possibility of being detected, and thus increase the success rate of crawling.

2. urllib

First, let's take the most basic urllib as an example to see how to set up the proxy. The code is as follows：

 from urllib.error import URLError
 from urllib.request import ProxyHandler, build_opener

 proxy = '127.0.0.1:7890'
 proxy_handler = ProxyHandler({
    'http': 'http://' + proxy,
    'https': 'http://' + proxy
 })
 opener = build_opener(proxy_handler)
 try:
    response = opener.open('https://httpbin.org/get')
    print(response.read().decode('utf-8'))
 except URLError as e:
    print(e.reason)

The result of the operation is as follows:

 {
  "args": {},
  "headers": {
    "Accept-Encoding": "identity",
    "Host": "httpbin.org",
    "User-Agent": "Python-urllib/3.7",
    "X-Amzn-Trace-Id": "Root=1-60e9a1b6-0a20b8a678844a0b2ab4e889"
  },
  "origin": "210.173.1.204",
  "url": "https://httpbin.org/get"
 }

Here we need to set the proxy with the help of ProxyHandler, the parameter is the dictionary type, the key name is protocoltype, and the key value is proxy. Note that the protocol needs to be added in front of the proxy here, that is, http:// or https://. When the requested link is HTTP protocol, the http key will be used The proxy corresponding to the key name, when the requested link is HTTPS protocol, the proxy corresponding to the https key name will be used. No Here we set the proxy itself to HTTP protocol, that is, the prefix is uniformly set to http://，Therefore, no matter whether you access the link of HTTP or HTTPS protocol, the proxy of HTTP protocol we configured will be used to make the request。

After creating the ProxyHandler object, we need to use the build_opener method to pass in the object to create an Opener, which means that the Opener has already set up an proxy. Next, directly call the open method of the Opener object to access the link we want。

The output of the operation is a JSON, which has a field origin, indicating the IP of the client. Verify that the IP here is indeed the IP of the proxy, and No is the real IP. In this way, we have successfully set up the proxy and can hide the real IP。

If we encounter an proxy that needs to be authenticated, we can set it in the following way:

 from urllib.error import URLError
 from urllib.request import ProxyHandler, build_opener

 proxy = 'username:password@127.0.0.1:7890'
 proxy_handler = ProxyHandler({
    'http': 'http://' + proxy,
    'https': 'http://' + proxy
 })
 opener = build_opener(proxy_handler)
 try:
    response = opener.open('https://httpbin.org/get')
    print(response.read().decode('utf-8'))
 except URLError as e:
    print(e.reason)

What is changed here is only the proxy variable. You only need to add the username and password of the proxy authentication in front of the proxy, where username is the username and password is the password. For example, if the username is foo and the password is bar, then the proxy is foo:bar@127.0.0.1:7890。

If the proxy is SOCKS5 type, then the proxy can be set as follows:

 import socks
 import socket
 from urllib import request
 from urllib.error import URLError

 socks.set_default_proxy(socks.SOCKS5, '127.0.0.1', 7891)
 socket.socket = socks.socksocket
 try:
    response = request.urlopen('https://httpbin.org/get')
    print(response.read().decode('utf-8'))
 except URLError as e:
    print(e.reason)

A socks module is required here, which can be installed by the following command：

 pip3 install PySocks

Here you need to run a SOCKS5 proxy locally, running on port 7891. After the operation is successful, the output result of the above HTTP proxy is the same：

 {
  "args": {},
  "headers": {
    "Accept-Encoding": "identity",
    "Host": "httpbin.org",
    "User-Agent": "Python-urllib/3.7",
    "X-Amzn-Trace-Id": "Root=1-60e9a1b6-0a20b8a678844a0b2ab4e889"
  },
  "origin": "210.173.1.204",
  "url": "https://httpbin.org/get"
 }

The origin field of the result is also the IP of the proxy, and the proxy is set successfully。

3.requests proxy settings

For requests, the proxy setting is very simple, we only need to pass in proxies parameter is enough.

Here I take my local proxy as an example to see the HTTP proxy settings of requests, the code is as follows：

 import requests

 proxy = '127.0.0.1:7890'
 proxies = {
    'http': 'http://' + proxy,
    'https': 'http://' + proxy,
 }
 try:
    response = requests.get('https://httpbin.org/get', proxies=proxies)
    print(response.text)
 except requests.exceptions.ConnectionError as e:
    print('Error', e.args)

The result of the operation is as follows:

 {
  "args": {},
  "headers": {
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate",
    "Host": "httpbin.org",
    "User-Agent": "python-requests/2.22.0",
    "X-Amzn-Trace-Id": "Root=1-5e8f358d-87913f68a192fb9f87aa0323"
  },
  "origin": "210.173.1.204",
  "url": "https://httpbin.org/get"
 }

and urllib Similarly, when the requested link is HTTP protocol, the proxy corresponding to the http key name will be used. When the requested link is HTTPS protocol, the proxy corresponding to the https key name will be used. No, the proxy of HTTP protocol is uniformly used here。

If the origin in the running result is the IP of the proxy Server, it proves that the proxy has been set up successfully。

If the proxy requires authentication, just add username and password in front of the proxy, and the writing method of the proxy becomes as follows：

 proxy = 'username:password@127.0.0.1:7890'

Here you only need to replace username and password。

If you need to use SOCKS proxy, you can use the following way to set：

 import requests

 proxy = '127.0.0.1:7891'
 proxies = {
    'http': 'socks5://' + proxy,
    'https': 'socks5://' + proxy
 }
 try:
    response = requests.get('https://httpbin.org/get', proxies=proxies)
    print(response.text)
 except requests.exceptions.ConnectionError as e:
    print('Error', e.args)

Here we need to install an additional package requests[socks], the relevant commands are as follows：

 pip3 install "requests[socks]"

The results are exactly the same:

 {
  "args": {},
  "headers": {
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate",
    "Host": "httpbin.org",
    "User-Agent": "python-requests/2.22.0",
    "X-Amzn-Trace-Id": "Root=1-5e8f364a-589d3cf2500fafd47b5560f2"
  },
  "origin": "210.173.1.204",
  "url": "https://httpbin.org/get"
 }

In addition, there is another setting method, that is, the use socks module, which also needs to install the socks library as above. This setup method is as follows：

 import requests
 import socks
 import socket

 socks.set_default_proxy(socks.SOCKS5, '127.0.0.1', 7891)
 socket.socket = socks.socksocket
 try:
    response = requests.get('https://httpbin.org/get')
    print(response.text)
 except requests.exceptions.ConnectionError as e:
    print('Error', e.args)

Use this method can also set SOCKS proxy, the operation results are exactly the same. Compared with the first method, this method is set globally. We can choose No different methods in No different situations。

4. httpx proxy settings

The use of httpx itself is very similar to the use of requests, so it also sets the proxy through the proxies parameter, but No is the same as requests No, the key name of the proxies parameter No can be http or https, but need to be changed to http:// or https://, other settings are the same。

For HTTP proxy, the setting method is as follows：

 import httpx

 proxy = '127.0.0.1:7890'
 proxies = {
    'http://': 'http://' + proxy,
    'https://': 'http://' + proxy,
 }

 with httpx.Client(proxies=proxies) as client:
    response = client.get('https://httpbin.org/get')
    print(response.text)

For proxy that require authentication, just change the value of proxy：

 proxy = 'username:password@127.0.0.1:7890'

Here you only need to replace username and password。

The running results anduse requests are similar, the results are as follows：

 {
  "args": {},
  "headers": {
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate",
    "Host": "httpbin.org",
    "User-Agent": "python-httpx/0.18.1",
    "X-Amzn-Trace-Id": "Root=1-60e9a3ef-5527ff6320484f8e46d39834"
  },
  "origin": "210.173.1.204",
  "url": "https://httpbin.org/get"
 }

For SOCKS proxy, we need to install httpx-socks library, the installation method is as follows：

 pip3 install "httpx-socks[asyncio]"

This will install both synchronous and asynchronous modes of support。

For synchronous mode, the setting method is as follows：

 import httpx
 from httpx_socks import SyncProxyTransport

 transport = SyncProxyTransport.from_url(
    'socks5://127.0.0.1:7891')

 with httpx.Client(transport=transport) as client:
    response = client.get('https://httpbin.org/get')
    print(response.text)

Here we need to set a transport object and configure the address of the SOCKS proxy, and at the same time pass in the transport parameter when declaring the client object of httpx, the running result is the same as before。

For asynchronous mode, the setting method is as follows:

 import httpx
 import asyncio
 from httpx_socks import AsyncProxyTransport

 transport = AsyncProxyTransport.from_url(
    'socks5://127.0.0.1:7891')

 async def main():
    async with httpx.AsyncClient(transport=transport) as client:
        response = await client.get('https://httpbin.org/get')
        print(response.text)

 if __name__ == '__main__':
    asyncio.get_event_loop().run_until_complete(main())

The same as the synchronization mode No, the transport object we use is AsyncProxyTransport and No is SyncProxyTransport, and the Client object needs to be changed to the AsyncClient object at the same time, the other No changes, the operation result is the same。

5. Selenium proxy settings

Selenium can also set the proxy, here we take Chrome as an example to introduce its setting method。

For proxy without authentication, the setting method is as follows：

 from selenium import webdriver

 proxy = '127.0.0.1:7890'
 options = webdriver.ChromeOptions()
 options.add_argument('--proxy-server=http://' + proxy)
 browser = webdriver.Chrome(options=options)
 browser.get('https://httpbin.org/get')
 print(browser.page_source)
 browser.close()

The result of the operation is as follows:

 {
  "args": {},
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
    "Accept-Encoding": "gzip, deflate",
    "Accept-Language": "zh-CN,zh;q=0.9",
    "Host": "httpbin.org",
    "Upgrade-Insecure-Requests": "1",
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36",
    "X-Amzn-Trace-Id": "Root=1-5e8f39cd-60930018205fd154a9af39cc"
  },
  "origin": "210.173.1.204",
  "url": "http://httpbin.org/get"
 }

The proxy is set successfully, origin is also the address of the proxy IP。

If the proxy is an authentication proxy, the setting method is relatively cumbersome, as follows:

 from selenium import webdriver
 from selenium.webdriver.chrome.options import Options
 import zipfile

 ip = '127.0.0.1'
 port = 7890
 username = 'foo'
 password = 'bar'

 manifest_json = """{"version":"1.0.0","manifest_version": 2,"name":"Chrome Proxy","permissions": ["proxy","tabs","unlimitedStorage","storage","<all_urls>","webRequest","webRequestBlocking"],"background": {"scripts": ["background.js"]
    }
 }
 """
 background_js = """
 var config = {
        mode: "fixed_servers",
        rules: {
          singleProxy: {
            scheme: "http",
            host: "%(ip) s",
            port: %(port) s
          }
        }
      }

 chrome.proxy.settings.set({value: config, scope: "regular"}, function() {});

 function callbackFn(details) {
    return {
        authCredentials: {username: "%(username) s",
            password: "%(password) s"
        }
    }
 }

 chrome.webRequest.onAuthRequired.addListener(
            callbackFn,
            {urls: ["<all_urls>"]},
            ['blocking']
 )
 """ % {'ip': ip, 'port': port, 'username': username, 'password': password}

 plugin_file = 'proxy_auth_plugin.zip'
 with zipfile.ZipFile(plugin_file, 'w') as zp:
    zp.writestr("manifest.json", manifest_json)
    zp.writestr("background.js", background_js)
 options = Options()
 options.add_argument("--start-maximized")
 options.add_extension(plugin_file)
 browser = webdriver.Chrome(options=options)
 browser.get('https://httpbin.org/get')
 print(browser.page_source)
 browser.close()

Here you need to create a manifest.json configuration file and background.js script locally to set up the authentication proxy. After running the code, a proxy_auth_plugin.zip file will be generated locally to save the current configuration.

The running result is the same as the above example, origin is also the proxy IP。

The setting of SOCKS proxy is also relatively simple, just change the corresponding protocol to socks5. For example, the proxy setting method without password authentication is as follows:

  from selenium import webdriver

 proxy = '127.0.0.1:7891'
 options = webdriver.ChromeOptions()
 options.add_argument('--proxy-server=socks5://' + proxy)
 browser = webdriver.Chrome(options=options)
 browser.get('https://httpbin.org/get')
 print(browser.page_source)
 browser.close()

The results are the same.

6.aiohttp proxy settings

For aiohttp, we can set it directly through proxy parameter. The HTTP proxy settings are as follows：

 import asyncio
 import aiohttp

 proxy = 'http://127.0.0.1:7890'

 async def main():
    async with aiohttp.ClientSession() as session:
        async with session.get('https://httpbin.org/get', proxy=proxy) as response:
            print(await response.text())


 if __name__ == '__main__':
    asyncio.get_event_loop().run_until_complete(main())

If the proxy has a username and password, like requests, modify proxy to the following：

 proxy = 'http://username:password@127.0.0.1:7890'

Here you only need to replace username and password。

For SOCKS proxy, we need to install a support library aiohttp-socks, the installation command is as follows：

 pip3 install aiohttp-socks

We can set the SOCKS proxy with the help of ProxyConnector of this library, the code is as follows：

 import asyncio
 import aiohttp
 from aiohttp_socks import ProxyConnector

 connector = ProxyConnector.from_url('socks5://127.0.0.1:7891')

 async def main():
    async with aiohttp.ClientSession(connector=connector) as session:
        async with session.get('https://httpbin.org/get') as response:
            print(await response.text())


 if __name__ == '__main__':
    asyncio.get_event_loop().run_until_complete(main())

The results are the same.

In addition, this library also supports setting SOCKS4, HTTP proxy and corresponding proxy authentication, you can refer to its official introduce。

7. Pyppeteer proxy settings

For Pyppeteer, since its default use is a Chrome-like Chromium browser, its setting method is the same as that of Selenium's Chrome, such as the HTTP non-authentication proxy setting method is set through args, The implementation is as follows：

 import asyncio
 from pyppeteer import launch

 proxy = '127.0.0.1:7890'

 async def main():
    browser = await launch({'args': ['--proxy-server=http://' + proxy], 'headless': False})
    page = await browser.newPage()
    await page.goto('https://httpbin.org/get')
    print(await page.content())
    await browser.close()


 if __name__ == '__main__':
    asyncio.get_event_loop().run_until_complete(main())

The result of the operation is as follows:

 {
  "args": {},
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "zh-CN,zh;q=0.9",
    "Host": "httpbin.org",
    "Upgrade-Insecure-Requests": "1",
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3494.0 Safari/537.36",
    "X-Amzn-Trace-Id": "Root=1-5e8f442c-12b1ed7865b049007267a66c"
  },
  "origin": "210.173.1.204",
  "url": "https://httpbin.org/get"
 }

You can also see that the setting is successful.

SOCKS The proxy is the same, only need to modify the protocol to socks5 That's it, the code is implemented as follows:

 import asyncio
 from pyppeteer import launch

 proxy = '127.0.0.1:7891'

 async def main():
    browser = await launch({'args': ['--proxy-server=socks5://' + proxy], 'headless': False})
    page = await browser.newPage()
    await page.goto('https://httpbin.org/get')
    print(await page.content())
    await browser.close()

 if __name__ == '__main__':
    asyncio.get_event_loop().run_until_complete(main())

The result is also the same.

8. Playwright proxy settings

Compared with Selenium and Pyppeteer, Playwright's proxy setting is more convenient. It reserves a proxy parameter, which can be set when starting Playwright。

For HTTP proxy, it can be set like this：

 from playwright.sync_api import sync_playwright

 with sync_playwright() as p:
    browser = p.chromium.launch(proxy={
        'server': 'http://127.0.0.1:7890'
    })
    page = browser.new_page()
    page.goto('https://httpbin.org/get')
    print(page.content())
    browser.close()

When calling the launch method, we can pass a proxy parameter, which is a dictionary. The dictionary has a required field called server, here we can directly fill in the address of the HTTP proxy。

The result of the operation is as follows:

 {
  "args": {},
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "zh-CN,zh;q=0.9",
    "Host": "httpbin.org",
    "Sec-Ch-Ua": "\" Not A;Brand\";v=\"99\", \"Chromium\";v=\"92\"",
    "Sec-Ch-Ua-Mobile": "?0",
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "none",
    "Sec-Fetch-User": "?1",
    "Upgrade-Insecure-Requests": "1",
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4498.0 Safari/537.36",
    "X-Amzn-Trace-Id": "Root=1-60e99eef-4fa746a01a38abd469ecb467"
  },
  "origin": "210.173.1.204",
  "url": "https://httpbin.org/get"
 }

For the SOCKS proxy, the setting method is exactly the same, we only need to replace the value of the server field with the address of the SOCKS proxy：

 from playwright.sync_api import sync_playwright

 with sync_playwright() as p:
    browser = p.chromium.launch(proxy={
        'server': 'socks5://127.0.0.1:7891'
    })
    page = browser.new_page()
    page.goto('https://httpbin.org/get')
    print(page.content())
    browser.close()

The running result is exactly the same as just now.

For an proxy with a username and password, the setting of Playwright is also very simple. We only need to set the username and password fields in the proxy parameter. If the username and password are foo and bar respectively, the setting method is as follows：

 from playwright.sync_api import sync_playwright

 with sync_playwright() as p:
    browser = p.chromium.launch(proxy={
        'server': 'http://127.0.0.1:7890',
        'username': 'foo',
        'password': 'bar'
    })
    page = browser.new_page()
    page.goto('https://httpbin.org/get')
    print(page.content())
    browser.close()

In this way, we can easily implement the authentication proxy settings for Playwright。

9. Summarize

Above we have summarized the proxyuse methods of each request library. The setting methods of various libraries are similar. After learning these methods, if we encounter the problem of IP blocking in the future, we can easily solve it by adding an proxy.。

By using the proxy Server located in No different geographical location, you can simulate crawling on No different geographical location and obtain data in a specific area. This is useful for tasks that require Obtain geographically related data. Through the use proxy Server, the real IP address of the crawler can be hidden, so as to avoid being banned or frequency restricted by the website. This also helps protect your privacy and security. Some websites prevent crawlers by identifying and blocking too many requests from the same IP address. use proxy Server can bypass such anti-crawling mechanism, allowing you to crawl data with faster speed and higher success rate。

Top 12 Best Residential Proxy Service Providers

How to use Python3 crawler proxy IP

Related Reading

what is crawler proxy？

Why do you need to use a proxy?

How to test if the proxy is available?

How to get available proxies from proxy pool?

How to ensure that the proxy server used is of good quality?

How to avoid proxy abuse?

How to set Python3crawler proxy request timeout？

Sponsor

Blog

Popular blog

Types of Proxies

How to use Python3 crawler proxy IP

Preparation

ObtainPython3crawler proxy

optimalU.S.staticProxy IP

cheapest static proxy

The most affordable static proxy

2. urllib

3.requests proxy settings

4. httpx proxy settings

5. Selenium proxy settings

6.aiohttp proxy settings

7. Pyppeteer proxy settings

8. Playwright proxy settings

9. Summarize

Related Reading

EuropestaticResidentialProxy IP

GermanystaticResidentialProxy IP

optimalU.K.static Residential proxy（U.K.ISP proxy）

what is crawler proxy？

Why do you need to use a proxy?

How to test if the proxy is available?

How to get available proxies from proxy pool?

How to ensure that the proxy server used is of good quality?

How to avoid proxy abuse?

How to set Python3crawler proxy request timeout？