1
Current Location:
>
Web Scraping
Advanced Python Web Scraping: How to Elegantly Bypass Anti-Scraping Mechanisms and Implement Distributed Collection
Release time:2024-11-28 09:31:39 read: 26
Copyright Statement: This article is an original work of the website and follows the CC 4.0 BY-SA copyright agreement. Please include the original source link and this statement when reprinting.

Article link: https://melooy.com/en/content/aid/2084?s=en%2Fcontent%2Faid%2F2084

Preface

Have you ever encountered situations where your carefully written scraping program got its IP banned by the target website after running for a short while? Or maybe you used proxy IPs but the collection efficiency was frustratingly low? Today I'll share some practical experience and techniques accumulated from years of scraper development.

Common Mistakes

To be honest, when I first started working with web scrapers, I made quite a few mistakes. For example, blindly pursuing scraping speed only to get completely blocked by websites, or using cheap proxies that resulted in terrible data quality. These are common pitfalls for beginners.

I remember once when scraping product data from an e-commerce site, I thought my code was flawless. But the program ran for less than 30 minutes before receiving a CAPTCHA challenge from the website. I was really frustrated at the time. After deeper research, I discovered that the request frequency was too high, triggering the website's anti-scraping mechanisms.

Technical Breakthroughs

Speaking of anti-scraping, we must address websites' increasingly complex protection measures. To tackle these challenges, we need to master some key technologies.

First is request header spoofing. Many people think changing the User-Agent is enough, but it's far from sufficient. Modern websites check many header fields, like Accept, Accept-Encoding, Accept-Language, etc. I recommend using complete request headers from real browsers.

Here's example code:

import random
import requests
from fake_useragent import UserAgent

class CustomRequests:
    def __init__(self):
        self.ua = UserAgent()
        self.session = requests.Session()

    def get_headers(self):
        return {
            'User-Agent': self.ua.random,
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
            'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
            'Accept-Encoding': 'gzip, deflate, br',
            'Connection': 'keep-alive',
            'Cache-Control': 'max-age=0'
        }

    def get(self, url, **kwargs):
        kwargs['headers'] = kwargs.get('headers', self.get_headers())
        kwargs['timeout'] = kwargs.get('timeout', 10)
        response = self.session.get(url, **kwargs)
        return response

Would you like me to explain this code?

Proxy Strategy

Speaking of proxy IPs, this is a topic worth exploring in depth. I've seen too many people frustrated by cheap proxies. Low-cost proxy IPs often have poor stability, slow response times, and may even leak your real IP.

In my view, proxy IP selection should consider three key factors: stability, response speed, and geographic distribution. A good proxy pool should include high-quality proxies from different regions and automatically test proxy availability.

Here's a proxy pool management code I often use:

import time
import asyncio
import aiohttp
from typing import List, Dict

class ProxyPool:
    def __init__(self):
        self.proxies: Dict[str, dict] = {}
        self.check_interval = 300  # Check every 5 minutes

    async def check_proxy(self, proxy: str):
        async with aiohttp.ClientSession() as session:
            try:
                start = time.time()
                async with session.get('http://www.example.com', 
                                     proxy=proxy, 
                                     timeout=5) as response:
                    if response.status == 200:
                        speed = time.time() - start
                        self.proxies[proxy].update({
                            'available': True,
                            'speed': speed,
                            'last_check': time.time()
                        })
                    else:
                        self.proxies[proxy]['available'] = False
            except:
                self.proxies[proxy]['available'] = False

    async def check_all_proxies(self):
        tasks = [self.check_proxy(proxy) for proxy in self.proxies]
        await asyncio.gather(*tasks)

    def get_fastest_proxy(self) -> str:
        available_proxies = {k: v for k, v in self.proxies.items() 
                           if v['available']}
        if not available_proxies:
            return None
        return min(available_proxies.items(), 
                  key=lambda x: x[1]['speed'])[0]

Would you like me to explain this code?

Distributed Architecture

As data volume grows, single-machine scrapers quickly hit bottlenecks. This is where distributed scraping comes in handy. I often use Scrapy with Redis to implement distributed scraping.

Here's a simplified distributed scraping framework:

from scrapy_redis.spiders import RedisSpider
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor

class DistributedSpider(RedisSpider):
    name = 'distributed_spider'
    redis_key = 'spider:urls'

    custom_settings = {
        'SCHEDULER': "scrapy_redis.scheduler.Scheduler",
        'DUPEFILTER_CLASS': "scrapy_redis.dupefilter.RFPDupeFilter",
        'REDIS_HOST': 'localhost',
        'REDIS_PORT': 6379,
        'CONCURRENT_REQUESTS': 32,
        'DOWNLOAD_DELAY': 1,
    }

    rules = (
        Rule(
            LinkExtractor(allow=r'item/\d+'),
            callback='parse_item',
            follow=True
        ),
    )

    def parse_item(self, response):
        item = {}
        item['url'] = response.url
        item['title'] = response.css('h1::text').get()
        item['content'] = response.css('article::text').getall()
        return item

Would you like me to explain this code?

Data Storage

Regarding data storage, I think many people rely too heavily on relational databases. For web scraping, document databases like MongoDB are often a better choice. Their schema-less nature is particularly suitable for storing web page data with varying structures.

Here's my commonly used data storage module:

from pymongo import MongoClient
from datetime import datetime

class DataStorage:
    def __init__(self):
        self.client = MongoClient('mongodb://localhost:27017/')
        self.db = self.client['crawler_db']

    def save_item(self, collection: str, item: dict):
        item['created_at'] = datetime.now()
        collection = self.db[collection]
        result = collection.insert_one(item)
        return result.inserted_id

    def batch_save(self, collection: str, items: list):
        if not items:
            return []

        for item in items:
            item['created_at'] = datetime.now()

        collection = self.db[collection]
        result = collection.insert_many(items)
        return result.inserted_ids

    def find_items(self, collection: str, query: dict):
        collection = self.db[collection]
        return collection.find(query)

Would you like me to explain this code?

Practical Experience

After discussing so many technical details, I want to share some lessons learned from practical experience.

I remember once when scraping a news website, I discovered data was always missing. After investigation, I found it was because the website's content was loaded dynamically through Ajax. In such cases, you need to use Selenium or directly analyze Ajax requests to get the data.

Additionally, many websites now use WebSocket to push data. Traditional request methods don't work well for these situations. I recommend using aiohttp or websockets libraries to interact directly with WebSocket interfaces.

Future Outlook

Looking ahead, I believe web scraping technology will develop in several directions:

First is intelligence. With the development of AI technology, future scrapers may automatically learn website structures and adapt to website changes.

Second is decentralization. Blockchain-based distributed scraping systems may emerge, better solving resource scheduling and data sharing issues.

Finally, compliance. As data protection regulations in various countries improve, legal and compliant scraping will become mainstream.

Closing Thoughts

After reading all this, do you have a new understanding of Python web scraping? Actually, scraping technology isn't mysterious - the key is understanding the underlying principles and continuously accumulating experience through practice.

If you encounter problems in practice, feel free to discuss them in the comments. After all, technological progress relies on communication and sharing. What do you think?

Python Web Crawler: A Wonderful Journey from Beginner to Master
Previous
2024-11-12 03:06:01
Python Web Scraping in Practice: How to Elegantly Handle Website Anti-Scraping Mechanisms
2024-12-02 09:05:51
Next
Related articles