Preface
Have you ever encountered situations where your carefully written scraping program got its IP banned by the target website after running for a short while? Or maybe you used proxy IPs but the collection efficiency was frustratingly low? Today I'll share some practical experience and techniques accumulated from years of scraper development.
Common Mistakes
To be honest, when I first started working with web scrapers, I made quite a few mistakes. For example, blindly pursuing scraping speed only to get completely blocked by websites, or using cheap proxies that resulted in terrible data quality. These are common pitfalls for beginners.
I remember once when scraping product data from an e-commerce site, I thought my code was flawless. But the program ran for less than 30 minutes before receiving a CAPTCHA challenge from the website. I was really frustrated at the time. After deeper research, I discovered that the request frequency was too high, triggering the website's anti-scraping mechanisms.
Technical Breakthroughs
Speaking of anti-scraping, we must address websites' increasingly complex protection measures. To tackle these challenges, we need to master some key technologies.
First is request header spoofing. Many people think changing the User-Agent is enough, but it's far from sufficient. Modern websites check many header fields, like Accept, Accept-Encoding, Accept-Language, etc. I recommend using complete request headers from real browsers.
Here's example code:
import random
import requests
from fake_useragent import UserAgent
class CustomRequests:
def __init__(self):
self.ua = UserAgent()
self.session = requests.Session()
def get_headers(self):
return {
'User-Agent': self.ua.random,
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Cache-Control': 'max-age=0'
}
def get(self, url, **kwargs):
kwargs['headers'] = kwargs.get('headers', self.get_headers())
kwargs['timeout'] = kwargs.get('timeout', 10)
response = self.session.get(url, **kwargs)
return response
Would you like me to explain this code?
Proxy Strategy
Speaking of proxy IPs, this is a topic worth exploring in depth. I've seen too many people frustrated by cheap proxies. Low-cost proxy IPs often have poor stability, slow response times, and may even leak your real IP.
In my view, proxy IP selection should consider three key factors: stability, response speed, and geographic distribution. A good proxy pool should include high-quality proxies from different regions and automatically test proxy availability.
Here's a proxy pool management code I often use:
import time
import asyncio
import aiohttp
from typing import List, Dict
class ProxyPool:
def __init__(self):
self.proxies: Dict[str, dict] = {}
self.check_interval = 300 # Check every 5 minutes
async def check_proxy(self, proxy: str):
async with aiohttp.ClientSession() as session:
try:
start = time.time()
async with session.get('http://www.example.com',
proxy=proxy,
timeout=5) as response:
if response.status == 200:
speed = time.time() - start
self.proxies[proxy].update({
'available': True,
'speed': speed,
'last_check': time.time()
})
else:
self.proxies[proxy]['available'] = False
except:
self.proxies[proxy]['available'] = False
async def check_all_proxies(self):
tasks = [self.check_proxy(proxy) for proxy in self.proxies]
await asyncio.gather(*tasks)
def get_fastest_proxy(self) -> str:
available_proxies = {k: v for k, v in self.proxies.items()
if v['available']}
if not available_proxies:
return None
return min(available_proxies.items(),
key=lambda x: x[1]['speed'])[0]
Would you like me to explain this code?
Distributed Architecture
As data volume grows, single-machine scrapers quickly hit bottlenecks. This is where distributed scraping comes in handy. I often use Scrapy with Redis to implement distributed scraping.
Here's a simplified distributed scraping framework:
from scrapy_redis.spiders import RedisSpider
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
class DistributedSpider(RedisSpider):
name = 'distributed_spider'
redis_key = 'spider:urls'
custom_settings = {
'SCHEDULER': "scrapy_redis.scheduler.Scheduler",
'DUPEFILTER_CLASS': "scrapy_redis.dupefilter.RFPDupeFilter",
'REDIS_HOST': 'localhost',
'REDIS_PORT': 6379,
'CONCURRENT_REQUESTS': 32,
'DOWNLOAD_DELAY': 1,
}
rules = (
Rule(
LinkExtractor(allow=r'item/\d+'),
callback='parse_item',
follow=True
),
)
def parse_item(self, response):
item = {}
item['url'] = response.url
item['title'] = response.css('h1::text').get()
item['content'] = response.css('article::text').getall()
return item
Would you like me to explain this code?
Data Storage
Regarding data storage, I think many people rely too heavily on relational databases. For web scraping, document databases like MongoDB are often a better choice. Their schema-less nature is particularly suitable for storing web page data with varying structures.
Here's my commonly used data storage module:
from pymongo import MongoClient
from datetime import datetime
class DataStorage:
def __init__(self):
self.client = MongoClient('mongodb://localhost:27017/')
self.db = self.client['crawler_db']
def save_item(self, collection: str, item: dict):
item['created_at'] = datetime.now()
collection = self.db[collection]
result = collection.insert_one(item)
return result.inserted_id
def batch_save(self, collection: str, items: list):
if not items:
return []
for item in items:
item['created_at'] = datetime.now()
collection = self.db[collection]
result = collection.insert_many(items)
return result.inserted_ids
def find_items(self, collection: str, query: dict):
collection = self.db[collection]
return collection.find(query)
Would you like me to explain this code?
Practical Experience
After discussing so many technical details, I want to share some lessons learned from practical experience.
I remember once when scraping a news website, I discovered data was always missing. After investigation, I found it was because the website's content was loaded dynamically through Ajax. In such cases, you need to use Selenium or directly analyze Ajax requests to get the data.
Additionally, many websites now use WebSocket to push data. Traditional request methods don't work well for these situations. I recommend using aiohttp or websockets libraries to interact directly with WebSocket interfaces.
Future Outlook
Looking ahead, I believe web scraping technology will develop in several directions:
First is intelligence. With the development of AI technology, future scrapers may automatically learn website structures and adapt to website changes.
Second is decentralization. Blockchain-based distributed scraping systems may emerge, better solving resource scheduling and data sharing issues.
Finally, compliance. As data protection regulations in various countries improve, legal and compliant scraping will become mainstream.
Closing Thoughts
After reading all this, do you have a new understanding of Python web scraping? Actually, scraping technology isn't mysterious - the key is understanding the underlying principles and continuously accumulating experience through practice.
If you encounter problems in practice, feel free to discuss them in the comments. After all, technological progress relies on communication and sharing. What do you think?