php curl 实现爬虫
PHP cURL 实现爬虫的方法
使用 PHP 的 cURL 库可以高效地实现网页爬虫功能。以下是具体实现步骤和注意事项:
基本 cURL 请求
初始化 cURL 会话并获取网页内容:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://example.com");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$output = curl_exec($ch);
curl_close($ch);
echo $output;
设置用户代理
模拟浏览器访问避免被屏蔽:
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36');
处理重定向
自动跟随重定向:
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
设置超时时间
防止长时间等待:
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10);
处理 HTTPS
忽略 SSL 证书验证(测试用):
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
获取请求信息
获取详细请求信息:
$info = curl_getinfo($ch);
echo 'HTTP Code: ' . $info['http_code'];
echo 'Total Time: ' . $info['total_time'];
POST 请求示例
发送 POST 请求:
$postData = [
'username' => 'test',
'password' => '123456'
];
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postData);
处理 Cookie
保存和发送 Cookie:
curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookie.txt');
curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookie.txt');
处理压缩内容
自动解压 gzip 内容:
curl_setopt($ch, CURLOPT_ENCODING, 'gzip');
多线程处理
使用 curl_multi 实现并发请求:
$mh = curl_multi_init();
$handles = [];
for ($i = 0; $i < 5; $i++) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://example.com/page/$i");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_multi_add_handle($mh, $ch);
$handles[] = $ch;
}
do {
curl_multi_exec($mh, $running);
curl_multi_select($mh);
} while ($running > 0);
foreach ($handles as $ch) {
$content = curl_multi_getcontent($ch);
// 处理内容
curl_multi_remove_handle($mh, $ch);
curl_close($ch);
}
curl_multi_close($mh);
使用代理
设置代理服务器:
curl_setopt($ch, CURLOPT_PROXY, '127.0.0.1:8888');
curl_setopt($ch, CURLOPT_PROXYTYPE, CURLPROXY_HTTP);
解析 HTML 内容
结合 DOMDocument 解析 HTML:
$dom = new DOMDocument();
@$dom->loadHTML($output);
$xpath = new DOMXPath($dom);
$titles = $xpath->query('//h1');
foreach ($titles as $title) {
echo $title->nodeValue;
}
错误处理
检查 cURL 错误:
if (curl_errno($ch)) {
echo 'cURL Error: ' . curl_error($ch);
}
性能优化建议
- 复用 cURL 句柄减少连接开销
- 合理设置并发数量避免被封禁
- 使用缓存减少重复请求
- 遵守 robots.txt 协议
- 设置合理的请求间隔
完整示例代码
function fetchUrl($url) {
$ch = curl_init();
curl_setopt_array($ch, [
CURLOPT_URL => $url,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_USERAGENT => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
CURLOPT_TIMEOUT => 30,
CURLOPT_CONNECTTIMEOUT => 10,
CURLOPT_SSL_VERIFYPEER => false,
CURLOPT_ENCODING => 'gzip'
]);
$result = curl_exec($ch);
if (curl_errno($ch)) {
throw new Exception(curl_error($ch));
}
curl_close($ch);
return $result;
}
try {
$html = fetchUrl('https://example.com');
// 处理获取的HTML内容
} catch (Exception $e) {
echo 'Error: ' . $e->getMessage();
}
以上方法涵盖了 PHP cURL 实现爬虫的主要功能和技术要点,可根据实际需求进行调整和扩展。







