php爬虫实现
PHP爬虫实现方法
使用PHP实现网络爬虫可以通过多种方式完成,以下为常见方法和关键技术点:
cURL库基础用法 cURL是PHP中强大的网络请求工具,可用于获取网页内容:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://example.com");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$response = curl_exec($ch);
curl_close($ch);
DOMDocument解析HTML 获取网页后需解析HTML内容:
$dom = new DOMDocument();
@$dom->loadHTML($response);
$xpath = new DOMXPath($dom);
$titles = $xpath->query("//h1");
foreach ($titles as $title) {
echo $title->nodeValue;
}
Guzzle HTTP客户端 更现代的HTTP请求方式:
require 'vendor/autoload.php';
$client = new GuzzleHttp\Client();
$res = $client->request('GET', 'https://example.com');
echo $res->getBody();
处理JavaScript渲染页面 对于动态加载的内容:
- 使用Panther(基于Symfony的浏览器自动化工具)
$client = \Panther\Client::createChromeClient(); $crawler = $client->request('GET', 'https://dynamic-site.com');
数据存储方案 爬取数据后可选择多种存储方式:
// 文件存储
file_put_contents('data.txt', $content);
// 数据库存储
$pdo = new PDO('mysql:host=localhost;dbname=test', 'user', 'pass');
$stmt = $pdo->prepare("INSERT INTO pages(content) VALUES(?)");
$stmt->execute([$content]);
遵守robots.txt 应检查目标网站的爬虫协议:
$robots = file_get_contents("https://example.com/robots.txt");
if (strpos($robots, "Disallow: /admin") !== false) {
// 遵守禁止爬取规则
}
设置延迟防止封禁
// 随机延迟1-3秒
sleep(rand(1, 3));
处理验证码 对于需要验证码的网站:
- 使用第三方验证码识别服务API
- 考虑人工介入方案
代理IP轮换 应对IP封锁的策略:
$proxies = ['1.1.1.1:8080', '2.2.2.2:3128'];
curl_setopt($ch, CURLOPT_PROXY, $proxies[array_rand($proxies)]);
用户代理伪装
$userAgents = [
'Mozilla/5.0 (Windows NT 10.0)',
'Googlebot/2.1 (+http://www.google.com/bot.html)'
];
curl_setopt($ch, CURLOPT_USERAGENT, $userAgents[array_rand($userAgents)]);
完整示例框架
<?php
function crawlPage($url) {
$ch = curl_init();
curl_setopt_array($ch, [
CURLOPT_URL => $url,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_TIMEOUT => 30
]);
$html = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
@$dom->loadHTML($html);
// 提取所需数据
$xpath = new DOMXPath($dom);
$links = $xpath->query("//a/@href");
$data = [];
foreach ($links as $link) {
$data[] = $link->nodeValue;
}
return $data;
}
$results = crawlPage("https://example.com");
print_r($results);
?>
注意事项
- 遵守目标网站的服务条款
- 控制请求频率避免对服务器造成负担
- 处理各种HTTP状态码(404、503等)
- 考虑使用队列系统管理待爬取URL
- 实现去重机制避免重复爬取
对于大规模爬虫项目,建议考虑使用专门框架如:
- Scrapy(Python)
- Apify(JavaScript)
- 或PHP的现成爬虫库如PHPCrawl







