当前位置：首页 > PHP

php怎么实现爬虫

2026-02-16 15:59:26PHP

PHP实现爬虫的方法

PHP可以通过多种方式实现网络爬虫功能，主要利用内置函数和第三方库。以下是几种常见实现方式：

使用file_get_contents()函数

这是最简单的获取网页内容的方式：

$url = 'https://example.com';
$html = file_get_contents($url);
echo $html;

需要确保php.ini中allow_url_fopen设置为On。这种方式适合简单场景，但缺乏高级功能如Cookie处理、HTTP头设置等。

使用cURL扩展

cURL提供更强大的网络请求功能：

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://example.com");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$response = curl_exec($ch);
curl_close($ch);

可以添加更多选项：

curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // 跟随重定向
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0'); // 设置User-Agent
curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookies.txt'); // 使用Cookie

使用DOMDocument解析HTML

获取网页内容后需要解析HTML：

php怎么实现爬虫

$dom = new DOMDocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$titles = $xpath->query('//h1');
foreach ($titles as $title) {
    echo $title->nodeValue;
}

使用第三方库

Goutte是流行的PHP爬虫库：

require 'vendor/autoload.php';
$client = new \Goutte\Client();
$crawler = $client->request('GET', 'https://example.com');
$crawler->filter('h1')->each(function ($node) {
    echo $node->text();
});

安装方式：

composer require fabpot/goutte

处理JavaScript渲染页面

对于动态加载的内容，可以使用Panther：

php怎么实现爬虫

$client = \Symfony\Component\Panther\Client::createChromeClient();
$crawler = $client->request('GET', 'https://example.com');
$client->waitFor('.dynamic-content');
$html = $crawler->html();

存储爬取数据

将结果保存到数据库：

$pdo = new PDO('mysql:host=localhost;dbname=test', 'user', 'pass');
$stmt = $pdo->prepare("INSERT INTO pages (url, content) VALUES (?, ?)");
$stmt->execute([$url, $html]);

遵守robots.txt

检查目标网站爬取规则：

$robotsUrl = 'https://example.com/robots.txt';
$robotsContent = file_get_contents($robotsUrl);
if (strpos($robotsContent, 'Disallow: /private/') !== false) {
    // 遵守禁止爬取规则
}

设置延迟防止被封

sleep(rand(1, 3)); // 随机延迟1-3秒

处理验证码

遇到验证码时可以考虑：

使用商业验证码识别服务
手动输入验证码
降低请求频率

完整示例

结合cURL和DOM解析的完整示例：

function crawlPage($url) {
    $ch = curl_init();
    curl_setopt_array($ch, [
        CURLOPT_URL => $url,
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_FOLLOWLOCATION => true,
        CURLOPT_USERAGENT => 'Mozilla/5.0'
    ]);
    $html = curl_exec($ch);
    curl_close($ch);

    $dom = new DOMDocument();
    @$dom->loadHTML($html);
    $xpath = new DOMXPath($dom);

    $data = [];
    $links = $xpath->query('//a/@href');
    foreach ($links as $link) {
        $data[] = $link->nodeValue;
    }

    return $data;
}

以上方法涵盖了从简单到复杂的PHP爬虫实现方案，可根据实际需求选择合适的方式。需要注意遵守目标网站的使用条款，合理设置爬取频率，避免对目标服务器造成过大负担。