当前位置：首页 > PHP

php 实现爬虫

2026-02-28 09:36:24PHP

PHP 实现爬虫的基本方法

使用 PHP 实现爬虫可以通过多种方式完成，主要依赖内置函数或第三方库。以下是几种常见的方法：

使用 file_get_contents 或 cURL 获取网页内容

通过 file_get_contents 可以快速获取网页内容，但灵活性较低：

$url = 'https://example.com';
$content = file_get_contents($url);
echo $content;

使用 cURL 可以更灵活地设置请求头、代理和超时：

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://example.com');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$content = curl_exec($ch);
curl_close($ch);
echo $content;

使用 DOMDocument 或 SimpleHTMLDom 解析 HTML

PHP 内置的 DOMDocument 可以解析 HTML 并提取数据：

$dom = new DOMDocument();
@$dom->loadHTML($content);
$xpath = new DOMXPath($dom);
$elements = $xpath->query('//div[@class="target"]');
foreach ($elements as $element) {
    echo $element->nodeValue;
}

第三方库 SimpleHTMLDom 提供更简洁的语法：

include 'simple_html_dom.php';
$html = file_get_html('https://example.com');
foreach ($html->find('div.target') as $element) {
    echo $element->innertext;
}

使用 GuzzleHTTP 发送 HTTP 请求

Guzzle 是一个流行的 PHP HTTP 客户端，适合复杂的爬虫需求：

require 'vendor/autoload.php';
$client = new GuzzleHttp\Client();
$response = $client->request('GET', 'https://example.com');
echo $response->getBody();

处理动态加载内容

对于动态渲染的页面（如 JavaScript 生成的内容），可以使用无头浏览器工具：

使用 Symfony Panther

Symfony Panther 是一个基于 ChromeDriver 的 PHP 库，支持动态内容抓取：

require 'vendor/autoload.php';
$client = \Symfony\Component\Panther\Client::createChromeClient();
$client->request('GET', 'https://example.com');
$crawler = $client->waitFor('.dynamic-content');
echo $crawler->filter('.dynamic-content')->text();

数据存储与去重

抓取的数据通常需要存储到数据库或文件中：

存储到 MySQL 数据库

$pdo = new PDO('mysql:host=localhost;dbname=test', 'user', 'password');
$stmt = $pdo->prepare('INSERT INTO pages (url, content) VALUES (?, ?)');
$stmt->execute([$url, $content]);

使用 Bloom Filter 去重

Bloom Filter 是一种高效的去重数据结构：

$filter = new BloomFilter(100000, 0.01);
if (!$filter->has($url)) {
    $filter->add($url);
    // 抓取逻辑
}

遵守 Robots.txt 和法律法规

爬虫应遵守目标网站的 robots.txt 规则，避免高频请求导致封禁：

$robotsTxt = file_get_contents('https://example.com/robots.txt');
if (strpos($robotsTxt, 'Disallow: /target-path') === false) {
    // 允许抓取
}

异常处理与日志记录

添加异常处理和日志记录以提高稳定性：

try {
    $content = file_get_contents($url);
    if ($content === false) {
        throw new Exception('Failed to fetch URL');
    }
} catch (Exception $e) {
    file_put_contents('crawler.log', $e->getMessage(), FILE_APPEND);
}

分布式爬虫架构

对于大规模爬虫，可以使用消息队列（如 RabbitMQ）分发任务：

$connection = new AMQPStreamConnection('localhost', 5672, 'guest', 'guest');
$channel = $connection->channel();
$channel->queue_declare('crawl_tasks', false, true, false, false);
$channel->basic_publish(new AMQPMessage($url), '', 'crawl_tasks');

通过以上方法，可以构建一个功能完善的 PHP 爬虫，适用于静态或动态网页的数据抓取。

php 实现爬虫