当前位置：首页 > PHP

php 实现爬虫

2026-02-14 10:28:03PHP

PHP 实现爬虫的方法

使用 PHP 实现爬虫可以通过多种方式完成，包括内置函数、第三方库或框架。以下是几种常见的方法：

使用 file_get_contents 和正则表达式

通过 file_get_contents 获取网页内容，再结合正则表达式提取数据。适合简单的数据抓取。

$url = 'https://example.com';
$html = file_get_contents($url);

preg_match_all('/<h1>(.*?)<\/h1>/', $html, $matches);
print_r($matches[1]);

使用 cURL 库

cURL 提供了更灵活的 HTTP 请求功能，支持设置请求头、POST 数据等。

$url = 'https://example.com';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($ch);
curl_close($ch);

preg_match_all('/<a href="(.*?)">/', $html, $links);
print_r($links[1]);

使用 DOMDocument 和 DOMXPath

DOMDocument 可以解析 HTML 结构，结合 XPath 更精准地提取数据。

$url = 'https://example.com';
$html = file_get_contents($url);

$dom = new DOMDocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);

$titles = $xpath->query('//h1');
foreach ($titles as $title) {
    echo $title->nodeValue . "\n";
}

使用第三方库（如 Goutte）

Goutte 是一个基于 Symfony 组件的 PHP 爬虫库，简化了爬取和解析过程。

require 'vendor/autoload.php';
use Goutte\Client;

$client = new Client();
$crawler = $client->request('GET', 'https://example.com');

$crawler->filter('h1')->each(function ($node) {
    echo $node->text() . "\n";
});

处理 JavaScript 渲染的页面

对于动态加载的内容，可以使用无头浏览器工具如 Puppeteer 或通过 API 调用外部服务。

// 调用 Puppeteer 或其他无头浏览器服务
$url = 'https://example.com';
$apiKey = 'YOUR_API_KEY';
$response = file_get_contents("https://api.headlessbrowser.com/?url=$url&key=$apiKey");
$data = json_decode($response, true);
echo $data['content'];