当前位置：首页 > PHP

php实现网页爬虫

2026-02-15 11:34:12PHP

PHP实现网页爬虫的方法

使用PHP实现网页爬虫可以通过多种方式完成，主要涉及HTTP请求发送、HTML解析和数据提取。以下是几种常见的方法：

使用file_get_contents和DOMDocument

这种方法简单直接，适合基础爬取需求：

$url = 'https://example.com';
$html = file_get_contents($url);

$dom = new DOMDocument();
@$dom->loadHTML($html);

$xpath = new DOMXPath($dom);
$titles = $xpath->query('//h1');
foreach ($titles as $title) {
    echo $title->nodeValue . "\n";
}

使用cURL库

cURL提供更多控制和灵活性：

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://example.com');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

$html = curl_exec($ch);
curl_close($ch);

$dom = new DOMDocument();
@$dom->loadHTML($html);
// 后续解析逻辑...

使用第三方库

Goutte是流行的PHP爬虫库：

require 'vendor/autoload.php';
use Goutte\Client;

$client = new Client();
$crawler = $client->request('GET', 'https://example.com');

$crawler->filter('h1')->each(function ($node) {
    echo $node->text() . "\n";
});

处理动态内容

对于JavaScript渲染的页面，可能需要配合无头浏览器：

use Symfony\Component\Panther\Client;

$client = Client::createChromeClient();
$crawler = $client->request('GET', 'https://dynamic-site.com');

// 等待元素加载
$client->waitFor('h1');
echo $crawler->filter('h1')->text();

数据存储

爬取的数据通常需要存储到数据库或文件：

// MySQL存储示例
$pdo = new PDO('mysql:host=localhost;dbname=test', 'user', 'pass');
$stmt = $pdo->prepare('INSERT INTO pages (title, content) VALUES (?, ?)');

$stmt->execute([
    $title,
    $content
]);