当前位置：首页 > PHP

php 实现爬虫

2026-01-29 12:22:18PHP

使用 PHP 实现爬虫的基本方法

PHP 可以通过多种方式实现网页爬虫功能，以下是几种常见的方法和工具。

使用 cURL 获取网页内容

cURL 是 PHP 中用于发送 HTTP 请求的强大工具，适合抓取网页内容。

$url = "https://example.com";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$response = curl_exec($ch);
curl_close($ch);
echo $response;

设置 CURLOPT_RETURNTRANSFER 为 true 可以确保返回内容而非直接输出。如果需要处理 HTTPS，可以添加以下选项：

curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);

使用 file_get_contents 简化请求

对于简单的 GET 请求，file_get_contents 是一个更轻量的选择。

$url = "https://example.com";
$content = file_get_contents($url);
echo $content;

如果需要传递上下文参数（如 User-Agent），可以使用 stream_context_create：

$options = [
    'http' => [
        'method' => 'GET',
        'header' => 'User-Agent: MyBot/1.0'
    ]
];
$context = stream_context_create($options);
$content = file_get_contents($url, false, $context);

解析 HTML 内容

抓取网页后，通常需要解析 HTML 提取数据。PHP 的 DOMDocument 和 DOMXPath 是常用工具。

php 实现爬虫

$dom = new DOMDocument();
@$dom->loadHTML($response); // 使用 @ 抑制可能的解析警告
$xpath = new DOMXPath($dom);
$titles = $xpath->query("//h1");
foreach ($titles as $title) {
    echo $title->nodeValue . "\n";
}

使用第三方库

PHP 有一些专门用于爬虫的第三方库，可以简化开发流程。

Goutte

Goutte 是一个基于 Symfony 组件的爬虫库，适合简单爬取任务。

require 'vendor/autoload.php';
use Goutte\Client;

$client = new Client();
$crawler = $client->request('GET', 'https://example.com');
$crawler->filter('h1')->each(function ($node) {
    echo $node->text() . "\n";
});

Guzzle + PHP-DI

Guzzle 是一个 HTTP 客户端库，结合 PHP-DI 可以处理更复杂的请求。

use GuzzleHttp\Client;

$client = new Client();
$response = $client->get('https://example.com');
echo $response->getBody();

处理动态内容

对于动态加载的内容（如 JavaScript 渲染的页面），可以使用无头浏览器工具如 Puppeteer 的 PHP 封装（如 ChromePHP）。

php 实现爬虫

use HeadlessChromium\BrowserFactory;

$browserFactory = new BrowserFactory();
$browser = $browserFactory->createBrowser();
$page = $browser->createPage();
$page->navigate('https://example.com')->waitForNavigation();
echo $page->evaluate('document.documentElement.outerHTML')->getReturnValue();
$browser->close();

数据存储

爬取的数据通常需要存储到数据库或文件中。

存储到文件

file_put_contents('data.txt', $content, FILE_APPEND);

存储到数据库

使用 PDO 连接数据库并插入数据。

$pdo = new PDO('mysql:host=localhost;dbname=test', 'user', 'password');
$stmt = $pdo->prepare("INSERT INTO pages (content) VALUES (?)");
$stmt->execute([$content]);

遵守 Robots.txt 和法律法规

在开发爬虫时，务必检查目标网站的 robots.txt 文件，避免爬取禁止访问的页面。同时遵守相关法律法规，避免高频请求导致服务器负载过高。

$robotsUrl = "https://example.com/robots.txt";
$robotsContent = file_get_contents($robotsUrl);
echo $robotsContent;

设置延迟和随机 User-Agent

为了避免被目标网站封禁，可以设置请求延迟和随机 User-Agent。

$userAgents = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.3 Safari/605.1.15'
];
$randomUserAgent = $userAgents[array_rand($userAgents)];

$options = [
    'http' => [
        'header' => "User-Agent: $randomUserAgent"
    ]
];
$context = stream_context_create($options);
$content = file_get_contents($url, false, $context);

sleep(rand(1, 5)); // 随机延迟 1-5 秒

通过以上方法，可以实现一个基础的 PHP 爬虫，并根据需求扩展功能。