当前位置：首页 > PHP

php爬虫实现

2026-03-12 18:01:11PHP

PHP爬虫实现方法

使用PHP实现爬虫可以通过多种方式完成，以下是一些常见的方法和工具：

使用cURL获取网页内容

cURL是PHP中常用的网络请求库，可以用于抓取网页内容：

$url = 'https://example.com';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$response = curl_exec($ch);
curl_close($ch);
echo $response;

使用file_get_contents简单抓取

对于简单的抓取需求，可以直接使用file_get_contents函数：

$url = 'https://example.com';
$html = file_get_contents($url);
echo $html;

使用DOMDocument解析HTML

抓取到网页内容后，可以使用DOMDocument来解析和提取数据：

$dom = new DOMDocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$titles = $xpath->query('//h1');
foreach ($titles as $title) {
    echo $title->nodeValue;
}

使用第三方库Goutte

Goutte是一个PHP爬虫库，基于Symfony组件构建：

require 'vendor/autoload.php';
use Goutte\Client;
$client = new Client();
$crawler = $client->request('GET', 'https://example.com');
$crawler->filter('h1')->each(function ($node) {
    echo $node->text();
});

处理JavaScript渲染的页面

对于需要执行JavaScript的页面，可以使用Panther：

use Symfony\Component\Panther\Client;
$client = Client::createChromeClient();
$crawler = $client->request('GET', 'https://example.com');
$client->waitFor('h1');
echo $crawler->filter('h1')->text();
$client->quit();

遵守robots.txt

在开发爬虫时，应该检查目标网站的robots.txt文件，遵守爬取规则：

$robotsUrl = 'https://example.com/robots.txt';
$robotsContent = file_get_contents($robotsUrl);
echo $robotsContent;

设置请求间隔

为了避免给目标网站造成过大压力，应该设置合理的请求间隔：

$urls = ['https://example.com/page1', 'https://example.com/page2'];
foreach ($urls as $url) {
    $html = file_get_contents($url);
    // 处理html内容
    sleep(1); // 1秒间隔
}

处理登录和会话

对于需要登录的网站，可以使用cURL维护会话：

$loginUrl = 'https://example.com/login';
$postData = ['username' => 'user', 'password' => 'pass'];
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $loginUrl);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query($postData));
curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookie.txt');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_exec($ch);

存储爬取数据

可以将爬取的数据存储到数据库或文件中：

// 存储到文件
file_put_contents('data.json', json_encode($data));

// 存储到MySQL数据库
$pdo = new PDO('mysql:host=localhost;dbname=test', 'user', 'pass');
$stmt = $pdo->prepare('INSERT INTO pages (url, content) VALUES (?, ?)');
$stmt->execute([$url, $html]);

处理反爬机制

一些网站可能有反爬机制，可以设置User-Agent等请求头：

php爬虫实现

$options = [
    'http' => [
        'header' => "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64)\r\n"
    ]
];
$context = stream_context_create($options);
$html = file_get_contents($url, false, $context);

以上方法涵盖了PHP实现爬虫的基本技术要点，可以根据具体需求选择合适的方法组合使用。