java如何写一个爬虫
Java爬虫实现方法
使用Jsoup库解析HTML
Jsoup是一款Java的HTML解析库,适合处理静态网页。添加Maven依赖:
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.15.3</version>
</dependency>
基础爬取示例代码:
Document doc = Jsoup.connect("https://example.com").get();
Elements links = doc.select("a[href]");
for (Element link : links) {
System.out.println(link.attr("href"));
}
使用HttpClient处理动态请求
Apache HttpClient适合需要处理Cookie、Header等复杂HTTP请求的场景。添加依赖:
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.13</version>
</dependency>
GET请求示例:
CloseableHttpClient client = HttpClients.createDefault();
HttpGet request = new HttpGet("https://api.example.com/data");
CloseableHttpResponse response = client.execute(request);
String responseBody = EntityUtils.toString(response.getEntity());
处理JavaScript渲染页面
对于动态加载内容的页面,可使用Selenium WebDriver。需先安装浏览器驱动,添加依赖:
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-java</artifactId>
<version>4.8.1</version>
</dependency>
示例代码:
WebDriver driver = new ChromeDriver();
driver.get("https://dynamic.example.com");
WebElement element = driver.findElement(By.cssSelector(".dynamic-content"));
System.out.println(element.getText());
driver.quit();
数据存储处理
爬取数据可存储到文件或数据库。JDBC示例:
String url = "jdbc:mysql://localhost:3306/crawler";
Connection conn = DriverManager.getConnection(url, "user", "password");
PreparedStatement stmt = conn.prepareStatement("INSERT INTO pages(url, content) VALUES(?,?)");
stmt.setString(1, pageUrl);
stmt.setString(2, pageContent);
stmt.executeUpdate();
遵守robots.txt规则
在爬取前应检查目标网站的robots.txt文件,可使用以下代码:
String robotsUrl = "https://example.com/robots.txt";
Document robotsDoc = Jsoup.connect(robotsUrl).get();
System.out.println(robotsDoc.body().text());
异常处理和限速
添加适当的异常处理和请求间隔:
try {
Thread.sleep(1000); // 1秒间隔
Document doc = Jsoup.connect(url)
.timeout(10000)
.userAgent("Mozilla/5.0")
.get();
} catch (IOException e) {
System.err.println("Error fetching URL: " + url);
}
使用代理IP
应对反爬机制可配置代理:

System.setProperty("http.proxyHost", "proxy.example.com");
System.setProperty("http.proxyPort", "8080");
Document doc = Jsoup.connect(url).get();
注意:实际开发中应遵守网站服务条款,避免对目标服务器造成过大负荷。






