如何用java爬虫
使用Java实现爬虫的基本方法
引入必要的库
Java中常用的爬虫库包括Jsoup和HttpClient。Jsoup适合解析HTML,HttpClient适合发送HTTP请求。在Maven项目中添加以下依赖:
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.15.3</version>
</dependency>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.13</version>
</dependency>
发送HTTP请求
使用HttpClient发送GET请求获取网页内容:
CloseableHttpClient httpClient = HttpClients.createDefault();
HttpGet httpGet = new HttpGet("https://example.com");
CloseableHttpResponse response = httpClient.execute(httpGet);
String html = EntityUtils.toString(response.getEntity());
response.close();
httpClient.close();
解析HTML内容
使用Jsoup解析获取的HTML内容:
Document doc = Jsoup.parse(html);
Elements links = doc.select("a[href]");
for (Element link : links) {
System.out.println(link.attr("href"));
}
处理动态内容
对于动态加载的内容,可以使用Selenium WebDriver:
WebDriver driver = new ChromeDriver();
driver.get("https://example.com");
WebElement element = driver.findElement(By.tagName("body"));
System.out.println(element.getText());
driver.quit();
存储爬取的数据
将爬取的数据存储到文件或数据库:
try (FileWriter writer = new FileWriter("output.txt")) {
writer.write(data);
}
遵守robots.txt
在爬取前检查目标网站的robots.txt文件,确保爬虫行为合法:
String robotsTxt = Jsoup.connect("https://example.com/robots.txt").execute().body();
System.out.println(robotsTxt);
设置请求头
模拟浏览器行为,避免被反爬虫机制拦截:
httpGet.setHeader("User-Agent", "Mozilla/5.0");
httpGet.setHeader("Accept-Language", "en-US,en;q=0.5");
处理异常
确保代码健壮性,处理可能出现的异常:
try {
// 爬虫代码
} catch (IOException e) {
e.printStackTrace();
}
使用代理
防止IP被封禁,可以使用代理:
HttpHost proxy = new HttpHost("proxy.example.com", 8080);
RequestConfig config = RequestConfig.custom().setProxy(proxy).build();
httpGet.setConfig(config);
多线程爬取
提高爬取效率,使用多线程:
ExecutorService executor = Executors.newFixedThreadPool(10);
executor.submit(() -> {
// 爬虫任务
});
executor.shutdown();






