java如何做爬虫
使用Jsoup进行简单网页爬取
Jsoup是一个Java库,用于处理HTML文档并从中提取数据。它提供简单的API用于URL抓取和数据解析。
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class JsoupExample {
public static void main(String[] args) throws Exception {
Document doc = Jsoup.connect("https://example.com").get();
Elements links = doc.select("a[href]");
for (Element link : links) {
System.out.println(link.attr("href"));
}
}
}
使用HttpClient进行更复杂请求
Apache HttpClient更适合处理需要自定义头信息、Cookie管理和POST请求的场景。
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
public class HttpClientExample {
public static void main(String[] args) throws Exception {
CloseableHttpClient httpClient = HttpClients.createDefault();
HttpGet request = new HttpGet("https://example.com");
try (CloseableHttpResponse response = httpClient.execute(request)) {
String html = EntityUtils.toString(response.getEntity());
System.out.println(html);
}
}
}
处理动态加载内容
对于JavaScript动态生成的内容,需要使用无头浏览器如Selenium WebDriver。

import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
public class SeleniumExample {
public static void main(String[] args) {
System.setProperty("webdriver.chrome.driver", "path/to/chromedriver");
WebDriver driver = new ChromeDriver();
driver.get("https://example.com");
String pageSource = driver.getPageSource();
System.out.println(pageSource);
driver.quit();
}
}
数据存储与处理
爬取的数据通常需要存储到数据库或文件中。
import java.io.FileWriter;
import java.io.IOException;
public class DataStorage {
public static void saveToFile(String data, String filename) throws IOException {
try (FileWriter writer = new FileWriter(filename)) {
writer.write(data);
}
}
}
遵守robots.txt协议
在爬取前检查目标网站的robots.txt文件,尊重网站的爬取规则。

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class RobotsChecker {
public static void main(String[] args) throws Exception {
Document robots = Jsoup.connect("https://example.com/robots.txt").get();
System.out.println(robots.text());
}
}
处理反爬机制
常见反爬措施包括IP限制、验证码和User-Agent检测。
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class AntiCrawler {
public static void main(String[] args) throws Exception {
Document doc = Jsoup.connect("https://example.com")
.userAgent("Mozilla/5.0")
.timeout(5000)
.cookie("auth", "token")
.get();
}
}
使用代理IP
当遇到IP限制时,可以使用代理服务器。
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class ProxyExample {
public static void main(String[] args) throws Exception {
System.setProperty("http.proxyHost", "proxy.example.com");
System.setProperty("http.proxyPort", "8080");
Document doc = Jsoup.connect("https://example.com").get();
}
}





