当前位置：首页 > Java

如何用java爬虫

2026-03-03 12:28:08Java

Java爬虫实现方法

使用Jsoup库解析HTML

Jsoup是一款Java的HTML解析库，适合处理静态网页。添加依赖后，可以通过以下代码获取网页内容并提取数据：

Document doc = Jsoup.connect("https://example.com").get();
Elements links = doc.select("a[href]");
for (Element link : links) {
    System.out.println(link.attr("href"));
}

使用HttpClient发送请求

Apache HttpClient适合处理需要复杂HTTP请求的场景。示例代码展示如何发送GET请求：

CloseableHttpClient httpClient = HttpClients.createDefault();
HttpGet httpGet = new HttpGet("https://example.com");
CloseableHttpResponse response = httpClient.execute(httpGet);
String content = EntityUtils.toString(response.getEntity());

处理动态加载内容

对于JavaScript渲染的页面，可使用Selenium WebDriver：

如何用java爬虫

WebDriver driver = new ChromeDriver();
driver.get("https://example.com");
WebElement element = driver.findElement(By.tagName("div"));
System.out.println(element.getText());
driver.quit();

数据存储与处理

获取数据后，可选用数据库或文件进行存储。JDBC连接MySQL示例：

Connection conn = DriverManager.getConnection(DB_URL, USER, PASS);
Statement stmt = conn.createStatement();
stmt.executeUpdate("INSERT INTO table VALUES(data)");

遵守robots.txt规则

爬取前应检查目标网站的robots.txt文件，设置合理爬取间隔：

如何用java爬虫

Thread.sleep(1000); // 延迟1秒

处理反爬机制

应对验证码、IP封锁等反爬措施：

HttpPost httpPost = new HttpPost("https://example.com/login");
List<NameValuePair> params = new ArrayList<>();
params.add(new BasicNameValuePair("username", "user"));
httpPost.setEntity(new UrlEncodedFormEntity(params));

使用代理IP

通过代理服务器避免IP被封：

HttpHost proxy = new HttpHost("proxy.example.com", 8080);
RequestConfig config = RequestConfig.custom().setProxy(proxy).build();
httpGet.setConfig(config);