如何用java写爬虫
使用Java编写爬虫的方法
Java编写爬虫通常需要借助第三方库如Jsoup或HttpClient,以下是具体实现方式:
使用Jsoup库
Jsoup是一款专门用于解析HTML的Java库,适合处理网页内容:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class JsoupCrawler {
public static void main(String[] args) throws Exception {
String url = "https://example.com";
Document doc = Jsoup.connect(url).get();
Elements links = doc.select("a[href]");
for (Element link : links) {
System.out.println(link.attr("abs:href"));
}
}
}
使用HttpClient库
Apache HttpClient更适合处理HTTP请求,需要配合HTML解析器:
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
public class HttpClientCrawler {
public static void main(String[] args) throws Exception {
CloseableHttpClient client = HttpClients.createDefault();
HttpGet request = new HttpGet("https://example.com");
try (CloseableHttpResponse response = client.execute(request)) {
String html = EntityUtils.toString(response.getEntity());
// 使用Jsoup或其他解析器处理html
}
}
}
处理动态内容
对于JavaScript渲染的页面,可使用Selenium WebDriver:
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
public class SeleniumCrawler {
public static void main(String[] args) {
System.setProperty("webdriver.chrome.driver", "chromedriver路径");
WebDriver driver = new ChromeDriver();
try {
driver.get("https://example.com");
String pageSource = driver.getPageSource();
// 处理获取的页面源码
} finally {
driver.quit();
}
}
}
存储爬取数据
可将数据存储到文件或数据库:
import java.io.FileWriter;
import java.io.IOException;
public class DataStorage {
public static void saveToFile(String data, String filename) throws IOException {
try (FileWriter writer = new FileWriter(filename, true)) {
writer.write(data + "\n");
}
}
}
遵守robots.txt
爬虫应尊重网站的robots.txt规则:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class RobotsChecker {
public static boolean isAllowed(String url) throws Exception {
String robotsUrl = new URL(url).getProtocol() + "://" +
new URL(url).getHost() + "/robots.txt";
Document doc = Jsoup.connect(robotsUrl).get();
return !doc.text().contains("Disallow: /");
}
}
异常处理和限速
添加异常处理和请求间隔:
import java.util.concurrent.TimeUnit;
public class PoliteCrawler {
public static void crawl(String url) {
try {
TimeUnit.SECONDS.sleep(1); // 延迟1秒
Document doc = Jsoup.connect(url)
.timeout(5000)
.get();
// 处理文档
} catch (Exception e) {
System.err.println("Error crawling " + url + ": " + e.getMessage());
}
}
}






