用HTTPclient或者htmlunit工具包,他們都可以做爬蟲獲取網頁的工具。比如htmlunit,樓主可以這樣獲取網頁源碼:
import?com.gargoylesoftware.htmlunit.WebClient;import?com.gargoylesoftware.htmlunit.html.HtmlPage;
import?com.gargoylesoftware.htmlunit.BrowserVersion;
import?com.gargoylesoftware.htmlunit.html.HtmlDivision;
import?com.gargoylesoftware.htmlunit.html.HtmlAnchor;
import?com.gargoylesoftware.htmlunit.*;
import?com.gargoylesoftware.htmlunit.WebClientOptions;
import?com.gargoylesoftware.htmlunit.html.HtmlInput;
import?com.gargoylesoftware.htmlunit.html.HtmlBody;
import?java.util.List;
public?class?helloHtmlUnit{
public?static?void?main(String[]?args)?throws?Exception{
String?str;
//創建壹個webclient
WebClient?webClient?=?new?WebClient();
//htmlunit?對css和javascript的支持不好,所以請關閉之
webClient.getOptions().setJavaScriptEnabled(false);
webClient.getOptions().setCssEnabled(false);
//獲取頁面
HtmlPage?page?=?webClient.getPage("/");
//獲取頁面的TITLE
str?=?page.getTitleText();
System.out.println(str);
//獲取頁面的XML代碼
str?=?page.asXml();
System.out.println(str);
//獲取頁面的文本
str?=?page.asText();
System.out.println(str);
//關閉webclient
webClient.closeAllWindows();
}
}
如果用HTTPclient,樓主可以百度它的教程,有本書叫做《自己動手寫網絡爬蟲》,裏面是以java語言為基礎講的,作為壹個爬蟲入門者可以去看看