如何抓取壹個網址下的所有頁面鏈接？

由於題目是放在編程語言欄目下的，但沒有指定希望使用哪種編程語言，我選擇使用java語言來實現。

在Java中，使用HttpURLConnection即可連接URL，隨後可以使用InputStreamReader獲取網頁內容文本。然後，使用正則表達式解析網頁內容文本，找到所有的<a>標簽即實現需求。

以下是詳細代碼：

import?java.io.BufferedReader;

import?java.io.IOException;

import?java.io.InputStreamReader;

import?java.net.HttpURLConnection;

import?java.net.URL;

import?java.util.ArrayList;

import?java.util.regex.Matcher;

import?java.util.regex.Pattern;

public?class?HtmlParser?{

/**

*?要分析的網頁

String?htmlUrl;

/**

*?分析結果

ArrayList<String>?hrefList?=?new?ArrayList();

/**

*?網頁編碼方式

String?charSet;

public?HtmlParser(String?htmlUrl)?{

//?TODO?自動生成的構造函數存根

this.htmlUrl?=?htmlUrl;

}

/**

*?獲取分析結果

*?@throws?IOException

public?ArrayList<String>?getHrefList()?throws?IOException?{

parser();

return?hrefList;

}

/**

*?解析網頁鏈接

*?@return

*?@throws?IOException

private?void?parser()?throws?IOException?{

URL?url?=?new?URL(htmlUrl);

HttpURLConnection?connection?=?(HttpURLConnection)?url.openConnection();

connection.setDoOutput(true);

String?contenttype?=?connection.getContentType();

charSet?=?getCharset(contenttype);

InputStreamReader?isr?=?new?InputStreamReader(

connection.getInputStream(),?charSet);

BufferedReader?br?=?new?BufferedReader(isr);

String?str?=?null,?rs?=?null;

while?((str?=?br.readLine())?!=?null)?{

rs?=?getHref(str);

if?(rs?!=?null)

hrefList.add(rs);

}

/**

*?獲取網頁編碼方式

*?@param?str

private?String?getCharset(String?str)?{

Pattern?pattern?=?Pattern.compile("charset=.*");

Matcher?matcher?=?pattern.matcher(str);

if?(matcher.find())

return?matcher.group(0).split("charset=")[1];

return?null;

}

/**

*?從壹行字符串中讀取鏈接

*?@return

private?String?getHref(String?str)?{

Pattern?pattern?=?Pattern.compile("<a?href=.*</a>");

Matcher?matcher?=?pattern.matcher(str);

if?(matcher.find())

return?matcher.group(0);

return?null;

}

public?static?void?main(String[]?arg)?throws?IOException?{

HtmlParser?a?=?new?HtmlParser("/");

ArrayList<String>?hrefList?=?a.getHrefList();

for?(int?i?=?0;?i?<?hrefList.size();?i++)

System.out.println(hrefList.get(i));

}

上一篇:第三課

下一篇:《煉句（壹）》：幸福是小熊抱著蜜罐發出的“呼呼”聲

如何抓取壹個網址下的所有頁面鏈接 ？

如何抓取壹個網址下的所有頁面鏈接？