处理连接错误和 JSoup

2022-01-24 00:00:00 connection java jsoup

我正在尝试创建一个应用程序来从网站上的多个页面中抓取内容.我正在使用 JSoup 进行连接.这是我的代码:

I'm trying to create an application to scrape content off of multiple pages on a site. I am using JSoup to connect. This is my code:

for (String locale : langList){
        sitemapPath = sitemapDomain+"/"+locale+"/"+sitemapName;
        try {
            Document doc = Jsoup.connect(sitemapPath)
                    .userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21")
                    .timeout(10000)
                    .get();

            Elements element = doc.select("loc");   
            for (Element urls : element) {
                System.out.println(urls.text());
                }
        } catch (IOException e) {
            System.out.println(e);
        }
    }

大部分时间一切都完美无缺.但是,我希望能够做一些事情.

Everything works perfectly most of the time. However there are a few things I want to be able to do.

首先,有时会返回 404 状态或 500 状态可能会返回 301.使用下面的代码,它只会打印错误并移至下一个 url.我想做的是尝试能够返回所有链接的 url 状态.如果页面连接打印一个200,如果没有打印相关状态码.

First off sometimes a 404 status will return or a 500 status maybe a 301. With my code below it will just print the error and move onto the next url. What I would like to be able to do is try to be able to return the url status for all links. If the page connects print a 200, if not print the relevant status code.

其次,我有时会遇到此错误java.net.SocketTimeoutException:读取超时"我可以增加我的超时但是我更愿意尝试连接 3 次,在第三次失败时我想将 URL 添加到"failed" 数组,以便我以后可以重试失败的连接.

Secondly I sometimes catch this error "java.net.SocketTimeoutException: Read timed out" I could increase my timeout however I would prefer to try to connect 3 times, upon failing the 3rd time I want to add the URL to a "failed" array so I can retry the failed connections in the future.

有比我知识渊博的人帮帮我吗?

Can someone with more knowledge than me help me out?

推荐答案

对于第一个问题,您可以分两步进行连接/读取,停止询问中间的状态码,如下所示:

For your first question, you can do your connection/read in two steps, stopping to ask for the status code in the middle like so:

Connection.Response response = Jsoup.connect(sitemapPath)
                        .userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21")
                        .timeout(10000)
                        .execute();

int statusCode = response.statusCode();
if(statusCode == 200) {
    Document doc = connection.get();
    Elements element = doc.select("loc");   
    for (Element urls : element) {
        System.out.println(urls.text());
    }
}
else {
    System.out.println("received error code : " + statusCode);
}

请注意,如果 execute() 方法无法连接到服务器,如果响应是格式错误的 HTTP 等,将失败并返回 IOException,所以你将需要处理.但是,只要服务器说出有意义的话,您就可以读取状态码并继续.此外,如果您已要求 Jsoup 跟踪重定向,您将不会看到 30x 响应代码 b/c Jsoup 将从获取的最终页面设置状态代码.

Note that the execute() method will fail with an IOException if it's unable to connect to the server, if the response is malformed HTTP, etc., so you'll need to handle that. However, as long as the server said something that made sense, you'll be able to read the status code and continue. Also, if you've asked Jsoup to follow redirects, you won't be seeing 30x response codes b/c Jsoup will set the status code from the final page fetched.

至于您的第二个问题,您所需要的只是围绕我刚刚给您的代码示例进行循环,该代码示例包含一个带有 SocketTimeoutException 的 try/catch 块.当您捕获异常时,循环应该继续.如果您能够获取数据,则返回或中断.如果您需要更多帮助,请大声喊叫!

As for your second question, all you need is a loop around the code sample I just gave you that's wrapped with a try/catch block with SocketTimeoutException. When you catch the exception, the loop should continue. If you're able to get data, then return or break. Shout if you need more help with it!

相关文章