如何使用 PHP 以编程方式检查有效(非死)链接?

2022-01-03 00:00:00 cron url php

给定一个网址列表,我想检查每个网址:

Given a list of urls, I would like to check that each url:

  • 返回 200 OK 状态代码
  • 在 X 时间内返回响应

最终目标是一个能够将 URL 标记为可能已损坏的系统,以便管理员可以查看它们.

The end goal is a system that is capable of flagging urls as potentially broken so that an administrator can review them.

脚本将用 PHP 编写,并且很可能每天通过 cron 运行.

The script will be written in PHP and will most likely run on a daily basis via cron.

该脚本将一次处理大约 1000 个网址.

The script will be processing approximately 1000 urls at a go.

问题有两部分:

  • 这样的操作是否有任何重大问题?您遇到了哪些问题?
  • 考虑到准确性和性能,在 PHP 中检查 url 状态的最佳方法是什么?

推荐答案

使用 PHP cURL 扩展.与 fopen() 不同,它还可以发出 HTTP HEAD 请求,这些请求足以检查 URL 的可用性并为您节省大量带宽,因为您不必下载整个页面进行检查.

Use the PHP cURL extension. Unlike fopen() it can also make HTTP HEAD requests which are sufficient to check the availability of a URL and save you a ton of bandwith as you don't have to download the entire body of the page to check.

作为起点,您可以使用如下函数:

As a starting point you could use some function like this:

function is_available($url, $timeout = 30) {
    $ch = curl_init(); // get cURL handle

    // set cURL options
    $opts = array(CURLOPT_RETURNTRANSFER => true, // do not output to browser
                  CURLOPT_URL => $url,            // set URL
                  CURLOPT_NOBODY => true,         // do a HEAD request only
                  CURLOPT_TIMEOUT => $timeout);   // set timeout
    curl_setopt_array($ch, $opts); 

    curl_exec($ch); // do it!

    $retval = curl_getinfo($ch, CURLINFO_HTTP_CODE) == 200; // check if HTTP OK

    curl_close($ch); // close handle

    return $retval;
}

但是,有很多可能的优化:您可能想要重新使用 cURL 实例,如果每个主机检查多个 URL,甚至可以重新使用连接.

However, there's a ton of possible optimizations: You might want to re-use the cURL instance and, if checking more than one URL per host, even re-use the connection.

哦,这段代码确实严格检查 HTTP 响应代码 200.它不遵循重定向 (302) -- 但也有一个 cURL 选项.

Oh, and this code does check strictly for HTTP response code 200. It does not follow redirects (302) -- but there also is a cURL-option for that.

相关文章