如何使用 XMLHttpRequest 在后台下载 HTML 页面并从中提取文本元素?

2022-01-15 00:00:00 xmlhttprequest javascript cross-domain greasemonkey tampermonkey

我想制作一个 Greasemonkey 脚本，当您在 URL_1 中时，该脚本会在后台解析 URL_2 的整个 HTML 网页，以便从中提取文本元素.

I want to make a Greasemonkey script that, while you are in URL_1, the script parses the whole HTML web page of URL_2 in the background in order to extract a text element from it.

具体来说，我想在后台下载整个页面的HTML代码(一个烂番茄页面)并将其存储在一个变量中，然后使用getElementsByClassName[0] 以便从类名为critic_consensus"的元素中提取我想要的文本.

To be specific, I want to download the whole page's HTML code (a Rotten Tomatoes page) in the background and store it in a variable and then use getElementsByClassName[0] in order to extract the text I want from the element with class name "critic_consensus".

我在 MDN 中找到了这个:XMLHttpRequest 中的 HTML所以，我最终得到了这个不幸的非工作代码:

I've found this in MDN: HTML in XMLHttpRequest so, I ended up in this unfortunately non-working code:

var xhr = new XMLHttpRequest(); xhr.onload = function() { alert(this.responseXML.getElementsByClassName(critic_consensus)[0].innerHTML); } xhr.open("GET", "http://www.rottentomatoes.com/m/godfather/",true); xhr.responseType = "document"; xhr.send();

当我在 Firefox Scratchpad 中运行它时，它会显示此错误消息:

It shows this error message when I run it in Firefox Scratchpad:

跨域请求被阻止:同源策略不允许读取http://www.rottentomatoes.com/m/godfather/ 的远程资源.这可以通过将资源移动到同一域或启用 CORS.

Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at http://www.rottentomatoes.com/m/godfather/. This can be fixed by moving the resource to the same domain or enabling CORS.

PS.我不使用烂番茄 API 的原因是他们已经删除了批评者的共识.

推荐答案

对于跨域请求，获取的站点没有帮助设置许可CORS 策略，Greasemonkey 提供 GM_xmlhttpRequest() 函数.(大多数其他用户脚本引擎也提供此功能.)

For cross-origin requests, where the fetched site has not helpfully set a permissive CORS policy, Greasemonkey provides the GM_xmlhttpRequest() function. (Most other userscript engines also provide this function.)

GM_xmlhttpRequest 明确设计为允许跨域请求.

GM_xmlhttpRequest is expressly designed to allow cross-origin requests.

要获取您的目标信息，请在结果上创建一个 DOMParser.不要使用 jQuery 方法，因为这会导致加载无关的图像、脚本和对象、减慢速度或使页面崩溃.

To get your target information create a DOMParser on the result. Do not use jQuery methods as this will cause extraneous images, scripts and objects to load, slowing things down, or crashing the page.

这里有一个完整的脚本来说明这个过程:

Here's a complete script that illustrates the process:

// ==UserScript== // @name _Parse Ajax Response for specific nodes // @include http://stackoverflow.com/questions/* // @require http://ajax.googleapis.com/ajax/libs/jquery/2.1.0/jquery.min.js // @grant GM_xmlhttpRequest // ==/UserScript== GM_xmlhttpRequest ( { method: "GET", url: "http://www.rottentomatoes.com/m/godfather/", onload: function (response) { var parser = new DOMParser (); /* IMPORTANT! 1) For Chrome, see https://developer.mozilla.org/en-US/docs/Web/API/DOMParser#DOMParser_HTML_extension_for_other_browsers for a work-around. 2) jQuery.parseHTML() and similar are bad because it causes images, etc., to be loaded. */ var doc = parser.parseFromString (response.responseText, "text/html"); var criticTxt = doc.getElementsByClassName ("critic_consensus")[0].textContent; $("body").prepend ('<h1>' + criticTxt + '</h1>'); }, onerror: function (e) { console.error ('**** error ', e); }, onabort: function (e) { console.error ('**** abort ', e); }, ontimeout: function (e) { console.error ('**** timeout ', e); } } );

相关文章