使用简单的 dom 解析器和分页从电子商店获取产品

2022-01-04 00:00:00 parsing php simple-html-dom pagination

我想解析一些产品链接、名称和价格.这是我的代码:解析时遇到了一些麻烦，因为我不知道如何获取产品链接和名称.价格很好，我明白了.并且分页效果不佳

 <h2>Telefonai Pigu</h2></br><?phpinclude_once('simple_html_dom.php');$url = "http://pigu.lt/foto_gsm_mp3/mobilieji_telefonai/";//从主页面开始$nextLink = $url;//只要它存在就循环每个下一个链接而 ($nextLink) {echo "<hr>nextLink: $nextLink<br>";//创建一个DOM对象$html = new simple_html_dom();//从 url 加载 HTML$html->load_file($nextLink);$phones = $html->find('div#productList span.product');foreach($phones 作为 $phone) {//获取链接$linkas = $phone->href;//获取名称$pavadinimas = $phone->find('a[alt]', 0)->plaintext;//获取名称价格并使用正则表达式提取有用的部分$kaina = $phone->find('strong[class=nw]', 0)->plaintext;//这将捕获十进制数的整数部分:在 "123,45" 中将捕获 "123"... 使用@([d,]+),?@ 也捕获小数部分echo $pavadinimas, " #----# ", $kaina, " #----# ", $linkas, "<br>";//$query = "插入电话 (pavadinimas,kaina,linkas) VALUES (?,?,?)";//$this->db->query($query, array($pavadinimas,$kaina, $linkas));}//提取下一个链接，如果没有找到返回NULL$nextLink = ( ($temp = $html->find('div.pagination a[="rel"]', 0)) ? "https://www.pigu.lt".$temp->href: 空值 );//清除 DOM 对象$html->clear();未设置($ html)；}?>

输出:

nextLink: http://pigu.lt/foto_gsm_mp3/mobilieji_telefonai/遇到 PHP 错误严重性:注意消息:试图获取非对象的属性文件名:views/pigu_view.php行号:26#----# 999,00 Lt #----#遇到 PHP 错误严重性:注意消息:试图获取非对象的属性文件名:views/pigu_view.php行号:26

解决方案

请仔细检查您正在处理的源代码，然后，您可以在此基础上检索您想要的节点...另一个网站的代码在这里不起作用，因为这两个网站没有相同的源代码/结构！

让我们继续，一步一步......

$phones = $html->find('div#productList span.product'); 会给你所有的手机容器"，或者我所说的块"......每个块具有以下结构:

<div class="fakeProductContainer"><p class="productPhoto"><span class=""><span class="flags flag-disc-value" title="Akcija"><strong>500<br><span class="currencySymbol">Lt</span></strong></跨度><span class="flags freeShipping" title="Nemokamas prekių atsiemimas į POST24 paštomatus. Pasiūlymas galioja iki sausio 31 d."></span></span><a href="/foto_gsm_mp3/mobilieji_telefonai/telefonas_sony_xperia_acro_s?id=4522595" title="Telefonas Sony Xperia acro S" class="photo-medium nobr"><img src="http://lt1.pigugroup.eu//colours/48355/16/4835516/c503caf69ad97d889842a5fd5b3ff372_medium.jpg" title="Telefonas Sony Xperia acro S" alt="Telefonas Sony Xperia acro S"></a</a</p><div class="价格"><strong class="nw">999,00 Lt</strong><del class="nw">1.499,00 Lt *</del><h3><a href="/foto_gsm_mp3/mobilieji_telefonai/telefonas_sony_xperia_acro_s?id=4522595" title="Telefonas Sony Xperia acro S">Sony Xperia acro S</a><<p class="descFields">3G:<em>HSDPA 14.4 Mbps，HSUPA 5.76 Mbps</em><br>GPS:<em>台北</em><br>NFC:<em>Taip</em><br>Operacinė 系统:<em>Android 操作系统</em><br></p></span>

包含产品链接的锚点包含在<p class="productPhoto">中，并且它是其中的唯一锚点，因此，检索它只需使用 $linkas = $phone->find('p.productPhoto a', 0)->href; (然后完成它，因为它只是相对链接)

产品名称位于

标签中，同样，我们简单地使用$pavadinimas = $phone->find('h3 a', 0)->plaintext; 检索它

价格包含在<div class="price"><strong>中，我们再次使用$kaina = $phone->find('div[class=price] strong', 0)->plaintext 检索它

然而，并非所有手机都显示价格，因此，我们必须检查价格是否被正确检索

最后，包含下一个链接的 HTML 代码如下:


<div class="pages-list"><strong>1</strong><a href="/foto_gsm_mp3/mobilieji_telefonai?page=2">2</a><a href="/foto_gsm_mp3/mobilieji_telefonai?page=3">3</a><a href="/foto_gsm_mp3/mobilieji_telefonai?page=4">4</a><a href="/foto_gsm_mp3/mobilieji_telefonai?page=5">5</a><a href="/foto_gsm_mp3/mobilieji_telefonai?page=6">6</a><a rel="next" href="/foto_gsm_mp3/mobilieji_telefonai?page=2">Toliau</a><div class="pages-info">Prekių

所以，我们对 <a rel="next"> 标签感兴趣，可以使用 $html->find('div#ListFootPanel a[rel="下一个"]', 0)

因此，如果我们将这些修改添加到您的原始代码中，我们将得到:

$url = "http://pigu.lt/foto_gsm_mp3/mobilieji_telefonai/";//从主页面开始$nextLink = $url;//只要它存在就循环每个下一个链接而 ($nextLink) {echo "nextLink: $nextLink
";//创建一个DOM对象$html = new simple_html_dom();//从 url 加载 HTML$html->load_file($nextLink);///////////////////////////////////////////////////获取电话块并提取有用的信息///////////////////////////////////////////////////$phones = $html->find('div#productList span.product');foreach($phones 作为 $phone) {//获取链接$linkas = "http://pigu.lt" .$phone->find('p.productPhoto a', 0)->href;//获取名称$pavadinimas = $phone->find('h3 a', 0)->plaintext;//如果没有找到价格，find() 返回 FALSE，然后返回 000if ( $tempPrice = $phone->find('div[class=price] strong', 0) ) {//获取名称价格并使用正则表达式提取有用的部分$kaina = $tempPrice-> 明文；//这将捕获十进制数的整数部分:在 "123,45" 中将捕获 "123"... 使用@([d,]+),?@ 也捕获小数部分preg_match('@(d+),?@', $kaina, $matches);$kaina = $matches[1];}别的$kaina = "000";echo $pavadinimas, " #----# ", $kaina, " #----# ", $linkas, "<br>";}//////////////////////////////////////////////////////////////////////////////////////////////////提取下一个链接，如果没有找到返回NULL$nextLink = ( ($temp = $html->find('div#ListFootPannel a[rel="next"]', 0)) ? "http://pigu.lt".$temp->href :空值 );//清除 DOM 对象$html->clear();未设置($ html)；echo "
";}

工作演示

I want to parse some products link, name and price. Here's my code: Having some trouble parsing, because I don't know how to get product link's and name's.Price is ok, I get it. And pagination not working as well

 <h2>Telefonai Pigu</h2>
</br>
<?php
  include_once('simple_html_dom.php'); 
  $url = "http://pigu.lt/foto_gsm_mp3/mobilieji_telefonai/";
  // Start from the main page
  $nextLink = $url;

// Loop on each next Link as long as it exsists
while ($nextLink) {
echo "<hr>nextLink: $nextLink<br>";
//Create a DOM object
$html = new simple_html_dom();
// Load HTML from a url
$html->load_file($nextLink);


$phones = $html->find('div#productList span.product');

foreach($phones as $phone) {
    // Get the link
    $linkas = $phone->href;

    // Get the name
    $pavadinimas = $phone->find('a[alt]', 0)->plaintext;

    // Get the name price and extract the useful part using regex
    $kaina = $phone->find('strong[class=nw]', 0)->plaintext;
    // This captures the integer part of decimal numbers: In "123,45" will capture      "123"... Use @([d,]+),?@ to capture the decimal part too

    echo $pavadinimas, " #----# ", $kaina, " #----# ", $linkas, "<br>";

  //$query = "insert into telefonai (pavadinimas,kaina,linkas) VALUES (?,?,?)";
//  $this->db->query($query, array($pavadinimas,$kaina, $linkas));
}


// Extract the next link, if not found return NULL
$nextLink = ( ($temp = $html->find('div.pagination a[="rel"]', 0)) ? "https://www.pigu.lt".$temp->href : NULL );

// Clear DOM object
$html->clear();
unset($html);
}
?>

Output:

nextLink: http://pigu.lt/foto_gsm_mp3/mobilieji_telefonai/
A PHP Error was encountered
Severity: Notice
Message: Trying to get property of non-object
Filename: views/pigu_view.php
Line Number: 26
#----# 999,00 Lt #----#
A PHP Error was encountered
Severity: Notice
Message: Trying to get property of non-object
Filename: views/pigu_view.php
Line Number: 26

解决方案

Please Inspect carefully the source code you're working on, then, based on that, you can retrive the nodes you want... It's normal that the compatible code with another website wont work here, since the two websites dont have the same source code/structure !

Lets proceed, again, step by step...

$phones = $html->find('div#productList span.product'); will give you all "phones containers", or what I called "blocks"... Each block has the following structure:

<span class="product">
   <div class="fakeProductContainer">
      <p class="productPhoto">
         <span class="">
         <span class="flags flag-disc-value" title="Akcija"><strong>500<br><span class="currencySymbol">Lt</span></strong></span>
         <span class="flags freeShipping" title="Nemokamas prekių atsiemimas į POST24 paštomatus. Pasiūlymas galioja iki sausio 31 d."></span>
         </span>
         <a href="/foto_gsm_mp3/mobilieji_telefonai/telefonas_sony_xperia_acro_s?id=4522595" title="Telefonas Sony Xperia acro S" class="photo-medium nobr"><img src="http://lt1.pigugroup.eu//colours/48355/16/4835516/c503caf69ad97d889842a5fd5b3ff372_medium.jpg" title="Telefonas Sony Xperia acro S" alt="Telefonas Sony Xperia acro S"></a>
      </p>
      <div class="price">
         <strong class="nw">999,00 Lt</strong>
         <del class="nw">1.499,00 Lt *</del>
      </div>
      <h3><a href="/foto_gsm_mp3/mobilieji_telefonai/telefonas_sony_xperia_acro_s?id=4522595" title="Telefonas Sony Xperia acro S">Sony Xperia acro S</a></h3>
      <p class="descFields">
         3G: <em>HSDPA 14.4 Mbps, HSUPA 5.76 Mbps</em><br>
         GPS: <em>Taip</em><br>
         NFC: <em>Taip</em><br>
         Operacinė sistema: <em>Android OS</em><br>
      </p>
   </div>
</span>

The anchor containing the product link an is included within <p class="productPhoto">, and it is the only anchor in there, so, to retrieve it simply use $linkas = $phone->find('p.productPhoto a', 0)->href; (then complete it since it's only the relative link)

The product name is located into <h3> tag, again, we use simply $pavadinimas = $phone->find('h3 a', 0)->plaintext; to retrieve it

The price is included within <div class="price"><strong>, and again we use $kaina = $phone->find('div[class=price] strong', 0)->plaintext to retrieve it

Hoever, not all phones have their price displayed, therefore, we must check if the price has been retrieved correctly or not

And finally, the HTML code containing the next link is the following:

<div id="ListFootPannel">
   <div class="pages-list">
      <strong>1</strong>
      <a href="/foto_gsm_mp3/mobilieji_telefonai?page=2">2</a>
      <a href="/foto_gsm_mp3/mobilieji_telefonai?page=3">3</a>
      <a href="/foto_gsm_mp3/mobilieji_telefonai?page=4">4</a>
      <a href="/foto_gsm_mp3/mobilieji_telefonai?page=5">5</a>
      <a href="/foto_gsm_mp3/mobilieji_telefonai?page=6">6</a>
      <a rel="next" href="/foto_gsm_mp3/mobilieji_telefonai?page=2">Toliau</a>      
   </div>
   <div class="pages-info">
      Prekių 
   </div>
</div>

So, we are interested in <a rel="next"> tag, wich can be retrieved using $html->find('div#ListFootPannel a[rel="next"]', 0)

So, if we make add these modifications to your original code, we'll get:

$url = "http://pigu.lt/foto_gsm_mp3/mobilieji_telefonai/";

// Start from the main page
$nextLink = $url;

// Loop on each next Link as long as it exsists
while ($nextLink) {
    echo "nextLink: $nextLink<br>";
    //Create a DOM object
    $html = new simple_html_dom();
    // Load HTML from a url
    $html->load_file($nextLink);

    ////////////////////////////////////////////////
    /// Get phone blocks and extract useful info ///
    ////////////////////////////////////////////////
    $phones = $html->find('div#productList span.product');

    foreach($phones as $phone) {
        // Get the link
        $linkas = "http://pigu.lt" . $phone->find('p.productPhoto a', 0)->href;

        // Get the name
        $pavadinimas = $phone->find('h3 a', 0)->plaintext;

        // If price not found, find() returns FALSE, then return 000
        if ( $tempPrice = $phone->find('div[class=price] strong', 0) ) {
            // Get the name price and extract the useful part using regex
            $kaina = $tempPrice->plaintext;
            // This captures the integer part of decimal numbers: In "123,45" will capture "123"... Use @([d,]+),?@ to capture the decimal part too
            preg_match('@(d+),?@', $kaina, $matches);
            $kaina = $matches[1];
        }
        else
            $kaina = "000";


        echo $pavadinimas, " #----# ", $kaina, " #----# ", $linkas, "<br>";

    }
    ////////////////////////////////////////////////
    ////////////////////////////////////////////////

    // Extract the next link, if not found return NULL
    $nextLink = ( ($temp = $html->find('div#ListFootPannel a[rel="next"]', 0)) ? "http://pigu.lt".$temp->href : NULL );

    // Clear DOM object
    $html->clear();
    unset($html);

    echo "<hr>";
}

Working DEMO

相关文章