robots协议

2023-01-31 00:01:58 协议 robots

<div id="cnblogs_post_body" class="blogpost-body"><h3><strong>什么是robots.txt?</strong></h3>
<p>robots.txt是一个纯文本文件,是爬虫抓取网站的时候要查看的第一个文件,一般位于网站的根目录下。robots.txt文件定义了爬虫在爬取该网站时存在的限制,哪些部分爬虫可以爬取,哪些不可以爬取(防君子不防小人)</p>
<p>更多robots.txt协议信息参考:www.robotstxt.org</p>
<p>在爬取网站之前,检查robots.txt文件可以最小化爬虫被封禁的可能</p>
<p>下面是百度robots.txt协议的一部分:https://www.baidu.com/robots.txt</p>
<div class="cnblogs_code"><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div>
<pre><span style="color: #008080;"> 1</span> <span style="color: #000000;">User-agent: Baiduspider
</span><span style="color: #008080;"> 2</span> <span style="color: #000000;">Disallow: /baidu
</span><span style="color: #008080;"> 3</span> <span style="color: #000000;">Disallow: /s?
</span><span style="color: #008080;"> 4</span> <span style="color: #000000;">Disallow: /ulink?
</span><span style="color: #008080;"> 5</span> <span style="color: #000000;">Disallow: /link?
</span><span style="color: #008080;"> 6</span> <span style="color: #000000;">Disallow: /home/news/data/
</span><span style="color: #008080;"> 7</span>
<span style="color: #008080;"> 8</span> <span style="color: #000000;">User-agent: Googlebot
</span><span style="color: #008080;"> 9</span> <span style="color: #000000;">Disallow: /baidu
</span><span style="color: #008080;">10</span> <span style="color: #000000;">Disallow: /s?
</span><span style="color: #008080;">11</span> <span style="color: #000000;">Disallow: /shifen/
</span><span style="color: #008080;">12</span> <span style="color: #000000;">Disallow: /homepage/
</span><span style="color: #008080;">13</span> <span style="color: #000000;">Disallow: /cpro
</span><span style="color: #008080;">14</span> <span style="color: #000000;">Disallow: /ulink?
</span><span style="color: #008080;">15</span> <span style="color: #000000;">Disallow: /link?
</span><span style="color: #008080;">16</span> <span style="color: #000000;">Disallow: /home/news/data/
</span><span style="color: #008080;">17</span>
<span style="color: #008080;">18</span> <span style="color: #000000;">User-agent: MSNBot
</span><span style="color: #008080;">19</span> <span style="color: #000000;">Disallow: /baidu
</span><span style="color: #008080;">20</span> <span style="color: #000000;">Disallow: /s?
</span><span style="color: #008080;">21</span> <span style="color: #000000;">Disallow: /shifen/
</span><span style="color: #008080;">22</span> <span style="color: #000000;">Disallow: /homepage/
</span><span style="color: #008080;">23</span> <span style="color: #000000;">Disallow: /cpro
</span><span style="color: #008080;">24</span> <span style="color: #000000;">Disallow: /ulink?
</span><span style="color: #008080;">25</span> <span style="color: #000000;">Disallow: /link?
</span><span style="color: #008080;">26</span> <span style="color: #000000;">Disallow: /home/news/data/
</span><span style="color: #008080;">27</span>
<span style="color: #008080;">28</span> <span style="color: #000000;">User-agent: Baiduspider-image
</span><span style="color: #008080;">29</span> <span style="color: #000000;">Disallow: /baidu
</span><span style="color: #008080;">30</span> <span style="color: #000000;">Disallow: /s?
</span><span style="color: #008080;">31</span> <span style="color: #000000;">Disallow: /shifen/
</span><span style="color: #008080;">32</span> <span style="color: #000000;">Disallow: /homepage/
</span><span style="color: #008080;">33</span> <span style="color: #000000;">Disallow: /cpro
</span><span style="color: #008080;">34</span> <span style="color: #000000;">Disallow: /ulink?
</span><span style="color: #008080;">35</span> <span style="color: #000000;">Disallow: /link?
</span><span style="color: #008080;">36</span> <span style="color: #000000;">Disallow: /home/news/data/
</span><span style="color: #008080;">37</span>
<span style="color: #008080;">38</span> <span style="color: #000000;">User-agent: YoudaoBot
</span><span style="color: #008080;">39</span> <span style="color: #000000;">Disallow: /baidu
</span><span style="color: #008080;">40</span> <span style="color: #000000;">Disallow: /s?
</span><span style="color: #008080;">41</span> <span style="color: #000000;">Disallow: /shifen/
</span><span style="color: #008080;">42</span> <span style="color: #000000;">Disallow: /homepage/
</span><span style="color: #008080;">43</span> <span style="color: #000000;">Disallow: /cpro
</span><span style="color: #008080;">44</span> <span style="color: #000000;">Disallow: /ulink?
</span><span style="color: #008080;">45</span> <span style="color: #000000;">Disallow: /link?
</span><span style="color: #008080;">46</span> <span style="color: #000000;">Disallow: /home/news/data/
</span><span style="color: #008080;">47</span>
<span style="color: #008080;">48</span> <span style="color: #000000;">User-agent: Sogou spider2
</span><span style="color: #008080;">49</span> <span style="color: #000000;">Disallow: /baidu
</span><span style="color: #008080;">50</span> <span style="color: #000000;">Disallow: /s?
</span><span style="color: #008080;">51</span> <span style="color: #000000;">Disallow: /shifen/
</span><span style="color: #008080;">52</span> <span style="color: #000000;">Disallow: /homepage/
</span><span style="color: #008080;">53</span> <span style="color: #000000;">Disallow: /cpro
</span><span style="color: #008080;">54</span> <span style="color: #000000;">Disallow: /ulink?
</span><span style="color: #008080;">55</span> <span style="color: #000000;">Disallow: /link?
</span><span style="color: #008080;">56</span> <span style="color: #000000;">Disallow: /home/news/data/
</span><span style="color: #008080;">57</span>
<span style="color: #008080;">58</span> <span style="color: #000000;">User-agent: Sogou blog
</span><span style="color: #008080;">59</span> <span style="color: #000000;">Disallow: /baidu
</span><span style="color: #008080;">60</span> <span style="color: #000000;">Disallow: /s?
</span><span style="color: #008080;">61</span> <span style="color: #000000;">Disallow: /shifen/
</span><span style="color: #008080;">62</span> <span style="color: #000000;">Disallow: /homepage/
</span><span style="color: #008080;">63</span> <span style="color: #000000;">Disallow: /cpro
</span><span style="color: #008080;">64</span> <span style="color: #000000;">Disallow: /ulink?
</span><span style="color: #008080;">65</span> <span style="color: #000000;">Disallow: /link?
</span><span style="color: #008080;">66</span> <span style="color: #000000;">Disallow: /home/news/data/
</span><span style="color: #008080;">67</span>
<span style="color: #008080;">68</span> <span style="color: #000000;">User-agent: Sogou News Spider
</span><span style="color: #008080;">69</span> <span style="color: #000000;">Disallow: /baidu
</span><span style="color: #008080;">70</span> <span style="color: #000000;">Disallow: /s?
</span><span style="color: #008080;">71</span> <span style="color: #000000;">Disallow: /shifen/
</span><span style="color: #008080;">72</span> <span style="color: #000000;">Disallow: /homepage/
</span><span style="color: #008080;">73</span> <span style="color: #000000;">Disallow: /cpro
</span><span style="color: #008080;">74</span> <span style="color: #000000;">Disallow: /ulink?
</span><span style="color: #008080;">75</span> <span style="color: #000000;">Disallow: /link?
</span><span style="color: #008080;">76</span> <span style="color: #000000;">Disallow: /home/news/data/
</span><span style="color: #008080;">77</span>
78 <span style="color: #000000;">User-agent: *
</span>79 Disallow: /</pre>
<div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="复制代码"><img src="//common.cnblogs.com/images/copycode.gif" alt="复制代码"></a></span></div></div>
<p><span style="font-size: 15px;"><strong>robots.txt中的参数含义:</strong></span></p>
<p>1. User-agent:描述搜索引擎spider的名字。在“robots.txt“文件中,如果有多条 User-agent记录,说明有多个robot会受到该协议的约束。所以,“robots.txt”文件中至少要有一条User- agent记录。如果该项的值设为*(通配符),则该协议对任何搜索引擎机器人均有效。在“robots.txt”文件 中,“User-agent:*”这样的记录只能有一条。</p>
<p>2. Disallow: / 禁止访问的路径</p>
<p>例如,Disallow: /home/news/data/,代表爬虫不能访问/home/news/data/后的所有URL,但能访问/home/news/data123</p>
<p>Disallow: /home/news/data,代表爬虫不能访问/home/news/data123、/home/news/datadasf等一系列以data开头的URL。</p>
<p>前者是精确屏蔽,后者是相对屏蔽</p>
<p>3.&nbsp; Allow:/允许访问的路径</p>
<p>例如,Disallow:/home/后面有news、video、image等多个路径</p>
<p>接着使用Allow:/home/news,代表禁止访问/home/后的一切路径,但可以访问/home/news路径</p>
<p>&nbsp;</p></div>

相关文章