blob: 544f50273dd17c41b1dde334455469c5529be7e9 [file] [log] [blame]
Senthil Kumaranaca8fd72008-06-23 04:41:59 +00001:mod:`urllib.robotparser` --- Parser for robots.txt
2====================================================
3
4.. module:: urllib.robotparser
Georg Brandl0f7ede42008-06-23 11:23:31 +00005 :synopsis: Load a robots.txt file and answer questions about
Senthil Kumaranaca8fd72008-06-23 04:41:59 +00006 fetchability of other URLs.
Terry Jan Reedyfa089b92016-06-11 15:02:54 -04007
Senthil Kumaranaca8fd72008-06-23 04:41:59 +00008.. sectionauthor:: Skip Montanaro <skip@pobox.com>
9
Terry Jan Reedyfa089b92016-06-11 15:02:54 -040010**Source code:** :source:`Lib/urllib/robotparser.py`
Senthil Kumaranaca8fd72008-06-23 04:41:59 +000011
12.. index::
13 single: WWW
14 single: World Wide Web
15 single: URL
16 single: robots.txt
17
Terry Jan Reedyfa089b92016-06-11 15:02:54 -040018--------------
19
Senthil Kumaranaca8fd72008-06-23 04:41:59 +000020This module provides a single class, :class:`RobotFileParser`, which answers
21questions about whether or not a particular user agent can fetch a URL on the
22Web site that published the :file:`robots.txt` file. For more details on the
23structure of :file:`robots.txt` files, see http://www.robotstxt.org/orig.html.
24
25
Terry Jan Reedyf3f06812013-03-15 16:50:23 -040026.. class:: RobotFileParser(url='')
Senthil Kumaranaca8fd72008-06-23 04:41:59 +000027
Terry Jan Reedyf3f06812013-03-15 16:50:23 -040028 This class provides methods to read, parse and answer questions about the
29 :file:`robots.txt` file at *url*.
Senthil Kumaranaca8fd72008-06-23 04:41:59 +000030
Senthil Kumaranaca8fd72008-06-23 04:41:59 +000031 .. method:: set_url(url)
32
33 Sets the URL referring to a :file:`robots.txt` file.
34
Senthil Kumaranaca8fd72008-06-23 04:41:59 +000035 .. method:: read()
36
37 Reads the :file:`robots.txt` URL and feeds it to the parser.
38
Senthil Kumaranaca8fd72008-06-23 04:41:59 +000039 .. method:: parse(lines)
40
41 Parses the lines argument.
42
Senthil Kumaranaca8fd72008-06-23 04:41:59 +000043 .. method:: can_fetch(useragent, url)
44
45 Returns ``True`` if the *useragent* is allowed to fetch the *url*
46 according to the rules contained in the parsed :file:`robots.txt`
47 file.
48
Senthil Kumaranaca8fd72008-06-23 04:41:59 +000049 .. method:: mtime()
50
51 Returns the time the ``robots.txt`` file was last fetched. This is
52 useful for long-running web spiders that need to check for new
53 ``robots.txt`` files periodically.
54
Senthil Kumaranaca8fd72008-06-23 04:41:59 +000055 .. method:: modified()
56
57 Sets the time the ``robots.txt`` file was last fetched to the current
58 time.
59
Berker Peksag960e8482015-10-08 12:27:06 +030060 .. method:: crawl_delay(useragent)
Georg Brandl0f7ede42008-06-23 11:23:31 +000061
Berker Peksag960e8482015-10-08 12:27:06 +030062 Returns the value of the ``Crawl-delay`` parameter from ``robots.txt``
63 for the *useragent* in question. If there is no such parameter or it
64 doesn't apply to the *useragent* specified or the ``robots.txt`` entry
65 for this parameter has invalid syntax, return ``None``.
66
67 .. versionadded:: 3.6
68
69 .. method:: request_rate(useragent)
70
71 Returns the contents of the ``Request-rate`` parameter from
Berker Peksag3df02db2017-11-24 02:40:26 +030072 ``robots.txt`` as a :term:`named tuple` ``RequestRate(requests, seconds)``.
73 If there is no such parameter or it doesn't apply to the *useragent*
74 specified or the ``robots.txt`` entry for this parameter has invalid
75 syntax, return ``None``.
Berker Peksag960e8482015-10-08 12:27:06 +030076
77 .. versionadded:: 3.6
78
Christopher Beacham5db5c062018-05-16 07:52:07 -070079 .. method:: site_maps()
80
81 Returns the contents of the ``Sitemap`` parameter from
82 ``robots.txt`` in the form of a :func:`list`. If there is no such
83 parameter or the ``robots.txt`` entry for this parameter has
84 invalid syntax, return ``None``.
85
86 .. versionadded:: 3.8
87
Berker Peksag960e8482015-10-08 12:27:06 +030088
89The following example demonstrates basic use of the :class:`RobotFileParser`
90class::
Senthil Kumaranaca8fd72008-06-23 04:41:59 +000091
92 >>> import urllib.robotparser
93 >>> rp = urllib.robotparser.RobotFileParser()
94 >>> rp.set_url("http://www.musi-cal.com/robots.txt")
95 >>> rp.read()
Berker Peksag960e8482015-10-08 12:27:06 +030096 >>> rrate = rp.request_rate("*")
97 >>> rrate.requests
98 3
99 >>> rrate.seconds
100 20
101 >>> rp.crawl_delay("*")
102 6
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000103 >>> rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco")
104 False
105 >>> rp.can_fetch("*", "http://www.musi-cal.com/")
106 True