Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 1 | :mod:`urllib.robotparser` --- Parser for robots.txt |
| 2 | ==================================================== |
| 3 | |
| 4 | .. module:: urllib.robotparser |
Georg Brandl | 0f7ede4 | 2008-06-23 11:23:31 +0000 | [diff] [blame] | 5 | :synopsis: Load a robots.txt file and answer questions about |
Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 6 | fetchability of other URLs. |
| 7 | .. sectionauthor:: Skip Montanaro <skip@pobox.com> |
| 8 | |
| 9 | |
| 10 | .. index:: |
| 11 | single: WWW |
| 12 | single: World Wide Web |
| 13 | single: URL |
| 14 | single: robots.txt |
| 15 | |
| 16 | This module provides a single class, :class:`RobotFileParser`, which answers |
| 17 | questions about whether or not a particular user agent can fetch a URL on the |
| 18 | Web site that published the :file:`robots.txt` file. For more details on the |
| 19 | structure of :file:`robots.txt` files, see http://www.robotstxt.org/orig.html. |
| 20 | |
| 21 | |
Terry Jan Reedy | f3f0681 | 2013-03-15 16:50:23 -0400 | [diff] [blame] | 22 | .. class:: RobotFileParser(url='') |
Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 23 | |
Terry Jan Reedy | f3f0681 | 2013-03-15 16:50:23 -0400 | [diff] [blame] | 24 | This class provides methods to read, parse and answer questions about the |
| 25 | :file:`robots.txt` file at *url*. |
Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 26 | |
Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 27 | .. method:: set_url(url) |
| 28 | |
| 29 | Sets the URL referring to a :file:`robots.txt` file. |
| 30 | |
Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 31 | .. method:: read() |
| 32 | |
| 33 | Reads the :file:`robots.txt` URL and feeds it to the parser. |
| 34 | |
Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 35 | .. method:: parse(lines) |
| 36 | |
| 37 | Parses the lines argument. |
| 38 | |
Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 39 | .. method:: can_fetch(useragent, url) |
| 40 | |
| 41 | Returns ``True`` if the *useragent* is allowed to fetch the *url* |
| 42 | according to the rules contained in the parsed :file:`robots.txt` |
| 43 | file. |
| 44 | |
Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 45 | .. method:: mtime() |
| 46 | |
| 47 | Returns the time the ``robots.txt`` file was last fetched. This is |
| 48 | useful for long-running web spiders that need to check for new |
| 49 | ``robots.txt`` files periodically. |
| 50 | |
Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 51 | .. method:: modified() |
| 52 | |
| 53 | Sets the time the ``robots.txt`` file was last fetched to the current |
| 54 | time. |
| 55 | |
Berker Peksag | 960e848 | 2015-10-08 12:27:06 +0300 | [diff] [blame] | 56 | .. method:: crawl_delay(useragent) |
Georg Brandl | 0f7ede4 | 2008-06-23 11:23:31 +0000 | [diff] [blame] | 57 | |
Berker Peksag | 960e848 | 2015-10-08 12:27:06 +0300 | [diff] [blame] | 58 | Returns the value of the ``Crawl-delay`` parameter from ``robots.txt`` |
| 59 | for the *useragent* in question. If there is no such parameter or it |
| 60 | doesn't apply to the *useragent* specified or the ``robots.txt`` entry |
| 61 | for this parameter has invalid syntax, return ``None``. |
| 62 | |
| 63 | .. versionadded:: 3.6 |
| 64 | |
| 65 | .. method:: request_rate(useragent) |
| 66 | |
| 67 | Returns the contents of the ``Request-rate`` parameter from |
| 68 | ``robots.txt`` in the form of a :func:`~collections.namedtuple` |
| 69 | ``(requests, seconds)``. If there is no such parameter or it doesn't |
| 70 | apply to the *useragent* specified or the ``robots.txt`` entry for this |
| 71 | parameter has invalid syntax, return ``None``. |
| 72 | |
| 73 | .. versionadded:: 3.6 |
| 74 | |
| 75 | |
| 76 | The following example demonstrates basic use of the :class:`RobotFileParser` |
| 77 | class:: |
Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 78 | |
| 79 | >>> import urllib.robotparser |
| 80 | >>> rp = urllib.robotparser.RobotFileParser() |
| 81 | >>> rp.set_url("http://www.musi-cal.com/robots.txt") |
| 82 | >>> rp.read() |
Berker Peksag | 960e848 | 2015-10-08 12:27:06 +0300 | [diff] [blame] | 83 | >>> rrate = rp.request_rate("*") |
| 84 | >>> rrate.requests |
| 85 | 3 |
| 86 | >>> rrate.seconds |
| 87 | 20 |
| 88 | >>> rp.crawl_delay("*") |
| 89 | 6 |
Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 90 | >>> rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco") |
| 91 | False |
| 92 | >>> rp.can_fetch("*", "http://www.musi-cal.com/") |
| 93 | True |