Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 1 | |
| 2 | :mod:`robotparser` --- Parser for robots.txt |
| 3 | ============================================= |
| 4 | |
| 5 | .. module:: robotparser |
Skip Montanaro | dfd9827 | 2008-04-28 03:25:37 +0000 | [diff] [blame] | 6 | :synopsis: Loads a robots.txt file and answers questions about |
Georg Brandl | e855991 | 2008-04-28 05:16:30 +0000 | [diff] [blame] | 7 | fetchability of other URLs. |
Skip Montanaro | 5466246 | 2007-12-08 15:26:16 +0000 | [diff] [blame] | 8 | .. sectionauthor:: Skip Montanaro <skip@pobox.com> |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 9 | |
| 10 | |
| 11 | .. index:: |
| 12 | single: WWW |
| 13 | single: World Wide Web |
| 14 | single: URL |
| 15 | single: robots.txt |
Georg Brandl | c62ef8b | 2009-01-03 20:55:06 +0000 | [diff] [blame] | 16 | |
Brett Cannon | 963ffdb | 2008-07-11 00:48:57 +0000 | [diff] [blame] | 17 | .. note:: |
| 18 | The :mod:`robotparser` module has been renamed :mod:`urllib.robotparser` in |
| 19 | Python 3.0. |
| 20 | The :term:`2to3` tool will automatically adapt imports when converting |
| 21 | your sources to 3.0. |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 22 | |
| 23 | This module provides a single class, :class:`RobotFileParser`, which answers |
| 24 | questions about whether or not a particular user agent can fetch a URL on the |
Georg Brandl | 0267781 | 2008-03-15 00:20:19 +0000 | [diff] [blame] | 25 | Web site that published the :file:`robots.txt` file. For more details on the |
| 26 | structure of :file:`robots.txt` files, see http://www.robotstxt.org/orig.html. |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 27 | |
| 28 | |
| 29 | .. class:: RobotFileParser() |
| 30 | |
Skip Montanaro | dfd9827 | 2008-04-28 03:25:37 +0000 | [diff] [blame] | 31 | This class provides a set of methods to read, parse and answer questions |
| 32 | about a single :file:`robots.txt` file. |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 33 | |
| 34 | |
Benjamin Peterson | c7b0592 | 2008-04-25 01:29:10 +0000 | [diff] [blame] | 35 | .. method:: set_url(url) |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 36 | |
| 37 | Sets the URL referring to a :file:`robots.txt` file. |
| 38 | |
| 39 | |
Benjamin Peterson | c7b0592 | 2008-04-25 01:29:10 +0000 | [diff] [blame] | 40 | .. method:: read() |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 41 | |
| 42 | Reads the :file:`robots.txt` URL and feeds it to the parser. |
| 43 | |
| 44 | |
Benjamin Peterson | c7b0592 | 2008-04-25 01:29:10 +0000 | [diff] [blame] | 45 | .. method:: parse(lines) |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 46 | |
| 47 | Parses the lines argument. |
| 48 | |
| 49 | |
Benjamin Peterson | c7b0592 | 2008-04-25 01:29:10 +0000 | [diff] [blame] | 50 | .. method:: can_fetch(useragent, url) |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 51 | |
Skip Montanaro | dfd9827 | 2008-04-28 03:25:37 +0000 | [diff] [blame] | 52 | Returns ``True`` if the *useragent* is allowed to fetch the *url* |
| 53 | according to the rules contained in the parsed :file:`robots.txt` |
| 54 | file. |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 55 | |
| 56 | |
Benjamin Peterson | c7b0592 | 2008-04-25 01:29:10 +0000 | [diff] [blame] | 57 | .. method:: mtime() |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 58 | |
Skip Montanaro | dfd9827 | 2008-04-28 03:25:37 +0000 | [diff] [blame] | 59 | Returns the time the ``robots.txt`` file was last fetched. This is |
| 60 | useful for long-running web spiders that need to check for new |
| 61 | ``robots.txt`` files periodically. |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 62 | |
| 63 | |
Benjamin Peterson | c7b0592 | 2008-04-25 01:29:10 +0000 | [diff] [blame] | 64 | .. method:: modified() |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 65 | |
Skip Montanaro | dfd9827 | 2008-04-28 03:25:37 +0000 | [diff] [blame] | 66 | Sets the time the ``robots.txt`` file was last fetched to the current |
| 67 | time. |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 68 | |
| 69 | The following example demonstrates basic use of the RobotFileParser class. :: |
| 70 | |
| 71 | >>> import robotparser |
| 72 | >>> rp = robotparser.RobotFileParser() |
| 73 | >>> rp.set_url("http://www.musi-cal.com/robots.txt") |
| 74 | >>> rp.read() |
| 75 | >>> rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco") |
| 76 | False |
| 77 | >>> rp.can_fetch("*", "http://www.musi-cal.com/") |
| 78 | True |
| 79 | |