| Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 1 | :mod:`urllib.robotparser` ---  Parser for robots.txt | 
 | 2 | ==================================================== | 
 | 3 |  | 
 | 4 | .. module:: urllib.robotparser | 
| Georg Brandl | 0f7ede4 | 2008-06-23 11:23:31 +0000 | [diff] [blame] | 5 |    :synopsis: Load a robots.txt file and answer questions about | 
| Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 6 |               fetchability of other URLs. | 
 | 7 | .. sectionauthor:: Skip Montanaro <skip@pobox.com> | 
 | 8 |  | 
 | 9 |  | 
 | 10 | .. index:: | 
 | 11 |    single: WWW | 
 | 12 |    single: World Wide Web | 
 | 13 |    single: URL | 
 | 14 |    single: robots.txt | 
 | 15 |  | 
 | 16 | This module provides a single class, :class:`RobotFileParser`, which answers | 
 | 17 | questions about whether or not a particular user agent can fetch a URL on the | 
 | 18 | Web site that published the :file:`robots.txt` file.  For more details on the | 
 | 19 | structure of :file:`robots.txt` files, see http://www.robotstxt.org/orig.html. | 
 | 20 |  | 
 | 21 |  | 
 | 22 | .. class:: RobotFileParser() | 
 | 23 |  | 
 | 24 |    This class provides a set of methods to read, parse and answer questions | 
 | 25 |    about a single :file:`robots.txt` file. | 
 | 26 |  | 
| Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 27 |    .. method:: set_url(url) | 
 | 28 |  | 
 | 29 |       Sets the URL referring to a :file:`robots.txt` file. | 
 | 30 |  | 
| Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 31 |    .. method:: read() | 
 | 32 |  | 
 | 33 |       Reads the :file:`robots.txt` URL and feeds it to the parser. | 
 | 34 |  | 
| Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 35 |    .. method:: parse(lines) | 
 | 36 |  | 
 | 37 |       Parses the lines argument. | 
 | 38 |  | 
| Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 39 |    .. method:: can_fetch(useragent, url) | 
 | 40 |  | 
 | 41 |       Returns ``True`` if the *useragent* is allowed to fetch the *url* | 
 | 42 |       according to the rules contained in the parsed :file:`robots.txt` | 
 | 43 |       file. | 
 | 44 |  | 
| Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 45 |    .. method:: mtime() | 
 | 46 |  | 
 | 47 |       Returns the time the ``robots.txt`` file was last fetched.  This is | 
 | 48 |       useful for long-running web spiders that need to check for new | 
 | 49 |       ``robots.txt`` files periodically. | 
 | 50 |  | 
| Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 51 |    .. method:: modified() | 
 | 52 |  | 
 | 53 |       Sets the time the ``robots.txt`` file was last fetched to the current | 
 | 54 |       time. | 
 | 55 |  | 
| Georg Brandl | 0f7ede4 | 2008-06-23 11:23:31 +0000 | [diff] [blame] | 56 |  | 
 | 57 | The following example demonstrates basic use of the RobotFileParser class. | 
| Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 58 |  | 
 | 59 |    >>> import urllib.robotparser | 
 | 60 |    >>> rp = urllib.robotparser.RobotFileParser() | 
 | 61 |    >>> rp.set_url("http://www.musi-cal.com/robots.txt") | 
 | 62 |    >>> rp.read() | 
 | 63 |    >>> rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco") | 
 | 64 |    False | 
 | 65 |    >>> rp.can_fetch("*", "http://www.musi-cal.com/") | 
 | 66 |    True | 
 | 67 |  |