blob: cce7966d41285fab2318fe36075020452cc85ca7 [file] [log] [blame]
Georg Brandl8ec7f652007-08-15 14:28:01 +00001
2:mod:`robotparser` --- Parser for robots.txt
3=============================================
4
5.. module:: robotparser
Skip Montanarodfd98272008-04-28 03:25:37 +00006 :synopsis: Loads a robots.txt file and answers questions about
Georg Brandle8559912008-04-28 05:16:30 +00007 fetchability of other URLs.
Skip Montanaro54662462007-12-08 15:26:16 +00008.. sectionauthor:: Skip Montanaro <skip@pobox.com>
Georg Brandl8ec7f652007-08-15 14:28:01 +00009
10
11.. index::
12 single: WWW
13 single: World Wide Web
14 single: URL
15 single: robots.txt
16
17This module provides a single class, :class:`RobotFileParser`, which answers
18questions about whether or not a particular user agent can fetch a URL on the
Georg Brandl02677812008-03-15 00:20:19 +000019Web site that published the :file:`robots.txt` file. For more details on the
20structure of :file:`robots.txt` files, see http://www.robotstxt.org/orig.html.
Georg Brandl8ec7f652007-08-15 14:28:01 +000021
22
23.. class:: RobotFileParser()
24
Skip Montanarodfd98272008-04-28 03:25:37 +000025 This class provides a set of methods to read, parse and answer questions
26 about a single :file:`robots.txt` file.
Georg Brandl8ec7f652007-08-15 14:28:01 +000027
28
Benjamin Petersonc7b05922008-04-25 01:29:10 +000029 .. method:: set_url(url)
Georg Brandl8ec7f652007-08-15 14:28:01 +000030
31 Sets the URL referring to a :file:`robots.txt` file.
32
33
Benjamin Petersonc7b05922008-04-25 01:29:10 +000034 .. method:: read()
Georg Brandl8ec7f652007-08-15 14:28:01 +000035
36 Reads the :file:`robots.txt` URL and feeds it to the parser.
37
38
Benjamin Petersonc7b05922008-04-25 01:29:10 +000039 .. method:: parse(lines)
Georg Brandl8ec7f652007-08-15 14:28:01 +000040
41 Parses the lines argument.
42
43
Benjamin Petersonc7b05922008-04-25 01:29:10 +000044 .. method:: can_fetch(useragent, url)
Georg Brandl8ec7f652007-08-15 14:28:01 +000045
Skip Montanarodfd98272008-04-28 03:25:37 +000046 Returns ``True`` if the *useragent* is allowed to fetch the *url*
47 according to the rules contained in the parsed :file:`robots.txt`
48 file.
Georg Brandl8ec7f652007-08-15 14:28:01 +000049
50
Benjamin Petersonc7b05922008-04-25 01:29:10 +000051 .. method:: mtime()
Georg Brandl8ec7f652007-08-15 14:28:01 +000052
Skip Montanarodfd98272008-04-28 03:25:37 +000053 Returns the time the ``robots.txt`` file was last fetched. This is
54 useful for long-running web spiders that need to check for new
55 ``robots.txt`` files periodically.
Georg Brandl8ec7f652007-08-15 14:28:01 +000056
57
Benjamin Petersonc7b05922008-04-25 01:29:10 +000058 .. method:: modified()
Georg Brandl8ec7f652007-08-15 14:28:01 +000059
Skip Montanarodfd98272008-04-28 03:25:37 +000060 Sets the time the ``robots.txt`` file was last fetched to the current
61 time.
Georg Brandl8ec7f652007-08-15 14:28:01 +000062
63The following example demonstrates basic use of the RobotFileParser class. ::
64
65 >>> import robotparser
66 >>> rp = robotparser.RobotFileParser()
67 >>> rp.set_url("http://www.musi-cal.com/robots.txt")
68 >>> rp.read()
69 >>> rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco")
70 False
71 >>> rp.can_fetch("*", "http://www.musi-cal.com/")
72 True
73