blob: f179de2f9263da6303e765a134b54e2033e4ebd2 [file] [log] [blame]
Senthil Kumaranaca8fd72008-06-23 04:41:59 +00001:mod:`urllib.robotparser` --- Parser for robots.txt
2====================================================
3
4.. module:: urllib.robotparser
Georg Brandl0f7ede42008-06-23 11:23:31 +00005 :synopsis: Load a robots.txt file and answer questions about
Senthil Kumaranaca8fd72008-06-23 04:41:59 +00006 fetchability of other URLs.
7.. sectionauthor:: Skip Montanaro <skip@pobox.com>
8
9
10.. index::
11 single: WWW
12 single: World Wide Web
13 single: URL
14 single: robots.txt
15
16This module provides a single class, :class:`RobotFileParser`, which answers
17questions about whether or not a particular user agent can fetch a URL on the
18Web site that published the :file:`robots.txt` file. For more details on the
19structure of :file:`robots.txt` files, see http://www.robotstxt.org/orig.html.
20
21
Terry Jan Reedyf3f06812013-03-15 16:50:23 -040022.. class:: RobotFileParser(url='')
Senthil Kumaranaca8fd72008-06-23 04:41:59 +000023
Terry Jan Reedyf3f06812013-03-15 16:50:23 -040024 This class provides methods to read, parse and answer questions about the
25 :file:`robots.txt` file at *url*.
Senthil Kumaranaca8fd72008-06-23 04:41:59 +000026
Senthil Kumaranaca8fd72008-06-23 04:41:59 +000027 .. method:: set_url(url)
28
29 Sets the URL referring to a :file:`robots.txt` file.
30
Senthil Kumaranaca8fd72008-06-23 04:41:59 +000031 .. method:: read()
32
33 Reads the :file:`robots.txt` URL and feeds it to the parser.
34
Senthil Kumaranaca8fd72008-06-23 04:41:59 +000035 .. method:: parse(lines)
36
37 Parses the lines argument.
38
Senthil Kumaranaca8fd72008-06-23 04:41:59 +000039 .. method:: can_fetch(useragent, url)
40
41 Returns ``True`` if the *useragent* is allowed to fetch the *url*
42 according to the rules contained in the parsed :file:`robots.txt`
43 file.
44
Senthil Kumaranaca8fd72008-06-23 04:41:59 +000045 .. method:: mtime()
46
47 Returns the time the ``robots.txt`` file was last fetched. This is
48 useful for long-running web spiders that need to check for new
49 ``robots.txt`` files periodically.
50
Senthil Kumaranaca8fd72008-06-23 04:41:59 +000051 .. method:: modified()
52
53 Sets the time the ``robots.txt`` file was last fetched to the current
54 time.
55
Georg Brandl0f7ede42008-06-23 11:23:31 +000056
57The following example demonstrates basic use of the RobotFileParser class.
Senthil Kumaranaca8fd72008-06-23 04:41:59 +000058
59 >>> import urllib.robotparser
60 >>> rp = urllib.robotparser.RobotFileParser()
61 >>> rp.set_url("http://www.musi-cal.com/robots.txt")
62 >>> rp.read()
63 >>> rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco")
64 False
65 >>> rp.can_fetch("*", "http://www.musi-cal.com/")
66 True
67