blob: 8321dadbb1ecfde752891a1b2249b23dedff53e2 [file] [log] [blame]
Fred Drake3c9f9362000-03-31 17:51:10 +00001\section{\module{robotparser} ---
Fred Drake40ee7ac2000-04-28 18:17:23 +00002 Parser for robots.txt}
Fred Drake3c9f9362000-03-31 17:51:10 +00003
4\declaremodule{standard}{robotparser}
5\modulesynopsis{Accepts as input a list of lines or URL that refers to a
6 robots.txt file, parses the file, then builds a
7 set of rules from that list and answers questions
8 about fetchability of other URLs.}
9\sectionauthor{Skip Montanaro}{skip@mojam.com}
10
11\index{WWW}
Fred Drake8ee679f2001-07-14 02:50:55 +000012\index{World Wide Web}
Fred Drake3c9f9362000-03-31 17:51:10 +000013\index{URL}
14\index{robots.txt}
15
16This module provides a single class, \class{RobotFileParser}, which answers
17questions about whether or not a particular user agent can fetch a URL on
Fred Drake8ee679f2001-07-14 02:50:55 +000018the Web site that published the \file{robots.txt} file. For more details on
Fred Drake3c9f9362000-03-31 17:51:10 +000019the structure of \file{robots.txt} files, see
20\url{http://info.webcrawler.com/mak/projects/robots/norobots.html}.
21
22\begin{classdesc}{RobotFileParser}{}
23
24This class provides a set of methods to read, parse and answer questions
25about a single \file{robots.txt} file.
26
27\begin{methoddesc}{set_url}{url}
28Sets the URL referring to a \file{robots.txt} file.
29\end{methoddesc}
30
31\begin{methoddesc}{read}{}
32Reads the \file{robots.txt} URL and feeds it to the parser.
33\end{methoddesc}
34
35\begin{methoddesc}{parse}{lines}
36Parses the lines argument.
37\end{methoddesc}
38
39\begin{methoddesc}{can_fetch}{useragent, url}
40Returns true if the \var{useragent} is allowed to fetch the \var{url}
41according to the rules contained in the parsed \file{robots.txt} file.
42\end{methoddesc}
43
44\begin{methoddesc}{mtime}{}
45Returns the time the \code{robots.txt} file was last fetched. This is
46useful for long-running web spiders that need to check for new
47\code{robots.txt} files periodically.
48\end{methoddesc}
49
50\begin{methoddesc}{modified}{}
51Sets the time the \code{robots.txt} file was last fetched to the current
52time.
53\end{methoddesc}
54
55\end{classdesc}
56
57The following example demonstrates basic use of the RobotFileParser class.
58
59\begin{verbatim}
60>>> import robotparser
61>>> rp = robotparser.RobotFileParser()
62>>> rp.set_url("http://www.musi-cal.com/robots.txt")
63>>> rp.read()
64>>> rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco")
650
66>>> rp.can_fetch("*", "http://www.musi-cal.com/")
671
68\end{verbatim}