Blame - Doc/library/urllib.robotparser.rst - platform/external/python/cpython3

blob: c2e1bef8bbc38d5017d19bf7962646db0153c624 [file] [log] [blame]

Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	1	:mod:`urllib.robotparser` --- Parser for robots.txt
				2	====================================================
				3
				4	.. module:: urllib.robotparser
Georg Brandl	0f7ede4	2008-06-23 11:23:31 +0000	[diff] [blame]	5	:synopsis: Load a robots.txt file and answer questions about
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	6	fetchability of other URLs.
				7	.. sectionauthor:: Skip Montanaro <skip@pobox.com>
				8
				9
				10	.. index::
				11	single: WWW
				12	single: World Wide Web
				13	single: URL
				14	single: robots.txt
				15
				16	This module provides a single class, :class:`RobotFileParser`, which answers
				17	questions about whether or not a particular user agent can fetch a URL on the
				18	Web site that published the :file:`robots.txt` file. For more details on the
				19	structure of :file:`robots.txt` files, see http://www.robotstxt.org/orig.html.
				20
				21
Terry Jan Reedy	f3f0681	2013-03-15 16:50:23 -0400	[diff] [blame]	22	.. class:: RobotFileParser(url='')
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	23
Terry Jan Reedy	f3f0681	2013-03-15 16:50:23 -0400	[diff] [blame]	24	This class provides methods to read, parse and answer questions about the
				25	:file:`robots.txt` file at url.
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	26
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	27	.. method:: set_url(url)
				28
				29	Sets the URL referring to a :file:`robots.txt` file.
				30
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	31	.. method:: read()
				32
				33	Reads the :file:`robots.txt` URL and feeds it to the parser.
				34
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	35	.. method:: parse(lines)
				36
				37	Parses the lines argument.
				38
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	39	.. method:: can_fetch(useragent, url)
				40
				41	Returns ``True`` if the useragent is allowed to fetch the url
				42	according to the rules contained in the parsed :file:`robots.txt`
				43	file.
				44
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	45	.. method:: mtime()
				46
				47	Returns the time the ``robots.txt`` file was last fetched. This is
				48	useful for long-running web spiders that need to check for new
				49	``robots.txt`` files periodically.
				50
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	51	.. method:: modified()
				52
				53	Sets the time the ``robots.txt`` file was last fetched to the current
				54	time.
				55
Berker Peksag	960e848	2015-10-08 12:27:06 +0300	[diff] [blame]	56	.. method:: crawl_delay(useragent)
Georg Brandl	0f7ede4	2008-06-23 11:23:31 +0000	[diff] [blame]	57
Berker Peksag	960e848	2015-10-08 12:27:06 +0300	[diff] [blame]	58	Returns the value of the ``Crawl-delay`` parameter from ``robots.txt``
				59	for the useragent in question. If there is no such parameter or it
				60	doesn't apply to the useragent specified or the ``robots.txt`` entry
				61	for this parameter has invalid syntax, return ``None``.
				62
				63	.. versionadded:: 3.6
				64
				65	.. method:: request_rate(useragent)
				66
				67	Returns the contents of the ``Request-rate`` parameter from
				68	``robots.txt`` in the form of a :func:`~collections.namedtuple`
				69	``(requests, seconds)``. If there is no such parameter or it doesn't
				70	apply to the useragent specified or the ``robots.txt`` entry for this
				71	parameter has invalid syntax, return ``None``.
				72
				73	.. versionadded:: 3.6
				74
				75
				76	The following example demonstrates basic use of the :class:`RobotFileParser`
				77	class::
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	78
				79	>>> import urllib.robotparser
				80	>>> rp = urllib.robotparser.RobotFileParser()
				81	>>> rp.set_url("http://www.musi-cal.com/robots.txt")
				82	>>> rp.read()
Berker Peksag	960e848	2015-10-08 12:27:06 +0300	[diff] [blame]	83	>>> rrate = rp.request_rate("*")
				84	>>> rrate.requests
				85	3
				86	>>> rrate.seconds
				87	20
				88	>>> rp.crawl_delay("*")
				89	6
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	90	>>> rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco")
				91	False
				92	>>> rp.can_fetch("*", "http://www.musi-cal.com/")
				93	True