blob: 4c509d896fa1e51a0878b6d09a13efe664e89d0b [file] [log] [blame]
Georg Brandl116aa622007-08-15 14:28:22 +00001
2:mod:`shlex` --- Simple lexical analysis
3========================================
4
5.. module:: shlex
6 :synopsis: Simple lexical analysis for Unix shell-like languages.
7.. moduleauthor:: Eric S. Raymond <esr@snark.thyrsus.com>
8.. moduleauthor:: Gustavo Niemeyer <niemeyer@conectiva.com>
9.. sectionauthor:: Eric S. Raymond <esr@snark.thyrsus.com>
10.. sectionauthor:: Gustavo Niemeyer <niemeyer@conectiva.com>
11
12
Georg Brandl116aa622007-08-15 14:28:22 +000013The :class:`shlex` class makes it easy to write lexical analyzers for simple
14syntaxes resembling that of the Unix shell. This will often be useful for
15writing minilanguages, (for example, in run control files for Python
16applications) or for parsing quoted strings.
17
18.. note::
19
20 The :mod:`shlex` module currently does not support Unicode input.
21
22The :mod:`shlex` module defines the following functions:
23
24
25.. function:: split(s[, comments[, posix]])
26
27 Split the string *s* using shell-like syntax. If *comments* is :const:`False`
28 (the default), the parsing of comments in the given string will be disabled
29 (setting the :attr:`commenters` member of the :class:`shlex` instance to the
30 empty string). This function operates in POSIX mode by default, but uses
31 non-POSIX mode if the *posix* argument is false.
32
Georg Brandl116aa622007-08-15 14:28:22 +000033 .. note::
34
35 Since the :func:`split` function instantiates a :class:`shlex` instance, passing
36 ``None`` for *s* will read the string to split from standard input.
37
38The :mod:`shlex` module defines the following class:
39
40
41.. class:: shlex([instream[, infile[, posix]]])
42
43 A :class:`shlex` instance or subclass instance is a lexical analyzer object.
44 The initialization argument, if present, specifies where to read characters
45 from. It must be a file-/stream-like object with :meth:`read` and
46 :meth:`readline` methods, or a string (strings are accepted since Python 2.3).
47 If no argument is given, input will be taken from ``sys.stdin``. The second
48 optional argument is a filename string, which sets the initial value of the
49 :attr:`infile` member. If the *instream* argument is omitted or equal to
50 ``sys.stdin``, this second argument defaults to "stdin". The *posix* argument
51 was introduced in Python 2.3, and defines the operational mode. When *posix* is
52 not true (default), the :class:`shlex` instance will operate in compatibility
53 mode. When operating in POSIX mode, :class:`shlex` will try to be as close as
54 possible to the POSIX shell parsing rules.
55
56
57.. seealso::
58
59 Module :mod:`ConfigParser`
60 Parser for configuration files similar to the Windows :file:`.ini` files.
61
62
63.. _shlex-objects:
64
65shlex Objects
66-------------
67
68A :class:`shlex` instance has the following methods:
69
70
71.. method:: shlex.get_token()
72
73 Return a token. If tokens have been stacked using :meth:`push_token`, pop a
74 token off the stack. Otherwise, read one from the input stream. If reading
75 encounters an immediate end-of-file, :attr:`self.eof` is returned (the empty
76 string (``''``) in non-POSIX mode, and ``None`` in POSIX mode).
77
78
79.. method:: shlex.push_token(str)
80
81 Push the argument onto the token stack.
82
83
84.. method:: shlex.read_token()
85
86 Read a raw token. Ignore the pushback stack, and do not interpret source
87 requests. (This is not ordinarily a useful entry point, and is documented here
88 only for the sake of completeness.)
89
90
91.. method:: shlex.sourcehook(filename)
92
93 When :class:`shlex` detects a source request (see :attr:`source` below) this
94 method is given the following token as argument, and expected to return a tuple
95 consisting of a filename and an open file-like object.
96
97 Normally, this method first strips any quotes off the argument. If the result
98 is an absolute pathname, or there was no previous source request in effect, or
99 the previous source was a stream (such as ``sys.stdin``), the result is left
100 alone. Otherwise, if the result is a relative pathname, the directory part of
101 the name of the file immediately before it on the source inclusion stack is
102 prepended (this behavior is like the way the C preprocessor handles ``#include
103 "file.h"``).
104
105 The result of the manipulations is treated as a filename, and returned as the
106 first component of the tuple, with :func:`open` called on it to yield the second
107 component. (Note: this is the reverse of the order of arguments in instance
108 initialization!)
109
110 This hook is exposed so that you can use it to implement directory search paths,
111 addition of file extensions, and other namespace hacks. There is no
112 corresponding 'close' hook, but a shlex instance will call the :meth:`close`
113 method of the sourced input stream when it returns EOF.
114
115 For more explicit control of source stacking, use the :meth:`push_source` and
116 :meth:`pop_source` methods.
117
118
119.. method:: shlex.push_source(stream[, filename])
120
121 Push an input source stream onto the input stack. If the filename argument is
122 specified it will later be available for use in error messages. This is the
123 same method used internally by the :meth:`sourcehook` method.
124
Georg Brandl116aa622007-08-15 14:28:22 +0000125
126.. method:: shlex.pop_source()
127
128 Pop the last-pushed input source from the input stack. This is the same method
129 used internally when the lexer reaches EOF on a stacked input stream.
130
Georg Brandl116aa622007-08-15 14:28:22 +0000131
132.. method:: shlex.error_leader([file[, line]])
133
134 This method generates an error message leader in the format of a Unix C compiler
135 error label; the format is ``'"%s", line %d: '``, where the ``%s`` is replaced
136 with the name of the current source file and the ``%d`` with the current input
137 line number (the optional arguments can be used to override these).
138
139 This convenience is provided to encourage :mod:`shlex` users to generate error
140 messages in the standard, parseable format understood by Emacs and other Unix
141 tools.
142
143Instances of :class:`shlex` subclasses have some public instance variables which
144either control lexical analysis or can be used for debugging:
145
146
147.. attribute:: shlex.commenters
148
149 The string of characters that are recognized as comment beginners. All
150 characters from the comment beginner to end of line are ignored. Includes just
151 ``'#'`` by default.
152
153
154.. attribute:: shlex.wordchars
155
156 The string of characters that will accumulate into multi-character tokens. By
157 default, includes all ASCII alphanumerics and underscore.
158
159
160.. attribute:: shlex.whitespace
161
162 Characters that will be considered whitespace and skipped. Whitespace bounds
163 tokens. By default, includes space, tab, linefeed and carriage-return.
164
165
166.. attribute:: shlex.escape
167
168 Characters that will be considered as escape. This will be only used in POSIX
169 mode, and includes just ``'\'`` by default.
170
Georg Brandl116aa622007-08-15 14:28:22 +0000171
172.. attribute:: shlex.quotes
173
174 Characters that will be considered string quotes. The token accumulates until
175 the same quote is encountered again (thus, different quote types protect each
176 other as in the shell.) By default, includes ASCII single and double quotes.
177
178
179.. attribute:: shlex.escapedquotes
180
181 Characters in :attr:`quotes` that will interpret escape characters defined in
182 :attr:`escape`. This is only used in POSIX mode, and includes just ``'"'`` by
183 default.
184
Georg Brandl116aa622007-08-15 14:28:22 +0000185
186.. attribute:: shlex.whitespace_split
187
188 If ``True``, tokens will only be split in whitespaces. This is useful, for
189 example, for parsing command lines with :class:`shlex`, getting tokens in a
190 similar way to shell arguments.
191
Georg Brandl116aa622007-08-15 14:28:22 +0000192
193.. attribute:: shlex.infile
194
195 The name of the current input file, as initially set at class instantiation time
196 or stacked by later source requests. It may be useful to examine this when
197 constructing error messages.
198
199
200.. attribute:: shlex.instream
201
202 The input stream from which this :class:`shlex` instance is reading characters.
203
204
205.. attribute:: shlex.source
206
207 This member is ``None`` by default. If you assign a string to it, that string
208 will be recognized as a lexical-level inclusion request similar to the
209 ``source`` keyword in various shells. That is, the immediately following token
210 will opened as a filename and input taken from that stream until EOF, at which
211 point the :meth:`close` method of that stream will be called and the input
212 source will again become the original input stream. Source requests may be
213 stacked any number of levels deep.
214
215
216.. attribute:: shlex.debug
217
218 If this member is numeric and ``1`` or more, a :class:`shlex` instance will
219 print verbose progress output on its behavior. If you need to use this, you can
220 read the module source code to learn the details.
221
222
223.. attribute:: shlex.lineno
224
225 Source line number (count of newlines seen so far plus one).
226
227
228.. attribute:: shlex.token
229
230 The token buffer. It may be useful to examine this when catching exceptions.
231
232
233.. attribute:: shlex.eof
234
235 Token used to determine end of file. This will be set to the empty string
236 (``''``), in non-POSIX mode, and to ``None`` in POSIX mode.
237
Georg Brandl116aa622007-08-15 14:28:22 +0000238
239.. _shlex-parsing-rules:
240
241Parsing Rules
242-------------
243
244When operating in non-POSIX mode, :class:`shlex` will try to obey to the
245following rules.
246
247* Quote characters are not recognized within words (``Do"Not"Separate`` is
248 parsed as the single word ``Do"Not"Separate``);
249
250* Escape characters are not recognized;
251
252* Enclosing characters in quotes preserve the literal value of all characters
253 within the quotes;
254
255* Closing quotes separate words (``"Do"Separate`` is parsed as ``"Do"`` and
256 ``Separate``);
257
258* If :attr:`whitespace_split` is ``False``, any character not declared to be a
259 word character, whitespace, or a quote will be returned as a single-character
260 token. If it is ``True``, :class:`shlex` will only split words in whitespaces;
261
262* EOF is signaled with an empty string (``''``);
263
264* It's not possible to parse empty strings, even if quoted.
265
266When operating in POSIX mode, :class:`shlex` will try to obey to the following
267parsing rules.
268
269* Quotes are stripped out, and do not separate words (``"Do"Not"Separate"`` is
270 parsed as the single word ``DoNotSeparate``);
271
272* Non-quoted escape characters (e.g. ``'\'``) preserve the literal value of the
273 next character that follows;
274
275* Enclosing characters in quotes which are not part of :attr:`escapedquotes`
276 (e.g. ``"'"``) preserve the literal value of all characters within the quotes;
277
278* Enclosing characters in quotes which are part of :attr:`escapedquotes` (e.g.
279 ``'"'``) preserves the literal value of all characters within the quotes, with
280 the exception of the characters mentioned in :attr:`escape`. The escape
281 characters retain its special meaning only when followed by the quote in use, or
282 the escape character itself. Otherwise the escape character will be considered a
283 normal character.
284
285* EOF is signaled with a :const:`None` value;
286
287* Quoted empty strings (``''``) are allowed;
288