blob: 7d88610fbe4f3f6dd46155574074527ab173e67f [file] [log] [blame]
Georg Brandl116aa622007-08-15 14:28:22 +00001
2:mod:`shlex` --- Simple lexical analysis
3========================================
4
5.. module:: shlex
6 :synopsis: Simple lexical analysis for Unix shell-like languages.
7.. moduleauthor:: Eric S. Raymond <esr@snark.thyrsus.com>
8.. moduleauthor:: Gustavo Niemeyer <niemeyer@conectiva.com>
9.. sectionauthor:: Eric S. Raymond <esr@snark.thyrsus.com>
10.. sectionauthor:: Gustavo Niemeyer <niemeyer@conectiva.com>
11
12
Georg Brandl116aa622007-08-15 14:28:22 +000013The :class:`shlex` class makes it easy to write lexical analyzers for simple
14syntaxes resembling that of the Unix shell. This will often be useful for
15writing minilanguages, (for example, in run control files for Python
16applications) or for parsing quoted strings.
17
18.. note::
19
20 The :mod:`shlex` module currently does not support Unicode input.
21
22The :mod:`shlex` module defines the following functions:
23
24
25.. function:: split(s[, comments[, posix]])
26
27 Split the string *s* using shell-like syntax. If *comments* is :const:`False`
28 (the default), the parsing of comments in the given string will be disabled
29 (setting the :attr:`commenters` member of the :class:`shlex` instance to the
30 empty string). This function operates in POSIX mode by default, but uses
31 non-POSIX mode if the *posix* argument is false.
32
Georg Brandl116aa622007-08-15 14:28:22 +000033 .. note::
34
35 Since the :func:`split` function instantiates a :class:`shlex` instance, passing
36 ``None`` for *s* will read the string to split from standard input.
37
38The :mod:`shlex` module defines the following class:
39
40
41.. class:: shlex([instream[, infile[, posix]]])
42
43 A :class:`shlex` instance or subclass instance is a lexical analyzer object.
44 The initialization argument, if present, specifies where to read characters
45 from. It must be a file-/stream-like object with :meth:`read` and
Georg Brandle6bcc912008-05-12 18:05:20 +000046 :meth:`readline` methods, or a string. If no argument is given, input will
47 be taken from ``sys.stdin``. The second optional argument is a filename
48 string, which sets the initial value of the :attr:`infile` member. If the
49 *instream* argument is omitted or equal to ``sys.stdin``, this second
50 argument defaults to "stdin". The *posix* argument defines the operational
51 mode: when *posix* is not true (default), the :class:`shlex` instance will
52 operate in compatibility mode. When operating in POSIX mode, :class:`shlex`
53 will try to be as close as possible to the POSIX shell parsing rules.
Georg Brandl116aa622007-08-15 14:28:22 +000054
55
56.. seealso::
57
58 Module :mod:`ConfigParser`
59 Parser for configuration files similar to the Windows :file:`.ini` files.
60
61
62.. _shlex-objects:
63
64shlex Objects
65-------------
66
67A :class:`shlex` instance has the following methods:
68
69
70.. method:: shlex.get_token()
71
72 Return a token. If tokens have been stacked using :meth:`push_token`, pop a
73 token off the stack. Otherwise, read one from the input stream. If reading
74 encounters an immediate end-of-file, :attr:`self.eof` is returned (the empty
75 string (``''``) in non-POSIX mode, and ``None`` in POSIX mode).
76
77
78.. method:: shlex.push_token(str)
79
80 Push the argument onto the token stack.
81
82
83.. method:: shlex.read_token()
84
85 Read a raw token. Ignore the pushback stack, and do not interpret source
86 requests. (This is not ordinarily a useful entry point, and is documented here
87 only for the sake of completeness.)
88
89
90.. method:: shlex.sourcehook(filename)
91
92 When :class:`shlex` detects a source request (see :attr:`source` below) this
93 method is given the following token as argument, and expected to return a tuple
94 consisting of a filename and an open file-like object.
95
96 Normally, this method first strips any quotes off the argument. If the result
97 is an absolute pathname, or there was no previous source request in effect, or
98 the previous source was a stream (such as ``sys.stdin``), the result is left
99 alone. Otherwise, if the result is a relative pathname, the directory part of
100 the name of the file immediately before it on the source inclusion stack is
101 prepended (this behavior is like the way the C preprocessor handles ``#include
102 "file.h"``).
103
104 The result of the manipulations is treated as a filename, and returned as the
105 first component of the tuple, with :func:`open` called on it to yield the second
106 component. (Note: this is the reverse of the order of arguments in instance
107 initialization!)
108
109 This hook is exposed so that you can use it to implement directory search paths,
110 addition of file extensions, and other namespace hacks. There is no
111 corresponding 'close' hook, but a shlex instance will call the :meth:`close`
112 method of the sourced input stream when it returns EOF.
113
114 For more explicit control of source stacking, use the :meth:`push_source` and
115 :meth:`pop_source` methods.
116
117
118.. method:: shlex.push_source(stream[, filename])
119
120 Push an input source stream onto the input stack. If the filename argument is
121 specified it will later be available for use in error messages. This is the
122 same method used internally by the :meth:`sourcehook` method.
123
Georg Brandl116aa622007-08-15 14:28:22 +0000124
125.. method:: shlex.pop_source()
126
127 Pop the last-pushed input source from the input stack. This is the same method
128 used internally when the lexer reaches EOF on a stacked input stream.
129
Georg Brandl116aa622007-08-15 14:28:22 +0000130
131.. method:: shlex.error_leader([file[, line]])
132
133 This method generates an error message leader in the format of a Unix C compiler
134 error label; the format is ``'"%s", line %d: '``, where the ``%s`` is replaced
135 with the name of the current source file and the ``%d`` with the current input
136 line number (the optional arguments can be used to override these).
137
138 This convenience is provided to encourage :mod:`shlex` users to generate error
139 messages in the standard, parseable format understood by Emacs and other Unix
140 tools.
141
142Instances of :class:`shlex` subclasses have some public instance variables which
143either control lexical analysis or can be used for debugging:
144
145
146.. attribute:: shlex.commenters
147
148 The string of characters that are recognized as comment beginners. All
149 characters from the comment beginner to end of line are ignored. Includes just
150 ``'#'`` by default.
151
152
153.. attribute:: shlex.wordchars
154
155 The string of characters that will accumulate into multi-character tokens. By
156 default, includes all ASCII alphanumerics and underscore.
157
158
159.. attribute:: shlex.whitespace
160
161 Characters that will be considered whitespace and skipped. Whitespace bounds
162 tokens. By default, includes space, tab, linefeed and carriage-return.
163
164
165.. attribute:: shlex.escape
166
167 Characters that will be considered as escape. This will be only used in POSIX
168 mode, and includes just ``'\'`` by default.
169
Georg Brandl116aa622007-08-15 14:28:22 +0000170
171.. attribute:: shlex.quotes
172
173 Characters that will be considered string quotes. The token accumulates until
174 the same quote is encountered again (thus, different quote types protect each
175 other as in the shell.) By default, includes ASCII single and double quotes.
176
177
178.. attribute:: shlex.escapedquotes
179
180 Characters in :attr:`quotes` that will interpret escape characters defined in
181 :attr:`escape`. This is only used in POSIX mode, and includes just ``'"'`` by
182 default.
183
Georg Brandl116aa622007-08-15 14:28:22 +0000184
185.. attribute:: shlex.whitespace_split
186
187 If ``True``, tokens will only be split in whitespaces. This is useful, for
188 example, for parsing command lines with :class:`shlex`, getting tokens in a
189 similar way to shell arguments.
190
Georg Brandl116aa622007-08-15 14:28:22 +0000191
192.. attribute:: shlex.infile
193
194 The name of the current input file, as initially set at class instantiation time
195 or stacked by later source requests. It may be useful to examine this when
196 constructing error messages.
197
198
199.. attribute:: shlex.instream
200
201 The input stream from which this :class:`shlex` instance is reading characters.
202
203
204.. attribute:: shlex.source
205
206 This member is ``None`` by default. If you assign a string to it, that string
207 will be recognized as a lexical-level inclusion request similar to the
208 ``source`` keyword in various shells. That is, the immediately following token
209 will opened as a filename and input taken from that stream until EOF, at which
210 point the :meth:`close` method of that stream will be called and the input
211 source will again become the original input stream. Source requests may be
212 stacked any number of levels deep.
213
214
215.. attribute:: shlex.debug
216
217 If this member is numeric and ``1`` or more, a :class:`shlex` instance will
218 print verbose progress output on its behavior. If you need to use this, you can
219 read the module source code to learn the details.
220
221
222.. attribute:: shlex.lineno
223
224 Source line number (count of newlines seen so far plus one).
225
226
227.. attribute:: shlex.token
228
229 The token buffer. It may be useful to examine this when catching exceptions.
230
231
232.. attribute:: shlex.eof
233
234 Token used to determine end of file. This will be set to the empty string
235 (``''``), in non-POSIX mode, and to ``None`` in POSIX mode.
236
Georg Brandl116aa622007-08-15 14:28:22 +0000237
238.. _shlex-parsing-rules:
239
240Parsing Rules
241-------------
242
243When operating in non-POSIX mode, :class:`shlex` will try to obey to the
244following rules.
245
246* Quote characters are not recognized within words (``Do"Not"Separate`` is
247 parsed as the single word ``Do"Not"Separate``);
248
249* Escape characters are not recognized;
250
251* Enclosing characters in quotes preserve the literal value of all characters
252 within the quotes;
253
254* Closing quotes separate words (``"Do"Separate`` is parsed as ``"Do"`` and
255 ``Separate``);
256
257* If :attr:`whitespace_split` is ``False``, any character not declared to be a
258 word character, whitespace, or a quote will be returned as a single-character
259 token. If it is ``True``, :class:`shlex` will only split words in whitespaces;
260
261* EOF is signaled with an empty string (``''``);
262
263* It's not possible to parse empty strings, even if quoted.
264
265When operating in POSIX mode, :class:`shlex` will try to obey to the following
266parsing rules.
267
268* Quotes are stripped out, and do not separate words (``"Do"Not"Separate"`` is
269 parsed as the single word ``DoNotSeparate``);
270
271* Non-quoted escape characters (e.g. ``'\'``) preserve the literal value of the
272 next character that follows;
273
274* Enclosing characters in quotes which are not part of :attr:`escapedquotes`
275 (e.g. ``"'"``) preserve the literal value of all characters within the quotes;
276
277* Enclosing characters in quotes which are part of :attr:`escapedquotes` (e.g.
278 ``'"'``) preserves the literal value of all characters within the quotes, with
279 the exception of the characters mentioned in :attr:`escape`. The escape
280 characters retain its special meaning only when followed by the quote in use, or
281 the escape character itself. Otherwise the escape character will be considered a
282 normal character.
283
284* EOF is signaled with a :const:`None` value;
285
286* Quoted empty strings (``''``) are allowed;
287