blob: a8421fdb7008ce27d6b5beb42960bdaa218d0fc8 [file] [log] [blame]
Georg Brandl116aa622007-08-15 14:28:22 +00001:mod:`shlex` --- Simple lexical analysis
2========================================
3
4.. module:: shlex
5 :synopsis: Simple lexical analysis for Unix shell-like languages.
Terry Jan Reedyfa089b92016-06-11 15:02:54 -04006
Georg Brandl116aa622007-08-15 14:28:22 +00007.. moduleauthor:: Eric S. Raymond <esr@snark.thyrsus.com>
8.. moduleauthor:: Gustavo Niemeyer <niemeyer@conectiva.com>
9.. sectionauthor:: Eric S. Raymond <esr@snark.thyrsus.com>
10.. sectionauthor:: Gustavo Niemeyer <niemeyer@conectiva.com>
11
Raymond Hettingera1993682011-01-27 01:20:32 +000012**Source code:** :source:`Lib/shlex.py`
13
14--------------
Georg Brandl116aa622007-08-15 14:28:22 +000015
Serhiy Storchaka4e985672013-10-13 21:19:00 +030016The :class:`~shlex.shlex` class makes it easy to write lexical analyzers for
17simple syntaxes resembling that of the Unix shell. This will often be useful
18for writing minilanguages, (for example, in run control files for Python
Georg Brandl116aa622007-08-15 14:28:22 +000019applications) or for parsing quoted strings.
20
Georg Brandl116aa622007-08-15 14:28:22 +000021The :mod:`shlex` module defines the following functions:
22
23
Georg Brandl18244152009-09-02 20:34:52 +000024.. function:: split(s, comments=False, posix=True)
Georg Brandl116aa622007-08-15 14:28:22 +000025
26 Split the string *s* using shell-like syntax. If *comments* is :const:`False`
27 (the default), the parsing of comments in the given string will be disabled
Serhiy Storchaka4e985672013-10-13 21:19:00 +030028 (setting the :attr:`~shlex.commenters` attribute of the
29 :class:`~shlex.shlex` instance to the empty string). This function operates
30 in POSIX mode by default, but uses non-POSIX mode if the *posix* argument is
31 false.
Georg Brandl116aa622007-08-15 14:28:22 +000032
Georg Brandl116aa622007-08-15 14:28:22 +000033 .. note::
34
Serhiy Storchaka4e985672013-10-13 21:19:00 +030035 Since the :func:`split` function instantiates a :class:`~shlex.shlex`
36 instance, passing ``None`` for *s* will read the string to split from
37 standard input.
Georg Brandl116aa622007-08-15 14:28:22 +000038
Éric Araujo9bce3112011-07-27 18:29:31 +020039
Bo Baylesca804952019-05-29 03:06:12 -050040.. function:: join(split_command)
41
42 Concatenate the tokens of the list *split_command* and return a string.
43 This function is the inverse of :func:`split`.
44
45 >>> from shlex import join
46 >>> print(join(['echo', '-n', 'Multiple words']))
47 echo -n 'Multiple words'
48
49 The returned value is shell-escaped to protect against injection
50 vulnerabilities (see :func:`quote`).
51
52 .. versionadded:: 3.8
53
54
Éric Araujo9bce3112011-07-27 18:29:31 +020055.. function:: quote(s)
56
57 Return a shell-escaped version of the string *s*. The returned value is a
Éric Araujo30e277b2011-07-29 15:08:42 +020058 string that can safely be used as one token in a shell command line, for
59 cases where you cannot use a list.
Éric Araujo9bce3112011-07-27 18:29:31 +020060
Marco Buttue65fcde2017-04-27 14:23:34 +020061 This idiom would be unsafe:
Éric Araujo30e277b2011-07-29 15:08:42 +020062
63 >>> filename = 'somefile; rm -rf ~'
64 >>> command = 'ls -l {}'.format(filename)
65 >>> print(command) # executed by a shell: boom!
66 ls -l somefile; rm -rf ~
67
Marco Buttue65fcde2017-04-27 14:23:34 +020068 :func:`quote` lets you plug the security hole:
Éric Araujo30e277b2011-07-29 15:08:42 +020069
Marco Buttue65fcde2017-04-27 14:23:34 +020070 >>> from shlex import quote
Éric Araujo9bce3112011-07-27 18:29:31 +020071 >>> command = 'ls -l {}'.format(quote(filename))
72 >>> print(command)
Éric Araujo30e277b2011-07-29 15:08:42 +020073 ls -l 'somefile; rm -rf ~'
Éric Araujo9bce3112011-07-27 18:29:31 +020074 >>> remote_command = 'ssh home {}'.format(quote(command))
75 >>> print(remote_command)
Éric Araujo30e277b2011-07-29 15:08:42 +020076 ssh home 'ls -l '"'"'somefile; rm -rf ~'"'"''
77
78 The quoting is compatible with UNIX shells and with :func:`split`:
79
Marco Buttue65fcde2017-04-27 14:23:34 +020080 >>> from shlex import split
Éric Araujo30e277b2011-07-29 15:08:42 +020081 >>> remote_command = split(remote_command)
82 >>> remote_command
83 ['ssh', 'home', "ls -l 'somefile; rm -rf ~'"]
84 >>> command = split(remote_command[-1])
85 >>> command
86 ['ls', '-l', 'somefile; rm -rf ~']
Éric Araujo9bce3112011-07-27 18:29:31 +020087
Eli Bendersky493846e2012-03-01 19:07:55 +020088 .. versionadded:: 3.3
Éric Araujo9bce3112011-07-27 18:29:31 +020089
Georg Brandl116aa622007-08-15 14:28:22 +000090The :mod:`shlex` module defines the following class:
91
92
Vinay Sajipc1f974c2016-07-29 22:35:03 +010093.. class:: shlex(instream=None, infile=None, posix=False, punctuation_chars=False)
Georg Brandl116aa622007-08-15 14:28:22 +000094
Serhiy Storchaka4e985672013-10-13 21:19:00 +030095 A :class:`~shlex.shlex` instance or subclass instance is a lexical analyzer
96 object. The initialization argument, if present, specifies where to read
Vinay Sajipc1f974c2016-07-29 22:35:03 +010097 characters from. It must be a file-/stream-like object with
Serhiy Storchaka4e985672013-10-13 21:19:00 +030098 :meth:`~io.TextIOBase.read` and :meth:`~io.TextIOBase.readline` methods, or
99 a string. If no argument is given, input will be taken from ``sys.stdin``.
100 The second optional argument is a filename string, which sets the initial
101 value of the :attr:`~shlex.infile` attribute. If the *instream*
102 argument is omitted or equal to ``sys.stdin``, this second argument
103 defaults to "stdin". The *posix* argument defines the operational mode:
104 when *posix* is not true (default), the :class:`~shlex.shlex` instance will
105 operate in compatibility mode. When operating in POSIX mode,
106 :class:`~shlex.shlex` will try to be as close as possible to the POSIX shell
Vinay Sajipc1f974c2016-07-29 22:35:03 +0100107 parsing rules. The *punctuation_chars* argument provides a way to make the
108 behaviour even closer to how real shells parse. This can take a number of
109 values: the default value, ``False``, preserves the behaviour seen under
110 Python 3.5 and earlier. If set to ``True``, then parsing of the characters
111 ``();<>|&`` is changed: any run of these characters (considered punctuation
112 characters) is returned as a single token. If set to a non-empty string of
113 characters, those characters will be used as the punctuation characters. Any
114 characters in the :attr:`wordchars` attribute that appear in
115 *punctuation_chars* will be removed from :attr:`wordchars`. See
116 :ref:`improved-shell-compatibility` for more information.
Georg Brandl116aa622007-08-15 14:28:22 +0000117
Vinay Sajipc1f974c2016-07-29 22:35:03 +0100118 .. versionchanged:: 3.6
Berker Peksag23aa24b2016-07-30 03:40:38 +0300119 The *punctuation_chars* parameter was added.
Georg Brandl116aa622007-08-15 14:28:22 +0000120
121.. seealso::
122
Alexandre Vassalotti1d1eaa42008-05-14 22:59:42 +0000123 Module :mod:`configparser`
Georg Brandl116aa622007-08-15 14:28:22 +0000124 Parser for configuration files similar to the Windows :file:`.ini` files.
125
126
127.. _shlex-objects:
128
129shlex Objects
130-------------
131
Serhiy Storchaka4e985672013-10-13 21:19:00 +0300132A :class:`~shlex.shlex` instance has the following methods:
Georg Brandl116aa622007-08-15 14:28:22 +0000133
134
135.. method:: shlex.get_token()
136
137 Return a token. If tokens have been stacked using :meth:`push_token`, pop a
138 token off the stack. Otherwise, read one from the input stream. If reading
Serhiy Storchaka4e985672013-10-13 21:19:00 +0300139 encounters an immediate end-of-file, :attr:`eof` is returned (the empty
Georg Brandl116aa622007-08-15 14:28:22 +0000140 string (``''``) in non-POSIX mode, and ``None`` in POSIX mode).
141
142
143.. method:: shlex.push_token(str)
144
145 Push the argument onto the token stack.
146
147
148.. method:: shlex.read_token()
149
150 Read a raw token. Ignore the pushback stack, and do not interpret source
151 requests. (This is not ordinarily a useful entry point, and is documented here
152 only for the sake of completeness.)
153
154
155.. method:: shlex.sourcehook(filename)
156
Serhiy Storchaka4e985672013-10-13 21:19:00 +0300157 When :class:`~shlex.shlex` detects a source request (see :attr:`source`
158 below) this method is given the following token as argument, and expected
159 to return a tuple consisting of a filename and an open file-like object.
Georg Brandl116aa622007-08-15 14:28:22 +0000160
161 Normally, this method first strips any quotes off the argument. If the result
162 is an absolute pathname, or there was no previous source request in effect, or
163 the previous source was a stream (such as ``sys.stdin``), the result is left
164 alone. Otherwise, if the result is a relative pathname, the directory part of
165 the name of the file immediately before it on the source inclusion stack is
166 prepended (this behavior is like the way the C preprocessor handles ``#include
167 "file.h"``).
168
169 The result of the manipulations is treated as a filename, and returned as the
170 first component of the tuple, with :func:`open` called on it to yield the second
171 component. (Note: this is the reverse of the order of arguments in instance
172 initialization!)
173
174 This hook is exposed so that you can use it to implement directory search paths,
175 addition of file extensions, and other namespace hacks. There is no
Serhiy Storchaka4e985672013-10-13 21:19:00 +0300176 corresponding 'close' hook, but a shlex instance will call the
177 :meth:`~io.IOBase.close` method of the sourced input stream when it returns
178 EOF.
Georg Brandl116aa622007-08-15 14:28:22 +0000179
180 For more explicit control of source stacking, use the :meth:`push_source` and
181 :meth:`pop_source` methods.
182
183
Georg Brandl18244152009-09-02 20:34:52 +0000184.. method:: shlex.push_source(newstream, newfile=None)
Georg Brandl116aa622007-08-15 14:28:22 +0000185
186 Push an input source stream onto the input stack. If the filename argument is
187 specified it will later be available for use in error messages. This is the
188 same method used internally by the :meth:`sourcehook` method.
189
Georg Brandl116aa622007-08-15 14:28:22 +0000190
191.. method:: shlex.pop_source()
192
193 Pop the last-pushed input source from the input stack. This is the same method
194 used internally when the lexer reaches EOF on a stacked input stream.
195
Georg Brandl116aa622007-08-15 14:28:22 +0000196
Georg Brandl18244152009-09-02 20:34:52 +0000197.. method:: shlex.error_leader(infile=None, lineno=None)
Georg Brandl116aa622007-08-15 14:28:22 +0000198
199 This method generates an error message leader in the format of a Unix C compiler
200 error label; the format is ``'"%s", line %d: '``, where the ``%s`` is replaced
201 with the name of the current source file and the ``%d`` with the current input
202 line number (the optional arguments can be used to override these).
203
204 This convenience is provided to encourage :mod:`shlex` users to generate error
205 messages in the standard, parseable format understood by Emacs and other Unix
206 tools.
207
Serhiy Storchaka4e985672013-10-13 21:19:00 +0300208Instances of :class:`~shlex.shlex` subclasses have some public instance
209variables which either control lexical analysis or can be used for debugging:
Georg Brandl116aa622007-08-15 14:28:22 +0000210
211
212.. attribute:: shlex.commenters
213
214 The string of characters that are recognized as comment beginners. All
215 characters from the comment beginner to end of line are ignored. Includes just
216 ``'#'`` by default.
217
218
219.. attribute:: shlex.wordchars
220
221 The string of characters that will accumulate into multi-character tokens. By
Vinay Sajipc1f974c2016-07-29 22:35:03 +0100222 default, includes all ASCII alphanumerics and underscore. In POSIX mode, the
223 accented characters in the Latin-1 set are also included. If
224 :attr:`punctuation_chars` is not empty, the characters ``~-./*?=``, which can
225 appear in filename specifications and command line parameters, will also be
226 included in this attribute, and any characters which appear in
227 ``punctuation_chars`` will be removed from ``wordchars`` if they are present
Evan56624a92019-06-02 05:09:22 +1000228 there. If :attr:`whitespace_split` is set to ``True``, this will have no
229 effect.
Georg Brandl116aa622007-08-15 14:28:22 +0000230
231
232.. attribute:: shlex.whitespace
233
234 Characters that will be considered whitespace and skipped. Whitespace bounds
235 tokens. By default, includes space, tab, linefeed and carriage-return.
236
237
238.. attribute:: shlex.escape
239
240 Characters that will be considered as escape. This will be only used in POSIX
241 mode, and includes just ``'\'`` by default.
242
Georg Brandl116aa622007-08-15 14:28:22 +0000243
244.. attribute:: shlex.quotes
245
246 Characters that will be considered string quotes. The token accumulates until
247 the same quote is encountered again (thus, different quote types protect each
248 other as in the shell.) By default, includes ASCII single and double quotes.
249
250
251.. attribute:: shlex.escapedquotes
252
253 Characters in :attr:`quotes` that will interpret escape characters defined in
254 :attr:`escape`. This is only used in POSIX mode, and includes just ``'"'`` by
255 default.
256
Georg Brandl116aa622007-08-15 14:28:22 +0000257
258.. attribute:: shlex.whitespace_split
259
Vinay Sajipc1f974c2016-07-29 22:35:03 +0100260 If ``True``, tokens will only be split in whitespaces. This is useful, for
Serhiy Storchaka4e985672013-10-13 21:19:00 +0300261 example, for parsing command lines with :class:`~shlex.shlex`, getting
Evan56624a92019-06-02 05:09:22 +1000262 tokens in a similar way to shell arguments. When used in combination with
263 :attr:`punctuation_chars`, tokens will be split on whitespace in addition to
264 those characters.
265
266 .. versionchanged:: 3.8
267 The :attr:`punctuation_chars` attribute was made compatible with the
268 :attr:`whitespace_split` attribute.
Georg Brandl116aa622007-08-15 14:28:22 +0000269
Georg Brandl116aa622007-08-15 14:28:22 +0000270
271.. attribute:: shlex.infile
272
273 The name of the current input file, as initially set at class instantiation time
274 or stacked by later source requests. It may be useful to examine this when
275 constructing error messages.
276
277
278.. attribute:: shlex.instream
279
Serhiy Storchaka4e985672013-10-13 21:19:00 +0300280 The input stream from which this :class:`~shlex.shlex` instance is reading
281 characters.
Georg Brandl116aa622007-08-15 14:28:22 +0000282
283
284.. attribute:: shlex.source
285
Senthil Kumarana6bac952011-07-04 11:28:30 -0700286 This attribute is ``None`` by default. If you assign a string to it, that
287 string will be recognized as a lexical-level inclusion request similar to the
Georg Brandl116aa622007-08-15 14:28:22 +0000288 ``source`` keyword in various shells. That is, the immediately following token
Vinay Sajipc1f974c2016-07-29 22:35:03 +0100289 will be opened as a filename and input will be taken from that stream until
290 EOF, at which point the :meth:`~io.IOBase.close` method of that stream will be
291 called and the input source will again become the original input stream. Source
Serhiy Storchaka4e985672013-10-13 21:19:00 +0300292 requests may be stacked any number of levels deep.
Georg Brandl116aa622007-08-15 14:28:22 +0000293
294
295.. attribute:: shlex.debug
296
Serhiy Storchaka4e985672013-10-13 21:19:00 +0300297 If this attribute is numeric and ``1`` or more, a :class:`~shlex.shlex`
298 instance will print verbose progress output on its behavior. If you need
299 to use this, you can read the module source code to learn the details.
Georg Brandl116aa622007-08-15 14:28:22 +0000300
301
302.. attribute:: shlex.lineno
303
304 Source line number (count of newlines seen so far plus one).
305
306
307.. attribute:: shlex.token
308
309 The token buffer. It may be useful to examine this when catching exceptions.
310
311
312.. attribute:: shlex.eof
313
314 Token used to determine end of file. This will be set to the empty string
315 (``''``), in non-POSIX mode, and to ``None`` in POSIX mode.
316
Georg Brandl116aa622007-08-15 14:28:22 +0000317
Vinay Sajipc1f974c2016-07-29 22:35:03 +0100318.. attribute:: shlex.punctuation_chars
319
320 Characters that will be considered punctuation. Runs of punctuation
321 characters will be returned as a single token. However, note that no
322 semantic validity checking will be performed: for example, '>>>' could be
323 returned as a token, even though it may not be recognised as such by shells.
324
325 .. versionadded:: 3.6
326
327
Georg Brandl116aa622007-08-15 14:28:22 +0000328.. _shlex-parsing-rules:
329
330Parsing Rules
331-------------
332
Serhiy Storchaka4e985672013-10-13 21:19:00 +0300333When operating in non-POSIX mode, :class:`~shlex.shlex` will try to obey to the
Georg Brandl116aa622007-08-15 14:28:22 +0000334following rules.
335
336* Quote characters are not recognized within words (``Do"Not"Separate`` is
337 parsed as the single word ``Do"Not"Separate``);
338
339* Escape characters are not recognized;
340
341* Enclosing characters in quotes preserve the literal value of all characters
342 within the quotes;
343
344* Closing quotes separate words (``"Do"Separate`` is parsed as ``"Do"`` and
345 ``Separate``);
346
Serhiy Storchaka4e985672013-10-13 21:19:00 +0300347* If :attr:`~shlex.whitespace_split` is ``False``, any character not
348 declared to be a word character, whitespace, or a quote will be returned as
349 a single-character token. If it is ``True``, :class:`~shlex.shlex` will only
350 split words in whitespaces;
Georg Brandl116aa622007-08-15 14:28:22 +0000351
352* EOF is signaled with an empty string (``''``);
353
354* It's not possible to parse empty strings, even if quoted.
355
Serhiy Storchaka4e985672013-10-13 21:19:00 +0300356When operating in POSIX mode, :class:`~shlex.shlex` will try to obey to the
357following parsing rules.
Georg Brandl116aa622007-08-15 14:28:22 +0000358
359* Quotes are stripped out, and do not separate words (``"Do"Not"Separate"`` is
360 parsed as the single word ``DoNotSeparate``);
361
362* Non-quoted escape characters (e.g. ``'\'``) preserve the literal value of the
363 next character that follows;
364
Serhiy Storchaka4e985672013-10-13 21:19:00 +0300365* Enclosing characters in quotes which are not part of
366 :attr:`~shlex.escapedquotes` (e.g. ``"'"``) preserve the literal value
367 of all characters within the quotes;
Georg Brandl116aa622007-08-15 14:28:22 +0000368
Serhiy Storchaka4e985672013-10-13 21:19:00 +0300369* Enclosing characters in quotes which are part of
370 :attr:`~shlex.escapedquotes` (e.g. ``'"'``) preserves the literal value
371 of all characters within the quotes, with the exception of the characters
372 mentioned in :attr:`~shlex.escape`. The escape characters retain its
373 special meaning only when followed by the quote in use, or the escape
374 character itself. Otherwise the escape character will be considered a
Georg Brandl116aa622007-08-15 14:28:22 +0000375 normal character.
376
377* EOF is signaled with a :const:`None` value;
378
Éric Araujo9bce3112011-07-27 18:29:31 +0200379* Quoted empty strings (``''``) are allowed.
Vinay Sajipc1f974c2016-07-29 22:35:03 +0100380
381.. _improved-shell-compatibility:
382
383Improved Compatibility with Shells
384----------------------------------
385
386.. versionadded:: 3.6
387
388The :class:`shlex` class provides compatibility with the parsing performed by
389common Unix shells like ``bash``, ``dash``, and ``sh``. To take advantage of
390this compatibility, specify the ``punctuation_chars`` argument in the
391constructor. This defaults to ``False``, which preserves pre-3.6 behaviour.
392However, if it is set to ``True``, then parsing of the characters ``();<>|&``
393is changed: any run of these characters is returned as a single token. While
394this is short of a full parser for shells (which would be out of scope for the
395standard library, given the multiplicity of shells out there), it does allow
396you to perform processing of command lines more easily than you could
Vinay Sajipaa655b32017-01-09 16:46:04 +0000397otherwise. To illustrate, you can see the difference in the following snippet:
Vinay Sajipc1f974c2016-07-29 22:35:03 +0100398
Vinay Sajipaa655b32017-01-09 16:46:04 +0000399.. doctest::
400 :options: +NORMALIZE_WHITESPACE
Vinay Sajipc1f974c2016-07-29 22:35:03 +0100401
Vinay Sajipaa655b32017-01-09 16:46:04 +0000402 >>> import shlex
403 >>> text = "a && b; c && d || e; f >'abc'; (def \"ghi\")"
Evan56624a92019-06-02 05:09:22 +1000404 >>> s = shlex.shlex(text, posix=True)
405 >>> s.whitespace_split = True
406 >>> list(s)
407 ['a', '&&', 'b;', 'c', '&&', 'd', '||', 'e;', 'f', '>abc;', '(def', 'ghi)']
408 >>> s = shlex.shlex(text, posix=True, punctuation_chars=True)
409 >>> s.whitespace_split = True
410 >>> list(s)
411 ['a', '&&', 'b', ';', 'c', '&&', 'd', '||', 'e', ';', 'f', '>', 'abc', ';',
412 '(', 'def', 'ghi', ')']
Vinay Sajipc1f974c2016-07-29 22:35:03 +0100413
414Of course, tokens will be returned which are not valid for shells, and you'll
415need to implement your own error checks on the returned tokens.
416
417Instead of passing ``True`` as the value for the punctuation_chars parameter,
418you can pass a string with specific characters, which will be used to determine
419which characters constitute punctuation. For example::
420
421 >>> import shlex
422 >>> s = shlex.shlex("a && b || c", punctuation_chars="|")
423 >>> list(s)
424 ['a', '&', '&', 'b', '||', 'c']
425
426.. note:: When ``punctuation_chars`` is specified, the :attr:`~shlex.wordchars`
427 attribute is augmented with the characters ``~-./*?=``. That is because these
428 characters can appear in file names (including wildcards) and command-line
429 arguments (e.g. ``--color=auto``). Hence::
430
431 >>> import shlex
432 >>> s = shlex.shlex('~/a && b-c --color=auto || d *.py?',
433 ... punctuation_chars=True)
434 >>> list(s)
435 ['~/a', '&&', 'b-c', '--color=auto', '||', 'd', '*.py?']
436
Evan56624a92019-06-02 05:09:22 +1000437 However, to match the shell as closely as possible, it is recommended to
438 always use ``posix`` and :attr:`~shlex.whitespace_split` when using
439 :attr:`~shlex.punctuation_chars`, which will negate
440 :attr:`~shlex.wordchars` entirely.
441
Vinay Sajipdc4ce0e2017-01-27 13:04:33 +0000442For best effect, ``punctuation_chars`` should be set in conjunction with
443``posix=True``. (Note that ``posix=False`` is the default for
444:class:`~shlex.shlex`.)