blob: 8c5b0239d1f08b8fa7f7c85fca0b081b3b097f93 [file] [log] [blame]
Georg Brandl116aa622007-08-15 14:28:22 +00001:mod:`shlex` --- Simple lexical analysis
2========================================
3
4.. module:: shlex
5 :synopsis: Simple lexical analysis for Unix shell-like languages.
Terry Jan Reedyfa089b92016-06-11 15:02:54 -04006
Georg Brandl116aa622007-08-15 14:28:22 +00007.. moduleauthor:: Eric S. Raymond <esr@snark.thyrsus.com>
8.. moduleauthor:: Gustavo Niemeyer <niemeyer@conectiva.com>
9.. sectionauthor:: Eric S. Raymond <esr@snark.thyrsus.com>
10.. sectionauthor:: Gustavo Niemeyer <niemeyer@conectiva.com>
11
Raymond Hettingera1993682011-01-27 01:20:32 +000012**Source code:** :source:`Lib/shlex.py`
13
14--------------
Georg Brandl116aa622007-08-15 14:28:22 +000015
Serhiy Storchaka4e985672013-10-13 21:19:00 +030016The :class:`~shlex.shlex` class makes it easy to write lexical analyzers for
17simple syntaxes resembling that of the Unix shell. This will often be useful
18for writing minilanguages, (for example, in run control files for Python
Georg Brandl116aa622007-08-15 14:28:22 +000019applications) or for parsing quoted strings.
20
Georg Brandl116aa622007-08-15 14:28:22 +000021The :mod:`shlex` module defines the following functions:
22
23
Georg Brandl18244152009-09-02 20:34:52 +000024.. function:: split(s, comments=False, posix=True)
Georg Brandl116aa622007-08-15 14:28:22 +000025
26 Split the string *s* using shell-like syntax. If *comments* is :const:`False`
27 (the default), the parsing of comments in the given string will be disabled
Serhiy Storchaka4e985672013-10-13 21:19:00 +030028 (setting the :attr:`~shlex.commenters` attribute of the
29 :class:`~shlex.shlex` instance to the empty string). This function operates
30 in POSIX mode by default, but uses non-POSIX mode if the *posix* argument is
31 false.
Georg Brandl116aa622007-08-15 14:28:22 +000032
Georg Brandl116aa622007-08-15 14:28:22 +000033 .. note::
34
Serhiy Storchaka4e985672013-10-13 21:19:00 +030035 Since the :func:`split` function instantiates a :class:`~shlex.shlex`
36 instance, passing ``None`` for *s* will read the string to split from
37 standard input.
Georg Brandl116aa622007-08-15 14:28:22 +000038
Éric Araujo9bce3112011-07-27 18:29:31 +020039
Bo Baylesca804952019-05-29 03:06:12 -050040.. function:: join(split_command)
41
42 Concatenate the tokens of the list *split_command* and return a string.
43 This function is the inverse of :func:`split`.
44
45 >>> from shlex import join
46 >>> print(join(['echo', '-n', 'Multiple words']))
47 echo -n 'Multiple words'
48
49 The returned value is shell-escaped to protect against injection
50 vulnerabilities (see :func:`quote`).
51
52 .. versionadded:: 3.8
53
54
Éric Araujo9bce3112011-07-27 18:29:31 +020055.. function:: quote(s)
56
57 Return a shell-escaped version of the string *s*. The returned value is a
Éric Araujo30e277b2011-07-29 15:08:42 +020058 string that can safely be used as one token in a shell command line, for
59 cases where you cannot use a list.
Éric Araujo9bce3112011-07-27 18:29:31 +020060
Marco Buttue65fcde2017-04-27 14:23:34 +020061 This idiom would be unsafe:
Éric Araujo30e277b2011-07-29 15:08:42 +020062
63 >>> filename = 'somefile; rm -rf ~'
64 >>> command = 'ls -l {}'.format(filename)
65 >>> print(command) # executed by a shell: boom!
66 ls -l somefile; rm -rf ~
67
Marco Buttue65fcde2017-04-27 14:23:34 +020068 :func:`quote` lets you plug the security hole:
Éric Araujo30e277b2011-07-29 15:08:42 +020069
Marco Buttue65fcde2017-04-27 14:23:34 +020070 >>> from shlex import quote
Éric Araujo9bce3112011-07-27 18:29:31 +020071 >>> command = 'ls -l {}'.format(quote(filename))
72 >>> print(command)
Éric Araujo30e277b2011-07-29 15:08:42 +020073 ls -l 'somefile; rm -rf ~'
Éric Araujo9bce3112011-07-27 18:29:31 +020074 >>> remote_command = 'ssh home {}'.format(quote(command))
75 >>> print(remote_command)
Éric Araujo30e277b2011-07-29 15:08:42 +020076 ssh home 'ls -l '"'"'somefile; rm -rf ~'"'"''
77
78 The quoting is compatible with UNIX shells and with :func:`split`:
79
Marco Buttue65fcde2017-04-27 14:23:34 +020080 >>> from shlex import split
Éric Araujo30e277b2011-07-29 15:08:42 +020081 >>> remote_command = split(remote_command)
82 >>> remote_command
83 ['ssh', 'home', "ls -l 'somefile; rm -rf ~'"]
84 >>> command = split(remote_command[-1])
85 >>> command
86 ['ls', '-l', 'somefile; rm -rf ~']
Éric Araujo9bce3112011-07-27 18:29:31 +020087
Eli Bendersky493846e2012-03-01 19:07:55 +020088 .. versionadded:: 3.3
Éric Araujo9bce3112011-07-27 18:29:31 +020089
Georg Brandl116aa622007-08-15 14:28:22 +000090The :mod:`shlex` module defines the following class:
91
92
Vinay Sajipc1f974c2016-07-29 22:35:03 +010093.. class:: shlex(instream=None, infile=None, posix=False, punctuation_chars=False)
Georg Brandl116aa622007-08-15 14:28:22 +000094
Serhiy Storchaka4e985672013-10-13 21:19:00 +030095 A :class:`~shlex.shlex` instance or subclass instance is a lexical analyzer
96 object. The initialization argument, if present, specifies where to read
Vinay Sajipc1f974c2016-07-29 22:35:03 +010097 characters from. It must be a file-/stream-like object with
Serhiy Storchaka4e985672013-10-13 21:19:00 +030098 :meth:`~io.TextIOBase.read` and :meth:`~io.TextIOBase.readline` methods, or
99 a string. If no argument is given, input will be taken from ``sys.stdin``.
100 The second optional argument is a filename string, which sets the initial
101 value of the :attr:`~shlex.infile` attribute. If the *instream*
102 argument is omitted or equal to ``sys.stdin``, this second argument
103 defaults to "stdin". The *posix* argument defines the operational mode:
104 when *posix* is not true (default), the :class:`~shlex.shlex` instance will
105 operate in compatibility mode. When operating in POSIX mode,
106 :class:`~shlex.shlex` will try to be as close as possible to the POSIX shell
Vinay Sajipc1f974c2016-07-29 22:35:03 +0100107 parsing rules. The *punctuation_chars* argument provides a way to make the
108 behaviour even closer to how real shells parse. This can take a number of
109 values: the default value, ``False``, preserves the behaviour seen under
110 Python 3.5 and earlier. If set to ``True``, then parsing of the characters
111 ``();<>|&`` is changed: any run of these characters (considered punctuation
112 characters) is returned as a single token. If set to a non-empty string of
113 characters, those characters will be used as the punctuation characters. Any
114 characters in the :attr:`wordchars` attribute that appear in
115 *punctuation_chars* will be removed from :attr:`wordchars`. See
116 :ref:`improved-shell-compatibility` for more information.
Georg Brandl116aa622007-08-15 14:28:22 +0000117
Vinay Sajipc1f974c2016-07-29 22:35:03 +0100118 .. versionchanged:: 3.6
Berker Peksag23aa24b2016-07-30 03:40:38 +0300119 The *punctuation_chars* parameter was added.
Georg Brandl116aa622007-08-15 14:28:22 +0000120
121.. seealso::
122
Alexandre Vassalotti1d1eaa42008-05-14 22:59:42 +0000123 Module :mod:`configparser`
Georg Brandl116aa622007-08-15 14:28:22 +0000124 Parser for configuration files similar to the Windows :file:`.ini` files.
125
126
127.. _shlex-objects:
128
129shlex Objects
130-------------
131
Serhiy Storchaka4e985672013-10-13 21:19:00 +0300132A :class:`~shlex.shlex` instance has the following methods:
Georg Brandl116aa622007-08-15 14:28:22 +0000133
134
135.. method:: shlex.get_token()
136
137 Return a token. If tokens have been stacked using :meth:`push_token`, pop a
138 token off the stack. Otherwise, read one from the input stream. If reading
Serhiy Storchaka4e985672013-10-13 21:19:00 +0300139 encounters an immediate end-of-file, :attr:`eof` is returned (the empty
Georg Brandl116aa622007-08-15 14:28:22 +0000140 string (``''``) in non-POSIX mode, and ``None`` in POSIX mode).
141
142
143.. method:: shlex.push_token(str)
144
145 Push the argument onto the token stack.
146
147
148.. method:: shlex.read_token()
149
150 Read a raw token. Ignore the pushback stack, and do not interpret source
151 requests. (This is not ordinarily a useful entry point, and is documented here
152 only for the sake of completeness.)
153
154
155.. method:: shlex.sourcehook(filename)
156
Serhiy Storchaka4e985672013-10-13 21:19:00 +0300157 When :class:`~shlex.shlex` detects a source request (see :attr:`source`
158 below) this method is given the following token as argument, and expected
159 to return a tuple consisting of a filename and an open file-like object.
Georg Brandl116aa622007-08-15 14:28:22 +0000160
161 Normally, this method first strips any quotes off the argument. If the result
162 is an absolute pathname, or there was no previous source request in effect, or
163 the previous source was a stream (such as ``sys.stdin``), the result is left
164 alone. Otherwise, if the result is a relative pathname, the directory part of
165 the name of the file immediately before it on the source inclusion stack is
166 prepended (this behavior is like the way the C preprocessor handles ``#include
167 "file.h"``).
168
169 The result of the manipulations is treated as a filename, and returned as the
170 first component of the tuple, with :func:`open` called on it to yield the second
171 component. (Note: this is the reverse of the order of arguments in instance
172 initialization!)
173
174 This hook is exposed so that you can use it to implement directory search paths,
175 addition of file extensions, and other namespace hacks. There is no
Serhiy Storchaka4e985672013-10-13 21:19:00 +0300176 corresponding 'close' hook, but a shlex instance will call the
177 :meth:`~io.IOBase.close` method of the sourced input stream when it returns
178 EOF.
Georg Brandl116aa622007-08-15 14:28:22 +0000179
180 For more explicit control of source stacking, use the :meth:`push_source` and
181 :meth:`pop_source` methods.
182
183
Georg Brandl18244152009-09-02 20:34:52 +0000184.. method:: shlex.push_source(newstream, newfile=None)
Georg Brandl116aa622007-08-15 14:28:22 +0000185
186 Push an input source stream onto the input stack. If the filename argument is
187 specified it will later be available for use in error messages. This is the
188 same method used internally by the :meth:`sourcehook` method.
189
Georg Brandl116aa622007-08-15 14:28:22 +0000190
191.. method:: shlex.pop_source()
192
193 Pop the last-pushed input source from the input stack. This is the same method
194 used internally when the lexer reaches EOF on a stacked input stream.
195
Georg Brandl116aa622007-08-15 14:28:22 +0000196
Georg Brandl18244152009-09-02 20:34:52 +0000197.. method:: shlex.error_leader(infile=None, lineno=None)
Georg Brandl116aa622007-08-15 14:28:22 +0000198
199 This method generates an error message leader in the format of a Unix C compiler
200 error label; the format is ``'"%s", line %d: '``, where the ``%s`` is replaced
201 with the name of the current source file and the ``%d`` with the current input
202 line number (the optional arguments can be used to override these).
203
204 This convenience is provided to encourage :mod:`shlex` users to generate error
205 messages in the standard, parseable format understood by Emacs and other Unix
206 tools.
207
Serhiy Storchaka4e985672013-10-13 21:19:00 +0300208Instances of :class:`~shlex.shlex` subclasses have some public instance
209variables which either control lexical analysis or can be used for debugging:
Georg Brandl116aa622007-08-15 14:28:22 +0000210
211
212.. attribute:: shlex.commenters
213
214 The string of characters that are recognized as comment beginners. All
215 characters from the comment beginner to end of line are ignored. Includes just
216 ``'#'`` by default.
217
218
219.. attribute:: shlex.wordchars
220
221 The string of characters that will accumulate into multi-character tokens. By
Vinay Sajipc1f974c2016-07-29 22:35:03 +0100222 default, includes all ASCII alphanumerics and underscore. In POSIX mode, the
223 accented characters in the Latin-1 set are also included. If
224 :attr:`punctuation_chars` is not empty, the characters ``~-./*?=``, which can
225 appear in filename specifications and command line parameters, will also be
226 included in this attribute, and any characters which appear in
227 ``punctuation_chars`` will be removed from ``wordchars`` if they are present
228 there.
Georg Brandl116aa622007-08-15 14:28:22 +0000229
230
231.. attribute:: shlex.whitespace
232
233 Characters that will be considered whitespace and skipped. Whitespace bounds
234 tokens. By default, includes space, tab, linefeed and carriage-return.
235
236
237.. attribute:: shlex.escape
238
239 Characters that will be considered as escape. This will be only used in POSIX
240 mode, and includes just ``'\'`` by default.
241
Georg Brandl116aa622007-08-15 14:28:22 +0000242
243.. attribute:: shlex.quotes
244
245 Characters that will be considered string quotes. The token accumulates until
246 the same quote is encountered again (thus, different quote types protect each
247 other as in the shell.) By default, includes ASCII single and double quotes.
248
249
250.. attribute:: shlex.escapedquotes
251
252 Characters in :attr:`quotes` that will interpret escape characters defined in
253 :attr:`escape`. This is only used in POSIX mode, and includes just ``'"'`` by
254 default.
255
Georg Brandl116aa622007-08-15 14:28:22 +0000256
257.. attribute:: shlex.whitespace_split
258
Vinay Sajipc1f974c2016-07-29 22:35:03 +0100259 If ``True``, tokens will only be split in whitespaces. This is useful, for
Serhiy Storchaka4e985672013-10-13 21:19:00 +0300260 example, for parsing command lines with :class:`~shlex.shlex`, getting
Vinay Sajipc1f974c2016-07-29 22:35:03 +0100261 tokens in a similar way to shell arguments. If this attribute is ``True``,
262 :attr:`punctuation_chars` will have no effect, and splitting will happen
263 only on whitespaces. When using :attr:`punctuation_chars`, which is
264 intended to provide parsing closer to that implemented by shells, it is
265 advisable to leave ``whitespace_split`` as ``False`` (the default value).
Georg Brandl116aa622007-08-15 14:28:22 +0000266
Georg Brandl116aa622007-08-15 14:28:22 +0000267
268.. attribute:: shlex.infile
269
270 The name of the current input file, as initially set at class instantiation time
271 or stacked by later source requests. It may be useful to examine this when
272 constructing error messages.
273
274
275.. attribute:: shlex.instream
276
Serhiy Storchaka4e985672013-10-13 21:19:00 +0300277 The input stream from which this :class:`~shlex.shlex` instance is reading
278 characters.
Georg Brandl116aa622007-08-15 14:28:22 +0000279
280
281.. attribute:: shlex.source
282
Senthil Kumarana6bac952011-07-04 11:28:30 -0700283 This attribute is ``None`` by default. If you assign a string to it, that
284 string will be recognized as a lexical-level inclusion request similar to the
Georg Brandl116aa622007-08-15 14:28:22 +0000285 ``source`` keyword in various shells. That is, the immediately following token
Vinay Sajipc1f974c2016-07-29 22:35:03 +0100286 will be opened as a filename and input will be taken from that stream until
287 EOF, at which point the :meth:`~io.IOBase.close` method of that stream will be
288 called and the input source will again become the original input stream. Source
Serhiy Storchaka4e985672013-10-13 21:19:00 +0300289 requests may be stacked any number of levels deep.
Georg Brandl116aa622007-08-15 14:28:22 +0000290
291
292.. attribute:: shlex.debug
293
Serhiy Storchaka4e985672013-10-13 21:19:00 +0300294 If this attribute is numeric and ``1`` or more, a :class:`~shlex.shlex`
295 instance will print verbose progress output on its behavior. If you need
296 to use this, you can read the module source code to learn the details.
Georg Brandl116aa622007-08-15 14:28:22 +0000297
298
299.. attribute:: shlex.lineno
300
301 Source line number (count of newlines seen so far plus one).
302
303
304.. attribute:: shlex.token
305
306 The token buffer. It may be useful to examine this when catching exceptions.
307
308
309.. attribute:: shlex.eof
310
311 Token used to determine end of file. This will be set to the empty string
312 (``''``), in non-POSIX mode, and to ``None`` in POSIX mode.
313
Georg Brandl116aa622007-08-15 14:28:22 +0000314
Vinay Sajipc1f974c2016-07-29 22:35:03 +0100315.. attribute:: shlex.punctuation_chars
316
317 Characters that will be considered punctuation. Runs of punctuation
318 characters will be returned as a single token. However, note that no
319 semantic validity checking will be performed: for example, '>>>' could be
320 returned as a token, even though it may not be recognised as such by shells.
321
322 .. versionadded:: 3.6
323
324
Georg Brandl116aa622007-08-15 14:28:22 +0000325.. _shlex-parsing-rules:
326
327Parsing Rules
328-------------
329
Serhiy Storchaka4e985672013-10-13 21:19:00 +0300330When operating in non-POSIX mode, :class:`~shlex.shlex` will try to obey to the
Georg Brandl116aa622007-08-15 14:28:22 +0000331following rules.
332
333* Quote characters are not recognized within words (``Do"Not"Separate`` is
334 parsed as the single word ``Do"Not"Separate``);
335
336* Escape characters are not recognized;
337
338* Enclosing characters in quotes preserve the literal value of all characters
339 within the quotes;
340
341* Closing quotes separate words (``"Do"Separate`` is parsed as ``"Do"`` and
342 ``Separate``);
343
Serhiy Storchaka4e985672013-10-13 21:19:00 +0300344* If :attr:`~shlex.whitespace_split` is ``False``, any character not
345 declared to be a word character, whitespace, or a quote will be returned as
346 a single-character token. If it is ``True``, :class:`~shlex.shlex` will only
347 split words in whitespaces;
Georg Brandl116aa622007-08-15 14:28:22 +0000348
349* EOF is signaled with an empty string (``''``);
350
351* It's not possible to parse empty strings, even if quoted.
352
Serhiy Storchaka4e985672013-10-13 21:19:00 +0300353When operating in POSIX mode, :class:`~shlex.shlex` will try to obey to the
354following parsing rules.
Georg Brandl116aa622007-08-15 14:28:22 +0000355
356* Quotes are stripped out, and do not separate words (``"Do"Not"Separate"`` is
357 parsed as the single word ``DoNotSeparate``);
358
359* Non-quoted escape characters (e.g. ``'\'``) preserve the literal value of the
360 next character that follows;
361
Serhiy Storchaka4e985672013-10-13 21:19:00 +0300362* Enclosing characters in quotes which are not part of
363 :attr:`~shlex.escapedquotes` (e.g. ``"'"``) preserve the literal value
364 of all characters within the quotes;
Georg Brandl116aa622007-08-15 14:28:22 +0000365
Serhiy Storchaka4e985672013-10-13 21:19:00 +0300366* Enclosing characters in quotes which are part of
367 :attr:`~shlex.escapedquotes` (e.g. ``'"'``) preserves the literal value
368 of all characters within the quotes, with the exception of the characters
369 mentioned in :attr:`~shlex.escape`. The escape characters retain its
370 special meaning only when followed by the quote in use, or the escape
371 character itself. Otherwise the escape character will be considered a
Georg Brandl116aa622007-08-15 14:28:22 +0000372 normal character.
373
374* EOF is signaled with a :const:`None` value;
375
Éric Araujo9bce3112011-07-27 18:29:31 +0200376* Quoted empty strings (``''``) are allowed.
Vinay Sajipc1f974c2016-07-29 22:35:03 +0100377
378.. _improved-shell-compatibility:
379
380Improved Compatibility with Shells
381----------------------------------
382
383.. versionadded:: 3.6
384
385The :class:`shlex` class provides compatibility with the parsing performed by
386common Unix shells like ``bash``, ``dash``, and ``sh``. To take advantage of
387this compatibility, specify the ``punctuation_chars`` argument in the
388constructor. This defaults to ``False``, which preserves pre-3.6 behaviour.
389However, if it is set to ``True``, then parsing of the characters ``();<>|&``
390is changed: any run of these characters is returned as a single token. While
391this is short of a full parser for shells (which would be out of scope for the
392standard library, given the multiplicity of shells out there), it does allow
393you to perform processing of command lines more easily than you could
Vinay Sajipaa655b32017-01-09 16:46:04 +0000394otherwise. To illustrate, you can see the difference in the following snippet:
Vinay Sajipc1f974c2016-07-29 22:35:03 +0100395
Vinay Sajipaa655b32017-01-09 16:46:04 +0000396.. doctest::
397 :options: +NORMALIZE_WHITESPACE
Vinay Sajipc1f974c2016-07-29 22:35:03 +0100398
Vinay Sajipaa655b32017-01-09 16:46:04 +0000399 >>> import shlex
400 >>> text = "a && b; c && d || e; f >'abc'; (def \"ghi\")"
401 >>> list(shlex.shlex(text))
402 ['a', '&', '&', 'b', ';', 'c', '&', '&', 'd', '|', '|', 'e', ';', 'f', '>',
403 "'abc'", ';', '(', 'def', '"ghi"', ')']
404 >>> list(shlex.shlex(text, punctuation_chars=True))
405 ['a', '&&', 'b', ';', 'c', '&&', 'd', '||', 'e', ';', 'f', '>', "'abc'",
406 ';', '(', 'def', '"ghi"', ')']
Vinay Sajipc1f974c2016-07-29 22:35:03 +0100407
408Of course, tokens will be returned which are not valid for shells, and you'll
409need to implement your own error checks on the returned tokens.
410
411Instead of passing ``True`` as the value for the punctuation_chars parameter,
412you can pass a string with specific characters, which will be used to determine
413which characters constitute punctuation. For example::
414
415 >>> import shlex
416 >>> s = shlex.shlex("a && b || c", punctuation_chars="|")
417 >>> list(s)
418 ['a', '&', '&', 'b', '||', 'c']
419
420.. note:: When ``punctuation_chars`` is specified, the :attr:`~shlex.wordchars`
421 attribute is augmented with the characters ``~-./*?=``. That is because these
422 characters can appear in file names (including wildcards) and command-line
423 arguments (e.g. ``--color=auto``). Hence::
424
425 >>> import shlex
426 >>> s = shlex.shlex('~/a && b-c --color=auto || d *.py?',
427 ... punctuation_chars=True)
428 >>> list(s)
429 ['~/a', '&&', 'b-c', '--color=auto', '||', 'd', '*.py?']
430
Vinay Sajipdc4ce0e2017-01-27 13:04:33 +0000431For best effect, ``punctuation_chars`` should be set in conjunction with
432``posix=True``. (Note that ``posix=False`` is the default for
433:class:`~shlex.shlex`.)