Lokasi ngalangkungan proxy:   [ UP ]  
[Ngawartoskeun bug]   [Panyetelan cookie]                
Skip to content

Git.execute doc is inaccurate on Windows and recommends shlex too broadly #2146

@EliahKagan

Description

@EliahKagan

The behavior of Git.execute when a single string is passed has long been unclear from the documentation, and remains unclear after #2144. I am inclined to consider that PR to have been a move in the right direction--but it also introduced some significant inaccuracies, with possible security implications.

This is a documentation bug only. It can be seen as a sequel to #2016, reflecting the situation as it stands since #2144 (which closed #2016). I hope to fix this soon by building on #2144 to combine the advantages of the wording both before and after that change while at least mostly addressing the disadvantages of both. But I may not be able to do so immediately--and also maybe someone else has a better idea of how to do it than I do--so I'm opening this to track it.

The documentation bug

Since #2144, the Git.execute docstring claims:

A string is accepted, but with shell set to False it is passed as a single executable name to subprocess.Popen. For example, "git log -n 1" looks for an executable literally named git log -n 1 and will fail with GitCommandNotFound.

While that is accurate on Unix-like systems, it is inaccurate on Windows. For example:

(GitPython) E:\repos\GitPython [master]> python
Python 3.12.13 (main, May  4 2026, 21:19:41) [MSC v.1944 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import git
>>> g = git.Git()
>>> g.execute(['git', 'version'])
'git version 2.54.0.windows.1'
>>> g.execute('git version')
'git version 2.54.0.windows.1'
>>> g.execute(['git', 'version'], shell=False)
'git version 2.54.0.windows.1'
>>> g.execute('git version', shell=False)
'git version 2.54.0.windows.1'

shell=False is the default and thus makes no difference there, but I included it to make extra clear that no shell is involved in parsing the command. And to use the example from the documentation:

>>> g.execute('git log -n 1')
'commit 7b83f7a99af4313264650339e08601834b3ca968\nMerge: c7648c0c d172e541\nAuthor: Sebastian Thiel <sebastian.thiel@icloud.com>\nDate:   Thu May 7 18:50:11 2026 +0800\n\n    Merge pull request #2144 from mvanhorn/osc/2016-clarify-execute-string-arg\n    \n    docs(cmd): clarify Git.execute() string vs list command argument'
>>> g.execute('git log -n 1', shell=False)
'commit 7b83f7a99af4313264650339e08601834b3ca968\nMerge: c7648c0c d172e541\nAuthor: Sebastian Thiel <sebastian.thiel@icloud.com>\nDate:   Thu May 7 18:50:11 2026 +0800\n\n    Merge pull request #2144 from mvanhorn/osc/2016-clarify-execute-string-arg\n    \n    docs(cmd): clarify Git.execute() string vs list command argument'

What's going on

On Unix-like systems, a command's arguments are an array of separate strings. If a command is written as a string, a shell--or something else implementing shell-like syntax--has to parse that string into separate arguments, and they are passed separately. The program actually receives a C-style argv array. This is generally intuitive to programmers, and it is Python quirks that are unintuitive: that subprocess.Popen, and thus Git.execute, convert a single string into a singleton sequence is confusing; I think this motivated #2144, and the clarity problem that it has made progress toward addressing is indeed real.

The opposite happens on Windows, where programs do not really receive separate arguments, but are instead handed a command-line string, which the program is then itself responsible for parsing. Different programs can, and sometimes do, even choose different ways to parse them! Fortunately, most Windows programs use the same rules to do so, and the operating system provides a facility to do this easily (which C programs that use 2- or 3-argument signatures for main call automatically before main runs). Unfortunately, these quoting rules are different from those used in any other operating system or any shell, and they are somewhat confusing. The command-line string furthermore is also parsed before the program runs, but only to determine what part of it identifies the program that is to be run (unless that is specified in another way, which is uncommon).

So on Windows, passing separate arguments actually requires them to be combined using Windows-specific whitespace and quoting/escaping syntax before the operating system is asked to run the program. Git.execute delegates to subprocess.Popen, which takes care of combining arguments into a command-line string. Currently it does so by calling its list2cmdline helper, but that is an implementation detail on which we are not supposed to rely. Having obtained a single string, Popen then uses CreateProcessW in the Windows API to run the program. So long as the program being run recognizes the same syntax list2cmdline uses when combining arguments--which is the syntax Microsoft suggests, though some programs use custom parsing rules--running a program using a sequence of separate argument strings works.

The linked issue

#2144 closed #2016, which was an issue mainly about this confusion. But the confusion there, as least as I understand it based on #2016 (comment), was that the Windows behavior was expected on macOS.

The documentation is now clearer about what happens on Unix-like systems. But I think this is not going to help people who think they can pass multiple arguments as a string (without a shell) because they can on the system they are using, and whose actual problem is that they need to make their code portable.

See also #2016 (comment) on the Windows-specific behavior of how Popen, and thus Git.execute, treat a single string.

We should be careful about shlex.split

Another change in #2144 is the recommendation to use shlex.split:

To split a command string into argv tokens, pass shlex.split(...) as a sequence or set shell to True (see the warning below).

shlex.split is indeed much safer than using a shell (#1896). It nonetheless often does not do the right thing, and it is not always safe. We should not recommend it as a default practice. At the same time, it may be right to mention it, as a form of harm reduction--it is usually a far better way to turn a string into separate arguments than passing them through a shell.

shlex.split is safe when the string is written in the syntax of a POSIX shell and fully trusted. If developers are able to write an argument sequence like ['foo', 'bar baz', 'quux'] and prefer to write it as foo 'bar baz' quux and pass that to shlex.split, that's fine.

Otherwise, it's not fine.

Sometimes someone may have a command that is quoted according to the syntax expected for a single command-line string on Windows, in which case this will sometimes be misparsed by shlex.split because shlex.split implements POSIX-shell-style syntax.

The bigger issue is security. If the input string to shlex.split contains attacker-controlled parts, they can still inject whitespace and quoting characters. It may feel more covenient and readable to write something like f"git diff {ref1} {ref2} -- {path}" or f"git diff '{ref1}' '{ref2}' -- '{path}'", neither of which is safe. Especially with the current blanket recommendation to use shlex.split, I think developers may be misled into thinking those forms (and others) are okay provided that they have verified that untrusted arguments don't begin with the well-known -. If ref1 or ref2 is something like v3.1.49 --output=target for the first case, or v3.1.49' --output=target' for the second case, then the arbitrary path target has been written.

Of course, programs should use much greater sanitation, at least whenever using GitPython facilities that do not explicitly document that a particular form of sanitization is performed (and sometimes maybe even then, for purposes of defense in depth). Even when passing a list, one should sanitize untrusted input carefully. But once one has done this, one is thinking in terms of specific arguments--in which case, passing an explicitly constructed list is then easier to write and to read than using shlex.split, as well as being easier to reason about.

I think, though I am not sure, that the intention behind the current shlex.split recommendation is that it only be followed for hard-coded commands that were never built by interpolating anything. I don't think that would be clear to anyone who does not already know about shlex.split.

The fix

Obviously it would be bad to write all that, or even a tenth of it, in the docstring. My first few attempts to write something that is acceptably concise, clear, and accurate have not been successful. I have also tried directing Claude Code to write something, and I have not been happy with that either; if that approach does work, then I'll of course follow the recently merged AI disclosure policy. (To clarify, no part of this issue is AI-generated.)

I hope to figure out a good wording in the next couple of days, unless someone else gets there first. For now, I'm opening this issue. (The process of writing it has hopefully gotten me closer to figuring out the fix.)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions