Lokasi ngalangkungan proxy:   [ UP ]  
[Ngawartoskeun bug]   [Panyetelan cookie]                
Skip to content

Fix DecompressMemMapReader.read returning b'' before EOF (#120)#135

Merged
Byron merged 2 commits intogitpython-developers:masterfrom
mvanhorn:fix/120-empty-read-before-eof
May 9, 2026
Merged

Fix DecompressMemMapReader.read returning b'' before EOF (#120)#135
Byron merged 2 commits intogitpython-developers:masterfrom
mvanhorn:fix/120-empty-read-before-eof

Conversation

@mvanhorn
Copy link
Copy Markdown
Contributor

@mvanhorn mvanhorn commented May 8, 2026

Summary

Closes #120

DecompressMemMapReader.read(N) could return b'' before _br == _s when N was small enough that the underlying zlib.decompress(indata, N) consumed input bytes without producing output (e.g. while ingesting the zlib header / dictionary frames). Callers using the standard while chunk := stream.read(N) idiom terminated at the first empty chunk -- before reaching the actual end of the uncompressed object. Reproduced with chunk sizes 1, 4, 16 against a 13 KB stream of compressible data.

Why the original guard existed

The previous if dcompdat and ... guard at the recursion site was added so that compressed_bytes_read() could drive read() in a scrub-loop that intentionally manipulates _br = 0 while the inner zlib object is already past EOF. Without that guard, the scrub loop would recurse forever because _br < _s stays true.

Fix

Convert the recursive "refill to size" branch into an iterative loop around the existing decompress block. The loop terminates when:

  1. len(dcompdat) >= size (caller's request satisfied), OR
  2. _br >= _s (uncompressed object fully read), OR
  3. Inner zlib produced an empty chunk AND (zlib has hit EOF, OR the input window has run off self._m, OR no compressed bytes were consumed on this turn).

Condition 3 preserves the compressed_bytes_read() scrub safety: once the inner zip is at EOF, the loop breaks instead of looping forever. It also bounds the truncated-stream case (zip.eof never becomes true) by the input-exhaustion guard, and bounds the crafted "many empty deflate blocks" attack -- previously this could blow Python's recursion depth at ~1500+ stored-block headers because each recursion only consumed a handful of input bytes; the iterative form walks the stream forward without growing the stack.

Tests

  • New: test_decompress_reader_chunked_read_does_not_terminate_early reads a 13 KB highly-compressible stream with chunk sizes 1, 4, 16, 64 and asserts the full payload is returned and _br == _s.
  • Existing: gitdb/test/ (24 tests) all pass, including test_decompress_reader_special_case and test_pack -- both exercise the compressed_bytes_read scrub path.
$ pytest gitdb/test
======================== 24 passed, 1 skipped in 7.10s ========================

Manual smoke

import zlib
from gitdb import DecompressMemMapReader

data = b"hello world! " * 1000
zdata = zlib.compress(data)
for chunk_size in (1, 4, 16, 64, 4096):
    r = DecompressMemMapReader(zdata, close_on_deletion=False, size=len(data))
    out = b""
    while chunk := r.read(chunk_size):
        out += chunk
    assert out == data, chunk_size

(Pre-fix this fails for chunk_size in (1, 4, 16).)

Truncated input is handled the same as before: read() returns whatever was decoded so far, then b'', instead of looping or raising. The crafted recursion-depth attack (~5000 empty deflate blocks ahead of one valid block) now decodes correctly instead of raising RecursionError.

Closes gitpython-developers#120

DecompressMemMapReader.read(N) could return b'' mid-stream when zlib
consumed input without producing output on a single decompress call
(small N, header / dictionary frames in flight). The original
`if dcompdat and ...` guard at the recursion site skipped the
"refill to size" recursion in that case, so a caller using the
standard idiom

    while chunk := stream.read(4096):
        yield chunk

terminated at the first empty chunk -- before _br == _s.

The guard exists for compressed_bytes_read(), which manipulates
_br=0 and then drains the inner zip past its EOF. Recursing there
would loop forever because the inner zip is already done.

The fix uses zlib's own `eof` attribute (available on standard
zlib.Decompress objects since Python 3.6) to distinguish:

  - dcompdat empty AND zip not at EOF -> still digesting, recurse
  - dcompdat empty AND zip at EOF      -> compressed_bytes_read
                                         scrub or genuine EOF; do
                                         not recurse.

`getattr(_zip, 'eof', False)` keeps the conservative behavior
when running against a custom zlib object that does not expose
the attribute.

Adds a regression test that reads with chunk_size in
{1, 4, 16, 64} from a 13 KB highly-compressible stream. With the
old guard, the chunk_size <= 16 cases stopped at byte 0; the new
test asserts they read all 13000 bytes.

The full existing test suite (24 tests) still passes, including
test_decompress_reader_special_case and test_pack which exercise
the compressed_bytes_read scrub path that the original guard
existed to protect.
@Byron
Copy link
Copy Markdown
Member

Byron commented May 9, 2026

Thanks a lot.

This is interesting because nobody should be using this pure Python object database. But I guess for those who are, having this fixed is valuable.

Let's see if Copilot finds anything, but if not I think this should be merged, even though I only did an armchair review and admittedly don't understand a thing, trusting the test-suite fully here.
The test of course makes sense, and I agree that it should be working.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes DecompressMemMapReader.read(N) returning b'' before true EOF for small N, which could prematurely terminate standard chunked read loops (issue #120). The change replaces a recursive refill strategy with an iterative decompression loop that continues as long as zlib is making forward progress, while still preserving the compressed_bytes_read() “scrub” behavior and avoiding recursion-depth failures on pathological inputs.

Changes:

  • Refactor DecompressMemMapReader.read() from recursion to an iterative loop that tolerates zlib consuming input without producing output.
  • Add a regression test ensuring chunked reads (1/4/16/64 bytes) return the full payload and reach _br == _s.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
gitdb/stream.py Reworks decompression refill logic to avoid empty reads before EOF and removes recursion to prevent stack overflows on adversarial streams.
gitdb/test/test_stream.py Adds a regression test covering chunked reads that previously could terminate early.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread gitdb/stream.py Outdated
Comment thread gitdb/stream.py Outdated
Comment thread gitdb/test/test_stream.py Outdated
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@Byron Byron merged commit 0a019a2 into gitpython-developers:master May 9, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

Empty read from gitdb.OStream.read() before EOF

3 participants