Fix DecompressMemMapReader.read returning b'' before EOF (#120)#135
Conversation
Closes gitpython-developers#120 DecompressMemMapReader.read(N) could return b'' mid-stream when zlib consumed input without producing output on a single decompress call (small N, header / dictionary frames in flight). The original `if dcompdat and ...` guard at the recursion site skipped the "refill to size" recursion in that case, so a caller using the standard idiom while chunk := stream.read(4096): yield chunk terminated at the first empty chunk -- before _br == _s. The guard exists for compressed_bytes_read(), which manipulates _br=0 and then drains the inner zip past its EOF. Recursing there would loop forever because the inner zip is already done. The fix uses zlib's own `eof` attribute (available on standard zlib.Decompress objects since Python 3.6) to distinguish: - dcompdat empty AND zip not at EOF -> still digesting, recurse - dcompdat empty AND zip at EOF -> compressed_bytes_read scrub or genuine EOF; do not recurse. `getattr(_zip, 'eof', False)` keeps the conservative behavior when running against a custom zlib object that does not expose the attribute. Adds a regression test that reads with chunk_size in {1, 4, 16, 64} from a 13 KB highly-compressible stream. With the old guard, the chunk_size <= 16 cases stopped at byte 0; the new test asserts they read all 13000 bytes. The full existing test suite (24 tests) still passes, including test_decompress_reader_special_case and test_pack which exercise the compressed_bytes_read scrub path that the original guard existed to protect.
|
Thanks a lot. This is interesting because nobody should be using this pure Python object database. But I guess for those who are, having this fixed is valuable. Let's see if Copilot finds anything, but if not I think this should be merged, even though I only did an armchair review and admittedly don't understand a thing, trusting the test-suite fully here. |
There was a problem hiding this comment.
Pull request overview
Fixes DecompressMemMapReader.read(N) returning b'' before true EOF for small N, which could prematurely terminate standard chunked read loops (issue #120). The change replaces a recursive refill strategy with an iterative decompression loop that continues as long as zlib is making forward progress, while still preserving the compressed_bytes_read() “scrub” behavior and avoiding recursion-depth failures on pathological inputs.
Changes:
- Refactor
DecompressMemMapReader.read()from recursion to an iterative loop that tolerates zlib consuming input without producing output. - Add a regression test ensuring chunked reads (1/4/16/64 bytes) return the full payload and reach
_br == _s.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
gitdb/stream.py |
Reworks decompression refill logic to avoid empty reads before EOF and removes recursion to prevent stack overflows on adversarial streams. |
gitdb/test/test_stream.py |
Adds a regression test covering chunked reads that previously could terminate early. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Summary
Closes #120
DecompressMemMapReader.read(N)could returnb''before_br == _swhenNwas small enough that the underlyingzlib.decompress(indata, N)consumed input bytes without producing output (e.g. while ingesting the zlib header / dictionary frames). Callers using the standardwhile chunk := stream.read(N)idiom terminated at the first empty chunk -- before reaching the actual end of the uncompressed object. Reproduced with chunk sizes 1, 4, 16 against a 13 KB stream of compressible data.Why the original guard existed
The previous
if dcompdat and ...guard at the recursion site was added so thatcompressed_bytes_read()could driveread()in a scrub-loop that intentionally manipulates_br = 0while the inner zlib object is already past EOF. Without that guard, the scrub loop would recurse forever because_br < _sstays true.Fix
Convert the recursive "refill to size" branch into an iterative loop around the existing decompress block. The loop terminates when:
len(dcompdat) >= size(caller's request satisfied), OR_br >= _s(uncompressed object fully read), ORself._m, OR no compressed bytes were consumed on this turn).Condition 3 preserves the
compressed_bytes_read()scrub safety: once the inner zip is at EOF, the loop breaks instead of looping forever. It also bounds the truncated-stream case (zip.eofnever becomes true) by the input-exhaustion guard, and bounds the crafted "many empty deflate blocks" attack -- previously this could blow Python's recursion depth at ~1500+ stored-block headers because each recursion only consumed a handful of input bytes; the iterative form walks the stream forward without growing the stack.Tests
test_decompress_reader_chunked_read_does_not_terminate_earlyreads a 13 KB highly-compressible stream with chunk sizes 1, 4, 16, 64 and asserts the full payload is returned and_br == _s.gitdb/test/(24 tests) all pass, includingtest_decompress_reader_special_caseandtest_pack-- both exercise thecompressed_bytes_readscrub path.Manual smoke
(Pre-fix this fails for
chunk_size in (1, 4, 16).)Truncated input is handled the same as before:
read()returns whatever was decoded so far, thenb'', instead of looping or raising. The crafted recursion-depth attack (~5000 empty deflate blocks ahead of one valid block) now decodes correctly instead of raisingRecursionError.