Fix DecompressMemMapReader.read returning b'' before EOF (#120) by mvanhorn · Pull Request #135 · gitpython-developers/gitdb

mvanhorn · 2026-05-08T11:45:57Z

Summary

Closes #120

DecompressMemMapReader.read(N) could return b'' before _br == _s when N was small enough that the underlying zlib.decompress(indata, N) consumed input bytes without producing output (e.g. while ingesting the zlib header / dictionary frames). Callers using the standard while chunk := stream.read(N) idiom terminated at the first empty chunk -- before reaching the actual end of the uncompressed object. Reproduced with chunk sizes 1, 4, 16 against a 13 KB stream of compressible data.

Why the original guard existed

The previous if dcompdat and ... guard at the recursion site was added so that compressed_bytes_read() could drive read() in a scrub-loop that intentionally manipulates _br = 0 while the inner zlib object is already past EOF. Without that guard, the scrub loop would recurse forever because _br < _s stays true.

Fix

Convert the recursive "refill to size" branch into an iterative loop around the existing decompress block. The loop terminates when:

len(dcompdat) >= size (caller's request satisfied), OR
_br >= _s (uncompressed object fully read), OR
Inner zlib produced an empty chunk AND (zlib has hit EOF, OR the input window has run off self._m, OR no compressed bytes were consumed on this turn).

Condition 3 preserves the compressed_bytes_read() scrub safety: once the inner zip is at EOF, the loop breaks instead of looping forever. It also bounds the truncated-stream case (zip.eof never becomes true) by the input-exhaustion guard, and bounds the crafted "many empty deflate blocks" attack -- previously this could blow Python's recursion depth at ~1500+ stored-block headers because each recursion only consumed a handful of input bytes; the iterative form walks the stream forward without growing the stack.

Tests

New: test_decompress_reader_chunked_read_does_not_terminate_early reads a 13 KB highly-compressible stream with chunk sizes 1, 4, 16, 64 and asserts the full payload is returned and _br == _s.
Existing: gitdb/test/ (24 tests) all pass, including test_decompress_reader_special_case and test_pack -- both exercise the compressed_bytes_read scrub path.

$ pytest gitdb/test
======================== 24 passed, 1 skipped in 7.10s ========================

Manual smoke

import zlib
from gitdb import DecompressMemMapReader

data = b"hello world! " * 1000
zdata = zlib.compress(data)
for chunk_size in (1, 4, 16, 64, 4096):
    r = DecompressMemMapReader(zdata, close_on_deletion=False, size=len(data))
    out = b""
    while chunk := r.read(chunk_size):
        out += chunk
    assert out == data, chunk_size

(Pre-fix this fails for chunk_size in (1, 4, 16).)

Truncated input is handled the same as before: read() returns whatever was decoded so far, then b'', instead of looping or raising. The crafted recursion-depth attack (~5000 empty deflate blocks ahead of one valid block) now decodes correctly instead of raising RecursionError.

Closes gitpython-developers#120 DecompressMemMapReader.read(N) could return b'' mid-stream when zlib consumed input without producing output on a single decompress call (small N, header / dictionary frames in flight). The original `if dcompdat and ...` guard at the recursion site skipped the "refill to size" recursion in that case, so a caller using the standard idiom while chunk := stream.read(4096): yield chunk terminated at the first empty chunk -- before _br == _s. The guard exists for compressed_bytes_read(), which manipulates _br=0 and then drains the inner zip past its EOF. Recursing there would loop forever because the inner zip is already done. The fix uses zlib's own `eof` attribute (available on standard zlib.Decompress objects since Python 3.6) to distinguish: - dcompdat empty AND zip not at EOF -> still digesting, recurse - dcompdat empty AND zip at EOF -> compressed_bytes_read scrub or genuine EOF; do not recurse. `getattr(_zip, 'eof', False)` keeps the conservative behavior when running against a custom zlib object that does not expose the attribute. Adds a regression test that reads with chunk_size in {1, 4, 16, 64} from a 13 KB highly-compressible stream. With the old guard, the chunk_size <= 16 cases stopped at byte 0; the new test asserts they read all 13000 bytes. The full existing test suite (24 tests) still passes, including test_decompress_reader_special_case and test_pack which exercise the compressed_bytes_read scrub path that the original guard existed to protect.

Byron · 2026-05-09T00:00:01Z

Thanks a lot.

This is interesting because nobody should be using this pure Python object database. But I guess for those who are, having this fixed is valuable.

Let's see if Copilot finds anything, but if not I think this should be merged, even though I only did an armchair review and admittedly don't understand a thing, trusting the test-suite fully here.
The test of course makes sense, and I agree that it should be working.

Copilot

Pull request overview

Fixes DecompressMemMapReader.read(N) returning b'' before true EOF for small N, which could prematurely terminate standard chunked read loops (issue #120). The change replaces a recursive refill strategy with an iterative decompression loop that continues as long as zlib is making forward progress, while still preserving the compressed_bytes_read() “scrub” behavior and avoiding recursion-depth failures on pathological inputs.

Changes:

Refactor DecompressMemMapReader.read() from recursion to an iterative loop that tolerates zlib consuming input without producing output.
Add a regression test ensuring chunked reads (1/4/16/64 bytes) return the full payload and reach _br == _s.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
`gitdb/stream.py`	Reworks decompression refill logic to avoid empty reads before EOF and removes recursion to prevent stack overflows on adversarial streams.
`gitdb/test/test_stream.py`	Adds a regression test covering chunked reads that previously could terminate early.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Byron requested a review from Copilot May 8, 2026 23:58

Copilot started reviewing on behalf of Byron May 8, 2026 23:59 View session

Copilot AI reviewed May 9, 2026

View reviewed changes

Comment thread gitdb/stream.py Outdated

Comment thread gitdb/stream.py Outdated

Comment thread gitdb/test/test_stream.py Outdated

Apply suggestions from code review

ed41d2c

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Byron merged commit 0a019a2 into gitpython-developers:master May 9, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix DecompressMemMapReader.read returning b'' before EOF (#120)#135

Fix DecompressMemMapReader.read returning b'' before EOF (#120)#135
Byron merged 2 commits intogitpython-developers:masterfrom
mvanhorn:fix/120-empty-read-before-eof

mvanhorn commented May 8, 2026

Uh oh!

Byron commented May 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

Conversation

mvanhorn commented May 8, 2026

Summary

Why the original guard existed

Fix

Tests

Manual smoke

Uh oh!

Byron commented May 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants