Lokasi ngalangkungan proxy:   [ UP ]  
[Ngawartoskeun bug]   [Panyetelan cookie]                
Skip to content

fix(#1469): codec-driven GC reference discovery + file-level orphan matching#1479

Open
dimitri-yatsenko wants to merge 1 commit into
masterfrom
fix/1469-codec-referenced-paths
Open

fix(#1469): codec-driven GC reference discovery + file-level orphan matching#1479
dimitri-yatsenko wants to merge 1 commit into
masterfrom
fix/1469-codec-referenced-paths

Conversation

@dimitri-yatsenko

@dimitri-yatsenko dimitri-yatsenko commented Jul 2, 2026

Copy link
Copy Markdown
Member

Fixes #1469 — garbage collection deleting external-store files that belong to existing rows. Implements the plan from #1469 / #1478.

Two root causes, both fixed

1. Custom-codec blindness (the reported bug). gc.scan recognized schema-addressed columns by hardcoded codec name (object/npy), so a custom SchemaCodec subclass (the reporter's NetCDF codec) was never scanned — its live files were reported orphaned and collect() deleted them. Recognition is now by type (isinstance(codec, SchemaCodec)), and reference extraction moves to a codec-owned hook:

class Codec:
    def referenced_paths(self, stored) -> list[tuple[str, str | None]]:
        # default recognizes standard {path, store} metadata; override for custom shapes

Custom SchemaCodec subclasses inherit correct behavior for free; fully custom codecs can override. scan calls attr.codec.referenced_paths(value) per column instead of switching on name.

2. Path-format mismatch (latent — also hit built-in <object@>/<npy@>). A row's stored metadata references an object file ({schema}/{table}/{pk}/{field}_{token}), but list_schema_paths enumerated the enclosing directory — so the referenced and stored path sets never matched and every live schema-addressed object was flagged orphaned (existing tests only asserted referenced >= 1, never orphaned == 0, so it was hidden). list_schema_paths now enumerates files (matching referenced paths, with per-token granularity), and delete_schema_path removes the single orphaned file and prunes empty parent dirs.

⚠️ Reviewers: cause (2) above means pre-fix collect() could delete live <object@>/<npy@> files, not just custom-codec files. This PR corrects that for all schema-addressed storage.

Not changed

Delete-then-GC semantics are unchanged — external files still survive row delete by design; GC reclaims. No delete-time eager cleanup (unsafe for hash dedup).

Tests

  • Recognition of a custom SchemaCodec subclass (_uses_schema_storage).
  • End-to-end guard: after deleting one row and running collect(), the surviving row's file remains and only the deleted row's file is reclaimed — asserted by exact path (robust to the shared test store). Both new tests fail on pre-fix code.
  • Full test_gc.py (32) + object/npy/hash/codec/adapter suites (237) pass locally on MySQL & PostgreSQL.

Discovery layer of the trustworthy-GC work (#1478); #1445 (two-phase quarantine/grace/purge) builds on the now-trustworthy scan.

…rphan matching

Garbage collection deleted external-store files belonging to LIVE rows in two
ways, both fixed here:

1. Custom-codec blindness. gc.scan recognized schema-addressed columns by
   hardcoded codec name (object/npy), so a custom SchemaCodec subclass (e.g. a
   NetCDF codec, #1469) was never scanned — its live files were reported
   orphaned and collect() deleted them. Recognition is now by type
   (isinstance(codec, SchemaCodec)), and reference extraction moves to a
   codec-owned hook, Codec.referenced_paths(stored). Custom SchemaCodec
   subclasses inherit correct behavior for free; fully custom codecs override.

2. Path-format mismatch (also hit built-in <object@>/<npy@>). A row's stored
   metadata references an object FILE ({schema}/{table}/{pk}/{field}_{token}),
   but list_schema_paths enumerated the enclosing DIRECTORY, so the referenced
   and stored path sets never matched and live objects were flagged orphaned.
   list_schema_paths now enumerates files (matching the referenced paths, with
   per-token granularity) and delete_schema_path removes the single orphaned
   file and prunes empty parent dirs.

Delete-then-GC semantics unchanged (files survive row delete by design; GC
reclaims). Adds Codec.referenced_paths to the base contract (default recognizes
standard {path, store} metadata).

Tests: recognition of custom SchemaCodec subclasses; end-to-end guard that
collect() keeps a surviving row's file and reclaims only the deleted row's file
(both fail on pre-fix code). Existing object/npy/hash suites still pass.

Discovery layer of the trustworthy-GC work (#1478); #1445 builds on it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Garbage collection deletes external-store files belonging to existing rows (custom codecs)

1 participant