You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Umbrella tracking the invariant: garbage collection must never delete an object that is still referenced. Today it can, via two independent failure modes that look similar but have different root causes:
Incomplete discovery — Garbage collection deletes external-store files belonging to existing rows (custom codecs) #1469: dj.gc.scan hardcodes the built-in codec names (hash/blob/attach, object/npy), so files referenced by a custom codec are never enumerated. The store scan then reports live files as orphans and collect() deletes them. (Row delete also fails to remove custom-codec files.)
#1445's purge() re-check reruns the scan. On a codec-blind scan (pre-#1469) it would still classify a live custom-codec file as an orphan and purge it after the grace window — so the grace window can't save you from a scan that never sees the reference. Complete discovery (#1469) is the precondition for safe deletion (#1445).
Documentation (datajoint-docs) — to land with the implementation
Add
reference/specs/garbage-collection.md (new normative spec — none exists today). The orphan-determination model, the referenced_paths codec contract, the two-phase quarantine/grace/purge state machine, config keys (gc.grace_seconds), re-check/concurrency semantics, backend atomic-move requirements, and restore. (Two-phase, transaction-safe garbage collection (quarantine -> grace -> purge) #1445 explicitly asks for a written spec.)
Update
how-to/garbage-collection.md ("Clean Up Object Storage") — document the two-phase workflow (quarantine / purge / restore, grace_seconds); note custom-codec external files are now handled; revise the "single-pass, best-effort" admonition added in fix issue #186 #189 once two-phase lands.
reference/specs/codec-api.md — document referenced_paths as part of the Codec contract (required for any codec that owns external artifacts).
how-to/create-custom-codec.md, explanation/custom-codecs.md, how-to/use-plugin-codecs.md — author guidance: if your codec writes external files, implement referenced_paths so delete + GC see them (otherwise files leak or, worse, get misclassified as orphans and deleted).
reference/specs/provenance.md — the GC concurrency wording references single-pass semantics; align once two-phase ships (minor).
Optionally a short explainer (e.g. in explanation/object-storage-overview.md) on how DataJoint tracks external references — codecs own their paths; delete and GC consult them.
Umbrella tracking the invariant: garbage collection must never delete an object that is still referenced. Today it can, via two independent failure modes that look similar but have different root causes:
dj.gc.scanhardcodes the built-in codec names (hash/blob/attach,object/npy), so files referenced by a custom codec are never enumerated. The store scan then reports live files as orphans andcollect()deletes them. (Row delete also fails to remove custom-codec files.)They are complementary layers of one goal, not duplicates — and must be fixed in order:
referenced_pathshook;scanand delete-cleanup become codec-driven instead of name-hardcoded. Closes active data loss for real pipelines (aeon_mecha)._trash/prefix for state.Why the sequence matters
#1445's
purge()re-check reruns the scan. On a codec-blind scan (pre-#1469) it would still classify a live custom-codec file as an orphan and purge it after the grace window — so the grace window can't save you from a scan that never sees the reference. Complete discovery (#1469) is the precondition for safe deletion (#1445).Documentation (datajoint-docs) — to land with the implementation
Add
reference/specs/garbage-collection.md(new normative spec — none exists today). The orphan-determination model, thereferenced_pathscodec contract, the two-phase quarantine/grace/purge state machine, config keys (gc.grace_seconds), re-check/concurrency semantics, backend atomic-move requirements, andrestore. (Two-phase, transaction-safe garbage collection (quarantine -> grace -> purge) #1445 explicitly asks for a written spec.)Update
how-to/garbage-collection.md("Clean Up Object Storage") — document the two-phase workflow (quarantine/purge/restore,grace_seconds); note custom-codec external files are now handled; revise the "single-pass, best-effort" admonition added in fix issue #186 #189 once two-phase lands.reference/specs/codec-api.md— documentreferenced_pathsas part of the Codec contract (required for any codec that owns external artifacts).how-to/create-custom-codec.md,explanation/custom-codecs.md,how-to/use-plugin-codecs.md— author guidance: if your codec writes external files, implementreferenced_pathsso delete + GC see them (otherwise files leak or, worse, get misclassified as orphans and deleted).reference/specs/provenance.md— the GC concurrency wording references single-pass semantics; align once two-phase ships (minor).explanation/object-storage-overview.md) on how DataJoint tracks external references — codecs own their paths; delete and GC consult them.