feat(#1425): strict_provenance config flag for runtime enforcement#1474
Conversation
Implements T2.2.c of the provenance trinity, completing the trio (Diagram.trace → self.upstream → strict_provenance). When dj.config["strict_provenance"] = True, runtime gates enforce the upstream-only convention inside make(): - Reads must target a table in the active trace's allowed set (declared ancestors + self + self's Parts). - Writes must target self or self's Parts. - Inserted rows' PK columns that overlap with the current key must equal the key's values (key-consistency rule). Default is False. Existing make() bodies are unaffected. Branch stacked on feat/1424-self-upstream (#1473) → feat/1423-diagram-trace (#1471) → fix/1429-cascade-part-part-renamed-fk (#1468). Will rebase onto master after the chain merges. What's added: - src/datajoint/provenance.py (new): the runtime context module. - `_active_strict_make` ContextVar holding (target, allowed_tables, key) for the currently-executing make() invocation. ContextVar chosen over threading.local to propagate correctly across contextvars-aware concurrency boundaries. - `push_strict_make_context` / `pop_strict_make_context` — context lifecycle managed by `_populate_one`'s try/finally. - `assert_read_allowed(query_expression)` — read gate. Recursively discovers base tables via the QueryExpression's `_support` chain and checks each against the allowed set. - `assert_write_allowed(target_table, rows)` — write gate. Verifies the target is self or one of self's Part tables, and checks the key-consistency rule on each dict row. - src/datajoint/settings.py: new `strict_provenance: bool` field on Config (default False), env-var `DJ_STRICT_PROVENANCE`, ENV_VAR_MAPPING entry. - src/datajoint/autopopulate.py: in `_populate_one`, push the strict context (when the flag is on) just before the make() invocation block. The allowed table set = trace's ancestor nodes ∪ {self.full_table_name} ∪ {self's Parts}. Pop in the existing `finally` block. - src/datajoint/expression.py: `QueryExpression.cursor` now calls `assert_read_allowed(self)` before issuing SQL. No-op outside make(). - src/datajoint/table.py: `Table.insert` calls `assert_write_allowed(self, rows)` after the existing `_allow_insert` check. No-op outside make(). Part-table detection uses class `__dict__` traversal (filtered to Part subclasses) instead of `dir/getattr` to avoid triggering the `_JobsDescriptor` (which would lazy-declare ~~table inside the populate transaction — caught by the first test iteration). Documented limitation (deferred): the read gate does not distinguish reads that came through `self.upstream` from reads of the same ancestor via a direct expression. Both are allowed if the table is in the allowed set. The intent is to catch reads from *undeclared* dependencies; tightening the "must come through self.upstream" path requires propagating an attribution marker through QueryExpression composition and is left for a follow-up release. Tests in tests/integration/test_strict_provenance.py (6 new): - test_strict_compliant_make_passes — make() reading via self.upstream and writing self.insert1 with matching key runs cleanly under strict. - test_strict_blocks_read_from_undeclared_table — read from an unrelated table raises with "strict_provenance ... undeclared" message. - test_strict_blocks_write_to_other_table — insert into a non-self, non-Part target raises "not permitted". - test_strict_blocks_write_with_mismatched_key — row PK that disagrees with the current key raises "does not match the current make() key". - test_strict_writes_to_part_table_pass — self.PartName.insert(...) works. - test_strict_off_by_default_no_change — default-off regression check; the canonical "direct (Ancestor & key).fetch1()" pattern still works when strict_provenance is unset. Regression: 17/17 autopopulate tests pass with strict_provenance unset (default). 6/6 new strict tests pass with strict_provenance=True. 8/8 trace tests + 9/9 cascade tests unaffected. Slated for DataJoint 2.3.
9aa7784 to
f60495b
Compare
86b1ae7 to
7e4130f
Compare
A note on adoption: two patterns strict mode will break, and what that tells us we're missingI've spent time with this PR and the spec, and I want to say up front: the provenance trinity is a real evolution for DataJoint. What I want to do here is think through adoption. There are a couple of patterns that are fairly commonly used in practice today that this feature will break. Both of them are, honestly, bad patterns — they're provenance bugs, and strict mode is right to catch them. But they're bad patterns that a lot of working pipelines lean on, and I think if we turn this on without a story for how those pipelines migrate, we take DataJoint to the next level and leave a chunk of our users behind. So this is really an argument for the next thing we should build, using these two patterns to show why. Pattern 1 — reading "sideways" to a table that isn't a declared ancestorThe common case looks like this: a computation needs a piece of reference information that lives off to the side of its own lineage. For example, class SpikeDetection(dj.Computed):
definition = "-> Recording"
def make(self, key):
signal = (Recording & key).fetch1("signal") # declared ancestor — fine
region = (BrainRegion & (Surgery & key)).fetch1("region") # NOT an ancestor
baseline = BASELINES[region]
...
Worth noting the boundary: the very common "parameter table" case ( Pattern 2 — one
|
| Durable domain model | Records where the data came from | Stable when ingestion changes | Keeps DataJoint orchestration | |
|---|---|---|---|---|
| A. Landing table (file is upstream) | ✗ input-driven | ~ but as a fake dependency | ✗ new source ⇒ schema change | ✓ |
| B. Roots + external orchestration | ✓ | ~ but outside DataJoint | ✓ | ✗ |
C. Fat make() (what strict mode blocks) |
✓ | ✗ severed | ✓ | ✓ |
Every row loses at least one thing that matters.
Why the "landing table" option is worse than it first looks
Option A is the one people will reach for, so it's worth being clear about its deepest problem: it bakes a transient operational fact into the durable data model.
How an entity physically arrives is not durable. A Session might come from a file today, an API call next quarter, a direct database sync the year after. But Session belongs to Subject — that's the real, invariant truth of the model, and it holds no matter how the data got there. These two facts live on completely different timescales: the domain relationships change on the scale of scientific redesign (years), while ingestion mechanics change on the scale of infrastructure (months).
The foreign-key graph is supposed to encode the durable, invariant relationships. The moment you put "came from this file" into it, you've married the slow-moving graph to the fast-moving one. Every time ingestion changes, you're doing a schema migration; and your historical rows permanently carry whatever ingestion ancestor happened to exist when they were created. So the principled strict-mode option is actually the least stable one over the life of a pipeline — enforcing provenance this way makes the schema churn with your infrastructure. That's a genuinely counterintuitive result, and I think it's the crux.
Off versus on, stated carefully
Here's the asymmetry that worries me. With strict_provenance off, a workable option exists — the fat make() — that keeps a clean, durable domain model and keeps DataJoint orchestration. Its only casualty is provenance, which the framework wasn't guaranteeing anyway. With it on, that option is gone, and there's no remaining option that records the file→entity relationship truthfully — as "how it arrived," rather than as a fabricated (and non-durable) dependency in the FK graph — while keeping orchestration. So turning strict mode on doesn't just close a loophole; for this class of pipeline it takes away the one option whose only cost was something already soft, and leaves you choosing between corrupting the durable model (A) or giving up the framework's orchestration (B).
What this actually points to — two things worth building
I don't think the takeaway is "don't ship strict mode." I think it's that strict mode is the feature that makes two follow-on capabilities necessary — it's the alarm, and these are the fire exits:
-
Provenance-first schema evolution. The reason both patterns are so painful to fix is that the remediation is always "restructure your schema," and we have no tooling for that — so people fall back on drop-and-repopulate, which destroys history. A proper schema-evolution tool would use the trinity's own primitives —
traceto find what's upstream of a change,cascadeto find what it invalidates — to change dependency structure and recompute safely. It's not alembic; it's evolution that manages the invalidation cascade. And it's built from exactly what this PR ships, which is what makes it a natural next step rather than a detour. -
A first-class notion of where data came from, kept separate from the FK graph. Everything above keeps circling one gap: DataJoint has one graph, and it uses it for both "what is derived from what" (durable, invariant) and, when forced, "how this row arrived" (transient, operational). Those need to be different things. Ingestion lineage should be recorded — append-only, allowing many sources over time — so a
Sessioncan carry a "came from file X" record today and an "came from API Y" record later, with no schema change at all. The FK graph structurally can't do that, and forcing it to (option A) is exactly what corrupts the model.
Neither of these is a 2.3 concern, and I'm not suggesting they gate this release. But I'd love for us to decide the sequencing deliberately, so we're not shipping the alarm well ahead of the exits. Happy to write these up properly as their own proposals if that's useful.
There was a problem hiding this comment.
I did a close read of the enforcement path and want to flag three correctness issues before this ships, so I'm requesting changes. To be clear up front: the scaffolding is right — the ContextVar push/pop with token reset, the exception-path cleanup, transaction rollback, Part detection via __dict__ (nicely avoiding the _JobsDescriptor), and the flag-off no-op are all correct. The problems are in the two gate functions themselves. None of this is about the broader strict-mode design conversation; it's purely implementation correctness.
1. [blocking] Write gate can silently drop all rows — data loss on compliant code
assert_write_allowed(self, rows) runs in Table.insert (table.py:804) before insert materializes rows, and when a strict context is active its non-dict branch does for row in rows. For a one-shot iterable (generator, iter(...), map(...)), that exhausts it; the post-gate consumer — _insert_rows → list(… for row in rows) at table.py:864, or the chunked iter(rows) path at :836 — then sees an empty iterator and inserts zero rows, with no error:
def make(self, key):
self.insert(dict(**key, x=i) for i in range(n)) # strict on → inserts nothing, silentlyTwo precisions, because they make this a genuine blocker rather than an edge case:
- It's scoped to one-shot iterables. A
list/tupleof dicts is safe — it's re-iterable, so the gate walks it andinsertre-walks it fine. Soself.insert([...])is unaffected; onlyself.insert(<generator>)breaks. But passing a generator toinsert()is a documented, idiomatic pattern (the chunked-insert path is literally built arounditer(rows)), so this isn't exotic input. - It strikes compliant code. The target check passes for a normal
self.insert(<generator>)into the table being populated — the gate then consumes the generator anyway. So a correct, strict-compliantmake()that inserts many computed rows (very common for Part tables) silently loses all of them. This isn't "you did something forbidden and got a confusing failure"; it's "you wrote correct code and it discarded your data."
Fix direction: don't consume rows in the gate — do the target-only check first (it doesn't need the rows), and defer the key-consistency check to where insert has already materialized rows into a concrete list.
Separately, insert(some_query_expression) isn't data loss but is double-executed: the gate iterates it (running the source SELECT client-side and firing the read gate on it), then insert re-runs it as INSERT … SELECT. Worth fixing in the same pass.
2. [blocking] Read gate misses len() / bool() / in
Only QueryExpression.cursor calls assert_read_allowed. But __len__ (expression.py:1149), __bool__ (:1178), and __contains__ (:1195, via bool(self & item)) each issue connection.query(...) directly and never touch cursor. So inside make():
len(Unrelated & key) # ungated
bool(Unrelated & cond) # ungated
key in Unrelated # ungated (→ __bool__)all read a forbidden table with no check. Aggregation.__len__/__bool__ and Union.__len__/__bool__ have the same direct-query bypass.
3. [blocking] Read gate is blind to restriction-by-table
_base_tables (provenance.py:56) only walks full_table_name and _support. A semijoin restriction Ancestor & Unrelated renders Unrelated as an IN (subquery) in the WHERE clause — it never enters _support — so _base_tables returns {Ancestor} and the read of Unrelated passes. (Joins are caught, because join extends _support; restrictions are not, and restrict-by-table is at least as common.)
Net on 2 + 3: the gate catches UndeclaredTable.fetch() but not the everyday idioms (restrict-by, existence check, count), so an undeclared dependency stays readable through several common paths. Since the whole value here is a guarantee, I think these need to close before merge — a gate that blocks the naive case while passing the common ones invites false confidence, which is worse than an honest "best-effort."
Tests
The 6 tests exercise the naive paths and would all pass with the three bugs above present. Worth adding: join-with-a-disallowed-operand (the one read path that actually works — untested, so a regression would be invisible), restriction-by-table, len/in/bool, insert-from-query, generator rows, reentrant make() context restore, and exception-path context cleanup.
|
Thanks for the close read, Thinh (@ttngu207) — flagging how these three land after the spec reframing in datajoint-docs#183, since it changes the triage for two of them. #1 (write gate consumes one-shot iterables → silent data loss): stands as a must-fix. This is a genuine correctness bug independent of any enforcement-model framing — compliant #2 ( That still leaves a real choice, and I'll make the call rather than leave it implicit:
I'll decide between those and follow up here so the code and the spec stay in lockstep. Either way #1 is a blocker and should land first. Appreciate the tests you listed — the join-with-disallowed-operand and generator-rows cases in particular would have caught #1 and the one read path that does work today. |
The strict-provenance write gate ran assert_write_allowed(self, rows) before insert materialized rows, and its key-consistency check did 'for row in rows'. For a one-shot iterable (generator, map, iter), that exhausted it, so the downstream insert saw an empty iterator and wrote zero rows with no error — silent data loss on compliant self.insert(<generator>) code (common for Part inserts). insert(<query_expression>) was also double-executed. Split the gate: assert_write_allowed(target) now does the target-only check (needs no rows) in Table.insert; a new per-row assert_row_key_allowed(row) runs in Table._insert_rows as each row is materialized — the single point reached by both the chunked and single-batch paths, so streaming/chunking is preserved and the caller's iterable is never consumed early. The QueryExpression (INSERT ... SELECT) path is no longer iterated by the gate, fixing the double-execution; per-row key consistency does not apply there (rows never materialize client-side), governed by the target check only. Adds regression tests: a generator insert into a Part must land all rows, and the per-row key check still fires for generator-sourced rows. Fixes the blocking bug flagged by @ttngu207 in review. Read-gate coverage (len/bool/in, restriction-by-table) is tracked separately pending the best-effort-vs-close decision.
|
Pushed What changed. Split the write gate so it never consumes
Tests added: a generator insert into a Part must land all rows (the direct regression — fails on the old code with 0 rows), and the per-row key check still fires for generator-sourced rows. Ran the split-gate logic directly (generator survives the gate; target + key checks still fire; no-op outside Bugs #2 and #3 (read-gate |
Per the decision to document rather than close the two read-gate coverage gaps Thinh flagged on datajoint/datajoint-python#1474: - Existence/count idioms (len/bool/in) issue COUNT/EXISTS SQL directly, not through the gated fetch path, so reads of undeclared tables via those are not caught. - Restriction-by-table (Ancestor & Undeclared) renders as an IN (subquery) the gate's base-table analysis does not inspect. Both are now listed under 'Enforcement model and its limits' as best-effort client-side gaps. Comprehensive enforcement is the platform's job: its code-deployment CI/CD combines these runtime provenance checks with agentic review of the make() source to cover the paths the runtime guardrail cannot.
|
Decision on your three findings, Thinh: #1 (write gate consumed one-shot iterables → data loss) — fixed in #2 ( On the platform, comprehensive enforcement is handled by combining these runtime provenance checks with agentic review of the So: #1 is resolved in code; #2/#3 are resolved-by-documentation. If that lands right with you, this should be clear to re-review — the only code change from your changes-requested pass is the bug-#1 fix. |
ttngu207
left a comment
There was a problem hiding this comment.
Re-reviewed after d0e8a80 — all resolved as discussed. Bug #1 (generator/one-shot iterable data loss + INSERT … SELECT double-exec) is genuinely fixed via the split gate, and the two new regression tests would fail on the old code. #2/#3 are legitimately settled by the best-effort-guardrail reframing in datajoint-docs#183, which now documents exactly those in-client gaps. LGTM — approving, assuming #183 lands in lockstep. Thanks for the fast turnaround.
…venance (#183) * docs(provenance): spec for Diagram.trace + self.upstream + strict_provenance Detailed normative specification for the canonical provenance trinity landing in DataJoint 2.3: - Diagram.trace(table_expr) — upstream mirror of Diagram.cascade(), walks the FK graph from a restricted seed to every ancestor with OR convergence. Reuses the upward propagation rules (U1/U2/U3) defined in the Cascade Specification. Supports trace[TableClass] returning a pre-restricted QueryExpression, and trace[str] returning a FreeTable. - self.upstream — property on AutoPopulate that the framework sets to Diagram.trace(self & key) before each make() invocation. Pre-restricted ancestor access for ergonomic, provenance-safe reads. - dj.config["strict_provenance"] — runtime flag (default False). When True, gates make() so reads must go through self.upstream and writes must target self or self's Parts with key-consistent primary keys. Spec structure: - Why this exists — the convention/enforcement gap, why three pieces. - Concepts — trace as cascade's mirror, OR convergence, the make() provenance boundary, why upward rules are shared with cascade. - §1 Diagram.trace — signature, behavior, indexing (class + string), counts() / heading() / iter, worked examples. - §2 self.upstream — construction lifecycle, allowed table set, per-key behavior, examples. - §3 strict_provenance — config key, read enforcement table, write enforcement table, key-consistency rule, opt-in rationale, examples of compliant + violating make() bodies. - Integration — end-to-end strict-mode example, properties guaranteed. - Migration path — staged adoption from on-default-off to enforcement. - Out of scope — static analysis, default flip, downstream provenance metadata persistence. Examples use core DataJoint types (int32, float64, longblob) per project convention. Cross-links to cascade.md (shared upward rules), autopopulate.md, diagram.md, entity-integrity.md. Nav entry under Reference > Specifications > Data Operations between AutoPopulate and Job Metadata. Slated for DataJoint 2.3. T2.2.a (#1423), T2.2.b (#1424), T2.2.c (#1425) implementations land against this spec. * docs(provenance): reframe strict_provenance as best-effort guardrail; align spec with shipped code Reviewer reconciliation (ttngu207, MilagrosMarin) against the three feature branches, plus a reframing of the enforcement model: - strict_provenance: describe the runtime check honestly as a best-effort, client-side development guardrail, not an airtight boundary. Raw SQL, other clients, and direct DB access bypass it; comprehensive enforcement is a code-inspection problem (static where sound, agentic on the platform) and is positioned as the roadmap. Softened 'runtime guarantee' / 'provenance-safe by construction' language throughout. - Read enforcement: corrected the table + violating example — a direct read of a declared ancestor is Allowed in 2.3 (only undeclared tables are blocked), matching assert_read_allowed. Added a note on what the check does/doesn't distinguish. Error messages now match provenance.py. - trace.counts(): keyed by full table name (dict[str,int]), includes the seed. - Dropped trace.heading() (not implemented). - iter(trace): parents/roots first (topological), not 'deepest first'. - Concurrency: contextvars.ContextVar, not thread-local; removed the dangling 'see Implementation notes' reference. - Noted load_all_upstream is new; glossed make_kwargs; documented self.upstream non-caching; narrowed the ancestor-Parts allowed-set wording. * docs(provenance): attribute code inspection to platform code-deployment CI/CD Draw the boundary plainly: the open-source core ships only the runtime guardrail; comprehensive code inspection (static + agentic) is performed by the DataJoint platform's code-deployment CI/CD process, not the OSS framework. Applied consistently across the intro, the enforcement-model subsection, the integration properties, the migration path, and the out-of-scope list. * docs(provenance): document read-gate best-effort limits (#2/#3) Per the decision to document rather than close the two read-gate coverage gaps Thinh flagged on datajoint/datajoint-python#1474: - Existence/count idioms (len/bool/in) issue COUNT/EXISTS SQL directly, not through the gated fetch path, so reads of undeclared tables via those are not caught. - Restriction-by-table (Ancestor & Undeclared) renders as an IN (subquery) the gate's base-table analysis does not inspect. Both are now listed under 'Enforcement model and its limits' as best-effort client-side gaps. Comprehensive enforcement is the platform's job: its code-deployment CI/CD combines these runtime provenance checks with agentic review of the make() source to cover the paths the runtime guardrail cannot. * docs(provenance): cohesion/completeness pass now that the trinity has shipped - References: 'to land in 2.3' -> shipped; link the merged PRs (#1471/#1473/#1474) and the #1468 cascade rules they build on. - trace[T]: correct the return type — both the class-form and string-form return a FreeTable (a QueryExpression); the class form does NOT return a restricted class instance (matches shipped __getitem__). - Intro: drop the misleading 'checked default' phrasing (the flag defaults to off / opt-in), reword to 'opt-in runtime check'. - Key consistency: note the check applies to dict-style rows; positional/numpy rows carry no names to check and pass unchecked (matches _check_row_key).
Summary
T2.2.c of the provenance trinity — completes the trio.
When `dj.config["strict_provenance"] = True`, runtime gates enforce the upstream-only convention inside `make()`:
Default is False. Existing `make()` bodies are unaffected.
Closes #1425. Slated for DataJoint 2.3.
Branch stack
Will rebase onto master after the chain merges in order.
What's added
Implementation notes
Documented limitation (deferred)
The read gate does not distinguish reads that came through `self.upstream` from reads of the same ancestor via a direct expression. Both are allowed if the table is in the allowed set. The intent is to catch reads from undeclared dependencies; tightening the "must come through self.upstream" path requires propagating an attribution marker through QueryExpression composition and is left for a follow-up.
Tests
Test plan