feat(#1425): strict_provenance config flag for runtime enforcement by dimitri-yatsenko · Pull Request #1474 · datajoint/datajoint-python

dimitri-yatsenko · 2026-06-23T13:04:33Z

Summary

T2.2.c of the provenance trinity — completes the trio.

When `dj.config["strict_provenance"] = True`, runtime gates enforce the upstream-only convention inside `make()`:

Reads must target a table in the active trace's allowed set (declared ancestors + self + self's Parts).
Writes must target self or self's Parts.
Inserted rows' PK columns that overlap with the current key must equal the key's values.

Default is False. Existing `make()` bodies are unaffected.

Closes #1425. Slated for DataJoint 2.3.

Branch stack

Layer	PR	Branch
T1.4 cascade fix	#1468	`fix/1429-cascade-part-part-renamed-fk`
T2.2.a `Diagram.trace()`	#1471	`feat/1423-diagram-trace`
T2.2.b `self.upstream`	#1473	`feat/1424-self-upstream`
T2.2.c `strict_provenance` (this)	this PR	`feat/1425-strict-provenance`

Will rebase onto master after the chain merges in order.

What's added

Component	File
Runtime context module — `ContextVar` + push/pop helpers + read/write gates	`src/datajoint/provenance.py` (new)
Config flag `strict_provenance: bool` (default False) + env-var `DJ_STRICT_PROVENANCE`	`src/datajoint/settings.py`
Push/pop context in `_populate_one` around the `make()` invocation	`src/datajoint/autopopulate.py`
Read gate in `QueryExpression.cursor`	`src/datajoint/expression.py`
Write gate in `Table.insert` (after existing `_allow_insert`)	`src/datajoint/table.py`
6 integration tests	`tests/integration/test_strict_provenance.py` (new)

Implementation notes

ContextVar (not threading.local) so the active-make context propagates correctly across `contextvars`-aware boundaries (asyncio, multiprocessing-with-context-propagation).
Part-table detection uses class `dict` traversal filtered to `Part` subclasses — `dir/getattr` would trigger the `_JobsDescriptor` and try to lazy-declare `~~table` inside the populate transaction (caught on the first test).
Read gate is no-op outside make() — checks `ContextVar.get()` returns None.
Write gate is no-op outside make() — same.

Documented limitation (deferred)

The read gate does not distinguish reads that came through `self.upstream` from reads of the same ancestor via a direct expression. Both are allowed if the table is in the allowed set. The intent is to catch reads from undeclared dependencies; tightening the "must come through self.upstream" path requires propagating an attribution marker through QueryExpression composition and is left for a follow-up.

Tests

6/6 new strict-mode tests pass.
17/17 existing autopopulate tests pass with strict default-off — confirms no regression.
8/8 trace + 9/9 cascade tests unaffected.

Test plan

6/6 new tests
17/17 default-off regression
CI green after rebase onto master
Manual: enable in a staging pipeline and verify legitimate `make()`s still run while violations raise with clear messages

Implements T2.2.c of the provenance trinity, completing the trio (Diagram.trace → self.upstream → strict_provenance). When dj.config["strict_provenance"] = True, runtime gates enforce the upstream-only convention inside make(): - Reads must target a table in the active trace's allowed set (declared ancestors + self + self's Parts). - Writes must target self or self's Parts. - Inserted rows' PK columns that overlap with the current key must equal the key's values (key-consistency rule). Default is False. Existing make() bodies are unaffected. Branch stacked on feat/1424-self-upstream (#1473) → feat/1423-diagram-trace (#1471) → fix/1429-cascade-part-part-renamed-fk (#1468). Will rebase onto master after the chain merges. What's added: - src/datajoint/provenance.py (new): the runtime context module. - `_active_strict_make` ContextVar holding (target, allowed_tables, key) for the currently-executing make() invocation. ContextVar chosen over threading.local to propagate correctly across contextvars-aware concurrency boundaries. - `push_strict_make_context` / `pop_strict_make_context` — context lifecycle managed by `_populate_one`'s try/finally. - `assert_read_allowed(query_expression)` — read gate. Recursively discovers base tables via the QueryExpression's `_support` chain and checks each against the allowed set. - `assert_write_allowed(target_table, rows)` — write gate. Verifies the target is self or one of self's Part tables, and checks the key-consistency rule on each dict row. - src/datajoint/settings.py: new `strict_provenance: bool` field on Config (default False), env-var `DJ_STRICT_PROVENANCE`, ENV_VAR_MAPPING entry. - src/datajoint/autopopulate.py: in `_populate_one`, push the strict context (when the flag is on) just before the make() invocation block. The allowed table set = trace's ancestor nodes ∪ {self.full_table_name} ∪ {self's Parts}. Pop in the existing `finally` block. - src/datajoint/expression.py: `QueryExpression.cursor` now calls `assert_read_allowed(self)` before issuing SQL. No-op outside make(). - src/datajoint/table.py: `Table.insert` calls `assert_write_allowed(self, rows)` after the existing `_allow_insert` check. No-op outside make(). Part-table detection uses class `__dict__` traversal (filtered to Part subclasses) instead of `dir/getattr` to avoid triggering the `_JobsDescriptor` (which would lazy-declare ~~table inside the populate transaction — caught by the first test iteration). Documented limitation (deferred): the read gate does not distinguish reads that came through `self.upstream` from reads of the same ancestor via a direct expression. Both are allowed if the table is in the allowed set. The intent is to catch reads from *undeclared* dependencies; tightening the "must come through self.upstream" path requires propagating an attribution marker through QueryExpression composition and is left for a follow-up release. Tests in tests/integration/test_strict_provenance.py (6 new): - test_strict_compliant_make_passes — make() reading via self.upstream and writing self.insert1 with matching key runs cleanly under strict. - test_strict_blocks_read_from_undeclared_table — read from an unrelated table raises with "strict_provenance ... undeclared" message. - test_strict_blocks_write_to_other_table — insert into a non-self, non-Part target raises "not permitted". - test_strict_blocks_write_with_mismatched_key — row PK that disagrees with the current key raises "does not match the current make() key". - test_strict_writes_to_part_table_pass — self.PartName.insert(...) works. - test_strict_off_by_default_no_change — default-off regression check; the canonical "direct (Ancestor & key).fetch1()" pattern still works when strict_provenance is unset. Regression: 17/17 autopopulate tests pass with strict_provenance unset (default). 6/6 new strict tests pass with strict_provenance=True. 8/8 trace tests + 9/9 cascade tests unaffected. Slated for DataJoint 2.3.

ttngu207 · 2026-07-01T02:41:25Z

A note on adoption: two patterns strict mode will break, and what that tells us we're missing

I've spent time with this PR and the spec, and I want to say up front: the provenance trinity is a real evolution for DataJoint. strict_provenance is exactly the right mechanism to turn "reads should come from declared ancestors" from a convention we hope people follow into a guarantee the framework actually enforces. None of what follows is a blocker, and none of it is a criticism of the design.

What I want to do here is think through adoption. There are a couple of patterns that are fairly commonly used in practice today that this feature will break. Both of them are, honestly, bad patterns — they're provenance bugs, and strict mode is right to catch them. But they're bad patterns that a lot of working pipelines lean on, and I think if we turn this on without a story for how those pipelines migrate, we take DataJoint to the next level and leave a chunk of our users behind. So this is really an argument for the next thing we should build, using these two patterns to show why.

Pattern 1 — reading "sideways" to a table that isn't a declared ancestor

The common case looks like this: a computation needs a piece of reference information that lives off to the side of its own lineage. For example, SpikeDetection depends on Recording, but to set a baseline it wants the brain region — which it gets by walking up to Subject and then back down a different branch to Surgery → BrainRegion:

class SpikeDetection(dj.Computed):
    definition = "-> Recording"
    def make(self, key):
        signal = (Recording & key).fetch1("signal")          # declared ancestor — fine
        region = (BrainRegion & (Surgery & key)).fetch1("region")  # NOT an ancestor
        baseline = BASELINES[region]
        ...

BrainRegion isn't upstream of SpikeDetection — it's a cousin. Under strict mode this read is blocked, and that's correct: if the region's baseline feeds the result, then BrainRegion genuinely is a dependency, and today nothing records that. Edit a BrainRegion row and every SpikeDetection derived from it is silently stale, with no way to know. So strict mode is doing exactly its job — turning a silent, invisible dependency into a loud, declared one.

Worth noting the boundary: the very common "parameter table" case (-> AnalysisParams) is already a declared ancestor, so it keeps working. Only genuinely undeclared sideways reads break. This one has a reasonably clear fix — make the dependency real by declaring it — so I'll spend most of the space on the second pattern, which is harder.

Pattern 2 — one `make()` that ingests a file and writes to many tables

This one is about using DataJoint's populate machinery (job reservation, error tracking, triggering on new data) as an orchestration layer. A very common shape: a table triggers when a new file appears, and its make() parses that file — which usually contains nested, hierarchical information — and fans the results out into a bunch of normalized entity tables:

class FileIngestion(dj.Computed):
    definition = "-> FileManifest"
    def make(self, key):
        data = parse(fetch_file(key))
        Session.insert(data["sessions"])   # writing to tables that aren't self
        Block.insert(data["blocks"])
        Trial.insert(data["trials"])
        Event.insert(data["events"])

Strict mode blocks this, because a make() may only write to self and its Parts. And again it's right to: those Session/Trial/Event rows land with no foreign-key edge back to the file they came from, so there's no way to trace an entity to its source. The traceability we care about is genuinely broken here.

The obvious suggestion is "make those tables Parts of FileIngestion." That doesn't work, and the reason is instructive: Session, Trial, and Event are top-level entities that other pipeline branches depend on (Recording -> Session, SpikeSorting -> Trial, and so on). Turning them into Parts of an ingestion table would force all those downstream branches to reference a Part from outside its master — the exact thing the master/Part convention tells us not to do. So Parts are structurally the wrong tool here, not just semantically awkward.

The hard part: for pattern 2 there's no clean replacement, and here's why

When I work through the alternatives for orchestrated ingestion, every strict-compliant option gives up something real. There are three:

Option A — the "landing-table" approach. Restructure so the ingestion is a real upstream table: ParsedFile becomes an ancestor, and Session, Trial, etc. each depend on it and get their own single-target make() that reads the already-parsed data and writes only to itself. This is strict-compliant and provenance is intact. But it means your entities are now downstream of a file. The pipeline reads as "everything begins with a data source" rather than "everything begins with a subject and a session." It's input-driven design, and I don't think it reflects the truth of the model.

Option B — keep the entities as roots, orchestrate outside DataJoint. Session etc. stay top-level tables, and some external system (a script, Airflow, whatever) does the file-watching and inserts them. The domain model stays clean — but you lose the DataJoint populate machinery you wanted in the first place, and the file→entity link now lives entirely outside DataJoint, invisible to it.

Option C — the fat make() above. Clean domain model, keeps orchestration — but it's the pattern strict mode exists to forbid, precisely because it severs provenance.

Laid out together:

	Durable domain model	Records where the data came from	Stable when ingestion changes	Keeps DataJoint orchestration
A. Landing table (file is upstream)	✗ input-driven	~ but as a fake dependency	✗ new source ⇒ schema change	✓
B. Roots + external orchestration	✓	~ but outside DataJoint	✓	✗
C. Fat `make()` (what strict mode blocks)	✓	✗ severed	✓	✓

Every row loses at least one thing that matters.

Why the "landing table" option is worse than it first looks

Option A is the one people will reach for, so it's worth being clear about its deepest problem: it bakes a transient operational fact into the durable data model.

How an entity physically arrives is not durable. A Session might come from a file today, an API call next quarter, a direct database sync the year after. But Session belongs to Subject — that's the real, invariant truth of the model, and it holds no matter how the data got there. These two facts live on completely different timescales: the domain relationships change on the scale of scientific redesign (years), while ingestion mechanics change on the scale of infrastructure (months).

The foreign-key graph is supposed to encode the durable, invariant relationships. The moment you put "came from this file" into it, you've married the slow-moving graph to the fast-moving one. Every time ingestion changes, you're doing a schema migration; and your historical rows permanently carry whatever ingestion ancestor happened to exist when they were created. So the principled strict-mode option is actually the least stable one over the life of a pipeline — enforcing provenance this way makes the schema churn with your infrastructure. That's a genuinely counterintuitive result, and I think it's the crux.

Off versus on, stated carefully

Here's the asymmetry that worries me. With strict_provenance off, a workable option exists — the fat make() — that keeps a clean, durable domain model and keeps DataJoint orchestration. Its only casualty is provenance, which the framework wasn't guaranteeing anyway. With it on, that option is gone, and there's no remaining option that records the file→entity relationship truthfully — as "how it arrived," rather than as a fabricated (and non-durable) dependency in the FK graph — while keeping orchestration. So turning strict mode on doesn't just close a loophole; for this class of pipeline it takes away the one option whose only cost was something already soft, and leaves you choosing between corrupting the durable model (A) or giving up the framework's orchestration (B).

What this actually points to — two things worth building

I don't think the takeaway is "don't ship strict mode." I think it's that strict mode is the feature that makes two follow-on capabilities necessary — it's the alarm, and these are the fire exits:

Provenance-first schema evolution. The reason both patterns are so painful to fix is that the remediation is always "restructure your schema," and we have no tooling for that — so people fall back on drop-and-repopulate, which destroys history. A proper schema-evolution tool would use the trinity's own primitives — trace to find what's upstream of a change, cascade to find what it invalidates — to change dependency structure and recompute safely. It's not alembic; it's evolution that manages the invalidation cascade. And it's built from exactly what this PR ships, which is what makes it a natural next step rather than a detour.
A first-class notion of where data came from, kept separate from the FK graph. Everything above keeps circling one gap: DataJoint has one graph, and it uses it for both "what is derived from what" (durable, invariant) and, when forced, "how this row arrived" (transient, operational). Those need to be different things. Ingestion lineage should be recorded — append-only, allowing many sources over time — so a Session can carry a "came from file X" record today and an "came from API Y" record later, with no schema change at all. The FK graph structurally can't do that, and forcing it to (option A) is exactly what corrupts the model.

Neither of these is a 2.3 concern, and I'm not suggesting they gate this release. But I'd love for us to decide the sequencing deliberately, so we're not shipping the alarm well ahead of the exits. Happy to write these up properly as their own proposals if that's useful.

ttngu207

I did a close read of the enforcement path and want to flag three correctness issues before this ships, so I'm requesting changes. To be clear up front: the scaffolding is right — the ContextVar push/pop with token reset, the exception-path cleanup, transaction rollback, Part detection via __dict__ (nicely avoiding the _JobsDescriptor), and the flag-off no-op are all correct. The problems are in the two gate functions themselves. None of this is about the broader strict-mode design conversation; it's purely implementation correctness.

1. [blocking] Write gate can silently drop all rows — data loss on compliant code

assert_write_allowed(self, rows) runs in Table.insert (table.py:804) before insert materializes rows, and when a strict context is active its non-dict branch does for row in rows. For a one-shot iterable (generator, iter(...), map(...)), that exhausts it; the post-gate consumer — _insert_rows → list(… for row in rows) at table.py:864, or the chunked iter(rows) path at :836 — then sees an empty iterator and inserts zero rows, with no error:

def make(self, key):
    self.insert(dict(**key, x=i) for i in range(n))   # strict on → inserts nothing, silently

Two precisions, because they make this a genuine blocker rather than an edge case:

It's scoped to one-shot iterables. A list/tuple of dicts is safe — it's re-iterable, so the gate walks it and insert re-walks it fine. So self.insert([...]) is unaffected; only self.insert(<generator>) breaks. But passing a generator to insert() is a documented, idiomatic pattern (the chunked-insert path is literally built around iter(rows)), so this isn't exotic input.
It strikes compliant code. The target check passes for a normal self.insert(<generator>) into the table being populated — the gate then consumes the generator anyway. So a correct, strict-compliant make() that inserts many computed rows (very common for Part tables) silently loses all of them. This isn't "you did something forbidden and got a confusing failure"; it's "you wrote correct code and it discarded your data."

Fix direction: don't consume rows in the gate — do the target-only check first (it doesn't need the rows), and defer the key-consistency check to where insert has already materialized rows into a concrete list.

Separately, insert(some_query_expression) isn't data loss but is double-executed: the gate iterates it (running the source SELECT client-side and firing the read gate on it), then insert re-runs it as INSERT … SELECT. Worth fixing in the same pass.

2. [blocking] Read gate misses `len()` / `bool()` / `in`

Only QueryExpression.cursor calls assert_read_allowed. But __len__ (expression.py:1149), __bool__ (:1178), and __contains__ (:1195, via bool(self & item)) each issue connection.query(...) directly and never touch cursor. So inside make():

len(Unrelated & key)     # ungated
bool(Unrelated & cond)   # ungated
key in Unrelated         # ungated (→ __bool__)

all read a forbidden table with no check. Aggregation.__len__/__bool__ and Union.__len__/__bool__ have the same direct-query bypass.

3. [blocking] Read gate is blind to restriction-by-table

_base_tables (provenance.py:56) only walks full_table_name and _support. A semijoin restriction Ancestor & Unrelated renders Unrelated as an IN (subquery) in the WHERE clause — it never enters _support — so _base_tables returns {Ancestor} and the read of Unrelated passes. (Joins are caught, because join extends _support; restrictions are not, and restrict-by-table is at least as common.)

Net on 2 + 3: the gate catches UndeclaredTable.fetch() but not the everyday idioms (restrict-by, existence check, count), so an undeclared dependency stays readable through several common paths. Since the whole value here is a guarantee, I think these need to close before merge — a gate that blocks the naive case while passing the common ones invites false confidence, which is worse than an honest "best-effort."

Tests

The 6 tests exercise the naive paths and would all pass with the three bugs above present. Worth adding: join-with-a-disallowed-operand (the one read path that actually works — untested, so a regression would be invisible), restriction-by-table, len/in/bool, insert-from-query, generator rows, reentrant make() context restore, and exception-path context cleanup.

dimitri-yatsenko · 2026-07-01T16:43:14Z

Thanks for the close read, Thinh (@ttngu207) — flagging how these three land after the spec reframing in datajoint-docs#183, since it changes the triage for two of them.

#1 (write gate consumes one-shot iterables → silent data loss): stands as a must-fix. This is a genuine correctness bug independent of any enforcement-model framing — compliant self.insert(<generator>) silently discarding all rows is unacceptable. Your fix direction is right: do the target-only check first (it doesn't need rows), and defer the key-consistency check to after insert has materialized rows into a concrete list, so the gate never consumes the caller's iterable. Fold the insert(query_expression) double-execution into the same pass.

#2 (len/bool/in bypass) and #3 (restriction-by-table bypass): reclassified by the reframing, not dropped. Your objection — "a gate that blocks the naive case while passing the common ones invites false confidence, which is worse than an honest best-effort" — was exactly right against the framing you were reviewing, where strict_provenance was billed as a runtime guarantee. The revised spec demotes it to an explicit best-effort development guardrail, with comprehensive enforcement delegated to code inspection in the platform's code-deployment CI/CD — not the open-source runtime gate. Under that framing, #2 and #3 are known limitations of the guardrail rather than correctness bugs, and the "false confidence" risk is addressed by the spec now saying plainly what the gate does and doesn't catch.

That still leaves a real choice, and I'll make the call rather than leave it implicit:

Close the common idioms anyway — implement fetch, including blobs with MATLAB arrays #3 especially is cheap and catches a very common pattern (restrict-by-table), which would materially raise the guardrail's value for surfacing accidental deps in staging; or
Accept them as best-effort limits and document the specific in-client gaps (restrict-by-table, len/in/bool) in the spec — which currently only notes that raw SQL and other clients bypass, not these Python-client paths.

I'll decide between those and follow up here so the code and the spec stay in lockstep. Either way #1 is a blocker and should land first. Appreciate the tests you listed — the join-with-disallowed-operand and generator-rows cases in particular would have caught #1 and the one read path that does work today.

@ttngu207

The strict-provenance write gate ran assert_write_allowed(self, rows) before insert materialized rows, and its key-consistency check did 'for row in rows'. For a one-shot iterable (generator, map, iter), that exhausted it, so the downstream insert saw an empty iterator and wrote zero rows with no error — silent data loss on compliant self.insert(<generator>) code (common for Part inserts). insert(<query_expression>) was also double-executed. Split the gate: assert_write_allowed(target) now does the target-only check (needs no rows) in Table.insert; a new per-row assert_row_key_allowed(row) runs in Table._insert_rows as each row is materialized — the single point reached by both the chunked and single-batch paths, so streaming/chunking is preserved and the caller's iterable is never consumed early. The QueryExpression (INSERT ... SELECT) path is no longer iterated by the gate, fixing the double-execution; per-row key consistency does not apply there (rows never materialize client-side), governed by the target check only. Adds regression tests: a generator insert into a Part must land all rows, and the per-row key check still fires for generator-sourced rows. Fixes the blocking bug flagged by @ttngu207 in review. Read-gate coverage (len/bool/in, restriction-by-table) is tracked separately pending the best-effort-vs-close decision.

dimitri-yatsenko · 2026-07-02T14:51:40Z

Pushed d0e8a80 fixing bug #1 (the blocker), along your suggested line, Thinh.

What changed. Split the write gate so it never consumes rows:

assert_write_allowed(target) now does the target-only check (no rows param) in Table.insert, before any branch.
A new per-row assert_row_key_allowed(row) runs in Table._insert_rows as each row is materialized — the single point reached by both the chunked and single-batch paths, so chunking/streaming is preserved and a one-shot generator is never exhausted early.
The QueryExpression (INSERT … SELECT) path is no longer iterated by the gate, which also fixes the double-execution. Per-row key consistency doesn't apply there (rows never materialize client-side) — the target check still governs which table is written.

Tests added: a generator insert into a Part must land all rows (the direct regression — fails on the old code with 0 rows), and the per-row key check still fires for generator-sourced rows.

Ran the split-gate logic directly (generator survives the gate; target + key checks still fire; no-op outside make()); the full integration suite is testcontainers/Docker-based so CI will exercise it here.

Bugs #2 and #3 (read-gate len/bool/in and restriction-by-table) are intentionally not in this commit. As discussed on the spec side, those hinge on the best-effort-guardrail-vs-close-the-gaps decision — I'll settle that separately and either close them or document the in-client limits. Bug #1 was the unambiguous must-fix regardless, so it goes first.

Per the decision to document rather than close the two read-gate coverage gaps Thinh flagged on datajoint/datajoint-python#1474: - Existence/count idioms (len/bool/in) issue COUNT/EXISTS SQL directly, not through the gated fetch path, so reads of undeclared tables via those are not caught. - Restriction-by-table (Ancestor & Undeclared) renders as an IN (subquery) the gate's base-table analysis does not inspect. Both are now listed under 'Enforcement model and its limits' as best-effort client-side gaps. Comprehensive enforcement is the platform's job: its code-deployment CI/CD combines these runtime provenance checks with agentic review of the make() source to cover the paths the runtime guardrail cannot.

dimitri-yatsenko · 2026-07-02T17:10:34Z

Decision on your three findings, Thinh:

#1 (write gate consumed one-shot iterables → data loss) — fixed in d0e8a80, validated locally on MySQL + PostgreSQL (strict suite 8/8 incl. new generator regressions; full insert/autopopulate/relation regression 83 passed, no collateral from the shared _insert_rows change).

#2 (len/bool/in bypass) and #3 (restriction-by-table bypass) — we're documenting these as best-effort limits rather than closing them in the runtime gate. Rationale: the revised spec (datajoint-docs#183) reframes strict_provenance as a best-effort development guardrail, not a guarantee — so a leak is a documented limit, not a correctness bug. Your "false confidence" concern is addressed by stating plainly what the gate does and doesn't catch. Closing #3 cleanly is also disproportionate (capture-at-restrict threading through every expression-copy path) and still can't catch raw-SQL restrictions.

On the platform, comprehensive enforcement is handled by combining these runtime provenance checks with agentic review of the make() source in the code-deployment CI/CD — which covers exactly the paths the client-side gate can't (existence/count idioms, restriction-by-table, and access outside the Python client). Both #2 and #3 are now listed explicitly under "Enforcement model and its limits" in the spec.

So: #1 is resolved in code; #2/#3 are resolved-by-documentation. If that lands right with you, this should be clear to re-review — the only code change from your changes-requested pass is the bug-#1 fix.

ttngu207

Re-reviewed after d0e8a80 — all resolved as discussed. Bug #1 (generator/one-shot iterable data loss + INSERT … SELECT double-exec) is genuinely fixed via the split gate, and the two new regression tests would fail on the old code. #2/#3 are legitimately settled by the best-effort-guardrail reframing in datajoint-docs#183, which now documents exactly those in-client gaps. LGTM — approving, assuming #183 lands in lockstep. Thanks for the fast turnaround.

…venance (#183) * docs(provenance): spec for Diagram.trace + self.upstream + strict_provenance Detailed normative specification for the canonical provenance trinity landing in DataJoint 2.3: - Diagram.trace(table_expr) — upstream mirror of Diagram.cascade(), walks the FK graph from a restricted seed to every ancestor with OR convergence. Reuses the upward propagation rules (U1/U2/U3) defined in the Cascade Specification. Supports trace[TableClass] returning a pre-restricted QueryExpression, and trace[str] returning a FreeTable. - self.upstream — property on AutoPopulate that the framework sets to Diagram.trace(self & key) before each make() invocation. Pre-restricted ancestor access for ergonomic, provenance-safe reads. - dj.config["strict_provenance"] — runtime flag (default False). When True, gates make() so reads must go through self.upstream and writes must target self or self's Parts with key-consistent primary keys. Spec structure: - Why this exists — the convention/enforcement gap, why three pieces. - Concepts — trace as cascade's mirror, OR convergence, the make() provenance boundary, why upward rules are shared with cascade. - §1 Diagram.trace — signature, behavior, indexing (class + string), counts() / heading() / iter, worked examples. - §2 self.upstream — construction lifecycle, allowed table set, per-key behavior, examples. - §3 strict_provenance — config key, read enforcement table, write enforcement table, key-consistency rule, opt-in rationale, examples of compliant + violating make() bodies. - Integration — end-to-end strict-mode example, properties guaranteed. - Migration path — staged adoption from on-default-off to enforcement. - Out of scope — static analysis, default flip, downstream provenance metadata persistence. Examples use core DataJoint types (int32, float64, longblob) per project convention. Cross-links to cascade.md (shared upward rules), autopopulate.md, diagram.md, entity-integrity.md. Nav entry under Reference > Specifications > Data Operations between AutoPopulate and Job Metadata. Slated for DataJoint 2.3. T2.2.a (#1423), T2.2.b (#1424), T2.2.c (#1425) implementations land against this spec. * docs(provenance): reframe strict_provenance as best-effort guardrail; align spec with shipped code Reviewer reconciliation (ttngu207, MilagrosMarin) against the three feature branches, plus a reframing of the enforcement model: - strict_provenance: describe the runtime check honestly as a best-effort, client-side development guardrail, not an airtight boundary. Raw SQL, other clients, and direct DB access bypass it; comprehensive enforcement is a code-inspection problem (static where sound, agentic on the platform) and is positioned as the roadmap. Softened 'runtime guarantee' / 'provenance-safe by construction' language throughout. - Read enforcement: corrected the table + violating example — a direct read of a declared ancestor is Allowed in 2.3 (only undeclared tables are blocked), matching assert_read_allowed. Added a note on what the check does/doesn't distinguish. Error messages now match provenance.py. - trace.counts(): keyed by full table name (dict[str,int]), includes the seed. - Dropped trace.heading() (not implemented). - iter(trace): parents/roots first (topological), not 'deepest first'. - Concurrency: contextvars.ContextVar, not thread-local; removed the dangling 'see Implementation notes' reference. - Noted load_all_upstream is new; glossed make_kwargs; documented self.upstream non-caching; narrowed the ancestor-Parts allowed-set wording. * docs(provenance): attribute code inspection to platform code-deployment CI/CD Draw the boundary plainly: the open-source core ships only the runtime guardrail; comprehensive code inspection (static + agentic) is performed by the DataJoint platform's code-deployment CI/CD process, not the OSS framework. Applied consistently across the intro, the enforcement-model subsection, the integration properties, the migration path, and the out-of-scope list. * docs(provenance): document read-gate best-effort limits (#2/#3) Per the decision to document rather than close the two read-gate coverage gaps Thinh flagged on datajoint/datajoint-python#1474: - Existence/count idioms (len/bool/in) issue COUNT/EXISTS SQL directly, not through the gated fetch path, so reads of undeclared tables via those are not caught. - Restriction-by-table (Ancestor & Undeclared) renders as an IN (subquery) the gate's base-table analysis does not inspect. Both are now listed under 'Enforcement model and its limits' as best-effort client-side gaps. Comprehensive enforcement is the platform's job: its code-deployment CI/CD combines these runtime provenance checks with agentic review of the make() source to cover the paths the runtime guardrail cannot. * docs(provenance): cohesion/completeness pass now that the trinity has shipped - References: 'to land in 2.3' -> shipped; link the merged PRs (#1471/#1473/#1474) and the #1468 cascade rules they build on. - trace[T]: correct the return type — both the class-form and string-form return a FreeTable (a QueryExpression); the class form does NOT return a restricted class instance (matches shipped __getitem__). - Intro: drop the misleading 'checked default' phrasing (the flag defaults to off / opt-in), reword to 'opt-in runtime check'. - Key consistency: note the check applies to dict-style rows; positional/numpy rows carry no names to check and pass unchecked (matches _check_row_key).

dimitri-yatsenko requested review from MilagrosMarin, esutlie and ttngu207 June 23, 2026 13:04

dimitri-yatsenko force-pushed the feat/1425-strict-provenance branch from 9aa7784 to f60495b Compare June 23, 2026 13:23

dimitri-yatsenko force-pushed the feat/1424-self-upstream branch from 86b1ae7 to 7e4130f Compare June 23, 2026 13:23

dimitri-yatsenko marked this pull request as ready for review June 23, 2026 13:39

dimitri-yatsenko requested a review from mweitzel June 26, 2026 17:04

ttngu207 requested changes Jul 1, 2026

View reviewed changes

Base automatically changed from feat/1424-self-upstream to master July 2, 2026 17:08

dimitri-yatsenko requested a review from ttngu207 July 2, 2026 17:50

ttngu207 approved these changes Jul 2, 2026

View reviewed changes

dimitri-yatsenko merged commit 44c86e6 into master Jul 2, 2026
7 checks passed

dimitri-yatsenko deleted the feat/1425-strict-provenance branch July 2, 2026 18:18

dimitri-yatsenko added the feature Indicates new features label Jul 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(#1425): strict_provenance config flag for runtime enforcement#1474

feat(#1425): strict_provenance config flag for runtime enforcement#1474
dimitri-yatsenko merged 2 commits into
masterfrom
feat/1425-strict-provenance

dimitri-yatsenko commented Jun 23, 2026

Uh oh!

ttngu207 commented Jul 1, 2026

Uh oh!

ttngu207 left a comment •

edited

Loading

Uh oh!

dimitri-yatsenko commented Jul 1, 2026

Uh oh!

dimitri-yatsenko commented Jul 2, 2026

Uh oh!

dimitri-yatsenko commented Jul 2, 2026

Uh oh!

ttngu207 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

dimitri-yatsenko commented Jun 23, 2026

Summary

Branch stack

What's added

Implementation notes

Documented limitation (deferred)

Tests

Test plan

Uh oh!

ttngu207 commented Jul 1, 2026

A note on adoption: two patterns strict mode will break, and what that tells us we're missing

Pattern 1 — reading "sideways" to a table that isn't a declared ancestor

Pattern 2 — one make() that ingests a file and writes to many tables

The hard part: for pattern 2 there's no clean replacement, and here's why

Why the "landing table" option is worse than it first looks

Off versus on, stated carefully

What this actually points to — two things worth building

Uh oh!

ttngu207 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

1. [blocking] Write gate can silently drop all rows — data loss on compliant code

2. [blocking] Read gate misses len() / bool() / in

3. [blocking] Read gate is blind to restriction-by-table

Tests

Uh oh!

dimitri-yatsenko commented Jul 1, 2026

Uh oh!

dimitri-yatsenko commented Jul 2, 2026

Uh oh!

dimitri-yatsenko commented Jul 2, 2026

Uh oh!

ttngu207 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Pattern 2 — one `make()` that ingests a file and writes to many tables

ttngu207 left a comment •

edited

Loading

2. [blocking] Read gate misses `len()` / `bool()` / `in`