Lokasi ngalangkungan proxy:   [ UP ]  
[Ngawartoskeun bug]   [Panyetelan cookie]                
Skip to content

fix(cloud): serialize auth refresh writes#1128

Merged
khaliqgant merged 1 commit into
mainfrom
fix/ar-252-cloud-auth-lock
Jun 14, 2026
Merged

fix(cloud): serialize auth refresh writes#1128
khaliqgant merged 1 commit into
mainfrom
fix/ar-252-cloud-auth-lock

Conversation

@khaliqgant

@khaliqgant khaliqgant commented Jun 14, 2026

Copy link
Copy Markdown
Member

Summary

  • write cloud auth through a pid-scoped sibling temp file, chmod 0600, then atomic rename
  • serialize file-backed read-refresh-write with a stale-aware cross-process lock and re-read/re-check stored auth under the lock
  • route file-backed CloudApiClient refreshes, including 401-forced refreshes, through the locked refresh path

Tests

  • npx vitest run packages/cloud/src
  • npm run build:config && npm --prefix packages/cloud run build

Review in cubic

@khaliqgant khaliqgant requested a review from willwashburn as a code owner June 14, 2026 10:25
@coderabbitai

coderabbitai Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 12864d26-b444-4c67-aff4-4098f7d2d185

📥 Commits

Reviewing files that changed from the base of the PR and between 87c05f1 and 2e9ed77.

📒 Files selected for processing (4)
  • CHANGELOG.md
  • packages/cloud/src/api-client.ts
  • packages/cloud/src/auth.test.ts
  • packages/cloud/src/auth.ts

📝 Walkthrough

Walkthrough

Adds atomic on-disk auth persistence and cross-process refresh locking to @agent-relay/cloud. writeStoredAuth now uses a pid-scoped temp file with chmod+rename. A withStoredAuthLock helper serializes file-backed refreshes via mkdir-based locking. CloudApiClient gains a refreshAuth hook, and createCloudSession/authorizedApiFetch are rewired accordingly.

Changes

Cloud Auth Atomic Write and Cross-Process Locking

Layer / File(s) Summary
CloudApiClient refreshAuth hook and snapshot plumbing
packages/cloud/src/api-client.ts
Adds optional refreshAuth callback to CloudApiClientOptions; refactors refresh() through doRefresh() dispatching to refreshAuth or requestRefresh; introduces applySnapshot() to centralize token field updates from a CloudApiClientSnapshot.
Atomic writeStoredAuth and file-lock helpers
packages/cloud/src/auth.ts
Adds lock-path constants and stale-lock thresholds; extracts readCanonicalStoredAuth; rewrites writeStoredAuth with pid-scoped temp file + chmod + rename; adds removeStaleStoredAuthLock, acquireStoredAuthLock, and withStoredAuthLock with retry, timeout, and abort support.
Refresh orchestration wiring
packages/cloud/src/auth.ts
Rewrites refreshStoredAuth to re-read canonical auth inside withStoredAuthLock; removes side-effect writes from requestStoredAuthRefresh; replaces onRefresh with refreshAuth in createCloudSession for non-env-backed auth; passes force: true in authorizedApiFetch's 401 path.
Tests for atomic writes, locking, and concurrency
packages/cloud/src/auth.test.ts, CHANGELOG.md
Extends fs mock setup; adds writeStoredAuth atomic-write and rollback tests; updates legacy-migration assertions; adds ensureCloudSession 401+lock test; adds refreshStoredAuth concurrency serialization and lock-release-on-failure tests; adds changelog entry.

Sequence Diagram(s)

sequenceDiagram
  rect rgba(173, 216, 230, 0.5)
    note over Process1,Process2: Concurrent file-backed refresh
  end
  participant Process1
  participant Process2
  participant withStoredAuthLock
  participant mkdir as fs.mkdir (lock dir)
  participant readCanonicalStoredAuth
  participant requestStoredAuthRefresh
  participant writeStoredAuth
  participant stat as fs.stat (poll lock)

  Process1->>mkdir: mkdir(AUTH_LOCK_PATH)
  mkdir-->>Process1: success (lock acquired)
  Process2->>mkdir: mkdir(AUTH_LOCK_PATH)
  mkdir-->>Process2: EEXIST (contention)
  Process2->>stat: poll until lock released

  Process1->>readCanonicalStoredAuth: read latest auth
  Process1->>requestStoredAuthRefresh: refresh tokens
  requestStoredAuthRefresh-->>Process1: rotated credentials
  Process1->>writeStoredAuth: atomic write rotated auth
  Process1->>withStoredAuthLock: rm lock dir (release)

  Process2->>readCanonicalStoredAuth: read updated canonical auth
  Process2-->>Process2: token fresh → skip refresh, return rotated creds
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Possibly related PRs

  • AgentWorkforce/relay#1119: Modifies the same refreshStoredAuth, createCloudSession, and CloudApiClient refresh-session code paths that this PR builds on and rewires.

Poem

🐇 Hop hop, no more race!
Two bunnies once fought for the same auth place,
Now one holds the lock with a mkdir tight,
The other waits, polls, then reads what's right.
Atomic renames keep credentials sound —
No rotated tokens lost on the ground! 🔐

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'fix(cloud): serialize auth refresh writes' accurately captures the main change—implementing serialization for authentication refresh writes in cloud operations.
Description check ✅ Passed The PR description covers the key changes (secure file writing, cross-process locking, unified refresh path) and testing instructions, but does not follow the exact template structure with Summary and Test Plan sections.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/ar-252-cloud-auth-lock

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint install failed due to a network error.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@agent-relay-code

Copy link
Copy Markdown
Contributor

pr-reviewer could not complete review for #1128 in AgentWorkforce/relay.
The review harness exited with code 1.
No review was posted; this needs operator attention.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements atomic credential writes and serializes file-backed token refreshes across processes using a file-based locking mechanism in @agent-relay/cloud. It introduces a refreshAuth callback in CloudApiClient and updates the storage logic to prevent concurrent refreshes from clobbering rotated credentials. The review feedback identifies a critical Time-of-Check to Time-of-Use (TOCTOU) race condition in the lock removal logic that could defeat mutual exclusion, and suggests an optimization to avoid redundant refreshes when multiple processes concurrently wait on the lock.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +168 to +183
async function removeStaleStoredAuthLock(): Promise<boolean> {
try {
const lockStats = await fs.stat(AUTH_LOCK_PATH);
if (Date.now() - lockStats.mtimeMs < AUTH_LOCK_STALE_MS) {
return false;
}
} catch (error) {
if (isNodeErrorWithCode(error, 'ENOENT')) {
return true;
}
throw error;
}

await fs.rm(AUTH_LOCK_PATH, { recursive: true, force: true });
return true;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a Time-of-Check to Time-of-Use (TOCTOU) race condition in removeStaleStoredAuthLock. If multiple concurrent processes attempt to acquire the lock, they may both determine that the lock is stale. One process will delete the lock and recreate it, while the other process, executing its delete call slightly later, will delete the newly created lock of the first process. This allows both processes to proceed as if they hold the lock, defeating mutual exclusion.

To prevent this, we can atomically rename the lock directory to a temporary path before inspecting and deleting it. If we accidentally rename a fresh lock, we can safely restore it.

async function removeStaleStoredAuthLock(): Promise<boolean> {
  try {
    const lockStats = await fs.stat(AUTH_LOCK_PATH);
    if (Date.now() - lockStats.mtimeMs < AUTH_LOCK_STALE_MS) {
      return false;
    }
  } catch (error) {
    if (isNodeErrorWithCode(error, 'ENOENT')) {
      return true;
    }
    throw error;
  }

  const tempTrashPath = `${AUTH_LOCK_PATH}.trash.${randomUUID()}`;
  try {
    await fs.rename(AUTH_LOCK_PATH, tempTrashPath);
  } catch (error) {
    if (isNodeErrorWithCode(error, 'ENOENT')) {
      return true;
    }
    throw error;
  }

  try {
    const trashStats = await fs.stat(tempTrashPath);
    if (Date.now() - trashStats.mtimeMs >= AUTH_LOCK_STALE_MS) {
      await fs.rm(tempTrashPath, { recursive: true, force: true });
      return true;
    } else {
      try {
        await fs.rename(tempTrashPath, AUTH_LOCK_PATH);
      } catch {
        await fs.rm(tempTrashPath, { recursive: true, force: true });
      }
      return false;
    }
  } catch (error) {
    await fs.rm(tempTrashPath, { recursive: true, force: true });
    if (isNodeErrorWithCode(error, 'ENOENT')) {
      return true;
    }
    throw error;
  }
}

Comment on lines +481 to +487
if (
!options.force &&
latestAuth?.apiUrl === auth.apiUrl &&
!shouldRefresh(latestAuth.accessTokenExpiresAt)
) {
return latestAuth;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If multiple processes concurrently trigger a forced refresh (e.g., due to a 401 response), they will all wait for the lock. The first process will successfully refresh the token and write it to disk. When the second process acquires the lock, it will still perform a redundant refresh because options.force is true, even though the stored token has already been updated to a fresh one by the first process.

We can optimize this by checking if the stored token's accessToken has changed since the current process initiated its refresh request. If it has changed, we can safely reuse the newly updated token and skip the redundant refresh.

Suggested change
if (
!options.force &&
latestAuth?.apiUrl === auth.apiUrl &&
!shouldRefresh(latestAuth.accessTokenExpiresAt)
) {
return latestAuth;
}
if (
latestAuth?.apiUrl === auth.apiUrl &&
(latestAuth.accessToken !== auth.accessToken ||
(!options.force && !shouldRefresh(latestAuth.accessTokenExpiresAt)))
) {
return latestAuth;
}

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2e9ed77728

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

const AUTH_DIR_PATH = path.dirname(AUTH_FILE_PATH);
const AUTH_LOCK_PATH = `${AUTH_FILE_PATH}.lock`;
const AUTH_LOCK_RETRY_DELAY_MS = 50;
const AUTH_LOCK_STALE_MS = 30_000;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid expiring live refresh locks

When a caller sets refreshTimeoutMs above 30s (the CLI exposes --refresh-timeout and forwards it into refreshStoredAuth), a healthy process can still be inside requestStoredAuthRefresh while this fixed stale window makes other processes remove its lock as stale. That reintroduces concurrent refresh/write races and can clobber rotated credentials; the stale threshold needs to account for the configured refresh timeout or be kept alive by the lock holder.

Useful? React with 👍 / 👎.

throw error;
}

await fs.rm(AUTH_LOCK_PATH, { recursive: true, force: true });

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid unlinking a replacement lock

If multiple waiters encounter an old stale lock, one waiter can stat the stale directory, another can remove it and acquire a fresh lock, and then this waiter still removes AUTH_LOCK_PATH by name. That deletes the new holder's lock and allows overlapping refreshes, so stale cleanup needs an ownership token or an identity check before deleting the lock path.

Useful? React with 👍 / 👎.

@khaliqgant

Copy link
Copy Markdown
Member Author

AR-252 review verdict: APPROVE (posting as a comment — GitHub blocks a formal approval from the same bot identity that opened the PR).

Reviewed against the daemon-vs-reaper concurrent-refresh failure this targets. Verified independently by checking out the branch: vitest packages/cloud/src111 passed (106 baseline + 5 new), tsc build clean with dist/ emitting the lock + atomic rename. CI all green (E2E + Devin still running, no failures).

Correct on the point that matters. The cross-process lock spans the full read→refresh→write critical section in refreshStoredAuth, not just writeStoredAuth — with a re-read + shouldRefresh re-check under the lock so the race loser returns the freshly-rotated token instead of replaying an invalidated single-use refresh token. The serializes concurrent file-backed refreshes test nails it: both callers resolve to the same rotated token and fetch is asserted to fire exactly once — precisely the assertion that catches a write-only-lock regression.

Other strengths: atomic pid/time/uuid temp + chmod 0600 + rename (temp cleaned in finally); env-backed auth bypasses the lock; native fs.mkdir EEXIST lock (no new dep); lock released on throw with a dedicated test; 401-forced refresh routed through the locked path; in-process refreshPromise still dedups (no self-deadlock); onRefresh dropped so no double-write; stale window 30s ≫ 10s refresh timeout. Preserves pear#323's atomic write; unblocks pear#324.

Non-blocking (optional, do not hold merge): (1) authorizedApiFetch force: true always network-refreshes on 401 even if a peer just rotated — correct/safe, minor optimization to retry with the re-read token first; (2) writeFile 0600 + explicit chmod 0600 is redundant but fine as umask defense; (3) stale-lock removal has an inherent TOCTOU, acceptable for advisory locking given the timing margin.

Cleared to merge at the maintainer's discretion (per repo policy I'm not merging).

@agent-relay-code

Copy link
Copy Markdown
Contributor

pr-reviewer could not complete review for #1128 in AgentWorkforce/relay.
The review harness exited with code 1.
No review was posted; this needs operator attention.

1 similar comment
@agent-relay-code

Copy link
Copy Markdown
Contributor

pr-reviewer could not complete review for #1128 in AgentWorkforce/relay.
The review harness exited with code 1.
No review was posted; this needs operator attention.

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

Open in Devin Review

@agent-relay-code

Copy link
Copy Markdown
Contributor

pr-reviewer could not complete review for #1128 in AgentWorkforce/relay.
The review harness exited with code 1.
No review was posted; this needs operator attention.

2 similar comments
@agent-relay-code

Copy link
Copy Markdown
Contributor

pr-reviewer could not complete review for #1128 in AgentWorkforce/relay.
The review harness exited with code 1.
No review was posted; this needs operator attention.

@agent-relay-code

Copy link
Copy Markdown
Contributor

pr-reviewer could not complete review for #1128 in AgentWorkforce/relay.
The review harness exited with code 1.
No review was posted; this needs operator attention.

@khaliqgant khaliqgant merged commit 569742e into main Jun 14, 2026
42 checks passed
@khaliqgant khaliqgant deleted the fix/ar-252-cloud-auth-lock branch June 14, 2026 13:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant