fix(cloud): serialize auth refresh writes#1128
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughAdds atomic on-disk auth persistence and cross-process refresh locking to ChangesCloud Auth Atomic Write and Cross-Process Locking
Sequence Diagram(s)sequenceDiagram
rect rgba(173, 216, 230, 0.5)
note over Process1,Process2: Concurrent file-backed refresh
end
participant Process1
participant Process2
participant withStoredAuthLock
participant mkdir as fs.mkdir (lock dir)
participant readCanonicalStoredAuth
participant requestStoredAuthRefresh
participant writeStoredAuth
participant stat as fs.stat (poll lock)
Process1->>mkdir: mkdir(AUTH_LOCK_PATH)
mkdir-->>Process1: success (lock acquired)
Process2->>mkdir: mkdir(AUTH_LOCK_PATH)
mkdir-->>Process2: EEXIST (contention)
Process2->>stat: poll until lock released
Process1->>readCanonicalStoredAuth: read latest auth
Process1->>requestStoredAuthRefresh: refresh tokens
requestStoredAuthRefresh-->>Process1: rotated credentials
Process1->>writeStoredAuth: atomic write rotated auth
Process1->>withStoredAuthLock: rm lock dir (release)
Process2->>readCanonicalStoredAuth: read updated canonical auth
Process2-->>Process2: token fresh → skip refresh, return rotated creds
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint install failed due to a network error. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
pr-reviewer could not complete review for #1128 in AgentWorkforce/relay. |
There was a problem hiding this comment.
Code Review
This pull request implements atomic credential writes and serializes file-backed token refreshes across processes using a file-based locking mechanism in @agent-relay/cloud. It introduces a refreshAuth callback in CloudApiClient and updates the storage logic to prevent concurrent refreshes from clobbering rotated credentials. The review feedback identifies a critical Time-of-Check to Time-of-Use (TOCTOU) race condition in the lock removal logic that could defeat mutual exclusion, and suggests an optimization to avoid redundant refreshes when multiple processes concurrently wait on the lock.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| async function removeStaleStoredAuthLock(): Promise<boolean> { | ||
| try { | ||
| const lockStats = await fs.stat(AUTH_LOCK_PATH); | ||
| if (Date.now() - lockStats.mtimeMs < AUTH_LOCK_STALE_MS) { | ||
| return false; | ||
| } | ||
| } catch (error) { | ||
| if (isNodeErrorWithCode(error, 'ENOENT')) { | ||
| return true; | ||
| } | ||
| throw error; | ||
| } | ||
|
|
||
| await fs.rm(AUTH_LOCK_PATH, { recursive: true, force: true }); | ||
| return true; | ||
| } |
There was a problem hiding this comment.
There is a Time-of-Check to Time-of-Use (TOCTOU) race condition in removeStaleStoredAuthLock. If multiple concurrent processes attempt to acquire the lock, they may both determine that the lock is stale. One process will delete the lock and recreate it, while the other process, executing its delete call slightly later, will delete the newly created lock of the first process. This allows both processes to proceed as if they hold the lock, defeating mutual exclusion.
To prevent this, we can atomically rename the lock directory to a temporary path before inspecting and deleting it. If we accidentally rename a fresh lock, we can safely restore it.
async function removeStaleStoredAuthLock(): Promise<boolean> {
try {
const lockStats = await fs.stat(AUTH_LOCK_PATH);
if (Date.now() - lockStats.mtimeMs < AUTH_LOCK_STALE_MS) {
return false;
}
} catch (error) {
if (isNodeErrorWithCode(error, 'ENOENT')) {
return true;
}
throw error;
}
const tempTrashPath = `${AUTH_LOCK_PATH}.trash.${randomUUID()}`;
try {
await fs.rename(AUTH_LOCK_PATH, tempTrashPath);
} catch (error) {
if (isNodeErrorWithCode(error, 'ENOENT')) {
return true;
}
throw error;
}
try {
const trashStats = await fs.stat(tempTrashPath);
if (Date.now() - trashStats.mtimeMs >= AUTH_LOCK_STALE_MS) {
await fs.rm(tempTrashPath, { recursive: true, force: true });
return true;
} else {
try {
await fs.rename(tempTrashPath, AUTH_LOCK_PATH);
} catch {
await fs.rm(tempTrashPath, { recursive: true, force: true });
}
return false;
}
} catch (error) {
await fs.rm(tempTrashPath, { recursive: true, force: true });
if (isNodeErrorWithCode(error, 'ENOENT')) {
return true;
}
throw error;
}
}| if ( | ||
| !options.force && | ||
| latestAuth?.apiUrl === auth.apiUrl && | ||
| !shouldRefresh(latestAuth.accessTokenExpiresAt) | ||
| ) { | ||
| return latestAuth; | ||
| } |
There was a problem hiding this comment.
If multiple processes concurrently trigger a forced refresh (e.g., due to a 401 response), they will all wait for the lock. The first process will successfully refresh the token and write it to disk. When the second process acquires the lock, it will still perform a redundant refresh because options.force is true, even though the stored token has already been updated to a fresh one by the first process.
We can optimize this by checking if the stored token's accessToken has changed since the current process initiated its refresh request. If it has changed, we can safely reuse the newly updated token and skip the redundant refresh.
| if ( | |
| !options.force && | |
| latestAuth?.apiUrl === auth.apiUrl && | |
| !shouldRefresh(latestAuth.accessTokenExpiresAt) | |
| ) { | |
| return latestAuth; | |
| } | |
| if ( | |
| latestAuth?.apiUrl === auth.apiUrl && | |
| (latestAuth.accessToken !== auth.accessToken || | |
| (!options.force && !shouldRefresh(latestAuth.accessTokenExpiresAt))) | |
| ) { | |
| return latestAuth; | |
| } |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2e9ed77728
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const AUTH_DIR_PATH = path.dirname(AUTH_FILE_PATH); | ||
| const AUTH_LOCK_PATH = `${AUTH_FILE_PATH}.lock`; | ||
| const AUTH_LOCK_RETRY_DELAY_MS = 50; | ||
| const AUTH_LOCK_STALE_MS = 30_000; |
There was a problem hiding this comment.
Avoid expiring live refresh locks
When a caller sets refreshTimeoutMs above 30s (the CLI exposes --refresh-timeout and forwards it into refreshStoredAuth), a healthy process can still be inside requestStoredAuthRefresh while this fixed stale window makes other processes remove its lock as stale. That reintroduces concurrent refresh/write races and can clobber rotated credentials; the stale threshold needs to account for the configured refresh timeout or be kept alive by the lock holder.
Useful? React with 👍 / 👎.
| throw error; | ||
| } | ||
|
|
||
| await fs.rm(AUTH_LOCK_PATH, { recursive: true, force: true }); |
There was a problem hiding this comment.
Avoid unlinking a replacement lock
If multiple waiters encounter an old stale lock, one waiter can stat the stale directory, another can remove it and acquire a fresh lock, and then this waiter still removes AUTH_LOCK_PATH by name. That deletes the new holder's lock and allows overlapping refreshes, so stale cleanup needs an ownership token or an identity check before deleting the lock path.
Useful? React with 👍 / 👎.
|
✅ AR-252 review verdict: APPROVE (posting as a comment — GitHub blocks a formal approval from the same bot identity that opened the PR). Reviewed against the daemon-vs-reaper concurrent-refresh failure this targets. Verified independently by checking out the branch: Correct on the point that matters. The cross-process lock spans the full read→refresh→write critical section in Other strengths: atomic pid/time/uuid temp + chmod 0600 + rename (temp cleaned in finally); env-backed auth bypasses the lock; native Non-blocking (optional, do not hold merge): (1) Cleared to merge at the maintainer's discretion (per repo policy I'm not merging). |
|
pr-reviewer could not complete review for #1128 in AgentWorkforce/relay. |
1 similar comment
|
pr-reviewer could not complete review for #1128 in AgentWorkforce/relay. |
|
pr-reviewer could not complete review for #1128 in AgentWorkforce/relay. |
2 similar comments
|
pr-reviewer could not complete review for #1128 in AgentWorkforce/relay. |
|
pr-reviewer could not complete review for #1128 in AgentWorkforce/relay. |
Summary
Tests