From b5323ff8b4d9eab0acbc55a2d8900c4532ad4e88 Mon Sep 17 00:00:00 2001 From: Arunavo Ray Date: Wed, 22 Oct 2025 08:44:22 +0530 Subject: [PATCH] chore: remove duplicate fix summary --- DUPLICATE_FIX_SUMMARY.md | 348 --------------------------------------- 1 file changed, 348 deletions(-) delete mode 100644 DUPLICATE_FIX_SUMMARY.md diff --git a/DUPLICATE_FIX_SUMMARY.md b/DUPLICATE_FIX_SUMMARY.md deleted file mode 100644 index cdae0be..0000000 --- a/DUPLICATE_FIX_SUMMARY.md +++ /dev/null @@ -1,348 +0,0 @@ -# Fix for Issue #115: Duplicate Timestamped Repositories - -## Problem Summary - -Users were experiencing hundreds/thousands of duplicate repositories with timestamps in their names (e.g., `repo-owner-1729123456789`). The issue was caused by a combination of: - -1. **Timestamp fallback** creating infinite unique names -2. **Race conditions** in concurrent mirror operations -3. **Retry logic** compounding the duplication -4. **No idempotency** checks to prevent duplicate operations - -## Root Cause Analysis - -### Primary Issue: Timestamp Fallback (Line 736-738) -```typescript -// OLD CODE (REMOVED) -const timestamp = Date.now(); -return `${baseName}-${githubOwner}-${timestamp}`; -``` - -When the `generateUniqueRepoName` function exhausted all 10 naming attempts: -- It fell back to using `Date.now()` timestamp -- Each retry got a different timestamp → unique name -- Retry logic (2 retries + exponential backoff) multiplied the problem -- Result: 100s-1000s of timestamped duplicates per repository - -### Contributing Factors - -1. **Concurrent Processing**: 3-10 repositories mirrored in parallel -2. **Periodic Sync**: Users running sync while previous sync still in progress -3. **Long-Running Mirrors**: Large repos timing out, then retrying -4. **No Race Protection**: Multiple processes could start mirroring same repo - -## Solution Implemented - -### 1. ✅ Removed Timestamp Fallback -**Location**: `src/lib/gitea.ts:817-826` - -```typescript -// NEW CODE -throw new Error( - `Unable to generate unique repository name for "${baseName}". ` + - `All ${maxAttempts} naming attempts resulted in conflicts. ` + - `Please manually resolve the naming conflict or adjust your duplicate strategy.` -); -``` - -**Impact**: -- Prevents infinite duplicate creation -- Forces manual resolution of genuine conflicts -- Provides clear error message to user - -### 2. ✅ Added Database Idempotency Check -**Location**: `src/lib/gitea.ts:207-291` - -```typescript -export const isRepoCurrentlyMirroring = async ({ - config, - repoName, - expectedLocation, -}: { - config: Partial; - repoName: string; - expectedLocation?: string; -}): Promise -``` - -**Features**: -- Checks database for repos in "mirroring" or "syncing" status -- Validates expected location to catch same-name conflicts -- **Stale operation detection**: Ignores operations stuck >2 hours -- Safe error handling: Returns `false` on errors (allows operation to proceed) - -### 3. ✅ Dual Idempotency Guards with Early Location Assignment -**Locations**: -- `mirrorGithubRepoToGitea`: Lines 363-378 & 447-473 -- `mirrorGitHubRepoToGiteaOrg`: Lines 876-891 & 984-1010 - -```typescript -// FIRST CHECK: Before starting mirror operation -const expectedLocation = `${repoOwner}/${targetRepoName}`; -const isCurrentlyMirroring = await isRepoCurrentlyMirroring({ - config, - repoName: targetRepoName, - expectedLocation, -}); - -if (isCurrentlyMirroring) { - console.log(`[Idempotency] Skipping - already being mirrored`); - return; -} - -// ... check if repo exists in Gitea ... - -// SECOND CHECK: Right before setting "mirroring" status -const finalCheck = await isRepoCurrentlyMirroring({ - config, - repoName: targetRepoName, - expectedLocation, -}); - -if (finalCheck) { - console.log(`[Idempotency] Race condition detected. Skipping.`); - return; -} - -// CRITICAL FIX: Set mirroredLocation IMMEDIATELY when setting status -// This allows idempotency checks to detect concurrent operations on first mirror -await db.update(repositories).set({ - status: "mirroring", - mirroredLocation: expectedLocation, // ← Set target location NOW - updatedAt: new Date(), -}); -``` - -**Why This Works**: -- **First check**: Catches obvious duplicates early (saves resources) -- **Second check**: Catches race conditions in the tiny window between first check and status update -- **Early location set**: Makes `mirroredLocation` the "target location" so concurrent checks can match it -- **Together**: Provides bulletproof protection against concurrent operations - -**Critical Fix Explained**: -Previously, `mirroredLocation` was only set AFTER successful mirror completion. This meant: -- Process A sets status="mirroring", mirroredLocation="" -- Process B checks and finds status="mirroring" BUT mirroredLocation="" ≠ expectedLocation -- Process B proceeds anyway, creating duplicate! - -Now we set `mirroredLocation` immediately, making it the "intent" field that blocks concurrent operations. - -### 4. ✅ Stale Operation Recovery -**Location**: `src/lib/gitea.ts:240-264` - -```typescript -const TWO_HOURS_MS = 2 * 60 * 60 * 1000; -const activeRepos = inProgressRepos.filter((repo) => { - if (!repo.updatedAt) return true; - const updatedTime = new Date(repo.updatedAt).getTime(); - const isStale = (now - updatedTime) > TWO_HOURS_MS; - - if (isStale) { - console.warn(`Repository has been stuck for over 2 hours. Allowing retry.`); - } - - return !isStale; -}); -``` - -**Prevents**: -- Repos permanently stuck in "mirroring" status after crashes -- Blocking legitimate retries for genuinely failed operations -- Manual database cleanup requirement - -## Testing - -### Regression Testing -✅ All 11 existing tests in `gitea.test.ts` pass -- No breaking changes to existing functionality -- Backward compatible implementation - -### Edge Cases Handled -1. ✅ Missing `userId` in config → Returns `false` (safe) -2. ✅ Database query failure → Caught and returns `false` (safe) -3. ✅ Stale operations (>2h) → Allows retry -4. ✅ Race conditions → Dual-check system prevents duplicates -5. ✅ Concurrent operations → Database queries are atomic -6. ✅ Error in `generateUniqueRepoName` → Propagates up, repo marked "failed" - -## Expected Behavior After Fix - -### Scenario 1: First Mirror Attempt -1. Check if repo is being mirrored → NO -2. Check if repo exists in Gitea → NO -3. Double-check before status update → NO -4. Set status to "mirroring" -5. ✅ Mirror proceeds normally - -### Scenario 2: Concurrent Mirror Attempts (Race Condition) -1. **Process A**: First check → NO (not being mirrored) -2. **Process B**: First check → NO (not being mirrored yet) -3. **Process A**: Double-check → NO, sets status to "mirroring" -4. **Process B**: Double-check → **YES** (Process A set status) -5. **Process B**: Skips with log message -6. ✅ Only one mirror operation proceeds - -### Scenario 3: Retry After Failure -1. Previous attempt failed: - - Status changed from "mirroring" → "failed" - - mirroredLocation remains set (e.g., "org/repo") -2. User clicks retry -3. Check if currently being mirrored: - - Query filters by status="mirroring" OR status="syncing" - - Repo has status="failed" → NOT matched - - Returns FALSE -4. ✅ Retry proceeds normally -5. Sets status="mirroring", mirroredLocation="org/repo" -6. ✅ Now protected from concurrent retries - -### Scenario 4: Stale Operation Recovery -1. Previous mirror attempt crashed 3 hours ago (still status="mirroring") -2. New sync attempt starts -3. Check detects "mirroring" but sees it's > 2 hours old -4. Considers it stale and returns `false` -5. ✅ New mirror proceeds (recovers stuck operation) - -### Scenario 5: Naming Conflict (10 Attempts Exhausted) -1. Tries: `repo-owner`, `repo-owner-1`, ..., `repo-owner-9` -2. All 10 attempts conflict with existing repos -3. ❌ Throws error with clear message -4. User sees: "Unable to generate unique repository name..." -5. ✅ No timestamp duplicates created - -## Upgrade Path - -### For Users With Existing Duplicates -1. The fix prevents NEW duplicates from being created -2. Existing timestamped duplicates will remain -3. Recommended cleanup process: - ```bash - # Identify timestamped repos - # Pattern: *-1234567890* (ends with timestamp) - - # Keep only the latest timestamp version - # Delete older duplicates manually via Gitea UI or API - ``` - -### Configuration Recommendations -- **Mirror Strategy**: Review `starredDuplicateStrategy` setting -- **Concurrency**: Reduce if still experiencing conflicts -- **Sync Frequency**: Ensure previous sync completes before starting new one - -## Logging & Monitoring - -### New Log Messages -``` -[Idempotency] Repository X is already being mirrored at Y -[Idempotency] Skipping X - already being mirrored to Y -[Idempotency] Race condition detected - X is now being mirrored by another process -[Idempotency] Repository X has been in "mirroring" status for over 2 hours -[Idempotency] All in-progress operations for X are stale (>2h). Allowing retry. -``` - -### Error Messages -``` -Unable to generate unique repository name for "X". -All 10 naming attempts resulted in conflicts. -Please manually resolve the naming conflict or adjust your duplicate strategy. -``` - -## Performance Impact - -- **Minimal overhead**: 2 additional database queries per mirror operation -- **Reduced load**: Prevents hundreds of unnecessary Gitea API calls -- **Improved reliability**: Fewer failed operations due to conflicts -- **Faster completion**: Skips duplicate work immediately - -## Breaking Changes - -**None** - This is a backward-compatible fix that only changes error behavior: -- Before: Created timestamped duplicates on conflict -- After: Throws error on genuine conflicts - -## Code Quality - -- ✅ Follows existing code patterns in the project -- ✅ Consistent error handling with `createSecureErrorResponse` -- ✅ Clear, descriptive logging for debugging -- ✅ Comprehensive inline documentation -- ✅ No hardcoded values (timeout configurable via constant) - -## Critical Review Finding & Fix - -### Issue Discovered During Code Review - -During thorough code review, a critical flaw was identified in the initial idempotency implementation: - -**The Problem:** -- Idempotency check relied on matching `mirroredLocation === expectedLocation` -- But `mirroredLocation` was only set AFTER successful mirror completion -- On first mirror attempt, `mirroredLocation` was `""` (empty) -- Concurrent processes could both see status="mirroring" but fail the location match -- Result: Race condition still existed! - -**The Timeline:** -``` -t=0ms: Process A checks → mirroredLocation="" → No match → Proceeds -t=5ms: Process A sets status="mirroring", mirroredLocation="" (still empty!) -t=10ms: Process B checks → Finds status="mirroring" -t=11ms: Process B checks mirroredLocation: "" === "org/repo"? → FALSE -t=12ms: Process B proceeds anyway → DUPLICATE CREATED! -``` - -**The Fix:** -Set `mirroredLocation = expectedLocation` immediately when setting status to "mirroring": - -```typescript -await db.update(repositories).set({ - status: "mirroring", - mirroredLocation: expectedLocation, // Critical addition - updatedAt: new Date(), -}); -``` - -**Semantic Change:** -- **Before**: `mirroredLocation` = confirmed location (set after success) -- **After**: `mirroredLocation` = target/intent location (set at start) - -This is acceptable because: -1. Status field still indicates success/failure -2. Failed mirrors have status="failed" making the location irrelevant -3. Actually improves debugging (can see WHERE mirror was attempted) -4. Required for idempotency checks to work correctly - -### Files Modified -- `src/lib/gitea.ts:466-473` - mirrorGithubRepoToGitea -- `src/lib/gitea.ts:1003-1010` - mirrorGitHubRepoToGiteaOrg - -## Related Issues - -- Fixes #115: "Large quantity of timestamped duplicate repos" -- Prevents future occurrences of similar race conditions -- Improves overall system reliability and data integrity - -## Future Improvements (Optional) - -1. **Database-level unique constraint**: Add unique index on (userId, name, mirroredLocation) -2. **Configurable timeout**: Make 2-hour stale timeout user-configurable -3. **Automatic cleanup**: Background job to detect and merge/delete duplicates -4. **Metrics**: Track duplicate prevention statistics -5. **Admin dashboard**: Show currently in-progress mirrors - -## Verification Steps - -To verify the fix is working: - -1. **Check logs** for `[Idempotency]` messages -2. **Monitor database** for repos stuck in "mirroring" status -3. **Run concurrent syncs** - should see skip messages -4. **Trigger naming conflicts** - should see error messages (not timestamps) -5. **Check Gitea** - no new timestamped repositories created - ---- - -**Branch**: `fix/duplicate-repos-issue-115` -**Files Changed**: `src/lib/gitea.ts` -**Lines Added**: ~140 -**Lines Removed**: ~3 -**Tests**: 11/11 passing -**Status**: ✅ Ready for review