chore: remove duplicate fix summary

This commit is contained in:
Arunavo Ray
2025-10-22 08:44:22 +05:30
parent 7fee2adb51
commit b5323ff8b4

View File

@@ -1,348 +0,0 @@
# Fix for Issue #115: Duplicate Timestamped Repositories
## Problem Summary
Users were experiencing hundreds/thousands of duplicate repositories with timestamps in their names (e.g., `repo-owner-1729123456789`). The issue was caused by a combination of:
1. **Timestamp fallback** creating infinite unique names
2. **Race conditions** in concurrent mirror operations
3. **Retry logic** compounding the duplication
4. **No idempotency** checks to prevent duplicate operations
## Root Cause Analysis
### Primary Issue: Timestamp Fallback (Line 736-738)
```typescript
// OLD CODE (REMOVED)
const timestamp = Date.now();
return `${baseName}-${githubOwner}-${timestamp}`;
```
When the `generateUniqueRepoName` function exhausted all 10 naming attempts:
- It fell back to using `Date.now()` timestamp
- Each retry got a different timestamp → unique name
- Retry logic (2 retries + exponential backoff) multiplied the problem
- Result: 100s-1000s of timestamped duplicates per repository
### Contributing Factors
1. **Concurrent Processing**: 3-10 repositories mirrored in parallel
2. **Periodic Sync**: Users running sync while previous sync still in progress
3. **Long-Running Mirrors**: Large repos timing out, then retrying
4. **No Race Protection**: Multiple processes could start mirroring same repo
## Solution Implemented
### 1. ✅ Removed Timestamp Fallback
**Location**: `src/lib/gitea.ts:817-826`
```typescript
// NEW CODE
throw new Error(
`Unable to generate unique repository name for "${baseName}". ` +
`All ${maxAttempts} naming attempts resulted in conflicts. ` +
`Please manually resolve the naming conflict or adjust your duplicate strategy.`
);
```
**Impact**:
- Prevents infinite duplicate creation
- Forces manual resolution of genuine conflicts
- Provides clear error message to user
### 2. ✅ Added Database Idempotency Check
**Location**: `src/lib/gitea.ts:207-291`
```typescript
export const isRepoCurrentlyMirroring = async ({
config,
repoName,
expectedLocation,
}: {
config: Partial<Config>;
repoName: string;
expectedLocation?: string;
}): Promise<boolean>
```
**Features**:
- Checks database for repos in "mirroring" or "syncing" status
- Validates expected location to catch same-name conflicts
- **Stale operation detection**: Ignores operations stuck >2 hours
- Safe error handling: Returns `false` on errors (allows operation to proceed)
### 3. ✅ Dual Idempotency Guards with Early Location Assignment
**Locations**:
- `mirrorGithubRepoToGitea`: Lines 363-378 & 447-473
- `mirrorGitHubRepoToGiteaOrg`: Lines 876-891 & 984-1010
```typescript
// FIRST CHECK: Before starting mirror operation
const expectedLocation = `${repoOwner}/${targetRepoName}`;
const isCurrentlyMirroring = await isRepoCurrentlyMirroring({
config,
repoName: targetRepoName,
expectedLocation,
});
if (isCurrentlyMirroring) {
console.log(`[Idempotency] Skipping - already being mirrored`);
return;
}
// ... check if repo exists in Gitea ...
// SECOND CHECK: Right before setting "mirroring" status
const finalCheck = await isRepoCurrentlyMirroring({
config,
repoName: targetRepoName,
expectedLocation,
});
if (finalCheck) {
console.log(`[Idempotency] Race condition detected. Skipping.`);
return;
}
// CRITICAL FIX: Set mirroredLocation IMMEDIATELY when setting status
// This allows idempotency checks to detect concurrent operations on first mirror
await db.update(repositories).set({
status: "mirroring",
mirroredLocation: expectedLocation, // ← Set target location NOW
updatedAt: new Date(),
});
```
**Why This Works**:
- **First check**: Catches obvious duplicates early (saves resources)
- **Second check**: Catches race conditions in the tiny window between first check and status update
- **Early location set**: Makes `mirroredLocation` the "target location" so concurrent checks can match it
- **Together**: Provides bulletproof protection against concurrent operations
**Critical Fix Explained**:
Previously, `mirroredLocation` was only set AFTER successful mirror completion. This meant:
- Process A sets status="mirroring", mirroredLocation=""
- Process B checks and finds status="mirroring" BUT mirroredLocation="" ≠ expectedLocation
- Process B proceeds anyway, creating duplicate!
Now we set `mirroredLocation` immediately, making it the "intent" field that blocks concurrent operations.
### 4. ✅ Stale Operation Recovery
**Location**: `src/lib/gitea.ts:240-264`
```typescript
const TWO_HOURS_MS = 2 * 60 * 60 * 1000;
const activeRepos = inProgressRepos.filter((repo) => {
if (!repo.updatedAt) return true;
const updatedTime = new Date(repo.updatedAt).getTime();
const isStale = (now - updatedTime) > TWO_HOURS_MS;
if (isStale) {
console.warn(`Repository has been stuck for over 2 hours. Allowing retry.`);
}
return !isStale;
});
```
**Prevents**:
- Repos permanently stuck in "mirroring" status after crashes
- Blocking legitimate retries for genuinely failed operations
- Manual database cleanup requirement
## Testing
### Regression Testing
✅ All 11 existing tests in `gitea.test.ts` pass
- No breaking changes to existing functionality
- Backward compatible implementation
### Edge Cases Handled
1. ✅ Missing `userId` in config → Returns `false` (safe)
2. ✅ Database query failure → Caught and returns `false` (safe)
3. ✅ Stale operations (>2h) → Allows retry
4. ✅ Race conditions → Dual-check system prevents duplicates
5. ✅ Concurrent operations → Database queries are atomic
6. ✅ Error in `generateUniqueRepoName` → Propagates up, repo marked "failed"
## Expected Behavior After Fix
### Scenario 1: First Mirror Attempt
1. Check if repo is being mirrored → NO
2. Check if repo exists in Gitea → NO
3. Double-check before status update → NO
4. Set status to "mirroring"
5. ✅ Mirror proceeds normally
### Scenario 2: Concurrent Mirror Attempts (Race Condition)
1. **Process A**: First check → NO (not being mirrored)
2. **Process B**: First check → NO (not being mirrored yet)
3. **Process A**: Double-check → NO, sets status to "mirroring"
4. **Process B**: Double-check → **YES** (Process A set status)
5. **Process B**: Skips with log message
6. ✅ Only one mirror operation proceeds
### Scenario 3: Retry After Failure
1. Previous attempt failed:
- Status changed from "mirroring" → "failed"
- mirroredLocation remains set (e.g., "org/repo")
2. User clicks retry
3. Check if currently being mirrored:
- Query filters by status="mirroring" OR status="syncing"
- Repo has status="failed" → NOT matched
- Returns FALSE
4. ✅ Retry proceeds normally
5. Sets status="mirroring", mirroredLocation="org/repo"
6. ✅ Now protected from concurrent retries
### Scenario 4: Stale Operation Recovery
1. Previous mirror attempt crashed 3 hours ago (still status="mirroring")
2. New sync attempt starts
3. Check detects "mirroring" but sees it's > 2 hours old
4. Considers it stale and returns `false`
5. ✅ New mirror proceeds (recovers stuck operation)
### Scenario 5: Naming Conflict (10 Attempts Exhausted)
1. Tries: `repo-owner`, `repo-owner-1`, ..., `repo-owner-9`
2. All 10 attempts conflict with existing repos
3. ❌ Throws error with clear message
4. User sees: "Unable to generate unique repository name..."
5. ✅ No timestamp duplicates created
## Upgrade Path
### For Users With Existing Duplicates
1. The fix prevents NEW duplicates from being created
2. Existing timestamped duplicates will remain
3. Recommended cleanup process:
```bash
# Identify timestamped repos
# Pattern: *-1234567890* (ends with timestamp)
# Keep only the latest timestamp version
# Delete older duplicates manually via Gitea UI or API
```
### Configuration Recommendations
- **Mirror Strategy**: Review `starredDuplicateStrategy` setting
- **Concurrency**: Reduce if still experiencing conflicts
- **Sync Frequency**: Ensure previous sync completes before starting new one
## Logging & Monitoring
### New Log Messages
```
[Idempotency] Repository X is already being mirrored at Y
[Idempotency] Skipping X - already being mirrored to Y
[Idempotency] Race condition detected - X is now being mirrored by another process
[Idempotency] Repository X has been in "mirroring" status for over 2 hours
[Idempotency] All in-progress operations for X are stale (>2h). Allowing retry.
```
### Error Messages
```
Unable to generate unique repository name for "X".
All 10 naming attempts resulted in conflicts.
Please manually resolve the naming conflict or adjust your duplicate strategy.
```
## Performance Impact
- **Minimal overhead**: 2 additional database queries per mirror operation
- **Reduced load**: Prevents hundreds of unnecessary Gitea API calls
- **Improved reliability**: Fewer failed operations due to conflicts
- **Faster completion**: Skips duplicate work immediately
## Breaking Changes
**None** - This is a backward-compatible fix that only changes error behavior:
- Before: Created timestamped duplicates on conflict
- After: Throws error on genuine conflicts
## Code Quality
- ✅ Follows existing code patterns in the project
- ✅ Consistent error handling with `createSecureErrorResponse`
- ✅ Clear, descriptive logging for debugging
- ✅ Comprehensive inline documentation
- ✅ No hardcoded values (timeout configurable via constant)
## Critical Review Finding & Fix
### Issue Discovered During Code Review
During thorough code review, a critical flaw was identified in the initial idempotency implementation:
**The Problem:**
- Idempotency check relied on matching `mirroredLocation === expectedLocation`
- But `mirroredLocation` was only set AFTER successful mirror completion
- On first mirror attempt, `mirroredLocation` was `""` (empty)
- Concurrent processes could both see status="mirroring" but fail the location match
- Result: Race condition still existed!
**The Timeline:**
```
t=0ms: Process A checks → mirroredLocation="" → No match → Proceeds
t=5ms: Process A sets status="mirroring", mirroredLocation="" (still empty!)
t=10ms: Process B checks → Finds status="mirroring"
t=11ms: Process B checks mirroredLocation: "" === "org/repo"? → FALSE
t=12ms: Process B proceeds anyway → DUPLICATE CREATED!
```
**The Fix:**
Set `mirroredLocation = expectedLocation` immediately when setting status to "mirroring":
```typescript
await db.update(repositories).set({
status: "mirroring",
mirroredLocation: expectedLocation, // Critical addition
updatedAt: new Date(),
});
```
**Semantic Change:**
- **Before**: `mirroredLocation` = confirmed location (set after success)
- **After**: `mirroredLocation` = target/intent location (set at start)
This is acceptable because:
1. Status field still indicates success/failure
2. Failed mirrors have status="failed" making the location irrelevant
3. Actually improves debugging (can see WHERE mirror was attempted)
4. Required for idempotency checks to work correctly
### Files Modified
- `src/lib/gitea.ts:466-473` - mirrorGithubRepoToGitea
- `src/lib/gitea.ts:1003-1010` - mirrorGitHubRepoToGiteaOrg
## Related Issues
- Fixes #115: "Large quantity of timestamped duplicate repos"
- Prevents future occurrences of similar race conditions
- Improves overall system reliability and data integrity
## Future Improvements (Optional)
1. **Database-level unique constraint**: Add unique index on (userId, name, mirroredLocation)
2. **Configurable timeout**: Make 2-hour stale timeout user-configurable
3. **Automatic cleanup**: Background job to detect and merge/delete duplicates
4. **Metrics**: Track duplicate prevention statistics
5. **Admin dashboard**: Show currently in-progress mirrors
## Verification Steps
To verify the fix is working:
1. **Check logs** for `[Idempotency]` messages
2. **Monitor database** for repos stuck in "mirroring" status
3. **Run concurrent syncs** - should see skip messages
4. **Trigger naming conflicts** - should see error messages (not timestamps)
5. **Check Gitea** - no new timestamped repositories created
---
**Branch**: `fix/duplicate-repos-issue-115`
**Files Changed**: `src/lib/gitea.ts`
**Lines Added**: ~140
**Lines Removed**: ~3
**Tests**: 11/11 passing
**Status**: ✅ Ready for review