feat: Implement comprehensive job recovery and resume process improvements

- Added a startup recovery script to handle interrupted jobs before application startup.
- Enhanced recovery system with database connection validation and stale job cleanup.
- Improved middleware to check for recovery needs and handle recovery during requests.
- Updated health check endpoint to include recovery system status and metrics.
- Introduced test scripts for verifying recovery functionality and job state management.
- Enhanced logging and error handling throughout the recovery process.
This commit is contained in:
Arunavo Ray
2025-05-24 13:45:25 +05:30
parent 98610482ae
commit a988be1028
10 changed files with 1006 additions and 111 deletions

View File

@@ -250,6 +250,30 @@ else
echo "Consider setting up external scheduled tasks to run cleanup scripts." echo "Consider setting up external scheduled tasks to run cleanup scripts."
fi fi
# Run startup recovery to handle any interrupted jobs
echo "Running startup recovery..."
if [ -f "dist/scripts/startup-recovery.js" ]; then
echo "Running startup recovery using compiled script..."
bun dist/scripts/startup-recovery.js --timeout=30000
RECOVERY_EXIT_CODE=$?
elif [ -f "scripts/startup-recovery.ts" ]; then
echo "Running startup recovery using TypeScript script..."
bun scripts/startup-recovery.ts --timeout=30000
RECOVERY_EXIT_CODE=$?
else
echo "Warning: Startup recovery script not found. Skipping recovery."
RECOVERY_EXIT_CODE=0
fi
# Log recovery result
if [ $RECOVERY_EXIT_CODE -eq 0 ]; then
echo "✅ Startup recovery completed successfully"
elif [ $RECOVERY_EXIT_CODE -eq 1 ]; then
echo "⚠️ Startup recovery completed with warnings"
else
echo "❌ Startup recovery failed with exit code $RECOVERY_EXIT_CODE"
fi
# Start the application # Start the application
echo "Starting Gitea Mirror..." echo "Starting Gitea Mirror..."
exec bun ./dist/server/entry.mjs exec bun ./dist/server/entry.mjs

View File

@@ -0,0 +1,170 @@
# Job Recovery and Resume Process Improvements
This document outlines the comprehensive improvements made to the job recovery and resume process to make it more robust to application restarts, container restarts, and application crashes.
## Problems Addressed
The original recovery system had several critical issues:
1. **Middleware-based initialization**: Recovery only ran when the first request came in
2. **Database connection issues**: No validation of database connectivity before recovery attempts
3. **Limited error handling**: Insufficient error handling for various failure scenarios
4. **No startup recovery**: No mechanism to handle recovery before serving requests
5. **Incomplete job state management**: Jobs could remain in inconsistent states
6. **No retry mechanisms**: Single-attempt recovery with no fallback strategies
## Improvements Implemented
### 1. Enhanced Recovery System (`src/lib/recovery.ts`)
#### New Features:
- **Database connection validation** before attempting recovery
- **Stale job cleanup** for jobs older than 24 hours
- **Retry mechanisms** with configurable attempts and delays
- **Individual job error handling** to prevent one failed job from stopping recovery
- **Recovery state tracking** to prevent concurrent recovery attempts
- **Enhanced logging** with detailed job information
#### Key Functions:
- `initializeRecovery()` - Main recovery function with enhanced error handling
- `validateDatabaseConnection()` - Ensures database is accessible
- `cleanupStaleJobs()` - Removes jobs that are too old to recover
- `getRecoveryStatus()` - Returns current recovery system status
- `forceRecovery()` - Bypasses recent attempt checks
- `hasJobsNeedingRecovery()` - Checks if recovery is needed
### 2. Startup Recovery Script (`scripts/startup-recovery.ts`)
A dedicated script that runs recovery before the application starts serving requests:
#### Features:
- **Timeout protection** (default: 30 seconds)
- **Force recovery option** to bypass recent attempt checks
- **Graceful signal handling** (SIGINT, SIGTERM)
- **Detailed logging** with progress indicators
- **Exit codes** for different scenarios (success, warnings, errors)
#### Usage:
```bash
bun scripts/startup-recovery.ts [--force] [--timeout=30000]
```
### 3. Improved Middleware (`src/middleware.ts`)
The middleware now serves as a fallback recovery mechanism:
#### Changes:
- **Checks if recovery is needed** before attempting
- **Shorter timeout** (15 seconds) for request-time recovery
- **Better error handling** with status logging
- **Prevents multiple attempts** with proper state tracking
### 4. Enhanced Database Queries (`src/lib/helpers.ts`)
#### Improvements:
- **Proper Drizzle ORM syntax** for all database queries
- **Enhanced interrupted job detection** with multiple criteria:
- Jobs with no recent checkpoint (10+ minutes)
- Jobs running too long (2+ hours)
- **Detailed logging** of found interrupted jobs
- **Better error handling** for database operations
### 5. Docker Integration (`docker-entrypoint.sh`)
#### Changes:
- **Automatic startup recovery** runs before application start
- **Exit code handling** with appropriate logging
- **Fallback mechanisms** if recovery script is not found
- **Non-blocking execution** - application starts even if recovery fails
### 6. Health Check Integration (`src/pages/api/health.ts`)
#### New Features:
- **Recovery system status** in health endpoint
- **Job recovery metrics** (jobs needing recovery, recovery in progress)
- **Overall health status** considers recovery state
- **Detailed recovery information** for monitoring
### 7. Testing Infrastructure (`scripts/test-recovery.ts`)
A comprehensive test script to verify recovery functionality:
#### Features:
- **Creates test interrupted jobs** with realistic scenarios
- **Verifies recovery detection** and execution
- **Checks final job states** after recovery
- **Cleanup functionality** for test data
- **Comprehensive logging** of test progress
## Configuration Options
### Recovery System Options:
- `maxRetries`: Number of recovery attempts (default: 3)
- `retryDelay`: Delay between attempts in ms (default: 5000)
- `skipIfRecentAttempt`: Skip if recent attempt made (default: true)
### Startup Recovery Options:
- `--force`: Force recovery even if recent attempt was made
- `--timeout`: Maximum time to wait for recovery (default: 30000ms)
## Usage Examples
### Manual Recovery:
```bash
# Run startup recovery
bun run startup-recovery
# Force recovery
bun run startup-recovery-force
# Test recovery system
bun run test-recovery
# Clean up test data
bun run test-recovery-cleanup
```
### Programmatic Usage:
```typescript
import { initializeRecovery, hasJobsNeedingRecovery } from '@/lib/recovery';
// Check if recovery is needed
const needsRecovery = await hasJobsNeedingRecovery();
// Run recovery with custom options
const success = await initializeRecovery({
maxRetries: 5,
retryDelay: 3000,
skipIfRecentAttempt: false
});
```
## Monitoring and Observability
### Health Check Endpoint:
- **URL**: `/api/health`
- **Recovery Status**: Included in response
- **Monitoring**: Can be used with external monitoring systems
### Log Messages:
- **Startup**: Clear indicators of recovery attempts and results
- **Progress**: Detailed logging of recovery steps
- **Errors**: Comprehensive error information for debugging
## Benefits
1. **Reliability**: Jobs are automatically recovered after application restarts
2. **Resilience**: Multiple retry mechanisms and fallback strategies
3. **Observability**: Comprehensive logging and health check integration
4. **Performance**: Efficient detection and processing of interrupted jobs
5. **Maintainability**: Clear separation of concerns and modular design
6. **Testing**: Built-in testing infrastructure for verification
## Migration Notes
- **Backward Compatible**: All existing functionality is preserved
- **Automatic**: Recovery runs automatically on startup
- **Configurable**: All timeouts and retry counts can be adjusted
- **Monitoring**: Health checks now include recovery status
This comprehensive improvement ensures that the gitea-mirror application can reliably handle job recovery in all deployment scenarios, from development to production container environments.

View File

@@ -20,6 +20,10 @@
"cleanup-events": "bun scripts/cleanup-events.ts", "cleanup-events": "bun scripts/cleanup-events.ts",
"cleanup-jobs": "bun scripts/cleanup-mirror-jobs.ts", "cleanup-jobs": "bun scripts/cleanup-mirror-jobs.ts",
"cleanup-all": "bun scripts/cleanup-events.ts && bun scripts/cleanup-mirror-jobs.ts", "cleanup-all": "bun scripts/cleanup-events.ts && bun scripts/cleanup-mirror-jobs.ts",
"startup-recovery": "bun scripts/startup-recovery.ts",
"startup-recovery-force": "bun scripts/startup-recovery.ts --force",
"test-recovery": "bun scripts/test-recovery.ts",
"test-recovery-cleanup": "bun scripts/test-recovery.ts --cleanup",
"preview": "bunx --bun astro preview", "preview": "bunx --bun astro preview",
"start": "bun dist/server/entry.mjs", "start": "bun dist/server/entry.mjs",
"start:fresh": "bun run cleanup-db && bun run manage-db init && bun run update-db && bun dist/server/entry.mjs", "start:fresh": "bun run cleanup-db && bun run manage-db init && bun run update-db && bun dist/server/entry.mjs",

View File

@@ -118,6 +118,27 @@ bun scripts/fix-interrupted-jobs.ts <userId>
Use this script if you're having trouble cleaning up activities due to "interrupted" jobs that won't delete. Use this script if you're having trouble cleaning up activities due to "interrupted" jobs that won't delete.
### Startup Recovery (startup-recovery.ts)
Runs job recovery during application startup to handle any interrupted jobs from previous runs.
```bash
# Run startup recovery (normal mode)
bun scripts/startup-recovery.ts
# Force recovery even if recent attempt was made
bun scripts/startup-recovery.ts --force
# Set custom timeout (default: 30000ms)
bun scripts/startup-recovery.ts --timeout=60000
# Using npm scripts
bun run startup-recovery
bun run startup-recovery-force
```
This script is automatically run by the Docker entrypoint during container startup. It ensures that any jobs interrupted by container restarts or application crashes are properly recovered or marked as failed.
## Deployment Scripts ## Deployment Scripts
### Docker Deployment ### Docker Deployment

113
scripts/startup-recovery.ts Normal file
View File

@@ -0,0 +1,113 @@
#!/usr/bin/env bun
/**
* Startup recovery script
* This script runs job recovery before the application starts serving requests
* It ensures that any interrupted jobs from previous runs are properly handled
*
* Usage:
* bun scripts/startup-recovery.ts [--force] [--timeout=30000]
*
* Options:
* --force: Force recovery even if a recent attempt was made
* --timeout: Maximum time to wait for recovery (in milliseconds, default: 30000)
*/
import { initializeRecovery, hasJobsNeedingRecovery, getRecoveryStatus } from "../src/lib/recovery";
// Parse command line arguments
const args = process.argv.slice(2);
const forceRecovery = args.includes('--force');
const timeoutArg = args.find(arg => arg.startsWith('--timeout='));
const timeout = timeoutArg ? parseInt(timeoutArg.split('=')[1], 10) : 30000;
if (isNaN(timeout) || timeout < 1000) {
console.error("Error: Timeout must be at least 1000ms");
process.exit(1);
}
async function runStartupRecovery() {
console.log('=== Gitea Mirror Startup Recovery ===');
console.log(`Timeout: ${timeout}ms`);
console.log(`Force recovery: ${forceRecovery}`);
console.log('');
const startTime = Date.now();
try {
// Set up timeout
const timeoutPromise = new Promise<never>((_, reject) => {
setTimeout(() => {
reject(new Error(`Recovery timeout after ${timeout}ms`));
}, timeout);
});
// Check if recovery is needed first
console.log('Checking if recovery is needed...');
const needsRecovery = await hasJobsNeedingRecovery();
if (!needsRecovery) {
console.log('✅ No jobs need recovery. Startup can proceed.');
process.exit(0);
}
console.log('⚠️ Jobs found that need recovery. Starting recovery process...');
// Run recovery with timeout
const recoveryPromise = initializeRecovery({
skipIfRecentAttempt: !forceRecovery,
maxRetries: 3,
retryDelay: 5000,
});
const recoveryResult = await Promise.race([recoveryPromise, timeoutPromise]);
const endTime = Date.now();
const duration = endTime - startTime;
if (recoveryResult) {
console.log(`✅ Recovery completed successfully in ${duration}ms`);
console.log('Application startup can proceed.');
process.exit(0);
} else {
console.log(`⚠️ Recovery completed with some failures in ${duration}ms`);
console.log('Application startup can proceed, but some jobs may have failed.');
process.exit(0);
}
} catch (error) {
const endTime = Date.now();
const duration = endTime - startTime;
if (error instanceof Error && error.message.includes('timeout')) {
console.error(`❌ Recovery timed out after ${duration}ms`);
console.error('Application will start anyway, but some jobs may remain interrupted.');
// Get current recovery status
const status = getRecoveryStatus();
console.log('Recovery status:', status);
// Exit with warning code but allow startup to continue
process.exit(1);
} else {
console.error(`❌ Recovery failed after ${duration}ms:`, error);
console.error('Application will start anyway, but recovery was unsuccessful.');
// Exit with error code but allow startup to continue
process.exit(1);
}
}
}
// Handle process signals gracefully
process.on('SIGINT', () => {
console.log('\n⚠ Recovery interrupted by SIGINT');
process.exit(130);
});
process.on('SIGTERM', () => {
console.log('\n⚠ Recovery interrupted by SIGTERM');
process.exit(143);
});
// Run the startup recovery
runStartupRecovery();

183
scripts/test-recovery.ts Normal file
View File

@@ -0,0 +1,183 @@
#!/usr/bin/env bun
/**
* Test script for the recovery system
* This script creates test jobs and verifies that the recovery system can handle them
*
* Usage:
* bun scripts/test-recovery.ts [--cleanup]
*
* Options:
* --cleanup: Clean up test jobs after testing
*/
import { db, mirrorJobs } from "../src/lib/db";
import { createMirrorJob } from "../src/lib/helpers";
import { initializeRecovery, hasJobsNeedingRecovery, getRecoveryStatus } from "../src/lib/recovery";
import { eq } from "drizzle-orm";
import { v4 as uuidv4 } from "uuid";
// Parse command line arguments
const args = process.argv.slice(2);
const cleanup = args.includes('--cleanup');
// Test configuration
const TEST_USER_ID = "test-user-recovery";
const TEST_BATCH_ID = "test-batch-recovery";
async function runRecoveryTest() {
console.log('=== Recovery System Test ===');
console.log(`Cleanup mode: ${cleanup}`);
console.log('');
try {
if (cleanup) {
await cleanupTestJobs();
return;
}
// Step 1: Create test jobs that simulate interrupted state
console.log('Step 1: Creating test interrupted jobs...');
await createTestInterruptedJobs();
// Step 2: Check if recovery system detects them
console.log('Step 2: Checking if recovery system detects interrupted jobs...');
const needsRecovery = await hasJobsNeedingRecovery();
console.log(`Jobs needing recovery: ${needsRecovery}`);
if (!needsRecovery) {
console.log('❌ Recovery system did not detect interrupted jobs');
return;
}
// Step 3: Get recovery status
console.log('Step 3: Getting recovery status...');
const status = getRecoveryStatus();
console.log('Recovery status:', status);
// Step 4: Run recovery
console.log('Step 4: Running recovery...');
const recoveryResult = await initializeRecovery({
skipIfRecentAttempt: false,
maxRetries: 2,
retryDelay: 2000,
});
console.log(`Recovery result: ${recoveryResult}`);
// Step 5: Verify recovery completed
console.log('Step 5: Verifying recovery completed...');
const stillNeedsRecovery = await hasJobsNeedingRecovery();
console.log(`Jobs still needing recovery: ${stillNeedsRecovery}`);
// Step 6: Check final job states
console.log('Step 6: Checking final job states...');
await checkTestJobStates();
console.log('');
console.log('✅ Recovery test completed successfully!');
console.log('Run with --cleanup to remove test jobs');
} catch (error) {
console.error('❌ Recovery test failed:', error);
process.exit(1);
}
}
/**
* Create test jobs that simulate interrupted state
*/
async function createTestInterruptedJobs() {
const testJobs = [
{
repositoryId: uuidv4(),
repositoryName: "test-repo-1",
message: "Test mirror job 1",
status: "mirroring" as const,
jobType: "mirror" as const,
},
{
repositoryId: uuidv4(),
repositoryName: "test-repo-2",
message: "Test sync job 2",
status: "syncing" as const,
jobType: "sync" as const,
},
];
for (const job of testJobs) {
const jobId = await createMirrorJob({
userId: TEST_USER_ID,
repositoryId: job.repositoryId,
repositoryName: job.repositoryName,
message: job.message,
status: job.status,
jobType: job.jobType,
batchId: TEST_BATCH_ID,
totalItems: 5,
itemIds: [job.repositoryId, uuidv4(), uuidv4(), uuidv4(), uuidv4()],
inProgress: true,
skipDuplicateEvent: true,
});
// Manually set the job to look interrupted (old timestamp)
const oldTimestamp = new Date();
oldTimestamp.setMinutes(oldTimestamp.getMinutes() - 15); // 15 minutes ago
await db
.update(mirrorJobs)
.set({
startedAt: oldTimestamp,
lastCheckpoint: oldTimestamp,
})
.where(eq(mirrorJobs.id, jobId));
console.log(`Created test job: ${jobId} (${job.repositoryName})`);
}
}
/**
* Check the final states of test jobs
*/
async function checkTestJobStates() {
const testJobs = await db
.select()
.from(mirrorJobs)
.where(eq(mirrorJobs.userId, TEST_USER_ID));
console.log(`Found ${testJobs.length} test jobs:`);
for (const job of testJobs) {
console.log(`- Job ${job.id}: ${job.status} (inProgress: ${job.inProgress})`);
console.log(` Message: ${job.message}`);
console.log(` Started: ${job.startedAt ? new Date(job.startedAt).toISOString() : 'never'}`);
console.log(` Completed: ${job.completedAt ? new Date(job.completedAt).toISOString() : 'never'}`);
console.log('');
}
}
/**
* Clean up test jobs
*/
async function cleanupTestJobs() {
console.log('Cleaning up test jobs...');
const result = await db
.delete(mirrorJobs)
.where(eq(mirrorJobs.userId, TEST_USER_ID));
console.log('✅ Test jobs cleaned up successfully');
}
// Handle process signals gracefully
process.on('SIGINT', () => {
console.log('\n⚠ Test interrupted by SIGINT');
process.exit(130);
});
process.on('SIGTERM', () => {
console.log('\n⚠ Test interrupted by SIGTERM');
process.exit(143);
});
// Run the test
runRecoveryTest();

View File

@@ -1,5 +1,6 @@
import type { RepoStatus } from "@/types/Repository"; import type { RepoStatus } from "@/types/Repository";
import { db, mirrorJobs } from "./db"; import { db, mirrorJobs } from "./db";
import { eq, and, or, lt, isNull } from "drizzle-orm";
import { v4 as uuidv4 } from "uuid"; import { v4 as uuidv4 } from "uuid";
import { publishEvent } from "./events"; import { publishEvent } from "./events";
@@ -120,7 +121,7 @@ export async function updateMirrorJobProgress({
const [job] = await db const [job] = await db
.select() .select()
.from(mirrorJobs) .from(mirrorJobs)
.where(mirrorJobs.id === jobId); .where(eq(mirrorJobs.id, jobId));
if (!job) { if (!job) {
throw new Error(`Mirror job with ID ${jobId} not found`); throw new Error(`Mirror job with ID ${jobId} not found`);
@@ -170,7 +171,7 @@ export async function updateMirrorJobProgress({
await db await db
.update(mirrorJobs) .update(mirrorJobs)
.set(updates) .set(updates)
.where(mirrorJobs.id === jobId); .where(eq(mirrorJobs.id, jobId));
// Publish the event with deduplication // Publish the event with deduplication
const updatedJob = { const updatedJob = {
@@ -203,7 +204,7 @@ export async function updateMirrorJobProgress({
} }
/** /**
* Finds interrupted jobs that need to be resumed * Finds interrupted jobs that need to be resumed with enhanced criteria
*/ */
export async function findInterruptedJobs() { export async function findInterruptedJobs() {
try { try {
@@ -211,15 +212,35 @@ export async function findInterruptedJobs() {
const cutoffTime = new Date(); const cutoffTime = new Date();
cutoffTime.setMinutes(cutoffTime.getMinutes() - 10); // Consider jobs inactive after 10 minutes without updates cutoffTime.setMinutes(cutoffTime.getMinutes() - 10); // Consider jobs inactive after 10 minutes without updates
// Also check for jobs that have been running for too long (over 2 hours)
const staleCutoffTime = new Date();
staleCutoffTime.setHours(staleCutoffTime.getHours() - 2);
const interruptedJobs = await db const interruptedJobs = await db
.select() .select()
.from(mirrorJobs) .from(mirrorJobs)
.where( .where(
mirrorJobs.inProgress === true && and(
(mirrorJobs.lastCheckpoint === null || eq(mirrorJobs.inProgress, true),
mirrorJobs.lastCheckpoint < cutoffTime) or(
// Jobs with no recent checkpoint
or(isNull(mirrorJobs.lastCheckpoint), lt(mirrorJobs.lastCheckpoint, cutoffTime)),
// Jobs that started too long ago (likely stale)
lt(mirrorJobs.startedAt, staleCutoffTime)
)
)
); );
// Log details about found jobs for debugging
if (interruptedJobs.length > 0) {
console.log(`Found ${interruptedJobs.length} interrupted jobs:`);
interruptedJobs.forEach(job => {
const lastCheckpoint = job.lastCheckpoint ? new Date(job.lastCheckpoint).toISOString() : 'never';
const startedAt = job.startedAt ? new Date(job.startedAt).toISOString() : 'unknown';
console.log(`- Job ${job.id}: ${job.jobType} (started: ${startedAt}, last checkpoint: ${lastCheckpoint})`);
});
}
return interruptedJobs; return interruptedJobs;
} catch (error) { } catch (error) {
console.error("Error finding interrupted jobs:", error); console.error("Error finding interrupted jobs:", error);

View File

@@ -4,101 +4,274 @@
*/ */
import { findInterruptedJobs, resumeInterruptedJob } from './helpers'; import { findInterruptedJobs, resumeInterruptedJob } from './helpers';
import { db, repositories, organizations } from './db'; import { db, repositories, organizations, mirrorJobs } from './db';
import { eq } from 'drizzle-orm'; import { eq, and, lt } from 'drizzle-orm';
import { mirrorGithubRepoToGitea, mirrorGitHubOrgRepoToGiteaOrg, syncGiteaRepo } from './gitea'; import { mirrorGithubRepoToGitea, mirrorGitHubOrgRepoToGiteaOrg, syncGiteaRepo } from './gitea';
import { createGitHubClient } from './github'; import { createGitHubClient } from './github';
import { processWithResilience } from './utils/concurrency'; import { processWithResilience } from './utils/concurrency';
import { repositoryVisibilityEnum, repoStatusEnum } from '@/types/Repository'; import { repositoryVisibilityEnum, repoStatusEnum } from '@/types/Repository';
import type { Repository } from './db/schema'; import type { Repository } from './db/schema';
// Recovery state tracking
let recoveryInProgress = false;
let lastRecoveryAttempt: Date | null = null;
/** /**
* Initialize the recovery system * Validates database connection before attempting recovery
* This should be called when the application starts
*/ */
export async function initializeRecovery() { async function validateDatabaseConnection(): Promise<boolean> {
console.log('Initializing recovery system...');
try { try {
// Find interrupted jobs // Simple query to test database connectivity
const interruptedJobs = await findInterruptedJobs(); await db.select().from(mirrorJobs).limit(1);
return true;
if (interruptedJobs.length === 0) {
console.log('No interrupted jobs found.');
return;
}
console.log(`Found ${interruptedJobs.length} interrupted jobs. Starting recovery...`);
// Process each interrupted job
for (const job of interruptedJobs) {
const resumeData = await resumeInterruptedJob(job);
if (!resumeData) {
console.log(`Job ${job.id} could not be resumed.`);
continue;
}
const { job: updatedJob, remainingItemIds } = resumeData;
// Handle different job types
switch (updatedJob.jobType) {
case 'mirror':
await recoverMirrorJob(updatedJob, remainingItemIds);
break;
case 'sync':
await recoverSyncJob(updatedJob, remainingItemIds);
break;
case 'retry':
await recoverRetryJob(updatedJob, remainingItemIds);
break;
default:
console.log(`Unknown job type: ${updatedJob.jobType}`);
}
}
console.log('Recovery process completed.');
} catch (error) { } catch (error) {
console.error('Error during recovery process:', error); console.error('Database connection validation failed:', error);
return false;
} }
} }
/** /**
* Recover a mirror job * Cleans up stale jobs that are too old to recover
*/
async function cleanupStaleJobs(): Promise<void> {
try {
const staleThreshold = new Date();
staleThreshold.setHours(staleThreshold.getHours() - 24); // Jobs older than 24 hours
const staleJobs = await db
.select()
.from(mirrorJobs)
.where(
and(
eq(mirrorJobs.inProgress, true),
lt(mirrorJobs.startedAt, staleThreshold)
)
);
if (staleJobs.length > 0) {
console.log(`Found ${staleJobs.length} stale jobs to clean up`);
// Mark stale jobs as failed
await db
.update(mirrorJobs)
.set({
inProgress: false,
completedAt: new Date(),
status: "failed",
message: "Job marked as failed due to being stale (older than 24 hours)"
})
.where(
and(
eq(mirrorJobs.inProgress, true),
lt(mirrorJobs.startedAt, staleThreshold)
)
);
console.log(`Cleaned up ${staleJobs.length} stale jobs`);
}
} catch (error) {
console.error('Error cleaning up stale jobs:', error);
}
}
/**
* Initialize the recovery system with enhanced error handling and resilience
* This should be called when the application starts
*/
export async function initializeRecovery(options: {
maxRetries?: number;
retryDelay?: number;
skipIfRecentAttempt?: boolean;
} = {}): Promise<boolean> {
const { maxRetries = 3, retryDelay = 5000, skipIfRecentAttempt = true } = options;
// Prevent concurrent recovery attempts
if (recoveryInProgress) {
console.log('Recovery already in progress, skipping...');
return false;
}
// Skip if recent attempt (within last 5 minutes) unless forced
if (skipIfRecentAttempt && lastRecoveryAttempt) {
const timeSinceLastAttempt = Date.now() - lastRecoveryAttempt.getTime();
if (timeSinceLastAttempt < 5 * 60 * 1000) {
console.log('Recent recovery attempt detected, skipping...');
return false;
}
}
recoveryInProgress = true;
lastRecoveryAttempt = new Date();
console.log('Initializing recovery system...');
let attempt = 0;
while (attempt < maxRetries) {
try {
attempt++;
console.log(`Recovery attempt ${attempt}/${maxRetries}`);
// Validate database connection first
const dbConnected = await validateDatabaseConnection();
if (!dbConnected) {
throw new Error('Database connection validation failed');
}
// Clean up stale jobs first
await cleanupStaleJobs();
// Find interrupted jobs
const interruptedJobs = await findInterruptedJobs();
if (interruptedJobs.length === 0) {
console.log('No interrupted jobs found.');
recoveryInProgress = false;
return true;
}
console.log(`Found ${interruptedJobs.length} interrupted jobs. Starting recovery...`);
// Process each interrupted job with individual error handling
let successCount = 0;
let failureCount = 0;
for (const job of interruptedJobs) {
try {
const resumeData = await resumeInterruptedJob(job);
if (!resumeData) {
console.log(`Job ${job.id} could not be resumed.`);
failureCount++;
continue;
}
const { job: updatedJob, remainingItemIds } = resumeData;
// Handle different job types
switch (updatedJob.jobType) {
case 'mirror':
await recoverMirrorJob(updatedJob, remainingItemIds);
break;
case 'sync':
await recoverSyncJob(updatedJob, remainingItemIds);
break;
case 'retry':
await recoverRetryJob(updatedJob, remainingItemIds);
break;
default:
console.log(`Unknown job type: ${updatedJob.jobType}`);
failureCount++;
continue;
}
successCount++;
} catch (jobError) {
console.error(`Error recovering individual job ${job.id}:`, jobError);
failureCount++;
// Mark the job as failed if recovery fails
try {
await db
.update(mirrorJobs)
.set({
inProgress: false,
completedAt: new Date(),
status: "failed",
message: `Job recovery failed: ${jobError instanceof Error ? jobError.message : String(jobError)}`
})
.where(eq(mirrorJobs.id, job.id));
} catch (updateError) {
console.error(`Failed to mark job ${job.id} as failed:`, updateError);
}
}
}
console.log(`Recovery process completed. Success: ${successCount}, Failures: ${failureCount}`);
recoveryInProgress = false;
return true;
} catch (error) {
console.error(`Recovery attempt ${attempt} failed:`, error);
if (attempt < maxRetries) {
console.log(`Retrying in ${retryDelay}ms...`);
await new Promise(resolve => setTimeout(resolve, retryDelay));
} else {
console.error('All recovery attempts failed');
recoveryInProgress = false;
return false;
}
}
}
recoveryInProgress = false;
return false;
}
/**
* Recover a mirror job with enhanced error handling
*/ */
async function recoverMirrorJob(job: any, remainingItemIds: string[]) { async function recoverMirrorJob(job: any, remainingItemIds: string[]) {
console.log(`Recovering mirror job ${job.id} with ${remainingItemIds.length} remaining items`); console.log(`Recovering mirror job ${job.id} with ${remainingItemIds.length} remaining items`);
try { try {
// Get the config for this user // Get the config for this user with better error handling
const [config] = await db const configs = await db
.select() .select()
.from(repositories) .from(repositories)
.where(eq(repositories.userId, job.userId)) .where(eq(repositories.userId, job.userId))
.limit(1); .limit(1);
if (!config || !config.configId) { if (configs.length === 0) {
throw new Error('Config not found for user'); throw new Error(`No configuration found for user ${job.userId}`);
} }
// Get repositories to process const config = configs[0];
if (!config.configId) {
throw new Error(`Configuration missing configId for user ${job.userId}`);
}
// Get repositories to process with validation
const repos = await db const repos = await db
.select() .select()
.from(repositories) .from(repositories)
.where(eq(repositories.id, remainingItemIds)); .where(eq(repositories.id, remainingItemIds));
if (repos.length === 0) { if (repos.length === 0) {
throw new Error('No repositories found for the remaining item IDs'); console.warn(`No repositories found for remaining item IDs: ${remainingItemIds.join(', ')}`);
// Mark job as completed since there's nothing to process
await db
.update(mirrorJobs)
.set({
inProgress: false,
completedAt: new Date(),
status: "mirrored",
message: "Job completed - no repositories found to process"
})
.where(eq(mirrorJobs.id, job.id));
return;
} }
// Create GitHub client console.log(`Found ${repos.length} repositories to process for recovery`);
const octokit = createGitHubClient(config.githubConfig.token);
// Validate GitHub configuration before creating client
// Process repositories with resilience if (!config.githubConfig?.token) {
throw new Error('GitHub token not found in configuration');
}
// Create GitHub client with error handling
let octokit;
try {
octokit = createGitHubClient(config.githubConfig.token);
} catch (error) {
throw new Error(`Failed to create GitHub client: ${error instanceof Error ? error.message : String(error)}`);
}
// Process repositories with resilience and reduced concurrency for recovery
await processWithResilience( await processWithResilience(
repos, repos,
async (repo) => { async (repo) => {
// Prepare repository data // Prepare repository data with validation
const repoData = { const repoData = {
...repo, ...repo,
status: repoStatusEnum.parse("imported"), status: repoStatusEnum.parse("imported"),
@@ -106,10 +279,10 @@ async function recoverMirrorJob(job: any, remainingItemIds: string[]) {
lastMirrored: repo.lastMirrored ?? undefined, lastMirrored: repo.lastMirrored ?? undefined,
errorMessage: repo.errorMessage ?? undefined, errorMessage: repo.errorMessage ?? undefined,
forkedFrom: repo.forkedFrom ?? undefined, forkedFrom: repo.forkedFrom ?? undefined,
visibility: repositoryVisibilityEnum.parse(repo.visibility), visibility: repositoryVisibilityEnum.parse(repo.visibility || "public"),
mirroredLocation: repo.mirroredLocation || "", mirroredLocation: repo.mirroredLocation || "",
}; };
// Mirror the repository based on whether it's in an organization // Mirror the repository based on whether it's in an organization
if (repo.organization && config.githubConfig.preserveOrgStructure) { if (repo.organization && config.githubConfig.preserveOrgStructure) {
await mirrorGitHubOrgRepoToGiteaOrg({ await mirrorGitHubOrgRepoToGiteaOrg({
@@ -125,7 +298,7 @@ async function recoverMirrorJob(job: any, remainingItemIds: string[]) {
config, config,
}); });
} }
return repo; return repo;
}, },
{ {
@@ -134,66 +307,102 @@ async function recoverMirrorJob(job: any, remainingItemIds: string[]) {
getItemId: (repo) => repo.id, getItemId: (repo) => repo.id,
getItemName: (repo) => repo.name, getItemName: (repo) => repo.name,
resumeFromJobId: job.id, resumeFromJobId: job.id,
concurrencyLimit: 3, concurrencyLimit: 2, // Reduced concurrency for recovery to be more stable
maxRetries: 2, maxRetries: 3, // Increased retries for recovery
retryDelay: 2000, retryDelay: 3000, // Longer delay for recovery
} }
); );
console.log(`Successfully recovered mirror job ${job.id}`);
} catch (error) { } catch (error) {
console.error(`Error recovering mirror job ${job.id}:`, error); console.error(`Error recovering mirror job ${job.id}:`, error);
// Mark the job as failed
try {
await db
.update(mirrorJobs)
.set({
inProgress: false,
completedAt: new Date(),
status: "failed",
message: `Mirror job recovery failed: ${error instanceof Error ? error.message : String(error)}`
})
.where(eq(mirrorJobs.id, job.id));
} catch (updateError) {
console.error(`Failed to mark mirror job ${job.id} as failed:`, updateError);
}
throw error; // Re-throw to be handled by the caller
} }
} }
/** /**
* Recover a sync job * Recover a sync job with enhanced error handling
*/ */
async function recoverSyncJob(job: any, remainingItemIds: string[]) { async function recoverSyncJob(job: any, remainingItemIds: string[]) {
// Implementation similar to recoverMirrorJob but for sync operations
console.log(`Recovering sync job ${job.id} with ${remainingItemIds.length} remaining items`); console.log(`Recovering sync job ${job.id} with ${remainingItemIds.length} remaining items`);
try { try {
// Get the config for this user // Get the config for this user with better error handling
const [config] = await db const configs = await db
.select() .select()
.from(repositories) .from(repositories)
.where(eq(repositories.userId, job.userId)) .where(eq(repositories.userId, job.userId))
.limit(1); .limit(1);
if (!config || !config.configId) { if (configs.length === 0) {
throw new Error('Config not found for user'); throw new Error(`No configuration found for user ${job.userId}`);
} }
// Get repositories to process const config = configs[0];
if (!config.configId) {
throw new Error(`Configuration missing configId for user ${job.userId}`);
}
// Get repositories to process with validation
const repos = await db const repos = await db
.select() .select()
.from(repositories) .from(repositories)
.where(eq(repositories.id, remainingItemIds)); .where(eq(repositories.id, remainingItemIds));
if (repos.length === 0) { if (repos.length === 0) {
throw new Error('No repositories found for the remaining item IDs'); console.warn(`No repositories found for remaining item IDs: ${remainingItemIds.join(', ')}`);
// Mark job as completed since there's nothing to process
await db
.update(mirrorJobs)
.set({
inProgress: false,
completedAt: new Date(),
status: "mirrored",
message: "Job completed - no repositories found to process"
})
.where(eq(mirrorJobs.id, job.id));
return;
} }
// Process repositories with resilience console.log(`Found ${repos.length} repositories to process for sync recovery`);
// Process repositories with resilience and reduced concurrency for recovery
await processWithResilience( await processWithResilience(
repos, repos,
async (repo) => { async (repo) => {
// Prepare repository data // Prepare repository data with validation
const repoData = { const repoData = {
...repo, ...repo,
status: repoStatusEnum.parse(repo.status), status: repoStatusEnum.parse(repo.status || "imported"),
organization: repo.organization ?? undefined, organization: repo.organization ?? undefined,
lastMirrored: repo.lastMirrored ?? undefined, lastMirrored: repo.lastMirrored ?? undefined,
errorMessage: repo.errorMessage ?? undefined, errorMessage: repo.errorMessage ?? undefined,
forkedFrom: repo.forkedFrom ?? undefined, forkedFrom: repo.forkedFrom ?? undefined,
visibility: repositoryVisibilityEnum.parse(repo.visibility), visibility: repositoryVisibilityEnum.parse(repo.visibility || "public"),
}; };
// Sync the repository // Sync the repository
await syncGiteaRepo({ await syncGiteaRepo({
config, config,
repository: repoData, repository: repoData,
}); });
return repo; return repo;
}, },
{ {
@@ -202,23 +411,94 @@ async function recoverSyncJob(job: any, remainingItemIds: string[]) {
getItemId: (repo) => repo.id, getItemId: (repo) => repo.id,
getItemName: (repo) => repo.name, getItemName: (repo) => repo.name,
resumeFromJobId: job.id, resumeFromJobId: job.id,
concurrencyLimit: 5, concurrencyLimit: 3, // Reduced concurrency for recovery
maxRetries: 2, maxRetries: 3, // Increased retries for recovery
retryDelay: 2000, retryDelay: 3000, // Longer delay for recovery
} }
); );
console.log(`Successfully recovered sync job ${job.id}`);
} catch (error) { } catch (error) {
console.error(`Error recovering sync job ${job.id}:`, error); console.error(`Error recovering sync job ${job.id}:`, error);
// Mark the job as failed
try {
await db
.update(mirrorJobs)
.set({
inProgress: false,
completedAt: new Date(),
status: "failed",
message: `Sync job recovery failed: ${error instanceof Error ? error.message : String(error)}`
})
.where(eq(mirrorJobs.id, job.id));
} catch (updateError) {
console.error(`Failed to mark sync job ${job.id} as failed:`, updateError);
}
throw error; // Re-throw to be handled by the caller
} }
} }
/** /**
* Recover a retry job * Recover a retry job with enhanced error handling
*/ */
async function recoverRetryJob(job: any, remainingItemIds: string[]) { async function recoverRetryJob(job: any, remainingItemIds: string[]) {
// Implementation similar to recoverMirrorJob but for retry operations
console.log(`Recovering retry job ${job.id} with ${remainingItemIds.length} remaining items`); console.log(`Recovering retry job ${job.id} with ${remainingItemIds.length} remaining items`);
// This would be similar to recoverMirrorJob but with retry-specific logic try {
console.log('Retry job recovery not yet implemented'); // For now, retry jobs are treated similarly to mirror jobs
// In the future, this could have specific retry logic
await recoverMirrorJob(job, remainingItemIds);
console.log(`Successfully recovered retry job ${job.id}`);
} catch (error) {
console.error(`Error recovering retry job ${job.id}:`, error);
// Mark the job as failed
try {
await db
.update(mirrorJobs)
.set({
inProgress: false,
completedAt: new Date(),
status: "failed",
message: `Retry job recovery failed: ${error instanceof Error ? error.message : String(error)}`
})
.where(eq(mirrorJobs.id, job.id));
} catch (updateError) {
console.error(`Failed to mark retry job ${job.id} as failed:`, updateError);
}
throw error; // Re-throw to be handled by the caller
}
}
/**
* Get recovery system status
*/
export function getRecoveryStatus() {
return {
inProgress: recoveryInProgress,
lastAttempt: lastRecoveryAttempt,
};
}
/**
* Force recovery to run (bypassing recent attempt check)
*/
export async function forceRecovery(): Promise<boolean> {
return initializeRecovery({ skipIfRecentAttempt: false });
}
/**
* Check if there are any jobs that need recovery
*/
export async function hasJobsNeedingRecovery(): Promise<boolean> {
try {
const interruptedJobs = await findInterruptedJobs();
return interruptedJobs.length > 0;
} catch (error) {
console.error('Error checking for jobs needing recovery:', error);
return false;
}
} }

View File

@@ -1,22 +1,58 @@
import { defineMiddleware } from 'astro:middleware'; import { defineMiddleware } from 'astro:middleware';
import { initializeRecovery } from './lib/recovery'; import { initializeRecovery, hasJobsNeedingRecovery, getRecoveryStatus } from './lib/recovery';
// Flag to track if recovery has been initialized // Flag to track if recovery has been initialized
let recoveryInitialized = false; let recoveryInitialized = false;
let recoveryAttempted = false;
export const onRequest = defineMiddleware(async (context, next) => { export const onRequest = defineMiddleware(async (context, next) => {
// Initialize recovery system only once when the server starts // Initialize recovery system only once when the server starts
if (!recoveryInitialized) { // This is a fallback in case the startup script didn't run
console.log('Initializing recovery system from middleware...'); if (!recoveryInitialized && !recoveryAttempted) {
recoveryAttempted = true;
try { try {
await initializeRecovery(); // Check if recovery is actually needed before attempting
console.log('Recovery system initialized successfully'); const needsRecovery = await hasJobsNeedingRecovery();
if (needsRecovery) {
console.log('⚠️ Middleware detected jobs needing recovery (startup script may not have run)');
console.log('Attempting recovery from middleware...');
// Run recovery with a shorter timeout since this is during request handling
const recoveryResult = await Promise.race([
initializeRecovery({
skipIfRecentAttempt: true,
maxRetries: 2,
retryDelay: 3000,
}),
new Promise<boolean>((_, reject) => {
setTimeout(() => reject(new Error('Middleware recovery timeout')), 15000);
})
]);
if (recoveryResult) {
console.log('✅ Middleware recovery completed successfully');
} else {
console.log('⚠️ Middleware recovery completed with some issues');
}
} else {
console.log('✅ No recovery needed (startup script likely handled it)');
}
recoveryInitialized = true;
} catch (error) { } catch (error) {
console.error('Error initializing recovery system:', error); console.error('⚠️ Middleware recovery failed or timed out:', error);
console.log('Application will continue, but some jobs may remain interrupted');
// Log recovery status for debugging
const status = getRecoveryStatus();
console.log('Recovery status:', status);
recoveryInitialized = true; // Mark as attempted to avoid retries
} }
recoveryInitialized = true;
} }
// Continue with the request // Continue with the request
return next(); return next();
}); });

View File

@@ -2,6 +2,7 @@ import type { APIRoute } from "astro";
import { jsonResponse } from "@/lib/utils"; import { jsonResponse } from "@/lib/utils";
import { db } from "@/lib/db"; import { db } from "@/lib/db";
import { ENV } from "@/lib/config"; import { ENV } from "@/lib/config";
import { getRecoveryStatus, hasJobsNeedingRecovery } from "@/lib/recovery";
import os from "os"; import os from "os";
import axios from "axios"; import axios from "axios";
@@ -38,9 +39,20 @@ export const GET: APIRoute = async () => {
const currentVersion = process.env.npm_package_version || "unknown"; const currentVersion = process.env.npm_package_version || "unknown";
const latestVersion = await checkLatestVersion(); const latestVersion = await checkLatestVersion();
// Get recovery system status
const recoveryStatus = await getRecoverySystemStatus();
// Determine overall health status
let overallStatus = "ok";
if (!dbStatus.connected) {
overallStatus = "error";
} else if (recoveryStatus.jobsNeedingRecovery > 0 && !recoveryStatus.inProgress) {
overallStatus = "degraded";
}
// Build response // Build response
const healthData = { const healthData = {
status: "ok", status: overallStatus,
timestamp: new Date().toISOString(), timestamp: new Date().toISOString(),
version: currentVersion, version: currentVersion,
latestVersion: latestVersion, latestVersion: latestVersion,
@@ -48,6 +60,7 @@ export const GET: APIRoute = async () => {
currentVersion !== "unknown" && currentVersion !== "unknown" &&
latestVersion !== currentVersion, latestVersion !== currentVersion,
database: dbStatus, database: dbStatus,
recovery: recoveryStatus,
system: systemInfo, system: systemInfo,
}; };
@@ -94,6 +107,36 @@ async function checkDatabaseConnection() {
} }
} }
/**
* Get recovery system status
*/
async function getRecoverySystemStatus() {
try {
const recoveryStatus = getRecoveryStatus();
const needsRecovery = await hasJobsNeedingRecovery();
return {
status: needsRecovery ? 'jobs-pending' : 'healthy',
inProgress: recoveryStatus.inProgress,
lastAttempt: recoveryStatus.lastAttempt?.toISOString() || null,
jobsNeedingRecovery: needsRecovery ? 1 : 0, // Simplified count for health check
message: needsRecovery
? 'Jobs found that need recovery'
: 'No jobs need recovery',
};
} catch (error) {
console.error('Recovery system status check failed:', error);
return {
status: 'error',
inProgress: false,
lastAttempt: null,
jobsNeedingRecovery: -1,
message: error instanceof Error ? error.message : 'Recovery status check failed',
};
}
}
/** /**
* Get server uptime information * Get server uptime information
*/ */