feat: Implement comprehensive job recovery and resume process improvements

- Added a startup recovery script to handle interrupted jobs before application startup. - Enhanced recovery system with database connection validation and stale job cleanup. - Improved middleware to check for recovery needs and handle recovery during requests. - Updated health check endpoint to include recovery system status and metrics. - Introduced test scripts for verifying recovery functionality and job state management. - Enhanced logging and error handling throughout the recovery process.
2025-12-06 11:36:44 +03:00 · 2025-05-24 13:45:25 +05:30
parent 98610482ae
commit a988be1028
10 changed files with 1006 additions and 111 deletions
--- a/docker-entrypoint.sh
+++ b/docker-entrypoint.sh
@@ -250,6 +250,30 @@ else
  echo "Consider setting up external scheduled tasks to run cleanup scripts."
 fi
 # Run startup recovery to handle any interrupted jobs
 echo "Running startup recovery..."
 if [ -f "dist/scripts/startup-recovery.js" ]; then
  echo "Running startup recovery using compiled script..."
  bun dist/scripts/startup-recovery.js --timeout=30000
  RECOVERY_EXIT_CODE=$?
 elif [ -f "scripts/startup-recovery.ts" ]; then
  echo "Running startup recovery using TypeScript script..."
  bun scripts/startup-recovery.ts --timeout=30000
  RECOVERY_EXIT_CODE=$?
 else
  echo "Warning: Startup recovery script not found. Skipping recovery."
  RECOVERY_EXIT_CODE=0
 fi
 # Log recovery result
 if [ $RECOVERY_EXIT_CODE -eq 0 ]; then
  echo "✅ Startup recovery completed successfully"
 elif [ $RECOVERY_EXIT_CODE -eq 1 ]; then
  echo "⚠️  Startup recovery completed with warnings"
 else
  echo "❌ Startup recovery failed with exit code $RECOVERY_EXIT_CODE"
 fi
 # Start the application
 echo "Starting Gitea Mirror..."
 exec bun ./dist/server/entry.mjs
--- a/docs/RECOVERY_IMPROVEMENTS.md
+++ b/docs/RECOVERY_IMPROVEMENTS.md
@@ -0,0 +1,170 @@
 # Job Recovery and Resume Process Improvements
 This document outlines the comprehensive improvements made to the job recovery and resume process to make it more robust to application restarts, container restarts, and application crashes.
 ## Problems Addressed
 The original recovery system had several critical issues:
 1. **Middleware-based initialization**: Recovery only ran when the first request came in
 2. **Database connection issues**: No validation of database connectivity before recovery attempts
 3. **Limited error handling**: Insufficient error handling for various failure scenarios
 4. **No startup recovery**: No mechanism to handle recovery before serving requests
 5. **Incomplete job state management**: Jobs could remain in inconsistent states
 6. **No retry mechanisms**: Single-attempt recovery with no fallback strategies
 ## Improvements Implemented
 ### 1. Enhanced Recovery System (`src/lib/recovery.ts`)
 #### New Features:
 - **Database connection validation** before attempting recovery
 - **Stale job cleanup** for jobs older than 24 hours
 - **Retry mechanisms** with configurable attempts and delays
 - **Individual job error handling** to prevent one failed job from stopping recovery
 - **Recovery state tracking** to prevent concurrent recovery attempts
 - **Enhanced logging** with detailed job information
 #### Key Functions:
 - `initializeRecovery()` - Main recovery function with enhanced error handling
 - `validateDatabaseConnection()` - Ensures database is accessible
 - `cleanupStaleJobs()` - Removes jobs that are too old to recover
 - `getRecoveryStatus()` - Returns current recovery system status
 - `forceRecovery()` - Bypasses recent attempt checks
 - `hasJobsNeedingRecovery()` - Checks if recovery is needed
 ### 2. Startup Recovery Script (`scripts/startup-recovery.ts`)
 A dedicated script that runs recovery before the application starts serving requests:
 #### Features:
 - **Timeout protection** (default: 30 seconds)
 - **Force recovery option** to bypass recent attempt checks
 - **Graceful signal handling** (SIGINT, SIGTERM)
 - **Detailed logging** with progress indicators
 - **Exit codes** for different scenarios (success, warnings, errors)
 #### Usage:
 ```bash
 bun scripts/startup-recovery.ts [--force] [--timeout=30000]
 ```
 ### 3. Improved Middleware (`src/middleware.ts`)
 The middleware now serves as a fallback recovery mechanism:
 #### Changes:
 - **Checks if recovery is needed** before attempting
 - **Shorter timeout** (15 seconds) for request-time recovery
 - **Better error handling** with status logging
 - **Prevents multiple attempts** with proper state tracking
 ### 4. Enhanced Database Queries (`src/lib/helpers.ts`)
 #### Improvements:
 - **Proper Drizzle ORM syntax** for all database queries
 - **Enhanced interrupted job detection** with multiple criteria:
  - Jobs with no recent checkpoint (10+ minutes)
  - Jobs running too long (2+ hours)
 - **Detailed logging** of found interrupted jobs
 - **Better error handling** for database operations
 ### 5. Docker Integration (`docker-entrypoint.sh`)
 #### Changes:
 - **Automatic startup recovery** runs before application start
 - **Exit code handling** with appropriate logging
 - **Fallback mechanisms** if recovery script is not found
 - **Non-blocking execution** - application starts even if recovery fails
 ### 6. Health Check Integration (`src/pages/api/health.ts`)
 #### New Features:
 - **Recovery system status** in health endpoint
 - **Job recovery metrics** (jobs needing recovery, recovery in progress)
 - **Overall health status** considers recovery state
 - **Detailed recovery information** for monitoring
 ### 7. Testing Infrastructure (`scripts/test-recovery.ts`)
 A comprehensive test script to verify recovery functionality:
 #### Features:
 - **Creates test interrupted jobs** with realistic scenarios
 - **Verifies recovery detection** and execution
 - **Checks final job states** after recovery
 - **Cleanup functionality** for test data
 - **Comprehensive logging** of test progress
 ## Configuration Options
 ### Recovery System Options:
 - `maxRetries`: Number of recovery attempts (default: 3)
 - `retryDelay`: Delay between attempts in ms (default: 5000)
 - `skipIfRecentAttempt`: Skip if recent attempt made (default: true)
 ### Startup Recovery Options:
 - `--force`: Force recovery even if recent attempt was made
 - `--timeout`: Maximum time to wait for recovery (default: 30000ms)
 ## Usage Examples
 ### Manual Recovery:
 ```bash
 # Run startup recovery
 bun run startup-recovery
 # Force recovery
 bun run startup-recovery-force
 # Test recovery system
 bun run test-recovery
 # Clean up test data
 bun run test-recovery-cleanup
 ```
 ### Programmatic Usage:
 ```typescript
 import { initializeRecovery, hasJobsNeedingRecovery } from '@/lib/recovery';
 // Check if recovery is needed
 const needsRecovery = await hasJobsNeedingRecovery();
 // Run recovery with custom options
 const success = await initializeRecovery({
  maxRetries: 5,
  retryDelay: 3000,
  skipIfRecentAttempt: false
 });
 ```
 ## Monitoring and Observability
 ### Health Check Endpoint:
 - **URL**: `/api/health`
 - **Recovery Status**: Included in response
 - **Monitoring**: Can be used with external monitoring systems
 ### Log Messages:
 - **Startup**: Clear indicators of recovery attempts and results
 - **Progress**: Detailed logging of recovery steps
 - **Errors**: Comprehensive error information for debugging
 ## Benefits
 1. **Reliability**: Jobs are automatically recovered after application restarts
 2. **Resilience**: Multiple retry mechanisms and fallback strategies
 3. **Observability**: Comprehensive logging and health check integration
 4. **Performance**: Efficient detection and processing of interrupted jobs
 5. **Maintainability**: Clear separation of concerns and modular design
 6. **Testing**: Built-in testing infrastructure for verification
 ## Migration Notes
 - **Backward Compatible**: All existing functionality is preserved
 - **Automatic**: Recovery runs automatically on startup
 - **Configurable**: All timeouts and retry counts can be adjusted
 - **Monitoring**: Health checks now include recovery status
 This comprehensive improvement ensures that the gitea-mirror application can reliably handle job recovery in all deployment scenarios, from development to production container environments.
--- a/package.json
+++ b/package.json
@@ -20,6 +20,10 @@
    "cleanup-events": "bun scripts/cleanup-events.ts",
    "cleanup-jobs": "bun scripts/cleanup-mirror-jobs.ts",
    "cleanup-all": "bun scripts/cleanup-events.ts && bun scripts/cleanup-mirror-jobs.ts",
    "startup-recovery": "bun scripts/startup-recovery.ts",
    "startup-recovery-force": "bun scripts/startup-recovery.ts --force",
    "test-recovery": "bun scripts/test-recovery.ts",
    "test-recovery-cleanup": "bun scripts/test-recovery.ts --cleanup",
    "preview": "bunx --bun astro preview",
    "start": "bun dist/server/entry.mjs",
    "start:fresh": "bun run cleanup-db && bun run manage-db init && bun run update-db && bun dist/server/entry.mjs",
--- a/scripts/README.md
+++ b/scripts/README.md
@@ -118,6 +118,27 @@ bun scripts/fix-interrupted-jobs.ts <userId>
 Use this script if you're having trouble cleaning up activities due to "interrupted" jobs that won't delete.
 ### Startup Recovery (startup-recovery.ts)
 Runs job recovery during application startup to handle any interrupted jobs from previous runs.
 ```bash
 # Run startup recovery (normal mode)
 bun scripts/startup-recovery.ts
 # Force recovery even if recent attempt was made
 bun scripts/startup-recovery.ts --force
 # Set custom timeout (default: 30000ms)
 bun scripts/startup-recovery.ts --timeout=60000
 # Using npm scripts
 bun run startup-recovery
 bun run startup-recovery-force
 ```
 This script is automatically run by the Docker entrypoint during container startup. It ensures that any jobs interrupted by container restarts or application crashes are properly recovered or marked as failed.
 ## Deployment Scripts
 ### Docker Deployment
--- a/scripts/startup-recovery.ts
+++ b/scripts/startup-recovery.ts
@@ -0,0 +1,113 @@
 #!/usr/bin/env bun
 /**
 * Startup recovery script
 * This script runs job recovery before the application starts serving requests
 * It ensures that any interrupted jobs from previous runs are properly handled
 *
 * Usage:
 *   bun scripts/startup-recovery.ts [--force] [--timeout=30000]
 *
 * Options:
 *   --force: Force recovery even if a recent attempt was made
 *   --timeout: Maximum time to wait for recovery (in milliseconds, default: 30000)
 */
 import { initializeRecovery, hasJobsNeedingRecovery, getRecoveryStatus } from "../src/lib/recovery";
 // Parse command line arguments
 const args = process.argv.slice(2);
 const forceRecovery = args.includes('--force');
 const timeoutArg = args.find(arg => arg.startsWith('--timeout='));
 const timeout = timeoutArg ? parseInt(timeoutArg.split('=')[1], 10) : 30000;
 if (isNaN(timeout) || timeout < 1000) {
  console.error("Error: Timeout must be at least 1000ms");
  process.exit(1);
 }
 async function runStartupRecovery() {
  console.log('=== Gitea Mirror Startup Recovery ===');
  console.log(`Timeout: ${timeout}ms`);
  console.log(`Force recovery: ${forceRecovery}`);
  console.log('');
  const startTime = Date.now();
  try {
    // Set up timeout
    const timeoutPromise = new Promise<never>((_, reject) => {
      setTimeout(() => {
        reject(new Error(`Recovery timeout after ${timeout}ms`));
      }, timeout);
    });
    // Check if recovery is needed first
    console.log('Checking if recovery is needed...');
    const needsRecovery = await hasJobsNeedingRecovery();
    if (!needsRecovery) {
      console.log('✅ No jobs need recovery. Startup can proceed.');
      process.exit(0);
    }
    console.log('⚠️  Jobs found that need recovery. Starting recovery process...');
    // Run recovery with timeout
    const recoveryPromise = initializeRecovery({
      skipIfRecentAttempt: !forceRecovery,
      maxRetries: 3,
      retryDelay: 5000,
    });
    const recoveryResult = await Promise.race([recoveryPromise, timeoutPromise]);
    const endTime = Date.now();
    const duration = endTime - startTime;
    if (recoveryResult) {
      console.log(`✅ Recovery completed successfully in ${duration}ms`);
      console.log('Application startup can proceed.');
      process.exit(0);
    } else {
      console.log(`⚠️  Recovery completed with some failures in ${duration}ms`);
      console.log('Application startup can proceed, but some jobs may have failed.');
      process.exit(0);
    }
  } catch (error) {
    const endTime = Date.now();
    const duration = endTime - startTime;
    if (error instanceof Error && error.message.includes('timeout')) {
      console.error(`❌ Recovery timed out after ${duration}ms`);
      console.error('Application will start anyway, but some jobs may remain interrupted.');
      // Get current recovery status
      const status = getRecoveryStatus();
      console.log('Recovery status:', status);
      // Exit with warning code but allow startup to continue
      process.exit(1);
    } else {
      console.error(`❌ Recovery failed after ${duration}ms:`, error);
      console.error('Application will start anyway, but recovery was unsuccessful.');
      // Exit with error code but allow startup to continue
      process.exit(1);
    }
  }
 }
 // Handle process signals gracefully
 process.on('SIGINT', () => {
  console.log('\n⚠️  Recovery interrupted by SIGINT');
  process.exit(130);
 });
 process.on('SIGTERM', () => {
  console.log('\n⚠️  Recovery interrupted by SIGTERM');
  process.exit(143);
 });
 // Run the startup recovery
 runStartupRecovery();
--- a/scripts/test-recovery.ts
+++ b/scripts/test-recovery.ts
@@ -0,0 +1,183 @@
 #!/usr/bin/env bun
 /**
 * Test script for the recovery system
 * This script creates test jobs and verifies that the recovery system can handle them
 *
 * Usage:
 *   bun scripts/test-recovery.ts [--cleanup]
 *
 * Options:
 *   --cleanup: Clean up test jobs after testing
 */
 import { db, mirrorJobs } from "../src/lib/db";
 import { createMirrorJob } from "../src/lib/helpers";
 import { initializeRecovery, hasJobsNeedingRecovery, getRecoveryStatus } from "../src/lib/recovery";
 import { eq } from "drizzle-orm";
 import { v4 as uuidv4 } from "uuid";
 // Parse command line arguments
 const args = process.argv.slice(2);
 const cleanup = args.includes('--cleanup');
 // Test configuration
 const TEST_USER_ID = "test-user-recovery";
 const TEST_BATCH_ID = "test-batch-recovery";
 async function runRecoveryTest() {
  console.log('=== Recovery System Test ===');
  console.log(`Cleanup mode: ${cleanup}`);
  console.log('');
  try {
    if (cleanup) {
      await cleanupTestJobs();
      return;
    }
    // Step 1: Create test jobs that simulate interrupted state
    console.log('Step 1: Creating test interrupted jobs...');
    await createTestInterruptedJobs();
    // Step 2: Check if recovery system detects them
    console.log('Step 2: Checking if recovery system detects interrupted jobs...');
    const needsRecovery = await hasJobsNeedingRecovery();
    console.log(`Jobs needing recovery: ${needsRecovery}`);
    if (!needsRecovery) {
      console.log('❌ Recovery system did not detect interrupted jobs');
      return;
    }
    // Step 3: Get recovery status
    console.log('Step 3: Getting recovery status...');
    const status = getRecoveryStatus();
    console.log('Recovery status:', status);
    // Step 4: Run recovery
    console.log('Step 4: Running recovery...');
    const recoveryResult = await initializeRecovery({
      skipIfRecentAttempt: false,
      maxRetries: 2,
      retryDelay: 2000,
    });
    console.log(`Recovery result: ${recoveryResult}`);
    // Step 5: Verify recovery completed
    console.log('Step 5: Verifying recovery completed...');
    const stillNeedsRecovery = await hasJobsNeedingRecovery();
    console.log(`Jobs still needing recovery: ${stillNeedsRecovery}`);
    // Step 6: Check final job states
    console.log('Step 6: Checking final job states...');
    await checkTestJobStates();
    console.log('');
    console.log('✅ Recovery test completed successfully!');
    console.log('Run with --cleanup to remove test jobs');
  } catch (error) {
    console.error('❌ Recovery test failed:', error);
    process.exit(1);
  }
 }
 /**
 * Create test jobs that simulate interrupted state
 */
 async function createTestInterruptedJobs() {
  const testJobs = [
    {
      repositoryId: uuidv4(),
      repositoryName: "test-repo-1",
      message: "Test mirror job 1",
      status: "mirroring" as const,
      jobType: "mirror" as const,
    },
    {
      repositoryId: uuidv4(),
      repositoryName: "test-repo-2", 
      message: "Test sync job 2",
      status: "syncing" as const,
      jobType: "sync" as const,
    },
  ];
  for (const job of testJobs) {
    const jobId = await createMirrorJob({
      userId: TEST_USER_ID,
      repositoryId: job.repositoryId,
      repositoryName: job.repositoryName,
      message: job.message,
      status: job.status,
      jobType: job.jobType,
      batchId: TEST_BATCH_ID,
      totalItems: 5,
      itemIds: [job.repositoryId, uuidv4(), uuidv4(), uuidv4(), uuidv4()],
      inProgress: true,
      skipDuplicateEvent: true,
    });
    // Manually set the job to look interrupted (old timestamp)
    const oldTimestamp = new Date();
    oldTimestamp.setMinutes(oldTimestamp.getMinutes() - 15); // 15 minutes ago
    await db
      .update(mirrorJobs)
      .set({
        startedAt: oldTimestamp,
        lastCheckpoint: oldTimestamp,
      })
      .where(eq(mirrorJobs.id, jobId));
    console.log(`Created test job: ${jobId} (${job.repositoryName})`);
  }
 }
 /**
 * Check the final states of test jobs
 */
 async function checkTestJobStates() {
  const testJobs = await db
    .select()
    .from(mirrorJobs)
    .where(eq(mirrorJobs.userId, TEST_USER_ID));
  console.log(`Found ${testJobs.length} test jobs:`);
  for (const job of testJobs) {
    console.log(`- Job ${job.id}: ${job.status} (inProgress: ${job.inProgress})`);
    console.log(`  Message: ${job.message}`);
    console.log(`  Started: ${job.startedAt ? new Date(job.startedAt).toISOString() : 'never'}`);
    console.log(`  Completed: ${job.completedAt ? new Date(job.completedAt).toISOString() : 'never'}`);
    console.log('');
  }
 }
 /**
 * Clean up test jobs
 */
 async function cleanupTestJobs() {
  console.log('Cleaning up test jobs...');
  const result = await db
    .delete(mirrorJobs)
    .where(eq(mirrorJobs.userId, TEST_USER_ID));
  console.log('✅ Test jobs cleaned up successfully');
 }
 // Handle process signals gracefully
 process.on('SIGINT', () => {
  console.log('\n⚠️  Test interrupted by SIGINT');
  process.exit(130);
 });
 process.on('SIGTERM', () => {
  console.log('\n⚠️  Test interrupted by SIGTERM');
  process.exit(143);
 });
 // Run the test
 runRecoveryTest();
--- a/src/lib/helpers.ts
+++ b/src/lib/helpers.ts
@@ -1,5 +1,6 @@
 import type { RepoStatus } from "@/types/Repository";
 import { db, mirrorJobs } from "./db";
 import { eq, and, or, lt, isNull } from "drizzle-orm";
 import { v4 as uuidv4 } from "uuid";
 import { publishEvent } from "./events";
@@ -120,7 +121,7 @@ export async function updateMirrorJobProgress({
    const [job] = await db
      .select()
      .from(mirrorJobs)
-      .where(mirrorJobs.id === jobId);
+      .where(eq(mirrorJobs.id, jobId));
    if (!job) {
      throw new Error(`Mirror job with ID ${jobId} not found`);
@@ -170,7 +171,7 @@ export async function updateMirrorJobProgress({
    await db
      .update(mirrorJobs)
      .set(updates)
-      .where(mirrorJobs.id === jobId);
+      .where(eq(mirrorJobs.id, jobId));
    // Publish the event with deduplication
    const updatedJob = {
@@ -203,7 +204,7 @@ export async function updateMirrorJobProgress({
 }
 /**
- * Finds interrupted jobs that need to be resumed
+ * Finds interrupted jobs that need to be resumed with enhanced criteria
 */
 export async function findInterruptedJobs() {
  try {
@@ -211,15 +212,35 @@ export async function findInterruptedJobs() {
    const cutoffTime = new Date();
    cutoffTime.setMinutes(cutoffTime.getMinutes() - 10); // Consider jobs inactive after 10 minutes without updates
    // Also check for jobs that have been running for too long (over 2 hours)
    const staleCutoffTime = new Date();
    staleCutoffTime.setHours(staleCutoffTime.getHours() - 2);
    const interruptedJobs = await db
      .select()
      .from(mirrorJobs)
      .where(
-        mirrorJobs.inProgress === true &&
+        and(
-        (mirrorJobs.lastCheckpoint === null ||
+          eq(mirrorJobs.inProgress, true),
-         mirrorJobs.lastCheckpoint < cutoffTime)
+          or(
            // Jobs with no recent checkpoint
            or(isNull(mirrorJobs.lastCheckpoint), lt(mirrorJobs.lastCheckpoint, cutoffTime)),
            // Jobs that started too long ago (likely stale)
            lt(mirrorJobs.startedAt, staleCutoffTime)
          )
        )
      );
    // Log details about found jobs for debugging
    if (interruptedJobs.length > 0) {
      console.log(`Found ${interruptedJobs.length} interrupted jobs:`);
      interruptedJobs.forEach(job => {
        const lastCheckpoint = job.lastCheckpoint ? new Date(job.lastCheckpoint).toISOString() : 'never';
        const startedAt = job.startedAt ? new Date(job.startedAt).toISOString() : 'unknown';
        console.log(`- Job ${job.id}: ${job.jobType} (started: ${startedAt}, last checkpoint: ${lastCheckpoint})`);
      });
    }
    return interruptedJobs;
  } catch (error) {
    console.error("Error finding interrupted jobs:", error);
--- a/src/lib/recovery.ts
+++ b/src/lib/recovery.ts
@@ -4,101 +4,274 @@
 */
 import { findInterruptedJobs, resumeInterruptedJob } from './helpers';
-import { db, repositories, organizations } from './db';
+import { db, repositories, organizations, mirrorJobs } from './db';
-import { eq } from 'drizzle-orm';
+import { eq, and, lt } from 'drizzle-orm';
 import { mirrorGithubRepoToGitea, mirrorGitHubOrgRepoToGiteaOrg, syncGiteaRepo } from './gitea';
 import { createGitHubClient } from './github';
 import { processWithResilience } from './utils/concurrency';
 import { repositoryVisibilityEnum, repoStatusEnum } from '@/types/Repository';
 import type { Repository } from './db/schema';
 // Recovery state tracking
 let recoveryInProgress = false;
 let lastRecoveryAttempt: Date | null = null;
 /**
- * Initialize the recovery system
+ * Validates database connection before attempting recovery
 * This should be called when the application starts
 */
-export async function initializeRecovery() {
+async function validateDatabaseConnection(): Promise<boolean> {
  console.log('Initializing recovery system...');
  try {
-    // Find interrupted jobs
+    // Simple query to test database connectivity
-    const interruptedJobs = await findInterruptedJobs();
+    await db.select().from(mirrorJobs).limit(1);
-    
+    return true;
    if (interruptedJobs.length === 0) {
      console.log('No interrupted jobs found.');
      return;
    }
    console.log(`Found ${interruptedJobs.length} interrupted jobs. Starting recovery...`);
    // Process each interrupted job
    for (const job of interruptedJobs) {
      const resumeData = await resumeInterruptedJob(job);
      if (!resumeData) {
        console.log(`Job ${job.id} could not be resumed.`);
        continue;
      }
      const { job: updatedJob, remainingItemIds } = resumeData;
      // Handle different job types
      switch (updatedJob.jobType) {
        case 'mirror':
          await recoverMirrorJob(updatedJob, remainingItemIds);
          break;
        case 'sync':
          await recoverSyncJob(updatedJob, remainingItemIds);
          break;
        case 'retry':
          await recoverRetryJob(updatedJob, remainingItemIds);
          break;
        default:
          console.log(`Unknown job type: ${updatedJob.jobType}`);
      }
    }
    console.log('Recovery process completed.');
  } catch (error) {
-    console.error('Error during recovery process:', error);
+    console.error('Database connection validation failed:', error);
    return false;
  }
 }
 /**
- * Recover a mirror job
+ * Cleans up stale jobs that are too old to recover
 */
 async function cleanupStaleJobs(): Promise<void> {
  try {
    const staleThreshold = new Date();
    staleThreshold.setHours(staleThreshold.getHours() - 24); // Jobs older than 24 hours
    const staleJobs = await db
      .select()
      .from(mirrorJobs)
      .where(
        and(
          eq(mirrorJobs.inProgress, true),
          lt(mirrorJobs.startedAt, staleThreshold)
        )
      );
    if (staleJobs.length > 0) {
      console.log(`Found ${staleJobs.length} stale jobs to clean up`);
      // Mark stale jobs as failed
      await db
        .update(mirrorJobs)
        .set({
          inProgress: false,
          completedAt: new Date(),
          status: "failed",
          message: "Job marked as failed due to being stale (older than 24 hours)"
        })
        .where(
          and(
            eq(mirrorJobs.inProgress, true),
            lt(mirrorJobs.startedAt, staleThreshold)
          )
        );
      console.log(`Cleaned up ${staleJobs.length} stale jobs`);
    }
  } catch (error) {
    console.error('Error cleaning up stale jobs:', error);
  }
 }
 /**
 * Initialize the recovery system with enhanced error handling and resilience
 * This should be called when the application starts
 */
 export async function initializeRecovery(options: {
  maxRetries?: number;
  retryDelay?: number;
  skipIfRecentAttempt?: boolean;
 } = {}): Promise<boolean> {
  const { maxRetries = 3, retryDelay = 5000, skipIfRecentAttempt = true } = options;
  // Prevent concurrent recovery attempts
  if (recoveryInProgress) {
    console.log('Recovery already in progress, skipping...');
    return false;
  }
  // Skip if recent attempt (within last 5 minutes) unless forced
  if (skipIfRecentAttempt && lastRecoveryAttempt) {
    const timeSinceLastAttempt = Date.now() - lastRecoveryAttempt.getTime();
    if (timeSinceLastAttempt < 5 * 60 * 1000) {
      console.log('Recent recovery attempt detected, skipping...');
      return false;
    }
  }
  recoveryInProgress = true;
  lastRecoveryAttempt = new Date();
  console.log('Initializing recovery system...');
  let attempt = 0;
  while (attempt < maxRetries) {
    try {
      attempt++;
      console.log(`Recovery attempt ${attempt}/${maxRetries}`);
      // Validate database connection first
      const dbConnected = await validateDatabaseConnection();
      if (!dbConnected) {
        throw new Error('Database connection validation failed');
      }
      // Clean up stale jobs first
      await cleanupStaleJobs();
      // Find interrupted jobs
      const interruptedJobs = await findInterruptedJobs();
      if (interruptedJobs.length === 0) {
        console.log('No interrupted jobs found.');
        recoveryInProgress = false;
        return true;
      }
      console.log(`Found ${interruptedJobs.length} interrupted jobs. Starting recovery...`);
      // Process each interrupted job with individual error handling
      let successCount = 0;
      let failureCount = 0;
      for (const job of interruptedJobs) {
        try {
          const resumeData = await resumeInterruptedJob(job);
          if (!resumeData) {
            console.log(`Job ${job.id} could not be resumed.`);
            failureCount++;
            continue;
          }
          const { job: updatedJob, remainingItemIds } = resumeData;
          // Handle different job types
          switch (updatedJob.jobType) {
            case 'mirror':
              await recoverMirrorJob(updatedJob, remainingItemIds);
              break;
            case 'sync':
              await recoverSyncJob(updatedJob, remainingItemIds);
              break;
            case 'retry':
              await recoverRetryJob(updatedJob, remainingItemIds);
              break;
            default:
              console.log(`Unknown job type: ${updatedJob.jobType}`);
              failureCount++;
              continue;
          }
          successCount++;
        } catch (jobError) {
          console.error(`Error recovering individual job ${job.id}:`, jobError);
          failureCount++;
          // Mark the job as failed if recovery fails
          try {
            await db
              .update(mirrorJobs)
              .set({
                inProgress: false,
                completedAt: new Date(),
                status: "failed",
                message: `Job recovery failed: ${jobError instanceof Error ? jobError.message : String(jobError)}`
              })
              .where(eq(mirrorJobs.id, job.id));
          } catch (updateError) {
            console.error(`Failed to mark job ${job.id} as failed:`, updateError);
          }
        }
      }
      console.log(`Recovery process completed. Success: ${successCount}, Failures: ${failureCount}`);
      recoveryInProgress = false;
      return true;
    } catch (error) {
      console.error(`Recovery attempt ${attempt} failed:`, error);
      if (attempt < maxRetries) {
        console.log(`Retrying in ${retryDelay}ms...`);
        await new Promise(resolve => setTimeout(resolve, retryDelay));
      } else {
        console.error('All recovery attempts failed');
        recoveryInProgress = false;
        return false;
      }
    }
  }
  recoveryInProgress = false;
  return false;
 }
 /**
 * Recover a mirror job with enhanced error handling
 */
 async function recoverMirrorJob(job: any, remainingItemIds: string[]) {
  console.log(`Recovering mirror job ${job.id} with ${remainingItemIds.length} remaining items`);
-  
+
  try {
-    // Get the config for this user
+    // Get the config for this user with better error handling
-    const [config] = await db
+    const configs = await db
      .select()
      .from(repositories)
      .where(eq(repositories.userId, job.userId))
      .limit(1);
-    
+
-    if (!config || !config.configId) {
+    if (configs.length === 0) {
-      throw new Error('Config not found for user');
+      throw new Error(`No configuration found for user ${job.userId}`);
    }
-    
+
-    // Get repositories to process
+    const config = configs[0];
    if (!config.configId) {
      throw new Error(`Configuration missing configId for user ${job.userId}`);
    }
    // Get repositories to process with validation
    const repos = await db
      .select()
      .from(repositories)
      .where(eq(repositories.id, remainingItemIds));
-    
+
    if (repos.length === 0) {
-      throw new Error('No repositories found for the remaining item IDs');
+      console.warn(`No repositories found for remaining item IDs: ${remainingItemIds.join(', ')}`);
      // Mark job as completed since there's nothing to process
      await db
        .update(mirrorJobs)
        .set({
          inProgress: false,
          completedAt: new Date(),
          status: "mirrored",
          message: "Job completed - no repositories found to process"
        })
        .where(eq(mirrorJobs.id, job.id));
      return;
    }
-    
+
-    // Create GitHub client
+    console.log(`Found ${repos.length} repositories to process for recovery`);
-    const octokit = createGitHubClient(config.githubConfig.token);
+
-    
+    // Validate GitHub configuration before creating client
-    // Process repositories with resilience
+    if (!config.githubConfig?.token) {
      throw new Error('GitHub token not found in configuration');
    }
    // Create GitHub client with error handling
    let octokit;
    try {
      octokit = createGitHubClient(config.githubConfig.token);
    } catch (error) {
      throw new Error(`Failed to create GitHub client: ${error instanceof Error ? error.message : String(error)}`);
    }
    // Process repositories with resilience and reduced concurrency for recovery
    await processWithResilience(
      repos,
      async (repo) => {
-        // Prepare repository data
+        // Prepare repository data with validation
        const repoData = {
          ...repo,
          status: repoStatusEnum.parse("imported"),
@@ -106,10 +279,10 @@ async function recoverMirrorJob(job: any, remainingItemIds: string[]) {
          lastMirrored: repo.lastMirrored ?? undefined,
          errorMessage: repo.errorMessage ?? undefined,
          forkedFrom: repo.forkedFrom ?? undefined,
-          visibility: repositoryVisibilityEnum.parse(repo.visibility),
+          visibility: repositoryVisibilityEnum.parse(repo.visibility || "public"),
          mirroredLocation: repo.mirroredLocation || "",
        };
-        
+
        // Mirror the repository based on whether it's in an organization
        if (repo.organization && config.githubConfig.preserveOrgStructure) {
          await mirrorGitHubOrgRepoToGiteaOrg({
@@ -125,7 +298,7 @@ async function recoverMirrorJob(job: any, remainingItemIds: string[]) {
            config,
          });
        }
-        
+
        return repo;
      },
      {
@@ -134,66 +307,102 @@ async function recoverMirrorJob(job: any, remainingItemIds: string[]) {
        getItemId: (repo) => repo.id,
        getItemName: (repo) => repo.name,
        resumeFromJobId: job.id,
-        concurrencyLimit: 3,
+        concurrencyLimit: 2, // Reduced concurrency for recovery to be more stable
-        maxRetries: 2,
+        maxRetries: 3, // Increased retries for recovery
-        retryDelay: 2000,
+        retryDelay: 3000, // Longer delay for recovery
      }
    );
    console.log(`Successfully recovered mirror job ${job.id}`);
  } catch (error) {
    console.error(`Error recovering mirror job ${job.id}:`, error);
    // Mark the job as failed
    try {
      await db
        .update(mirrorJobs)
        .set({
          inProgress: false,
          completedAt: new Date(),
          status: "failed",
          message: `Mirror job recovery failed: ${error instanceof Error ? error.message : String(error)}`
        })
        .where(eq(mirrorJobs.id, job.id));
    } catch (updateError) {
      console.error(`Failed to mark mirror job ${job.id} as failed:`, updateError);
    }
    throw error; // Re-throw to be handled by the caller
  }
 }
 /**
- * Recover a sync job
+ * Recover a sync job with enhanced error handling
 */
 async function recoverSyncJob(job: any, remainingItemIds: string[]) {
  // Implementation similar to recoverMirrorJob but for sync operations
  console.log(`Recovering sync job ${job.id} with ${remainingItemIds.length} remaining items`);
-  
+
  try {
-    // Get the config for this user
+    // Get the config for this user with better error handling
-    const [config] = await db
+    const configs = await db
      .select()
      .from(repositories)
      .where(eq(repositories.userId, job.userId))
      .limit(1);
-    
+
-    if (!config || !config.configId) {
+    if (configs.length === 0) {
-      throw new Error('Config not found for user');
+      throw new Error(`No configuration found for user ${job.userId}`);
    }
-    
+
-    // Get repositories to process
+    const config = configs[0];
    if (!config.configId) {
      throw new Error(`Configuration missing configId for user ${job.userId}`);
    }
    // Get repositories to process with validation
    const repos = await db
      .select()
      .from(repositories)
      .where(eq(repositories.id, remainingItemIds));
-    
+
    if (repos.length === 0) {
-      throw new Error('No repositories found for the remaining item IDs');
+      console.warn(`No repositories found for remaining item IDs: ${remainingItemIds.join(', ')}`);
      // Mark job as completed since there's nothing to process
      await db
        .update(mirrorJobs)
        .set({
          inProgress: false,
          completedAt: new Date(),
          status: "mirrored",
          message: "Job completed - no repositories found to process"
        })
        .where(eq(mirrorJobs.id, job.id));
      return;
    }
-    
+
-    // Process repositories with resilience
+    console.log(`Found ${repos.length} repositories to process for sync recovery`);
    // Process repositories with resilience and reduced concurrency for recovery
    await processWithResilience(
      repos,
      async (repo) => {
-        // Prepare repository data
+        // Prepare repository data with validation
        const repoData = {
          ...repo,
-          status: repoStatusEnum.parse(repo.status),
+          status: repoStatusEnum.parse(repo.status || "imported"),
          organization: repo.organization ?? undefined,
          lastMirrored: repo.lastMirrored ?? undefined,
          errorMessage: repo.errorMessage ?? undefined,
          forkedFrom: repo.forkedFrom ?? undefined,
-          visibility: repositoryVisibilityEnum.parse(repo.visibility),
+          visibility: repositoryVisibilityEnum.parse(repo.visibility || "public"),
        };
-        
+
        // Sync the repository
        await syncGiteaRepo({
          config,
          repository: repoData,
        });
-        
+
        return repo;
      },
      {
@@ -202,23 +411,94 @@ async function recoverSyncJob(job: any, remainingItemIds: string[]) {
        getItemId: (repo) => repo.id,
        getItemName: (repo) => repo.name,
        resumeFromJobId: job.id,
-        concurrencyLimit: 5,
+        concurrencyLimit: 3, // Reduced concurrency for recovery
-        maxRetries: 2,
+        maxRetries: 3, // Increased retries for recovery
-        retryDelay: 2000,
+        retryDelay: 3000, // Longer delay for recovery
      }
    );
    console.log(`Successfully recovered sync job ${job.id}`);
  } catch (error) {
    console.error(`Error recovering sync job ${job.id}:`, error);
    // Mark the job as failed
    try {
      await db
        .update(mirrorJobs)
        .set({
          inProgress: false,
          completedAt: new Date(),
          status: "failed",
          message: `Sync job recovery failed: ${error instanceof Error ? error.message : String(error)}`
        })
        .where(eq(mirrorJobs.id, job.id));
    } catch (updateError) {
      console.error(`Failed to mark sync job ${job.id} as failed:`, updateError);
    }
    throw error; // Re-throw to be handled by the caller
  }
 }
 /**
- * Recover a retry job
+ * Recover a retry job with enhanced error handling
 */
 async function recoverRetryJob(job: any, remainingItemIds: string[]) {
  // Implementation similar to recoverMirrorJob but for retry operations
  console.log(`Recovering retry job ${job.id} with ${remainingItemIds.length} remaining items`);
-  
+
-  // This would be similar to recoverMirrorJob but with retry-specific logic
+  try {
-  console.log('Retry job recovery not yet implemented');
+    // For now, retry jobs are treated similarly to mirror jobs
    // In the future, this could have specific retry logic
    await recoverMirrorJob(job, remainingItemIds);
    console.log(`Successfully recovered retry job ${job.id}`);
  } catch (error) {
    console.error(`Error recovering retry job ${job.id}:`, error);
    // Mark the job as failed
    try {
      await db
        .update(mirrorJobs)
        .set({
          inProgress: false,
          completedAt: new Date(),
          status: "failed",
          message: `Retry job recovery failed: ${error instanceof Error ? error.message : String(error)}`
        })
        .where(eq(mirrorJobs.id, job.id));
    } catch (updateError) {
      console.error(`Failed to mark retry job ${job.id} as failed:`, updateError);
    }
    throw error; // Re-throw to be handled by the caller
  }
 }
 /**
 * Get recovery system status
 */
 export function getRecoveryStatus() {
  return {
    inProgress: recoveryInProgress,
    lastAttempt: lastRecoveryAttempt,
  };
 }
 /**
 * Force recovery to run (bypassing recent attempt check)
 */
 export async function forceRecovery(): Promise<boolean> {
  return initializeRecovery({ skipIfRecentAttempt: false });
 }
 /**
 * Check if there are any jobs that need recovery
 */
 export async function hasJobsNeedingRecovery(): Promise<boolean> {
  try {
    const interruptedJobs = await findInterruptedJobs();
    return interruptedJobs.length > 0;
  } catch (error) {
    console.error('Error checking for jobs needing recovery:', error);
    return false;
  }
 }
--- a/src/middleware.ts
+++ b/src/middleware.ts
@@ -1,22 +1,58 @@
 import { defineMiddleware } from 'astro:middleware';
-import { initializeRecovery } from './lib/recovery';
+import { initializeRecovery, hasJobsNeedingRecovery, getRecoveryStatus } from './lib/recovery';
 // Flag to track if recovery has been initialized
 let recoveryInitialized = false;
 let recoveryAttempted = false;
 export const onRequest = defineMiddleware(async (context, next) => {
  // Initialize recovery system only once when the server starts
-  if (!recoveryInitialized) {
+  // This is a fallback in case the startup script didn't run
-    console.log('Initializing recovery system from middleware...');
+  if (!recoveryInitialized && !recoveryAttempted) {
    recoveryAttempted = true;
    try {
-      await initializeRecovery();
+      // Check if recovery is actually needed before attempting
-      console.log('Recovery system initialized successfully');
+      const needsRecovery = await hasJobsNeedingRecovery();
      if (needsRecovery) {
        console.log('⚠️  Middleware detected jobs needing recovery (startup script may not have run)');
        console.log('Attempting recovery from middleware...');
        // Run recovery with a shorter timeout since this is during request handling
        const recoveryResult = await Promise.race([
          initializeRecovery({
            skipIfRecentAttempt: true,
            maxRetries: 2,
            retryDelay: 3000,
          }),
          new Promise<boolean>((_, reject) => {
            setTimeout(() => reject(new Error('Middleware recovery timeout')), 15000);
          })
        ]);
        if (recoveryResult) {
          console.log('✅ Middleware recovery completed successfully');
        } else {
          console.log('⚠️  Middleware recovery completed with some issues');
        }
      } else {
        console.log('✅ No recovery needed (startup script likely handled it)');
      }
      recoveryInitialized = true;
    } catch (error) {
-      console.error('Error initializing recovery system:', error);
+      console.error('⚠️  Middleware recovery failed or timed out:', error);
      console.log('Application will continue, but some jobs may remain interrupted');
      // Log recovery status for debugging
      const status = getRecoveryStatus();
      console.log('Recovery status:', status);
      recoveryInitialized = true; // Mark as attempted to avoid retries
    }
    recoveryInitialized = true;
  }
-  
+
  // Continue with the request
  return next();
 });
--- a/src/pages/api/health.ts
+++ b/src/pages/api/health.ts
@@ -2,6 +2,7 @@ import type { APIRoute } from "astro";
 import { jsonResponse } from "@/lib/utils";
 import { db } from "@/lib/db";
 import { ENV } from "@/lib/config";
 import { getRecoveryStatus, hasJobsNeedingRecovery } from "@/lib/recovery";
 import os from "os";
 import axios from "axios";
@@ -38,9 +39,20 @@ export const GET: APIRoute = async () => {
    const currentVersion = process.env.npm_package_version || "unknown";
    const latestVersion = await checkLatestVersion();
    // Get recovery system status
    const recoveryStatus = await getRecoverySystemStatus();
    // Determine overall health status
    let overallStatus = "ok";
    if (!dbStatus.connected) {
      overallStatus = "error";
    } else if (recoveryStatus.jobsNeedingRecovery > 0 && !recoveryStatus.inProgress) {
      overallStatus = "degraded";
    }
    // Build response
    const healthData = {
-      status: "ok",
+      status: overallStatus,
      timestamp: new Date().toISOString(),
      version: currentVersion,
      latestVersion: latestVersion,
@@ -48,6 +60,7 @@ export const GET: APIRoute = async () => {
                       currentVersion !== "unknown" &&
                       latestVersion !== currentVersion,
      database: dbStatus,
      recovery: recoveryStatus,
      system: systemInfo,
    };
@@ -94,6 +107,36 @@ async function checkDatabaseConnection() {
  }
 }
 /**
 * Get recovery system status
 */
 async function getRecoverySystemStatus() {
  try {
    const recoveryStatus = getRecoveryStatus();
    const needsRecovery = await hasJobsNeedingRecovery();
    return {
      status: needsRecovery ? 'jobs-pending' : 'healthy',
      inProgress: recoveryStatus.inProgress,
      lastAttempt: recoveryStatus.lastAttempt?.toISOString() || null,
      jobsNeedingRecovery: needsRecovery ? 1 : 0, // Simplified count for health check
      message: needsRecovery
        ? 'Jobs found that need recovery'
        : 'No jobs need recovery',
    };
  } catch (error) {
    console.error('Recovery system status check failed:', error);
    return {
      status: 'error',
      inProgress: false,
      lastAttempt: null,
      jobsNeedingRecovery: -1,
      message: error instanceof Error ? error.message : 'Recovery status check failed',
    };
  }
 }
 /**
 * Get server uptime information
 */