mirror of https://github.com/RayLabsHQ/gitea-mirror.git synced 2025-12-06 11:36:44 +03:00

Files

Arunavo Ray daf4ab6a93 feat: Implement graceful shutdown and enhanced job recovery

- Added shutdown handler in docker-entrypoint.sh to manage application termination signals.
- Introduced shutdown manager to track active jobs and ensure state persistence during shutdown.
- Enhanced cleanup service to support stopping and status retrieval.
- Integrated signal handlers for proper response to termination signals (SIGTERM, SIGINT, SIGHUP).
- Updated middleware to initialize shutdown manager and cleanup service.
- Created integration tests for graceful shutdown functionality, verifying job state preservation and recovery.
- Documented graceful shutdown process and configuration in GRACEFUL_SHUTDOWN.md and SHUTDOWN_PROCESS.md.
- Added new scripts for testing shutdown behavior and cleanup.

2025-05-24 23:06:28 +05:30

7.0 KiB

Raw Blame History

Graceful Shutdown Process

This document details how the gitea-mirror application handles graceful shutdown during active mirroring operations, with specific focus on job interruption and recovery.

Overview

The graceful shutdown system is designed for fast, clean termination without waiting for long-running jobs to complete. It prioritizes quick shutdown times (under 30 seconds) while preserving all progress for seamless recovery.

Key Principle

The application does NOT wait for jobs to finish before shutting down. Instead, it saves the current state and resumes after restart.

Shutdown Scenario Example

Initial State

Job: Mirror 500 repositories
Progress: 200 repositories completed
Remaining: 300 repositories pending
Action: User initiates shutdown (SIGTERM, Ctrl+C, Docker stop)

Shutdown Process (Under 30 seconds)

Step 1: Signal Detection (Immediate)

📡 Received SIGTERM signal
🛑 Graceful shutdown initiated by signal: SIGTERM
📊 Shutdown status: 1 active jobs, 2 callbacks

Step 2: Job State Saving (1-10 seconds)

📝 Step 1: Saving active job states...
Saving state for job abc-123...
✅ Saved state for job abc-123

What gets saved:

inProgress: false - Mark job as not currently running
completedItems: 200 - Number of repos successfully mirrored
totalItems: 500 - Total repos in the job
completedItemIds: [repo1, repo2, ..., repo200] - List of completed repos
itemIds: [repo1, repo2, ..., repo500] - Full list of repos
lastCheckpoint: 2025-05-24T17:30:00Z - Exact shutdown time
message: "Job interrupted by application shutdown - will resume on restart"
status: "imported" - Keeps status as resumable (not "failed")

Step 3: Service Cleanup (1-5 seconds)

🔧 Step 2: Executing shutdown callbacks...
🛑 Shutting down cleanup service...
✅ Cleanup service stopped
✅ Shutdown callback 1 completed

Step 4: Clean Exit (Immediate)

💾 Step 3: Closing database connections...
✅ Graceful shutdown completed successfully

Total shutdown time: ~15 seconds (well under the 30-second limit)

What Happens to the Remaining 300 Repos?

During Shutdown

NOT processed - The remaining 300 repos are not mirrored
NOT lost - Their IDs are preserved in the job state
NOT marked as failed - Job status remains "imported" for recovery

After Restart

The recovery system automatically:

Detects interrupted job during startup
Calculates remaining work: 500 - 200 = 300 repos
Extracts remaining repo IDs: repos 201-500 from the original list
Resumes processing from exactly where it left off
Continues until completion of all 500 repos

Timeout Configuration

Shutdown Timeouts

const SHUTDOWN_TIMEOUT = 30000; // 30 seconds max shutdown time
const JOB_SAVE_TIMEOUT = 10000; // 10 seconds to save job state

Timeout Behavior

Normal case: Shutdown completes in 10-20 seconds
Slow database: Up to 30 seconds allowed
Timeout exceeded: Force exit with code 1
Container kill: Orchestrator should allow 45+ seconds grace period

Job State Persistence

Database Schema

The mirror_jobs table stores complete job state:

-- Job identification
id TEXT PRIMARY KEY,
user_id TEXT NOT NULL,
job_type TEXT NOT NULL DEFAULT 'mirror',

-- Progress tracking  
total_items INTEGER,
completed_items INTEGER DEFAULT 0,
item_ids TEXT, -- JSON array of all repo IDs
completed_item_ids TEXT DEFAULT '[]', -- JSON array of completed repo IDs

-- State management
in_progress INTEGER NOT NULL DEFAULT 0, -- Boolean: currently running
started_at TIMESTAMP,
completed_at TIMESTAMP,
last_checkpoint TIMESTAMP, -- Last progress save

-- Status and messaging
status TEXT NOT NULL DEFAULT 'imported',
message TEXT NOT NULL

Recovery Query

The recovery system finds interrupted jobs:

SELECT * FROM mirror_jobs 
WHERE in_progress = 0 
  AND status = 'imported' 
  AND completed_at IS NULL
  AND total_items > completed_items;

Shutdown-Aware Processing

Concurrency Check

During job execution, each repo processing checks for shutdown:

// Before processing each repository
if (isShuttingDown()) {
  throw new Error('Processing interrupted by application shutdown');
}

Checkpoint Intervals

Jobs save progress periodically (every 10 repos by default):

checkpointInterval: 10, // Save progress every 10 repositories

This ensures minimal work loss even if shutdown occurs between checkpoints.

Container Integration

Docker Entrypoint

The Docker entrypoint properly forwards signals:

# Set up signal handlers
trap 'shutdown_handler' TERM INT HUP

# Start application in background
bun ./dist/server/entry.mjs &
APP_PID=$!

# Wait for application to finish
wait "$APP_PID"

Kubernetes Configuration

Recommended pod configuration:

apiVersion: v1
kind: Pod
spec:
  terminationGracePeriodSeconds: 45  # Allow time for graceful shutdown
  containers:
  - name: gitea-mirror
    # ... other configuration

Monitoring and Logging

Shutdown Logs

🛑 Graceful shutdown initiated by signal: SIGTERM
📊 Shutdown status: 1 active jobs, 2 callbacks
📝 Step 1: Saving active job states...
Saving state for 1 active jobs...
✅ Completed saving all active jobs
🔧 Step 2: Executing shutdown callbacks...
✅ Completed all shutdown callbacks  
💾 Step 3: Closing database connections...
✅ Graceful shutdown completed successfully

Recovery Logs

⚠️  Jobs found that need recovery. Starting recovery process...
Resuming job abc-123 with 300 remaining items...
✅ Recovery completed successfully

Best Practices

For Operations

Monitor shutdown times - Should complete under 30 seconds
Check recovery logs - Verify jobs resume correctly after restart
Set appropriate grace periods - Allow 45+ seconds in orchestrators
Plan maintenance windows - Jobs will resume but may take time to complete

For Development

Test shutdown scenarios - Use bun run test-shutdown
Monitor job progress - Check checkpoint frequency and timing
Verify recovery - Ensure interrupted jobs resume correctly
Handle edge cases - Test shutdown during different job phases

Troubleshooting

Shutdown Takes Too Long

Check: Database performance during job state saving
Solution: Increase SHUTDOWN_TIMEOUT environment variable
Monitor: Job complexity and checkpoint frequency

Jobs Don't Resume

Check: Recovery logs for errors during startup
Verify: Database contains interrupted jobs with correct status
Test: Run bun run startup-recovery manually

Container Force-Killed

Check: Container orchestrator termination grace period
Increase: Grace period to 45+ seconds
Monitor: Application shutdown completion time

This design ensures production-ready graceful shutdown with zero data loss and fast recovery times suitable for modern containerized deployments.

7.0 KiB Raw Blame History