- Added shutdown handler in docker-entrypoint.sh to manage application termination signals. - Introduced shutdown manager to track active jobs and ensure state persistence during shutdown. - Enhanced cleanup service to support stopping and status retrieval. - Integrated signal handlers for proper response to termination signals (SIGTERM, SIGINT, SIGHUP). - Updated middleware to initialize shutdown manager and cleanup service. - Created integration tests for graceful shutdown functionality, verifying job state preservation and recovery. - Documented graceful shutdown process and configuration in GRACEFUL_SHUTDOWN.md and SHUTDOWN_PROCESS.md. - Added new scripts for testing shutdown behavior and cleanup.
7.0 KiB
Graceful Shutdown Process
This document details how the gitea-mirror application handles graceful shutdown during active mirroring operations, with specific focus on job interruption and recovery.
Overview
The graceful shutdown system is designed for fast, clean termination without waiting for long-running jobs to complete. It prioritizes quick shutdown times (under 30 seconds) while preserving all progress for seamless recovery.
Key Principle
The application does NOT wait for jobs to finish before shutting down. Instead, it saves the current state and resumes after restart.
Shutdown Scenario Example
Initial State
- Job: Mirror 500 repositories
- Progress: 200 repositories completed
- Remaining: 300 repositories pending
- Action: User initiates shutdown (SIGTERM, Ctrl+C, Docker stop)
Shutdown Process (Under 30 seconds)
Step 1: Signal Detection (Immediate)
📡 Received SIGTERM signal
🛑 Graceful shutdown initiated by signal: SIGTERM
📊 Shutdown status: 1 active jobs, 2 callbacks
Step 2: Job State Saving (1-10 seconds)
📝 Step 1: Saving active job states...
Saving state for job abc-123...
✅ Saved state for job abc-123
What gets saved:
inProgress: false- Mark job as not currently runningcompletedItems: 200- Number of repos successfully mirroredtotalItems: 500- Total repos in the jobcompletedItemIds: [repo1, repo2, ..., repo200]- List of completed repositemIds: [repo1, repo2, ..., repo500]- Full list of reposlastCheckpoint: 2025-05-24T17:30:00Z- Exact shutdown timemessage: "Job interrupted by application shutdown - will resume on restart"status: "imported"- Keeps status as resumable (not "failed")
Step 3: Service Cleanup (1-5 seconds)
🔧 Step 2: Executing shutdown callbacks...
🛑 Shutting down cleanup service...
✅ Cleanup service stopped
✅ Shutdown callback 1 completed
Step 4: Clean Exit (Immediate)
💾 Step 3: Closing database connections...
✅ Graceful shutdown completed successfully
Total shutdown time: ~15 seconds (well under the 30-second limit)
What Happens to the Remaining 300 Repos?
During Shutdown
- NOT processed - The remaining 300 repos are not mirrored
- NOT lost - Their IDs are preserved in the job state
- NOT marked as failed - Job status remains "imported" for recovery
After Restart
The recovery system automatically:
- Detects interrupted job during startup
- Calculates remaining work: 500 - 200 = 300 repos
- Extracts remaining repo IDs: repos 201-500 from the original list
- Resumes processing from exactly where it left off
- Continues until completion of all 500 repos
Timeout Configuration
Shutdown Timeouts
const SHUTDOWN_TIMEOUT = 30000; // 30 seconds max shutdown time
const JOB_SAVE_TIMEOUT = 10000; // 10 seconds to save job state
Timeout Behavior
- Normal case: Shutdown completes in 10-20 seconds
- Slow database: Up to 30 seconds allowed
- Timeout exceeded: Force exit with code 1
- Container kill: Orchestrator should allow 45+ seconds grace period
Job State Persistence
Database Schema
The mirror_jobs table stores complete job state:
-- Job identification
id TEXT PRIMARY KEY,
user_id TEXT NOT NULL,
job_type TEXT NOT NULL DEFAULT 'mirror',
-- Progress tracking
total_items INTEGER,
completed_items INTEGER DEFAULT 0,
item_ids TEXT, -- JSON array of all repo IDs
completed_item_ids TEXT DEFAULT '[]', -- JSON array of completed repo IDs
-- State management
in_progress INTEGER NOT NULL DEFAULT 0, -- Boolean: currently running
started_at TIMESTAMP,
completed_at TIMESTAMP,
last_checkpoint TIMESTAMP, -- Last progress save
-- Status and messaging
status TEXT NOT NULL DEFAULT 'imported',
message TEXT NOT NULL
Recovery Query
The recovery system finds interrupted jobs:
SELECT * FROM mirror_jobs
WHERE in_progress = 0
AND status = 'imported'
AND completed_at IS NULL
AND total_items > completed_items;
Shutdown-Aware Processing
Concurrency Check
During job execution, each repo processing checks for shutdown:
// Before processing each repository
if (isShuttingDown()) {
throw new Error('Processing interrupted by application shutdown');
}
Checkpoint Intervals
Jobs save progress periodically (every 10 repos by default):
checkpointInterval: 10, // Save progress every 10 repositories
This ensures minimal work loss even if shutdown occurs between checkpoints.
Container Integration
Docker Entrypoint
The Docker entrypoint properly forwards signals:
# Set up signal handlers
trap 'shutdown_handler' TERM INT HUP
# Start application in background
bun ./dist/server/entry.mjs &
APP_PID=$!
# Wait for application to finish
wait "$APP_PID"
Kubernetes Configuration
Recommended pod configuration:
apiVersion: v1
kind: Pod
spec:
terminationGracePeriodSeconds: 45 # Allow time for graceful shutdown
containers:
- name: gitea-mirror
# ... other configuration
Monitoring and Logging
Shutdown Logs
🛑 Graceful shutdown initiated by signal: SIGTERM
📊 Shutdown status: 1 active jobs, 2 callbacks
📝 Step 1: Saving active job states...
Saving state for 1 active jobs...
✅ Completed saving all active jobs
🔧 Step 2: Executing shutdown callbacks...
✅ Completed all shutdown callbacks
💾 Step 3: Closing database connections...
✅ Graceful shutdown completed successfully
Recovery Logs
⚠️ Jobs found that need recovery. Starting recovery process...
Resuming job abc-123 with 300 remaining items...
✅ Recovery completed successfully
Best Practices
For Operations
- Monitor shutdown times - Should complete under 30 seconds
- Check recovery logs - Verify jobs resume correctly after restart
- Set appropriate grace periods - Allow 45+ seconds in orchestrators
- Plan maintenance windows - Jobs will resume but may take time to complete
For Development
- Test shutdown scenarios - Use
bun run test-shutdown - Monitor job progress - Check checkpoint frequency and timing
- Verify recovery - Ensure interrupted jobs resume correctly
- Handle edge cases - Test shutdown during different job phases
Troubleshooting
Shutdown Takes Too Long
- Check: Database performance during job state saving
- Solution: Increase
SHUTDOWN_TIMEOUTenvironment variable - Monitor: Job complexity and checkpoint frequency
Jobs Don't Resume
- Check: Recovery logs for errors during startup
- Verify: Database contains interrupted jobs with correct status
- Test: Run
bun run startup-recoverymanually
Container Force-Killed
- Check: Container orchestrator termination grace period
- Increase: Grace period to 45+ seconds
- Monitor: Application shutdown completion time
This design ensures production-ready graceful shutdown with zero data loss and fast recovery times suitable for modern containerized deployments.