feat: Implement graceful shutdown and enhanced job recovery

- Added shutdown handler in docker-entrypoint.sh to manage application termination signals. - Introduced shutdown manager to track active jobs and ensure state persistence during shutdown. - Enhanced cleanup service to support stopping and status retrieval. - Integrated signal handlers for proper response to termination signals (SIGTERM, SIGINT, SIGHUP). - Updated middleware to initialize shutdown manager and cleanup service. - Created integration tests for graceful shutdown functionality, verifying job state preservation and recovery. - Documented graceful shutdown process and configuration in GRACEFUL_SHUTDOWN.md and SHUTDOWN_PROCESS.md. - Added new scripts for testing shutdown behavior and cleanup.
2026-03-13 22:12:54 +03:00 · 2025-05-24 23:06:28 +05:30
parent 4404af7d40
commit daf4ab6a93
10 changed files with 1243 additions and 12 deletions
--- a/docs/GRACEFUL_SHUTDOWN.md
+++ b/docs/GRACEFUL_SHUTDOWN.md
@@ -0,0 +1,249 @@
+# Graceful Shutdown and Enhanced Job Recovery
+
+This document describes the graceful shutdown and enhanced job recovery capabilities implemented in gitea-mirror v2.8.0+.
+
+## Overview
+
+The gitea-mirror application now includes comprehensive graceful shutdown handling and enhanced job recovery mechanisms designed specifically for containerized environments. These features ensure:
+
+- **No data loss** during container restarts or shutdowns
+- **Automatic job resumption** after application restarts
+- **Clean termination** of all active processes and connections
+- **Container-aware design** optimized for Docker/LXC deployments
+
+## Features
+
+### 1. Graceful Shutdown Manager
+
+The shutdown manager (`src/lib/shutdown-manager.ts`) provides centralized coordination of application termination:
+
+#### Key Capabilities:
+- **Active Job Tracking**: Monitors all running mirroring/sync jobs
+- **State Persistence**: Saves job progress to database before shutdown
+- **Callback System**: Allows services to register cleanup functions
+- **Timeout Protection**: Prevents hanging shutdowns with configurable timeouts
+- **Signal Coordination**: Works with signal handlers for proper container lifecycle
+
+#### Configuration:
+- **Shutdown Timeout**: 30 seconds maximum (configurable)
+- **Job Save Timeout**: 10 seconds per job (configurable)
+
+### 2. Signal Handlers
+
+The signal handler system (`src/lib/signal-handlers.ts`) ensures proper response to container lifecycle events:
+
+#### Supported Signals:
+- **SIGTERM**: Docker stop, Kubernetes pod termination
+- **SIGINT**: Ctrl+C, manual interruption
+- **SIGHUP**: Terminal hangup, service reload
+- **Uncaught Exceptions**: Emergency shutdown on critical errors
+- **Unhandled Rejections**: Graceful handling of promise failures
+
+### 3. Enhanced Job Recovery
+
+Building on the existing recovery system, new enhancements include:
+
+#### Shutdown-Aware Processing:
+- Jobs check for shutdown signals during execution
+- Automatic state saving when shutdown is detected
+- Proper job status management (interrupted vs failed)
+
+#### Container Integration:
+- Docker entrypoint script forwards signals correctly
+- Startup recovery runs before main application
+- Recovery timeouts prevent startup delays
+
+## Usage
+
+### Basic Operation
+
+The graceful shutdown system is automatically initialized when the application starts. No manual configuration is required for basic operation.
+
+### Testing
+
+Test the graceful shutdown functionality:
+
+```bash
+# Run the integration test
+bun run test-shutdown
+
+# Clean up test data
+bun run test-shutdown-cleanup
+
+# Run unit tests
+bun test src/lib/shutdown-manager.test.ts
+bun test src/lib/signal-handlers.test.ts
+```
+
+### Manual Testing
+
+1. **Start the application**:
+   ```bash
+   bun run dev
+   # or in production
+   bun run start
+   ```
+
+2. **Start a mirroring job** through the web interface
+
+3. **Send shutdown signal**:
+   ```bash
+   # Send SIGTERM (recommended)
+   kill -TERM <process_id>
+   
+   # Or use Ctrl+C for SIGINT
+   ```
+
+4. **Verify job state** is saved and can be resumed on restart
+
+### Container Testing
+
+Test with Docker:
+
+```bash
+# Build and run container
+docker build -t gitea-mirror .
+docker run -d --name test-shutdown gitea-mirror
+
+# Start a job, then stop container
+docker stop test-shutdown
+
+# Restart and verify recovery
+docker start test-shutdown
+docker logs test-shutdown
+```
+
+## Implementation Details
+
+### Shutdown Flow
+
+1. **Signal Reception**: Signal handlers detect termination request
+2. **Shutdown Initiation**: Shutdown manager begins graceful termination
+3. **Job State Saving**: All active jobs save current progress to database
+4. **Service Cleanup**: Registered callbacks stop background services
+5. **Connection Cleanup**: Database connections and resources are released
+6. **Process Termination**: Application exits with appropriate code
+
+### Job State Management
+
+During shutdown, active jobs are updated with:
+- `inProgress: false` - Mark as not currently running
+- `lastCheckpoint: <timestamp>` - Record shutdown time
+- `message: "Job interrupted by application shutdown - will resume on restart"`
+- Status remains as `"imported"` (not `"failed"`) to enable recovery
+
+### Recovery Integration
+
+The existing recovery system automatically detects and resumes interrupted jobs:
+- Jobs with `inProgress: false` and incomplete status are candidates for recovery
+- Recovery runs during application startup (before serving requests)
+- Jobs resume from their last checkpoint with remaining items
+
+## Configuration
+
+### Environment Variables
+
+```bash
+# Optional: Adjust shutdown timeout (default: 30000ms)
+SHUTDOWN_TIMEOUT=30000
+
+# Optional: Adjust job save timeout (default: 10000ms)
+JOB_SAVE_TIMEOUT=10000
+```
+
+### Docker Configuration
+
+The Docker entrypoint script includes proper signal handling:
+
+```dockerfile
+# Signals are forwarded to the application process
+# SIGTERM is handled gracefully with 30-second timeout
+# Container stops cleanly without force-killing processes
+```
+
+### Kubernetes Configuration
+
+For Kubernetes deployments, configure appropriate termination grace period:
+
+```yaml
+apiVersion: v1
+kind: Pod
+spec:
+  terminationGracePeriodSeconds: 45  # Allow time for graceful shutdown
+  containers:
+  - name: gitea-mirror
+    # ... other configuration
+```
+
+## Monitoring and Debugging
+
+### Logs
+
+The application provides detailed logging during shutdown:
+
+```
+🛑 Graceful shutdown initiated by signal: SIGTERM
+📊 Shutdown status: 2 active jobs, 1 callbacks
+📝 Step 1: Saving active job states...
+Saving state for job abc-123...
+✅ Saved state for job abc-123
+🔧 Step 2: Executing shutdown callbacks...
+✅ Shutdown callback 1 completed
+💾 Step 3: Closing database connections...
+✅ Graceful shutdown completed successfully
+```
+
+### Status Endpoints
+
+Check shutdown manager status via API:
+
+```bash
+# Get current status (if application is running)
+curl http://localhost:4321/api/health
+```
+
+### Troubleshooting
+
+**Problem**: Jobs not resuming after restart
+- **Check**: Startup recovery logs for errors
+- **Verify**: Database contains interrupted jobs with correct status
+- **Test**: Run `bun run startup-recovery` manually
+
+**Problem**: Shutdown timeout reached
+- **Check**: Job complexity and database performance
+- **Adjust**: Increase `SHUTDOWN_TIMEOUT` environment variable
+- **Monitor**: Database connection and disk I/O during shutdown
+
+**Problem**: Container force-killed
+- **Check**: Container orchestrator termination grace period
+- **Adjust**: Increase grace period to allow shutdown completion
+- **Monitor**: Application shutdown logs for timing issues
+
+## Best Practices
+
+### Development
+- Always test graceful shutdown during development
+- Use the provided test scripts to verify functionality
+- Monitor logs for shutdown timing and job state persistence
+
+### Production
+- Set appropriate container termination grace periods
+- Monitor shutdown logs for performance issues
+- Use health checks to verify application readiness after restart
+- Consider job complexity when planning maintenance windows
+
+### Monitoring
+- Track job recovery success rates
+- Monitor shutdown duration metrics
+- Alert on forced terminations or recovery failures
+- Log analysis for shutdown pattern optimization
+
+## Future Enhancements
+
+Planned improvements for future versions:
+
+1. **Configurable Timeouts**: Environment variable configuration for all timeouts
+2. **Shutdown Metrics**: Prometheus metrics for shutdown performance
+3. **Progressive Shutdown**: Graceful degradation of service capabilities
+4. **Job Prioritization**: Priority-based job saving during shutdown
+5. **Health Check Integration**: Readiness probes during shutdown process
--- a/docs/SHUTDOWN_PROCESS.md
+++ b/docs/SHUTDOWN_PROCESS.md
@@ -0,0 +1,236 @@
+# Graceful Shutdown Process
+
+This document details how the gitea-mirror application handles graceful shutdown during active mirroring operations, with specific focus on job interruption and recovery.
+
+## Overview
+
+The graceful shutdown system is designed for **fast, clean termination** without waiting for long-running jobs to complete. It prioritizes **quick shutdown times** (under 30 seconds) while **preserving all progress** for seamless recovery.
+
+## Key Principle
+
+**The application does NOT wait for jobs to finish before shutting down.** Instead, it saves the current state and resumes after restart.
+
+## Shutdown Scenario Example
+
+### Initial State
+- **Job**: Mirror 500 repositories
+- **Progress**: 200 repositories completed
+- **Remaining**: 300 repositories pending
+- **Action**: User initiates shutdown (SIGTERM, Ctrl+C, Docker stop)
+
+### Shutdown Process (Under 30 seconds)
+
+#### Step 1: Signal Detection (Immediate)
+```
+📡 Received SIGTERM signal
+🛑 Graceful shutdown initiated by signal: SIGTERM
+📊 Shutdown status: 1 active jobs, 2 callbacks
+```
+
+#### Step 2: Job State Saving (1-10 seconds)
+```
+📝 Step 1: Saving active job states...
+Saving state for job abc-123...
+✅ Saved state for job abc-123
+```
+
+**What gets saved:**
+- `inProgress: false` - Mark job as not currently running
+- `completedItems: 200` - Number of repos successfully mirrored
+- `totalItems: 500` - Total repos in the job
+- `completedItemIds: [repo1, repo2, ..., repo200]` - List of completed repos
+- `itemIds: [repo1, repo2, ..., repo500]` - Full list of repos
+- `lastCheckpoint: 2025-05-24T17:30:00Z` - Exact shutdown time
+- `message: "Job interrupted by application shutdown - will resume on restart"`
+- `status: "imported"` - Keeps status as resumable (not "failed")
+
+#### Step 3: Service Cleanup (1-5 seconds)
+```
+🔧 Step 2: Executing shutdown callbacks...
+🛑 Shutting down cleanup service...
+✅ Cleanup service stopped
+✅ Shutdown callback 1 completed
+```
+
+#### Step 4: Clean Exit (Immediate)
+```
+💾 Step 3: Closing database connections...
+✅ Graceful shutdown completed successfully
+```
+
+**Total shutdown time: ~15 seconds** (well under the 30-second limit)
+
+## What Happens to the Remaining 300 Repos?
+
+### During Shutdown
+- **NOT processed** - The remaining 300 repos are not mirrored
+- **NOT lost** - Their IDs are preserved in the job state
+- **NOT marked as failed** - Job status remains "imported" for recovery
+
+### After Restart
+The recovery system automatically:
+
+1. **Detects interrupted job** during startup
+2. **Calculates remaining work**: 500 - 200 = 300 repos
+3. **Extracts remaining repo IDs**: repos 201-500 from the original list
+4. **Resumes processing** from exactly where it left off
+5. **Continues until completion** of all 500 repos
+
+## Timeout Configuration
+
+### Shutdown Timeouts
+```typescript
+const SHUTDOWN_TIMEOUT = 30000; // 30 seconds max shutdown time
+const JOB_SAVE_TIMEOUT = 10000; // 10 seconds to save job state
+```
+
+### Timeout Behavior
+- **Normal case**: Shutdown completes in 10-20 seconds
+- **Slow database**: Up to 30 seconds allowed
+- **Timeout exceeded**: Force exit with code 1
+- **Container kill**: Orchestrator should allow 45+ seconds grace period
+
+## Job State Persistence
+
+### Database Schema
+The `mirror_jobs` table stores complete job state:
+
+```sql
+-- Job identification
+id TEXT PRIMARY KEY,
+user_id TEXT NOT NULL,
+job_type TEXT NOT NULL DEFAULT 'mirror',
+
+-- Progress tracking  
+total_items INTEGER,
+completed_items INTEGER DEFAULT 0,
+item_ids TEXT, -- JSON array of all repo IDs
+completed_item_ids TEXT DEFAULT '[]', -- JSON array of completed repo IDs
+
+-- State management
+in_progress INTEGER NOT NULL DEFAULT 0, -- Boolean: currently running
+started_at TIMESTAMP,
+completed_at TIMESTAMP,
+last_checkpoint TIMESTAMP, -- Last progress save
+
+-- Status and messaging
+status TEXT NOT NULL DEFAULT 'imported',
+message TEXT NOT NULL
+```
+
+### Recovery Query
+The recovery system finds interrupted jobs:
+
+```sql
+SELECT * FROM mirror_jobs 
+WHERE in_progress = 0 
+  AND status = 'imported' 
+  AND completed_at IS NULL
+  AND total_items > completed_items;
+```
+
+## Shutdown-Aware Processing
+
+### Concurrency Check
+During job execution, each repo processing checks for shutdown:
+
+```typescript
+// Before processing each repository
+if (isShuttingDown()) {
+  throw new Error('Processing interrupted by application shutdown');
+}
+```
+
+### Checkpoint Intervals
+Jobs save progress periodically (every 10 repos by default):
+
+```typescript
+checkpointInterval: 10, // Save progress every 10 repositories
+```
+
+This ensures minimal work loss even if shutdown occurs between checkpoints.
+
+## Container Integration
+
+### Docker Entrypoint
+The Docker entrypoint properly forwards signals:
+
+```bash
+# Set up signal handlers
+trap 'shutdown_handler' TERM INT HUP
+
+# Start application in background
+bun ./dist/server/entry.mjs &
+APP_PID=$!
+
+# Wait for application to finish
+wait "$APP_PID"
+```
+
+### Kubernetes Configuration
+Recommended pod configuration:
+
+```yaml
+apiVersion: v1
+kind: Pod
+spec:
+  terminationGracePeriodSeconds: 45  # Allow time for graceful shutdown
+  containers:
+  - name: gitea-mirror
+    # ... other configuration
+```
+
+## Monitoring and Logging
+
+### Shutdown Logs
+```
+🛑 Graceful shutdown initiated by signal: SIGTERM
+📊 Shutdown status: 1 active jobs, 2 callbacks
+📝 Step 1: Saving active job states...
+Saving state for 1 active jobs...
+✅ Completed saving all active jobs
+🔧 Step 2: Executing shutdown callbacks...
+✅ Completed all shutdown callbacks  
+💾 Step 3: Closing database connections...
+✅ Graceful shutdown completed successfully
+```
+
+### Recovery Logs
+```
+⚠️  Jobs found that need recovery. Starting recovery process...
+Resuming job abc-123 with 300 remaining items...
+✅ Recovery completed successfully
+```
+
+## Best Practices
+
+### For Operations
+1. **Monitor shutdown times** - Should complete under 30 seconds
+2. **Check recovery logs** - Verify jobs resume correctly after restart
+3. **Set appropriate grace periods** - Allow 45+ seconds in orchestrators
+4. **Plan maintenance windows** - Jobs will resume but may take time to complete
+
+### For Development
+1. **Test shutdown scenarios** - Use `bun run test-shutdown`
+2. **Monitor job progress** - Check checkpoint frequency and timing
+3. **Verify recovery** - Ensure interrupted jobs resume correctly
+4. **Handle edge cases** - Test shutdown during different job phases
+
+## Troubleshooting
+
+### Shutdown Takes Too Long
+- **Check**: Database performance during job state saving
+- **Solution**: Increase `SHUTDOWN_TIMEOUT` environment variable
+- **Monitor**: Job complexity and checkpoint frequency
+
+### Jobs Don't Resume
+- **Check**: Recovery logs for errors during startup
+- **Verify**: Database contains interrupted jobs with correct status
+- **Test**: Run `bun run startup-recovery` manually
+
+### Container Force-Killed
+- **Check**: Container orchestrator termination grace period
+- **Increase**: Grace period to 45+ seconds
+- **Monitor**: Application shutdown completion time
+
+This design ensures **production-ready graceful shutdown** with **zero data loss** and **fast recovery times** suitable for modern containerized deployments.