mirror of
https://github.com/RayLabsHQ/gitea-mirror.git
synced 2026-01-27 12:50:54 +03:00
feat: Implement graceful shutdown and enhanced job recovery
- Added shutdown handler in docker-entrypoint.sh to manage application termination signals. - Introduced shutdown manager to track active jobs and ensure state persistence during shutdown. - Enhanced cleanup service to support stopping and status retrieval. - Integrated signal handlers for proper response to termination signals (SIGTERM, SIGINT, SIGHUP). - Updated middleware to initialize shutdown manager and cleanup service. - Created integration tests for graceful shutdown functionality, verifying job state preservation and recovery. - Documented graceful shutdown process and configuration in GRACEFUL_SHUTDOWN.md and SHUTDOWN_PROCESS.md. - Added new scripts for testing shutdown behavior and cleanup.
This commit is contained in:
249
docs/GRACEFUL_SHUTDOWN.md
Normal file
249
docs/GRACEFUL_SHUTDOWN.md
Normal file
@@ -0,0 +1,249 @@
|
||||
# Graceful Shutdown and Enhanced Job Recovery
|
||||
|
||||
This document describes the graceful shutdown and enhanced job recovery capabilities implemented in gitea-mirror v2.8.0+.
|
||||
|
||||
## Overview
|
||||
|
||||
The gitea-mirror application now includes comprehensive graceful shutdown handling and enhanced job recovery mechanisms designed specifically for containerized environments. These features ensure:
|
||||
|
||||
- **No data loss** during container restarts or shutdowns
|
||||
- **Automatic job resumption** after application restarts
|
||||
- **Clean termination** of all active processes and connections
|
||||
- **Container-aware design** optimized for Docker/LXC deployments
|
||||
|
||||
## Features
|
||||
|
||||
### 1. Graceful Shutdown Manager
|
||||
|
||||
The shutdown manager (`src/lib/shutdown-manager.ts`) provides centralized coordination of application termination:
|
||||
|
||||
#### Key Capabilities:
|
||||
- **Active Job Tracking**: Monitors all running mirroring/sync jobs
|
||||
- **State Persistence**: Saves job progress to database before shutdown
|
||||
- **Callback System**: Allows services to register cleanup functions
|
||||
- **Timeout Protection**: Prevents hanging shutdowns with configurable timeouts
|
||||
- **Signal Coordination**: Works with signal handlers for proper container lifecycle
|
||||
|
||||
#### Configuration:
|
||||
- **Shutdown Timeout**: 30 seconds maximum (configurable)
|
||||
- **Job Save Timeout**: 10 seconds per job (configurable)
|
||||
|
||||
### 2. Signal Handlers
|
||||
|
||||
The signal handler system (`src/lib/signal-handlers.ts`) ensures proper response to container lifecycle events:
|
||||
|
||||
#### Supported Signals:
|
||||
- **SIGTERM**: Docker stop, Kubernetes pod termination
|
||||
- **SIGINT**: Ctrl+C, manual interruption
|
||||
- **SIGHUP**: Terminal hangup, service reload
|
||||
- **Uncaught Exceptions**: Emergency shutdown on critical errors
|
||||
- **Unhandled Rejections**: Graceful handling of promise failures
|
||||
|
||||
### 3. Enhanced Job Recovery
|
||||
|
||||
Building on the existing recovery system, new enhancements include:
|
||||
|
||||
#### Shutdown-Aware Processing:
|
||||
- Jobs check for shutdown signals during execution
|
||||
- Automatic state saving when shutdown is detected
|
||||
- Proper job status management (interrupted vs failed)
|
||||
|
||||
#### Container Integration:
|
||||
- Docker entrypoint script forwards signals correctly
|
||||
- Startup recovery runs before main application
|
||||
- Recovery timeouts prevent startup delays
|
||||
|
||||
## Usage
|
||||
|
||||
### Basic Operation
|
||||
|
||||
The graceful shutdown system is automatically initialized when the application starts. No manual configuration is required for basic operation.
|
||||
|
||||
### Testing
|
||||
|
||||
Test the graceful shutdown functionality:
|
||||
|
||||
```bash
|
||||
# Run the integration test
|
||||
bun run test-shutdown
|
||||
|
||||
# Clean up test data
|
||||
bun run test-shutdown-cleanup
|
||||
|
||||
# Run unit tests
|
||||
bun test src/lib/shutdown-manager.test.ts
|
||||
bun test src/lib/signal-handlers.test.ts
|
||||
```
|
||||
|
||||
### Manual Testing
|
||||
|
||||
1. **Start the application**:
|
||||
```bash
|
||||
bun run dev
|
||||
# or in production
|
||||
bun run start
|
||||
```
|
||||
|
||||
2. **Start a mirroring job** through the web interface
|
||||
|
||||
3. **Send shutdown signal**:
|
||||
```bash
|
||||
# Send SIGTERM (recommended)
|
||||
kill -TERM <process_id>
|
||||
|
||||
# Or use Ctrl+C for SIGINT
|
||||
```
|
||||
|
||||
4. **Verify job state** is saved and can be resumed on restart
|
||||
|
||||
### Container Testing
|
||||
|
||||
Test with Docker:
|
||||
|
||||
```bash
|
||||
# Build and run container
|
||||
docker build -t gitea-mirror .
|
||||
docker run -d --name test-shutdown gitea-mirror
|
||||
|
||||
# Start a job, then stop container
|
||||
docker stop test-shutdown
|
||||
|
||||
# Restart and verify recovery
|
||||
docker start test-shutdown
|
||||
docker logs test-shutdown
|
||||
```
|
||||
|
||||
## Implementation Details
|
||||
|
||||
### Shutdown Flow
|
||||
|
||||
1. **Signal Reception**: Signal handlers detect termination request
|
||||
2. **Shutdown Initiation**: Shutdown manager begins graceful termination
|
||||
3. **Job State Saving**: All active jobs save current progress to database
|
||||
4. **Service Cleanup**: Registered callbacks stop background services
|
||||
5. **Connection Cleanup**: Database connections and resources are released
|
||||
6. **Process Termination**: Application exits with appropriate code
|
||||
|
||||
### Job State Management
|
||||
|
||||
During shutdown, active jobs are updated with:
|
||||
- `inProgress: false` - Mark as not currently running
|
||||
- `lastCheckpoint: <timestamp>` - Record shutdown time
|
||||
- `message: "Job interrupted by application shutdown - will resume on restart"`
|
||||
- Status remains as `"imported"` (not `"failed"`) to enable recovery
|
||||
|
||||
### Recovery Integration
|
||||
|
||||
The existing recovery system automatically detects and resumes interrupted jobs:
|
||||
- Jobs with `inProgress: false` and incomplete status are candidates for recovery
|
||||
- Recovery runs during application startup (before serving requests)
|
||||
- Jobs resume from their last checkpoint with remaining items
|
||||
|
||||
## Configuration
|
||||
|
||||
### Environment Variables
|
||||
|
||||
```bash
|
||||
# Optional: Adjust shutdown timeout (default: 30000ms)
|
||||
SHUTDOWN_TIMEOUT=30000
|
||||
|
||||
# Optional: Adjust job save timeout (default: 10000ms)
|
||||
JOB_SAVE_TIMEOUT=10000
|
||||
```
|
||||
|
||||
### Docker Configuration
|
||||
|
||||
The Docker entrypoint script includes proper signal handling:
|
||||
|
||||
```dockerfile
|
||||
# Signals are forwarded to the application process
|
||||
# SIGTERM is handled gracefully with 30-second timeout
|
||||
# Container stops cleanly without force-killing processes
|
||||
```
|
||||
|
||||
### Kubernetes Configuration
|
||||
|
||||
For Kubernetes deployments, configure appropriate termination grace period:
|
||||
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
spec:
|
||||
terminationGracePeriodSeconds: 45 # Allow time for graceful shutdown
|
||||
containers:
|
||||
- name: gitea-mirror
|
||||
# ... other configuration
|
||||
```
|
||||
|
||||
## Monitoring and Debugging
|
||||
|
||||
### Logs
|
||||
|
||||
The application provides detailed logging during shutdown:
|
||||
|
||||
```
|
||||
🛑 Graceful shutdown initiated by signal: SIGTERM
|
||||
📊 Shutdown status: 2 active jobs, 1 callbacks
|
||||
📝 Step 1: Saving active job states...
|
||||
Saving state for job abc-123...
|
||||
✅ Saved state for job abc-123
|
||||
🔧 Step 2: Executing shutdown callbacks...
|
||||
✅ Shutdown callback 1 completed
|
||||
💾 Step 3: Closing database connections...
|
||||
✅ Graceful shutdown completed successfully
|
||||
```
|
||||
|
||||
### Status Endpoints
|
||||
|
||||
Check shutdown manager status via API:
|
||||
|
||||
```bash
|
||||
# Get current status (if application is running)
|
||||
curl http://localhost:4321/api/health
|
||||
```
|
||||
|
||||
### Troubleshooting
|
||||
|
||||
**Problem**: Jobs not resuming after restart
|
||||
- **Check**: Startup recovery logs for errors
|
||||
- **Verify**: Database contains interrupted jobs with correct status
|
||||
- **Test**: Run `bun run startup-recovery` manually
|
||||
|
||||
**Problem**: Shutdown timeout reached
|
||||
- **Check**: Job complexity and database performance
|
||||
- **Adjust**: Increase `SHUTDOWN_TIMEOUT` environment variable
|
||||
- **Monitor**: Database connection and disk I/O during shutdown
|
||||
|
||||
**Problem**: Container force-killed
|
||||
- **Check**: Container orchestrator termination grace period
|
||||
- **Adjust**: Increase grace period to allow shutdown completion
|
||||
- **Monitor**: Application shutdown logs for timing issues
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Development
|
||||
- Always test graceful shutdown during development
|
||||
- Use the provided test scripts to verify functionality
|
||||
- Monitor logs for shutdown timing and job state persistence
|
||||
|
||||
### Production
|
||||
- Set appropriate container termination grace periods
|
||||
- Monitor shutdown logs for performance issues
|
||||
- Use health checks to verify application readiness after restart
|
||||
- Consider job complexity when planning maintenance windows
|
||||
|
||||
### Monitoring
|
||||
- Track job recovery success rates
|
||||
- Monitor shutdown duration metrics
|
||||
- Alert on forced terminations or recovery failures
|
||||
- Log analysis for shutdown pattern optimization
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
Planned improvements for future versions:
|
||||
|
||||
1. **Configurable Timeouts**: Environment variable configuration for all timeouts
|
||||
2. **Shutdown Metrics**: Prometheus metrics for shutdown performance
|
||||
3. **Progressive Shutdown**: Graceful degradation of service capabilities
|
||||
4. **Job Prioritization**: Priority-based job saving during shutdown
|
||||
5. **Health Check Integration**: Readiness probes during shutdown process
|
||||
236
docs/SHUTDOWN_PROCESS.md
Normal file
236
docs/SHUTDOWN_PROCESS.md
Normal file
@@ -0,0 +1,236 @@
|
||||
# Graceful Shutdown Process
|
||||
|
||||
This document details how the gitea-mirror application handles graceful shutdown during active mirroring operations, with specific focus on job interruption and recovery.
|
||||
|
||||
## Overview
|
||||
|
||||
The graceful shutdown system is designed for **fast, clean termination** without waiting for long-running jobs to complete. It prioritizes **quick shutdown times** (under 30 seconds) while **preserving all progress** for seamless recovery.
|
||||
|
||||
## Key Principle
|
||||
|
||||
**The application does NOT wait for jobs to finish before shutting down.** Instead, it saves the current state and resumes after restart.
|
||||
|
||||
## Shutdown Scenario Example
|
||||
|
||||
### Initial State
|
||||
- **Job**: Mirror 500 repositories
|
||||
- **Progress**: 200 repositories completed
|
||||
- **Remaining**: 300 repositories pending
|
||||
- **Action**: User initiates shutdown (SIGTERM, Ctrl+C, Docker stop)
|
||||
|
||||
### Shutdown Process (Under 30 seconds)
|
||||
|
||||
#### Step 1: Signal Detection (Immediate)
|
||||
```
|
||||
📡 Received SIGTERM signal
|
||||
🛑 Graceful shutdown initiated by signal: SIGTERM
|
||||
📊 Shutdown status: 1 active jobs, 2 callbacks
|
||||
```
|
||||
|
||||
#### Step 2: Job State Saving (1-10 seconds)
|
||||
```
|
||||
📝 Step 1: Saving active job states...
|
||||
Saving state for job abc-123...
|
||||
✅ Saved state for job abc-123
|
||||
```
|
||||
|
||||
**What gets saved:**
|
||||
- `inProgress: false` - Mark job as not currently running
|
||||
- `completedItems: 200` - Number of repos successfully mirrored
|
||||
- `totalItems: 500` - Total repos in the job
|
||||
- `completedItemIds: [repo1, repo2, ..., repo200]` - List of completed repos
|
||||
- `itemIds: [repo1, repo2, ..., repo500]` - Full list of repos
|
||||
- `lastCheckpoint: 2025-05-24T17:30:00Z` - Exact shutdown time
|
||||
- `message: "Job interrupted by application shutdown - will resume on restart"`
|
||||
- `status: "imported"` - Keeps status as resumable (not "failed")
|
||||
|
||||
#### Step 3: Service Cleanup (1-5 seconds)
|
||||
```
|
||||
🔧 Step 2: Executing shutdown callbacks...
|
||||
🛑 Shutting down cleanup service...
|
||||
✅ Cleanup service stopped
|
||||
✅ Shutdown callback 1 completed
|
||||
```
|
||||
|
||||
#### Step 4: Clean Exit (Immediate)
|
||||
```
|
||||
💾 Step 3: Closing database connections...
|
||||
✅ Graceful shutdown completed successfully
|
||||
```
|
||||
|
||||
**Total shutdown time: ~15 seconds** (well under the 30-second limit)
|
||||
|
||||
## What Happens to the Remaining 300 Repos?
|
||||
|
||||
### During Shutdown
|
||||
- **NOT processed** - The remaining 300 repos are not mirrored
|
||||
- **NOT lost** - Their IDs are preserved in the job state
|
||||
- **NOT marked as failed** - Job status remains "imported" for recovery
|
||||
|
||||
### After Restart
|
||||
The recovery system automatically:
|
||||
|
||||
1. **Detects interrupted job** during startup
|
||||
2. **Calculates remaining work**: 500 - 200 = 300 repos
|
||||
3. **Extracts remaining repo IDs**: repos 201-500 from the original list
|
||||
4. **Resumes processing** from exactly where it left off
|
||||
5. **Continues until completion** of all 500 repos
|
||||
|
||||
## Timeout Configuration
|
||||
|
||||
### Shutdown Timeouts
|
||||
```typescript
|
||||
const SHUTDOWN_TIMEOUT = 30000; // 30 seconds max shutdown time
|
||||
const JOB_SAVE_TIMEOUT = 10000; // 10 seconds to save job state
|
||||
```
|
||||
|
||||
### Timeout Behavior
|
||||
- **Normal case**: Shutdown completes in 10-20 seconds
|
||||
- **Slow database**: Up to 30 seconds allowed
|
||||
- **Timeout exceeded**: Force exit with code 1
|
||||
- **Container kill**: Orchestrator should allow 45+ seconds grace period
|
||||
|
||||
## Job State Persistence
|
||||
|
||||
### Database Schema
|
||||
The `mirror_jobs` table stores complete job state:
|
||||
|
||||
```sql
|
||||
-- Job identification
|
||||
id TEXT PRIMARY KEY,
|
||||
user_id TEXT NOT NULL,
|
||||
job_type TEXT NOT NULL DEFAULT 'mirror',
|
||||
|
||||
-- Progress tracking
|
||||
total_items INTEGER,
|
||||
completed_items INTEGER DEFAULT 0,
|
||||
item_ids TEXT, -- JSON array of all repo IDs
|
||||
completed_item_ids TEXT DEFAULT '[]', -- JSON array of completed repo IDs
|
||||
|
||||
-- State management
|
||||
in_progress INTEGER NOT NULL DEFAULT 0, -- Boolean: currently running
|
||||
started_at TIMESTAMP,
|
||||
completed_at TIMESTAMP,
|
||||
last_checkpoint TIMESTAMP, -- Last progress save
|
||||
|
||||
-- Status and messaging
|
||||
status TEXT NOT NULL DEFAULT 'imported',
|
||||
message TEXT NOT NULL
|
||||
```
|
||||
|
||||
### Recovery Query
|
||||
The recovery system finds interrupted jobs:
|
||||
|
||||
```sql
|
||||
SELECT * FROM mirror_jobs
|
||||
WHERE in_progress = 0
|
||||
AND status = 'imported'
|
||||
AND completed_at IS NULL
|
||||
AND total_items > completed_items;
|
||||
```
|
||||
|
||||
## Shutdown-Aware Processing
|
||||
|
||||
### Concurrency Check
|
||||
During job execution, each repo processing checks for shutdown:
|
||||
|
||||
```typescript
|
||||
// Before processing each repository
|
||||
if (isShuttingDown()) {
|
||||
throw new Error('Processing interrupted by application shutdown');
|
||||
}
|
||||
```
|
||||
|
||||
### Checkpoint Intervals
|
||||
Jobs save progress periodically (every 10 repos by default):
|
||||
|
||||
```typescript
|
||||
checkpointInterval: 10, // Save progress every 10 repositories
|
||||
```
|
||||
|
||||
This ensures minimal work loss even if shutdown occurs between checkpoints.
|
||||
|
||||
## Container Integration
|
||||
|
||||
### Docker Entrypoint
|
||||
The Docker entrypoint properly forwards signals:
|
||||
|
||||
```bash
|
||||
# Set up signal handlers
|
||||
trap 'shutdown_handler' TERM INT HUP
|
||||
|
||||
# Start application in background
|
||||
bun ./dist/server/entry.mjs &
|
||||
APP_PID=$!
|
||||
|
||||
# Wait for application to finish
|
||||
wait "$APP_PID"
|
||||
```
|
||||
|
||||
### Kubernetes Configuration
|
||||
Recommended pod configuration:
|
||||
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
spec:
|
||||
terminationGracePeriodSeconds: 45 # Allow time for graceful shutdown
|
||||
containers:
|
||||
- name: gitea-mirror
|
||||
# ... other configuration
|
||||
```
|
||||
|
||||
## Monitoring and Logging
|
||||
|
||||
### Shutdown Logs
|
||||
```
|
||||
🛑 Graceful shutdown initiated by signal: SIGTERM
|
||||
📊 Shutdown status: 1 active jobs, 2 callbacks
|
||||
📝 Step 1: Saving active job states...
|
||||
Saving state for 1 active jobs...
|
||||
✅ Completed saving all active jobs
|
||||
🔧 Step 2: Executing shutdown callbacks...
|
||||
✅ Completed all shutdown callbacks
|
||||
💾 Step 3: Closing database connections...
|
||||
✅ Graceful shutdown completed successfully
|
||||
```
|
||||
|
||||
### Recovery Logs
|
||||
```
|
||||
⚠️ Jobs found that need recovery. Starting recovery process...
|
||||
Resuming job abc-123 with 300 remaining items...
|
||||
✅ Recovery completed successfully
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### For Operations
|
||||
1. **Monitor shutdown times** - Should complete under 30 seconds
|
||||
2. **Check recovery logs** - Verify jobs resume correctly after restart
|
||||
3. **Set appropriate grace periods** - Allow 45+ seconds in orchestrators
|
||||
4. **Plan maintenance windows** - Jobs will resume but may take time to complete
|
||||
|
||||
### For Development
|
||||
1. **Test shutdown scenarios** - Use `bun run test-shutdown`
|
||||
2. **Monitor job progress** - Check checkpoint frequency and timing
|
||||
3. **Verify recovery** - Ensure interrupted jobs resume correctly
|
||||
4. **Handle edge cases** - Test shutdown during different job phases
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Shutdown Takes Too Long
|
||||
- **Check**: Database performance during job state saving
|
||||
- **Solution**: Increase `SHUTDOWN_TIMEOUT` environment variable
|
||||
- **Monitor**: Job complexity and checkpoint frequency
|
||||
|
||||
### Jobs Don't Resume
|
||||
- **Check**: Recovery logs for errors during startup
|
||||
- **Verify**: Database contains interrupted jobs with correct status
|
||||
- **Test**: Run `bun run startup-recovery` manually
|
||||
|
||||
### Container Force-Killed
|
||||
- **Check**: Container orchestrator termination grace period
|
||||
- **Increase**: Grace period to 45+ seconds
|
||||
- **Monitor**: Application shutdown completion time
|
||||
|
||||
This design ensures **production-ready graceful shutdown** with **zero data loss** and **fast recovery times** suitable for modern containerized deployments.
|
||||
Reference in New Issue
Block a user