mirror of
https://github.com/RayLabsHQ/gitea-mirror.git
synced 2025-12-06 11:36:44 +03:00
- Added a startup recovery script to handle interrupted jobs before application startup. - Enhanced recovery system with database connection validation and stale job cleanup. - Improved middleware to check for recovery needs and handle recovery during requests. - Updated health check endpoint to include recovery system status and metrics. - Introduced test scripts for verifying recovery functionality and job state management. - Enhanced logging and error handling throughout the recovery process.
171 lines
6.2 KiB
Markdown
171 lines
6.2 KiB
Markdown
# Job Recovery and Resume Process Improvements
|
|
|
|
This document outlines the comprehensive improvements made to the job recovery and resume process to make it more robust to application restarts, container restarts, and application crashes.
|
|
|
|
## Problems Addressed
|
|
|
|
The original recovery system had several critical issues:
|
|
|
|
1. **Middleware-based initialization**: Recovery only ran when the first request came in
|
|
2. **Database connection issues**: No validation of database connectivity before recovery attempts
|
|
3. **Limited error handling**: Insufficient error handling for various failure scenarios
|
|
4. **No startup recovery**: No mechanism to handle recovery before serving requests
|
|
5. **Incomplete job state management**: Jobs could remain in inconsistent states
|
|
6. **No retry mechanisms**: Single-attempt recovery with no fallback strategies
|
|
|
|
## Improvements Implemented
|
|
|
|
### 1. Enhanced Recovery System (`src/lib/recovery.ts`)
|
|
|
|
#### New Features:
|
|
- **Database connection validation** before attempting recovery
|
|
- **Stale job cleanup** for jobs older than 24 hours
|
|
- **Retry mechanisms** with configurable attempts and delays
|
|
- **Individual job error handling** to prevent one failed job from stopping recovery
|
|
- **Recovery state tracking** to prevent concurrent recovery attempts
|
|
- **Enhanced logging** with detailed job information
|
|
|
|
#### Key Functions:
|
|
- `initializeRecovery()` - Main recovery function with enhanced error handling
|
|
- `validateDatabaseConnection()` - Ensures database is accessible
|
|
- `cleanupStaleJobs()` - Removes jobs that are too old to recover
|
|
- `getRecoveryStatus()` - Returns current recovery system status
|
|
- `forceRecovery()` - Bypasses recent attempt checks
|
|
- `hasJobsNeedingRecovery()` - Checks if recovery is needed
|
|
|
|
### 2. Startup Recovery Script (`scripts/startup-recovery.ts`)
|
|
|
|
A dedicated script that runs recovery before the application starts serving requests:
|
|
|
|
#### Features:
|
|
- **Timeout protection** (default: 30 seconds)
|
|
- **Force recovery option** to bypass recent attempt checks
|
|
- **Graceful signal handling** (SIGINT, SIGTERM)
|
|
- **Detailed logging** with progress indicators
|
|
- **Exit codes** for different scenarios (success, warnings, errors)
|
|
|
|
#### Usage:
|
|
```bash
|
|
bun scripts/startup-recovery.ts [--force] [--timeout=30000]
|
|
```
|
|
|
|
### 3. Improved Middleware (`src/middleware.ts`)
|
|
|
|
The middleware now serves as a fallback recovery mechanism:
|
|
|
|
#### Changes:
|
|
- **Checks if recovery is needed** before attempting
|
|
- **Shorter timeout** (15 seconds) for request-time recovery
|
|
- **Better error handling** with status logging
|
|
- **Prevents multiple attempts** with proper state tracking
|
|
|
|
### 4. Enhanced Database Queries (`src/lib/helpers.ts`)
|
|
|
|
#### Improvements:
|
|
- **Proper Drizzle ORM syntax** for all database queries
|
|
- **Enhanced interrupted job detection** with multiple criteria:
|
|
- Jobs with no recent checkpoint (10+ minutes)
|
|
- Jobs running too long (2+ hours)
|
|
- **Detailed logging** of found interrupted jobs
|
|
- **Better error handling** for database operations
|
|
|
|
### 5. Docker Integration (`docker-entrypoint.sh`)
|
|
|
|
#### Changes:
|
|
- **Automatic startup recovery** runs before application start
|
|
- **Exit code handling** with appropriate logging
|
|
- **Fallback mechanisms** if recovery script is not found
|
|
- **Non-blocking execution** - application starts even if recovery fails
|
|
|
|
### 6. Health Check Integration (`src/pages/api/health.ts`)
|
|
|
|
#### New Features:
|
|
- **Recovery system status** in health endpoint
|
|
- **Job recovery metrics** (jobs needing recovery, recovery in progress)
|
|
- **Overall health status** considers recovery state
|
|
- **Detailed recovery information** for monitoring
|
|
|
|
### 7. Testing Infrastructure (`scripts/test-recovery.ts`)
|
|
|
|
A comprehensive test script to verify recovery functionality:
|
|
|
|
#### Features:
|
|
- **Creates test interrupted jobs** with realistic scenarios
|
|
- **Verifies recovery detection** and execution
|
|
- **Checks final job states** after recovery
|
|
- **Cleanup functionality** for test data
|
|
- **Comprehensive logging** of test progress
|
|
|
|
## Configuration Options
|
|
|
|
### Recovery System Options:
|
|
- `maxRetries`: Number of recovery attempts (default: 3)
|
|
- `retryDelay`: Delay between attempts in ms (default: 5000)
|
|
- `skipIfRecentAttempt`: Skip if recent attempt made (default: true)
|
|
|
|
### Startup Recovery Options:
|
|
- `--force`: Force recovery even if recent attempt was made
|
|
- `--timeout`: Maximum time to wait for recovery (default: 30000ms)
|
|
|
|
## Usage Examples
|
|
|
|
### Manual Recovery:
|
|
```bash
|
|
# Run startup recovery
|
|
bun run startup-recovery
|
|
|
|
# Force recovery
|
|
bun run startup-recovery-force
|
|
|
|
# Test recovery system
|
|
bun run test-recovery
|
|
|
|
# Clean up test data
|
|
bun run test-recovery-cleanup
|
|
```
|
|
|
|
### Programmatic Usage:
|
|
```typescript
|
|
import { initializeRecovery, hasJobsNeedingRecovery } from '@/lib/recovery';
|
|
|
|
// Check if recovery is needed
|
|
const needsRecovery = await hasJobsNeedingRecovery();
|
|
|
|
// Run recovery with custom options
|
|
const success = await initializeRecovery({
|
|
maxRetries: 5,
|
|
retryDelay: 3000,
|
|
skipIfRecentAttempt: false
|
|
});
|
|
```
|
|
|
|
## Monitoring and Observability
|
|
|
|
### Health Check Endpoint:
|
|
- **URL**: `/api/health`
|
|
- **Recovery Status**: Included in response
|
|
- **Monitoring**: Can be used with external monitoring systems
|
|
|
|
### Log Messages:
|
|
- **Startup**: Clear indicators of recovery attempts and results
|
|
- **Progress**: Detailed logging of recovery steps
|
|
- **Errors**: Comprehensive error information for debugging
|
|
|
|
## Benefits
|
|
|
|
1. **Reliability**: Jobs are automatically recovered after application restarts
|
|
2. **Resilience**: Multiple retry mechanisms and fallback strategies
|
|
3. **Observability**: Comprehensive logging and health check integration
|
|
4. **Performance**: Efficient detection and processing of interrupted jobs
|
|
5. **Maintainability**: Clear separation of concerns and modular design
|
|
6. **Testing**: Built-in testing infrastructure for verification
|
|
|
|
## Migration Notes
|
|
|
|
- **Backward Compatible**: All existing functionality is preserved
|
|
- **Automatic**: Recovery runs automatically on startup
|
|
- **Configurable**: All timeouts and retry counts can be adjusted
|
|
- **Monitoring**: Health checks now include recovery status
|
|
|
|
This comprehensive improvement ensures that the gitea-mirror application can reliably handle job recovery in all deployment scenarios, from development to production container environments.
|