gitea-mirror/docs/SHUTDOWN_PROCESS.md

# Graceful Shutdown Process

This document details how the gitea-mirror application handles graceful shutdown during active mirroring operations, with specific focus on job interruption and recovery.

## Overview

The graceful shutdown system is designed for **fast, clean termination** without waiting for long-running jobs to complete. It prioritizes **quick shutdown times** (under 30 seconds) while **preserving all progress** for seamless recovery.

## Key Principle

**The application does NOT wait for jobs to finish before shutting down.** Instead, it saves the current state and resumes after restart.

## Shutdown Scenario Example

### Initial State
- **Job**: Mirror 500 repositories
- **Progress**: 200 repositories completed
- **Remaining**: 300 repositories pending
- **Action**: User initiates shutdown (SIGTERM, Ctrl+C, Docker stop)

### Shutdown Process (Under 30 seconds)

#### Step 1: Signal Detection (Immediate)
```
📡 Received SIGTERM signal
🛑 Graceful shutdown initiated by signal: SIGTERM
📊 Shutdown status: 1 active jobs, 2 callbacks
```

#### Step 2: Job State Saving (1-10 seconds)
```
📝 Step 1: Saving active job states...
Saving state for job abc-123...
✅ Saved state for job abc-123
```

**What gets saved:**
- `inProgress: false` - Mark job as not currently running
- `completedItems: 200` - Number of repos successfully mirrored
- `totalItems: 500` - Total repos in the job
- `completedItemIds: [repo1, repo2, ..., repo200]` - List of completed repos
- `itemIds: [repo1, repo2, ..., repo500]` - Full list of repos
- `lastCheckpoint: 2025-05-24T17:30:00Z` - Exact shutdown time
- `message: "Job interrupted by application shutdown - will resume on restart"`
- `status: "imported"` - Keeps status as resumable (not "failed")

#### Step 3: Service Cleanup (1-5 seconds)
```
🔧 Step 2: Executing shutdown callbacks...
🛑 Shutting down cleanup service...
✅ Cleanup service stopped
✅ Shutdown callback 1 completed
```

#### Step 4: Clean Exit (Immediate)
```
💾 Step 3: Closing database connections...
✅ Graceful shutdown completed successfully
```

**Total shutdown time: ~15 seconds** (well under the 30-second limit)

## What Happens to the Remaining 300 Repos?

### During Shutdown
- **NOT processed** - The remaining 300 repos are not mirrored
- **NOT lost** - Their IDs are preserved in the job state
- **NOT marked as failed** - Job status remains "imported" for recovery

### After Restart
The recovery system automatically:

1. **Detects interrupted job** during startup
2. **Calculates remaining work**: 500 - 200 = 300 repos
3. **Extracts remaining repo IDs**: repos 201-500 from the original list
4. **Resumes processing** from exactly where it left off
5. **Continues until completion** of all 500 repos

## Timeout Configuration

### Shutdown Timeouts
```typescript
const SHUTDOWN_TIMEOUT = 30000; // 30 seconds max shutdown time
const JOB_SAVE_TIMEOUT = 10000; // 10 seconds to save job state
```

### Timeout Behavior
- **Normal case**: Shutdown completes in 10-20 seconds
- **Slow database**: Up to 30 seconds allowed
- **Timeout exceeded**: Force exit with code 1
- **Container kill**: Orchestrator should allow 45+ seconds grace period

## Job State Persistence

### Database Schema
The `mirror_jobs` table stores complete job state:

```sql
-- Job identification
id TEXT PRIMARY KEY,
user_id TEXT NOT NULL,
job_type TEXT NOT NULL DEFAULT 'mirror',

-- Progress tracking
total_items INTEGER,
completed_items INTEGER DEFAULT 0,
item_ids TEXT, -- JSON array of all repo IDs
completed_item_ids TEXT DEFAULT '[]', -- JSON array of completed repo IDs

-- State management
in_progress INTEGER NOT NULL DEFAULT 0, -- Boolean: currently running
started_at TIMESTAMP,
completed_at TIMESTAMP,
last_checkpoint TIMESTAMP, -- Last progress save

-- Status and messaging
status TEXT NOT NULL DEFAULT 'imported',
message TEXT NOT NULL
```

### Recovery Query
The recovery system finds interrupted jobs:

```sql
SELECT * FROM mirror_jobs
WHERE in_progress = 0
  AND status = 'imported'
  AND completed_at IS NULL
  AND total_items > completed_items;
```

## Shutdown-Aware Processing

### Concurrency Check
During job execution, each repo processing checks for shutdown:

```typescript
// Before processing each repository
if (isShuttingDown()) {
  throw new Error('Processing interrupted by application shutdown');
}
```

### Checkpoint Intervals
Jobs save progress periodically (every 10 repos by default):

```typescript
checkpointInterval: 10, // Save progress every 10 repositories
```

This ensures minimal work loss even if shutdown occurs between checkpoints.

## Container Integration

### Docker Entrypoint
The Docker entrypoint properly forwards signals:

```bash
# Set up signal handlers
trap 'shutdown_handler' TERM INT HUP

# Start application in background
bun ./dist/server/entry.mjs &
APP_PID=$!

# Wait for application to finish
wait "$APP_PID"
```

### Kubernetes Configuration
Recommended pod configuration:

```yaml
apiVersion: v1
kind: Pod
spec:
  terminationGracePeriodSeconds: 45  # Allow time for graceful shutdown
  containers:
  - name: gitea-mirror
    # ... other configuration
```

## Monitoring and Logging

### Shutdown Logs
```
🛑 Graceful shutdown initiated by signal: SIGTERM
📊 Shutdown status: 1 active jobs, 2 callbacks
📝 Step 1: Saving active job states...
Saving state for 1 active jobs...
✅ Completed saving all active jobs
🔧 Step 2: Executing shutdown callbacks...
✅ Completed all shutdown callbacks
💾 Step 3: Closing database connections...
✅ Graceful shutdown completed successfully
```

### Recovery Logs
```
⚠️  Jobs found that need recovery. Starting recovery process...
Resuming job abc-123 with 300 remaining items...
✅ Recovery completed successfully
```

## Best Practices

### For Operations
1. **Monitor shutdown times** - Should complete under 30 seconds
2. **Check recovery logs** - Verify jobs resume correctly after restart
3. **Set appropriate grace periods** - Allow 45+ seconds in orchestrators
4. **Plan maintenance windows** - Jobs will resume but may take time to complete

### For Development
1. **Test shutdown scenarios** - Use `bun run test-shutdown`
2. **Monitor job progress** - Check checkpoint frequency and timing
3. **Verify recovery** - Ensure interrupted jobs resume correctly
4. **Handle edge cases** - Test shutdown during different job phases

## Troubleshooting

### Shutdown Takes Too Long
- **Check**: Database performance during job state saving
- **Solution**: Increase `SHUTDOWN_TIMEOUT` environment variable
- **Monitor**: Job complexity and checkpoint frequency

### Jobs Don't Resume
- **Check**: Recovery logs for errors during startup
- **Verify**: Database contains interrupted jobs with correct status
- **Test**: Run `bun run startup-recovery` manually

### Container Force-Killed
- **Check**: Container orchestrator termination grace period
- **Increase**: Grace period to 45+ seconds
- **Monitor**: Application shutdown completion time

This design ensures **production-ready graceful shutdown** with **zero data loss** and **fast recovery times** suitable for modern containerized deployments.