diff --git a/.github/SELF-HEALING-README.md b/.github/SELF-HEALING-README.md new file mode 100644 index 000000000..fe66f1979 --- /dev/null +++ b/.github/SELF-HEALING-README.md @@ -0,0 +1,208 @@ +# Self-Healing CI/CD Pipeline + +This repository includes a self-healing CI/CD pipeline that automatically monitors, analyzes, and responds to workflow failures. + +## Overview + +The self-healing agent: + +1. **Monitors** all workflow runs for failures +2. **Analyzes** failure logs to determine root cause +3. **Classifies** failures into categories (code, workflow, infrastructure, quality gate) +4. **Diagnoses** issues with specific error details and recommendations +5. **Takes action** based on configuration (assist mode or auto-fix mode) + +## Files + +``` +.github/ +├── self-healing-config.yml # Configuration file +├── workflows/ +│ └── self-healing.yml # Main workflow that monitors failures +└── scripts/ + ├── analyze-failure.js # Failure analysis script + └── apply-fix.js # Auto-fix script +``` + +## Configuration + +Edit `.github/self-healing-config.yml` to customize behavior: + +### Operating Modes + +- **`assist`** (default): Creates issues/comments with diagnosis and proposed fixes. Does not commit changes automatically. +- **`auto-fix`**: Automatically creates PRs with fixes for certain failure types. + +```yaml +mode: "assist" # or "auto-fix" +``` + +### Failure Classifications + +The agent classifies failures into four categories: + +| Classification | Description | Auto-retry | Auto-fix | +|---------------|-------------|------------|----------| +| `code` | Test failures, compilation errors, linting issues | No | No | +| `workflow` | YAML issues, action versions, permissions | No | Yes* | +| `infrastructure` | Timeouts, rate limits, network issues | Yes | No | +| `quality_gate` | SonarQube, coverage thresholds | No | No | + +\* Only in `auto-fix` mode + +### Retry Configuration + +```yaml +retry: + enabled: true + max_attempts: 2 + delay_minutes: 1 + auto_retry_types: + - "infrastructure" +``` + +### Guardrails + +```yaml +guardrails: + max_prs_per_day: 5 + max_issues_per_day: 10 + require_approval: true + protected_files: + - ".github/self-healing-config.yml" + - "CODEOWNERS" + max_lines_changed: 50 +``` + +## How It Works + +### 1. Trigger + +The self-healing workflow triggers on `workflow_run` events when any workflow completes with a failure: + +```yaml +on: + workflow_run: + workflows: ["*"] + types: [completed] +``` + +### 2. Analysis + +When a failure is detected: + +1. Downloads the failed job logs via GitHub CLI +2. Extracts error messages and failed steps +3. Matches error patterns to classify the failure +4. Generates a diagnosis with recommendations + +### 3. Actions + +Based on classification and configuration: + +| Classification | Assist Mode | Auto-Fix Mode | +|---------------|-------------|---------------| +| Code | Create issue | Create issue | +| Workflow | Create issue | Create PR with fix | +| Infrastructure | Retry + Issue if persists | Retry + Issue | +| Quality Gate | Create issue | Create issue | + +### 4. Issue Creation + +Issues are created with: +- Failure classification +- Failed job/step details +- Key error lines from logs +- Specific recommendations +- Links to workflow run + +### 5. PR Creation (Auto-Fix Mode) + +For workflow issues, the agent can automatically: +- Update deprecated action versions +- Add missing permissions +- Remove invalid inputs + +## Supported Auto-Fixes + +| Issue | Fix Applied | +|-------|------------| +| Deprecated `actions/checkout@v2/v3` | Update to `v4` | +| Deprecated `actions/setup-node@v2/v3` | Update to `v4` | +| Deprecated `actions/setup-java@v2/v3` | Update to `v4` | +| Missing permissions | Add permissions block | +| Unexpected action inputs | Remove invalid inputs | + +## Labels + +The agent uses these labels: +- `ci-failure` - All CI failure issues +- `self-healing` - Issues created by self-healing agent +- `auto-fix` - PRs created automatically + +## Customization + +### Adding New Failure Patterns + +Edit `self-healing-config.yml`: + +```yaml +classification: + code: + patterns: + - "your-custom-pattern" + - "another-pattern" +``` + +### Adding New Auto-Fixes + +Edit `.github/scripts/apply-fix.js` to add new fix patterns: + +```javascript +const fixPatterns = { + 'my-fix': { + patterns: [/my error pattern/i], + apply: myFixFunction + } +}; +``` + +## Permissions Required + +The self-healing workflow requires: + +```yaml +permissions: + contents: write # For creating branches/commits + pull-requests: write # For creating PRs + issues: write # For creating issues + actions: read # For reading workflow logs +``` + +## Troubleshooting + +### Workflow not triggering + +- Ensure the workflow file is in the default branch +- Check that `workflow_run` permissions are enabled +- Verify no syntax errors in workflow file + +### Issues not being created + +- Check `GITHUB_TOKEN` permissions +- Verify label existence or creation permissions +- Check rate limits + +### Auto-fixes not applying + +- Ensure mode is set to `auto-fix` +- Verify the failure type has `auto_fix_enabled: true` +- Check that files aren't in `protected_files` list + +## Security Considerations + +- The agent runs with repository permissions +- Auto-fixes are limited to workflow files by default +- Protected files cannot be modified +- All PRs require human approval before merge +- Secrets are never exposed in logs or issues diff --git a/.github/scripts/analyze-failure.js b/.github/scripts/analyze-failure.js new file mode 100644 index 000000000..e1d3dca9b --- /dev/null +++ b/.github/scripts/analyze-failure.js @@ -0,0 +1,377 @@ +#!/usr/bin/env node + +/** + * Self-Healing CI - Failure Analysis Script + * + * This script analyzes workflow failure logs and classifies the failure type, + * generates a diagnosis, and recommends actions. + */ + +const fs = require('fs'); +const path = require('path'); +const yaml = require('yaml'); + +// Parse command line arguments +const args = process.argv.slice(2); +const getArg = (name) => { + const index = args.indexOf(`--${name}`); + return index !== -1 ? args[index + 1] : null; +}; + +const logsPath = getArg('logs'); +const runDetailsPath = getArg('run-details'); +const jobsPath = getArg('jobs'); +const configPath = getArg('config'); + +// Load configuration +let config = { + mode: 'assist', + classification: { + code: { + patterns: [ + 'BUILD FAILURE', 'COMPILATION ERROR', 'Test.*failed', 'Tests run:.*Failures:', + 'error: cannot find symbol', 'error: incompatible types', 'SyntaxError', + 'TypeError', 'eslint.*error', 'checkstyle.*ERROR', 'spotless' + ], + auto_fix_enabled: false + }, + workflow: { + patterns: [ + 'Invalid workflow file', 'unexpected value', 'action.*not found', + 'uses:.*@.*not found', 'permission.*denied', 'Required secret.*not found', + 'matrix.*invalid', 'Unexpected input' + ], + auto_fix_enabled: true + }, + infrastructure: { + patterns: [ + 'timeout', 'ETIMEDOUT', 'ECONNREFUSED', 'rate limit', '503 Service', + '502 Bad Gateway', 'Could not resolve host', 'TLS handshake timeout', + 'connection reset', 'No space left on device' + ], + auto_fix_enabled: true, + retry_on_infra_failure: true + }, + quality_gate: { + patterns: [ + 'Quality Gate.*FAILED', 'coverage.*below', 'Quality gate status', + 'does not meet.*threshold' + ], + auto_fix_enabled: false, + create_issue: true + } + }, + retry: { + enabled: true, + max_attempts: 2, + auto_retry_types: ['infrastructure'] + } +}; + +if (configPath && fs.existsSync(configPath)) { + try { + const configContent = fs.readFileSync(configPath, 'utf8'); + config = { ...config, ...yaml.parse(configContent) }; + } catch (e) { + console.error(`Warning: Could not load config from ${configPath}: ${e.message}`); + } +} + +// Read logs +let logs = ''; +if (logsPath && fs.existsSync(logsPath)) { + logs = fs.readFileSync(logsPath, 'utf8'); +} + +// Read run details +let runDetails = {}; +if (runDetailsPath && fs.existsSync(runDetailsPath)) { + try { + runDetails = JSON.parse(fs.readFileSync(runDetailsPath, 'utf8')); + } catch (e) { + console.error(`Warning: Could not parse run details: ${e.message}`); + } +} + +// Read jobs +let jobs = { jobs: [] }; +if (jobsPath && fs.existsSync(jobsPath)) { + try { + jobs = JSON.parse(fs.readFileSync(jobsPath, 'utf8')); + } catch (e) { + console.error(`Warning: Could not parse jobs: ${e.message}`); + } +} + +/** + * Classify the failure based on log patterns + * Uses weighted scoring to determine the most likely root cause + */ +function classifyFailure(logs) { + const classifications = []; + + // Extract the last 100 lines where the actual error usually is + const logLines = logs.split('\n'); + const lastSection = logLines.slice(-100).join('\n'); + + for (const [type, settings] of Object.entries(config.classification)) { + let score = 0; + let matchedPattern = null; + + for (const pattern of settings.patterns || []) { + const regex = new RegExp(pattern, 'gi'); + const matches = logs.match(regex) || []; + const lastSectionMatches = lastSection.match(regex) || []; + + if (matches.length > 0) { + // Patterns in the last section (near the error) are more important + score += lastSectionMatches.length * 3; + score += matches.length; + if (!matchedPattern) matchedPattern = pattern; + } + } + + if (score > 0) { + classifications.push({ + type, + pattern: matchedPattern, + settings, + score + }); + } + } + + // Sort by score (highest first) then by specificity + // Quality gate and workflow are more specific than generic code/infra + const specificity = { quality_gate: 10, workflow: 8, code: 5, infrastructure: 3 }; + classifications.sort((a, b) => { + const scoreA = a.score + (specificity[a.type] || 0); + const scoreB = b.score + (specificity[b.type] || 0); + return scoreB - scoreA; + }); + + if (classifications.length > 0) { + return classifications[0]; + } + + return { type: 'unknown', pattern: null, settings: {} }; +} + +/** + * Extract key error lines from logs + */ +function extractKeyErrors(logs, maxLines = 20) { + const errorPatterns = [ + /##\[error\].*/gi, + /Error:.*/gi, + /Exception:.*/gi, + /FAILURE:.*/gi, + /FAILED.*/gi, + /fatal:.*/gi, + /error:.*/gi + ]; + + const errorLines = []; + const lines = logs.split('\n'); + + for (const line of lines) { + for (const pattern of errorPatterns) { + if (pattern.test(line)) { + const cleanLine = line.replace(/^\s*\S+\s+UNKNOWN STEP\s+\S+\s*/, '').trim(); + if (cleanLine && !errorLines.includes(cleanLine)) { + errorLines.push(cleanLine); + } + break; + } + } + if (errorLines.length >= maxLines) break; + } + + return errorLines; +} + +/** + * Find failed jobs and steps + */ +function findFailedComponents(jobs) { + const failed = []; + + for (const job of jobs.jobs || []) { + if (job.conclusion === 'failure') { + const failedSteps = (job.steps || []) + .filter(step => step.conclusion === 'failure') + .map(step => ({ + name: step.name, + number: step.number + })); + + failed.push({ + jobName: job.name, + jobId: job.id, + failedSteps + }); + } + } + + return failed; +} + +/** + * Generate diagnosis and recommendations + */ +function generateDiagnosis(classification, errorLines, failedComponents, logs) { + let diagnosis = []; + let recommendations = []; + + // Summary based on classification + switch (classification.type) { + case 'code': + diagnosis.push('**Type:** Code Failure (tests, compilation, or linting)'); + recommendations.push('Review the failing tests or compilation errors'); + recommendations.push('Check recent code changes for regressions'); + break; + + case 'quality_gate': + diagnosis.push('**Type:** Quality Gate Failure'); + + // Check if it's SonarQube + if (/sonar/i.test(logs)) { + diagnosis.push('The SonarQube Quality Gate check failed.'); + recommendations.push('Review the SonarQube dashboard for detailed issues'); + recommendations.push('Check for new code smells, bugs, or security vulnerabilities'); + recommendations.push('Ensure test coverage meets the threshold'); + + // Extract SonarQube dashboard URL if present + const dashboardMatch = logs.match(/dashboard\?id=[^\s&]+(&[^\s]+)?/); + if (dashboardMatch) { + diagnosis.push(`\n**SonarQube Dashboard:** Check the dashboard for details`); + } + } + break; + + case 'workflow': + diagnosis.push('**Type:** Workflow Configuration Issue'); + recommendations.push('Check the workflow YAML file for syntax errors'); + recommendations.push('Verify action versions are correct and available'); + recommendations.push('Ensure all required secrets are configured'); + break; + + case 'infrastructure': + diagnosis.push('**Type:** Infrastructure/Transient Failure'); + recommendations.push('This may be a transient issue - retry may resolve it'); + recommendations.push('Check external service status if issue persists'); + recommendations.push('Consider adding retry logic for flaky steps'); + break; + + default: + diagnosis.push('**Type:** Unknown Failure Type'); + recommendations.push('Manual investigation required'); + } + + // Add failed components + if (failedComponents.length > 0) { + diagnosis.push('\n**Failed Jobs/Steps:**'); + for (const comp of failedComponents) { + diagnosis.push(`- Job: \`${comp.jobName}\``); + for (const step of comp.failedSteps) { + diagnosis.push(` - Step ${step.number}: \`${step.name}\``); + } + } + } + + // Add key error lines + if (errorLines.length > 0) { + diagnosis.push('\n**Key Error Lines:**'); + diagnosis.push('```'); + diagnosis.push(errorLines.slice(0, 10).join('\n')); + diagnosis.push('```'); + } + + // Add matched pattern + if (classification.pattern) { + diagnosis.push(`\n**Matched Pattern:** \`${classification.pattern}\``); + } + + // Add recommendations + diagnosis.push('\n**Recommendations:**'); + for (const rec of recommendations) { + diagnosis.push(`- ${rec}`); + } + + return diagnosis.join('\n'); +} + +/** + * Determine actions to take + */ +function determineActions(classification, config) { + const actions = { + shouldRetry: false, + shouldCreateIssue: false, + shouldCreatePR: false + }; + + const mode = config.mode || 'assist'; + const settings = classification.settings || {}; + + // Retry for infrastructure failures + if (classification.type === 'infrastructure' && config.retry?.enabled) { + if (config.retry.auto_retry_types?.includes('infrastructure')) { + actions.shouldRetry = true; + } + } + + // Create issue for quality gate and code failures + if (['quality_gate', 'code', 'unknown'].includes(classification.type)) { + actions.shouldCreateIssue = true; + } + + // Create PR for workflow issues in auto-fix mode + if (mode === 'auto-fix' && classification.type === 'workflow' && settings.auto_fix_enabled) { + actions.shouldCreatePR = true; + } + + // Always create issue in assist mode (except for retried infra failures) + if (mode === 'assist' && !actions.shouldRetry) { + actions.shouldCreateIssue = true; + } + + return actions; +} + +// Main execution +const classification = classifyFailure(logs); +const errorLines = extractKeyErrors(logs); +const failedComponents = findFailedComponents(jobs); +const diagnosis = generateDiagnosis(classification, errorLines, failedComponents, logs); +const actions = determineActions(classification, config); + +// Output results for GitHub Actions (using GITHUB_OUTPUT file, not deprecated set-output) +const setOutput = (name, value) => { + const outputFile = process.env.GITHUB_OUTPUT; + if (outputFile) { + // Handle multiline values + if (value.includes('\n')) { + const delimiter = `EOF_${Date.now()}`; + fs.appendFileSync(outputFile, `${name}<<${delimiter}\n${value}\n${delimiter}\n`); + } else { + fs.appendFileSync(outputFile, `${name}=${value}\n`); + } + } + // Note: Not using deprecated ::set-output command +}; + +setOutput('classification', classification.type); +setOutput('should_retry', actions.shouldRetry.toString()); +setOutput('should_create_issue', actions.shouldCreateIssue.toString()); +setOutput('should_create_pr', actions.shouldCreatePR.toString()); +setOutput('diagnosis', diagnosis); + +// Log summary +console.log('\n=== Self-Healing CI Analysis ===\n'); +console.log(`Classification: ${classification.type}`); +console.log(`Should Retry: ${actions.shouldRetry}`); +console.log(`Should Create Issue: ${actions.shouldCreateIssue}`); +console.log(`Should Create PR: ${actions.shouldCreatePR}`); +console.log('\n--- Diagnosis ---\n'); +console.log(diagnosis); diff --git a/.github/scripts/apply-fix.js b/.github/scripts/apply-fix.js new file mode 100644 index 000000000..d107ea0c8 --- /dev/null +++ b/.github/scripts/apply-fix.js @@ -0,0 +1,288 @@ +#!/usr/bin/env node + +/** + * Self-Healing CI - Auto-Fix Script + * + * This script applies automated fixes for known failure patterns. + * It's designed to make minimal, targeted changes that are safe to apply. + */ + +const fs = require('fs'); +const path = require('path'); +const yaml = require('yaml'); + +// Get environment variables +const diagnosis = process.env.DIAGNOSIS || ''; +const classification = process.env.CLASSIFICATION || 'unknown'; + +// Known fix patterns and their implementations +const fixPatterns = { + // Update deprecated action versions + 'deprecated-actions': { + patterns: [ + /Node\.js 12 actions are deprecated/i, + /set-output command is deprecated/i, + /save-state command is deprecated/i + ], + apply: updateDeprecatedActions + }, + + // Fix permission issues + 'permissions': { + patterns: [ + /Resource not accessible by integration/i, + /permission.*denied/i, + /pull_requests:.*read/i + ], + apply: addMissingPermissions + }, + + // Fix unexpected inputs warning + 'unexpected-inputs': { + patterns: [ + /Unexpected input\(s\)/i + ], + apply: removeUnexpectedInputs + }, + + // Update action versions + 'action-versions': { + patterns: [ + /actions\/checkout@v[123]/i, + /actions\/setup-java@v[123]/i, + /actions\/setup-node@v[123]/i + ], + apply: updateActionVersions + } +}; + +/** + * Update deprecated GitHub Actions to latest versions + */ +function updateDeprecatedActions(workflowPath) { + let content = fs.readFileSync(workflowPath, 'utf8'); + let changes = []; + + // Map of actions to update + const actionUpdates = { + 'actions/checkout@v2': 'actions/checkout@v4', + 'actions/checkout@v3': 'actions/checkout@v4', + 'actions/setup-node@v2': 'actions/setup-node@v4', + 'actions/setup-node@v3': 'actions/setup-node@v4', + 'actions/setup-java@v2': 'actions/setup-java@v4', + 'actions/setup-java@v3': 'actions/setup-java@v4', + 'actions/upload-artifact@v2': 'actions/upload-artifact@v4', + 'actions/upload-artifact@v3': 'actions/upload-artifact@v4', + 'actions/download-artifact@v2': 'actions/download-artifact@v4', + 'actions/download-artifact@v3': 'actions/download-artifact@v4', + 'actions/cache@v2': 'actions/cache@v4', + 'actions/cache@v3': 'actions/cache@v4' + }; + + for (const [oldAction, newAction] of Object.entries(actionUpdates)) { + if (content.includes(oldAction)) { + content = content.replace(new RegExp(escapeRegex(oldAction), 'g'), newAction); + changes.push(`Updated ${oldAction} → ${newAction}`); + } + } + + if (changes.length > 0) { + fs.writeFileSync(workflowPath, content); + } + + return changes; +} + +/** + * Add missing permissions to workflow + */ +function addMissingPermissions(workflowPath) { + let content = fs.readFileSync(workflowPath, 'utf8'); + let changes = []; + + try { + const workflow = yaml.parse(content); + + // Check if permissions block exists + if (!workflow.permissions) { + // Add basic permissions after 'on:' block + const onMatch = content.match(/^on:\s*\n([\s\S]*?)(?=\n\w)/m); + if (onMatch) { + const insertPoint = onMatch.index + onMatch[0].length; + const permissionsBlock = `\npermissions:\n contents: read\n pull-requests: write\n issues: write\n`; + content = content.slice(0, insertPoint) + permissionsBlock + content.slice(insertPoint); + changes.push('Added permissions block with contents:read, pull-requests:write, issues:write'); + fs.writeFileSync(workflowPath, content); + } + } + } catch (e) { + console.error(`Error parsing workflow: ${e.message}`); + } + + return changes; +} + +/** + * Remove or fix unexpected inputs + */ +function removeUnexpectedInputs(workflowPath) { + let content = fs.readFileSync(workflowPath, 'utf8'); + let changes = []; + + // Common unexpected inputs that can be safely removed + const unexpectedInputs = { + 'sonarqube-quality-gate-action': ['sonar_host_url'] + }; + + try { + const workflow = yaml.parse(content); + + // Find steps with unexpected inputs + for (const jobName in workflow.jobs || {}) { + const job = workflow.jobs[jobName]; + for (const step of job.steps || []) { + if (step.uses) { + for (const [actionPattern, inputs] of Object.entries(unexpectedInputs)) { + if (step.uses.includes(actionPattern)) { + for (const input of inputs) { + if (step.with && step.with[input]) { + // Remove the input + delete step.with[input]; + changes.push(`Removed unexpected input '${input}' from ${step.uses}`); + } + } + } + } + } + } + } + + if (changes.length > 0) { + fs.writeFileSync(workflowPath, yaml.stringify(workflow, { lineWidth: 0 })); + } + } catch (e) { + console.error(`Error processing workflow: ${e.message}`); + } + + return changes; +} + +/** + * Update action versions to latest + */ +function updateActionVersions(workflowPath) { + let content = fs.readFileSync(workflowPath, 'utf8'); + let changes = []; + + const versionUpdates = [ + { pattern: /actions\/checkout@v[12]/g, replacement: 'actions/checkout@v4', desc: 'checkout → v4' }, + { pattern: /actions\/checkout@v3/g, replacement: 'actions/checkout@v4', desc: 'checkout v3 → v4' }, + { pattern: /actions\/setup-java@v[123]/g, replacement: 'actions/setup-java@v4', desc: 'setup-java → v4' }, + { pattern: /actions\/setup-node@v[123]/g, replacement: 'actions/setup-node@v4', desc: 'setup-node → v4' } + ]; + + for (const update of versionUpdates) { + if (update.pattern.test(content)) { + content = content.replace(update.pattern, update.replacement); + changes.push(`Updated ${update.desc}`); + } + } + + if (changes.length > 0) { + fs.writeFileSync(workflowPath, content); + } + + return changes; +} + +/** + * Escape special regex characters + */ +function escapeRegex(string) { + return string.replace(/[.*+?^${}()|[\]\\]/g, '\\$&'); +} + +/** + * Find all workflow files + */ +function findWorkflowFiles() { + const workflowDir = '.github/workflows'; + const files = []; + + if (fs.existsSync(workflowDir)) { + for (const file of fs.readdirSync(workflowDir)) { + if (file.endsWith('.yml') || file.endsWith('.yaml')) { + // Skip the self-healing workflow itself + if (file !== 'self-healing.yml') { + files.push(path.join(workflowDir, file)); + } + } + } + } + + return files; +} + +/** + * Main execution + */ +function main() { + console.log('=== Self-Healing CI - Auto-Fix ===\n'); + console.log(`Classification: ${classification}`); + console.log(`Diagnosis:\n${diagnosis}\n`); + + const allChanges = []; + const workflowFiles = findWorkflowFiles(); + + // Determine which fixes to apply based on diagnosis + for (const [fixName, fix] of Object.entries(fixPatterns)) { + const shouldApply = fix.patterns.some(pattern => pattern.test(diagnosis)); + + if (shouldApply) { + console.log(`\nApplying fix: ${fixName}`); + + for (const workflowPath of workflowFiles) { + console.log(` Processing: ${workflowPath}`); + try { + const changes = fix.apply(workflowPath); + if (changes.length > 0) { + allChanges.push({ + file: workflowPath, + fix: fixName, + changes + }); + console.log(` ${changes.length} change(s) applied`); + } + } catch (e) { + console.error(` Error: ${e.message}`); + } + } + } + } + + // Output results for GitHub Actions + const outputFile = process.env.GITHUB_OUTPUT; + const changesMade = allChanges.length > 0; + + let changesDescription = 'No automatic fixes were applied.'; + if (changesMade) { + changesDescription = allChanges.map(c => { + return `**${c.file}** (${c.fix}):\n${c.changes.map(ch => `- ${ch}`).join('\n')}`; + }).join('\n\n'); + } + + if (outputFile) { + fs.appendFileSync(outputFile, `changes_made=${changesMade}\n`); + const delimiter = `EOF_${Date.now()}`; + fs.appendFileSync(outputFile, `changes_description<<${delimiter}\n${changesDescription}\n${delimiter}\n`); + } + + console.log('\n=== Summary ==='); + console.log(`Changes made: ${changesMade}`); + if (changesMade) { + console.log('\nChanges:'); + console.log(changesDescription); + } +} + +main(); diff --git a/.github/self-healing-config.yml b/.github/self-healing-config.yml new file mode 100644 index 000000000..56d723362 --- /dev/null +++ b/.github/self-healing-config.yml @@ -0,0 +1,137 @@ +# Self-Healing Pipeline Configuration +# This file controls the behavior of the self-healing CI/CD agent + +# Operating mode: +# - "assist": Creates issues/comments with diagnosis and proposed fixes (default, safer) +# - "auto-fix": Automatically creates PRs with fixes for certain failure types +mode: "assist" + +# Workflows to monitor (empty means all workflows) +# Add workflow filenames to restrict monitoring +monitored_workflows: [] + +# Excluded workflows - these will never be auto-fixed +excluded_workflows: + - "self-healing.yml" # Never self-heal the self-healer + +# Failure classification rules +classification: + # Code failures - test failures, compilation errors, lint issues + code: + patterns: + - "BUILD FAILURE" + - "COMPILATION ERROR" + - "Tests run:.*Failures: [1-9]" # Only match when there are actual failures + - "Tests run:.*Errors: [1-9]" # Only match when there are actual errors + - "error: cannot find symbol" + - "error: incompatible types" + - "SyntaxError" + - "TypeError" + - "eslint.*error" + - "checkstyle.*ERROR" + - "spotless.*failed" + auto_fix_enabled: false # Code fixes need human review + + # Workflow failures - YAML issues, action versions, permissions + workflow: + patterns: + - "Invalid workflow file" + - "unexpected value" + - "action.*not found" + - "uses:.*@.*not found" + - "permission.*denied" + - "Required secret.*not found" + - "matrix.*invalid" + - "Unexpected input" + auto_fix_enabled: true + + # Infrastructure failures - flaky services, timeouts, rate limits + infrastructure: + patterns: + - "timeout" + - "ETIMEDOUT" + - "ECONNREFUSED" + - "rate limit" + - "503 Service" + - "502 Bad Gateway" + - "Could not resolve host" + - "TLS handshake timeout" + - "connection reset" + - "No space left on device" + auto_fix_enabled: true + retry_on_infra_failure: true + max_retries: 2 + + # Quality gate failures - SonarQube, code coverage thresholds + quality_gate: + patterns: + - "Quality Gate.*FAILED" + - "coverage.*below" + - "Quality gate status" + - "does not meet.*threshold" + auto_fix_enabled: false + create_issue: true + +# Auto-fix rules for specific patterns +auto_fix_rules: + # Retry workflow on transient infrastructure failures + - name: "retry-on-transient-failure" + match: "ETIMEDOUT|ECONNREFUSED|rate limit|503 Service|502 Bad Gateway" + action: "retry" + max_retries: 2 + delay_seconds: 60 + + # Update deprecated action versions + - name: "update-deprecated-actions" + match: "Node.js 12 actions are deprecated|set-output command is deprecated" + action: "create-pr" + fix_type: "update-action-version" + + # Fix missing permissions + - name: "fix-permissions" + match: "Resource not accessible by integration|permission.*denied" + action: "suggest-fix" + fix_type: "add-permissions" + +# Notification settings +notifications: + # Create GitHub issue for failures + create_issue: true + issue_labels: + - "ci-failure" + - "self-healing" + + # Add comment to PR if failure is on a PR + comment_on_pr: true + + # Assign issues to specific users (GitHub usernames) + assignees: [] + +# Retry configuration +retry: + enabled: true + max_attempts: 2 + delay_minutes: 1 + # Only retry these failure types automatically + auto_retry_types: + - "infrastructure" + +# Limits and guardrails +guardrails: + # Maximum PRs to create per day + max_prs_per_day: 5 + + # Maximum issues to create per day + max_issues_per_day: 10 + + # Require approval for auto-fix PRs + require_approval: true + + # Never modify these files automatically + protected_files: + - ".github/self-healing-config.yml" + - "CODEOWNERS" + - ".github/workflows/main-build-and-deploy.yml" + + # Maximum lines of code to change in auto-fix + max_lines_changed: 50 diff --git a/.github/workflows/self-healing.yml b/.github/workflows/self-healing.yml new file mode 100644 index 000000000..3869bbab7 --- /dev/null +++ b/.github/workflows/self-healing.yml @@ -0,0 +1,342 @@ +name: Self-Healing CI Pipeline + +on: + workflow_run: + workflows: ["*"] # Monitor all workflows + types: + - completed + +permissions: + contents: write + pull-requests: write + issues: write + actions: read + +jobs: + analyze-failure: + name: Analyze Workflow Failure + if: ${{ github.event.workflow_run.conclusion == 'failure' }} + runs-on: ubuntu-latest + + outputs: + classification: ${{ steps.analyze.outputs.classification }} + should_retry: ${{ steps.analyze.outputs.should_retry }} + should_create_issue: ${{ steps.analyze.outputs.should_create_issue }} + should_create_pr: ${{ steps.analyze.outputs.should_create_pr }} + diagnosis: ${{ steps.analyze.outputs.diagnosis }} + + steps: + - name: Checkout repository + uses: actions/checkout@v4 + + - name: Setup Node.js + uses: actions/setup-node@v4 + with: + node-version: '20' + + - name: Install dependencies + run: | + npm install yaml @octokit/rest + + - name: Download workflow logs + id: download-logs + env: + GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} + run: | + echo "Downloading logs for run ${{ github.event.workflow_run.id }}" + + # Get the failed run logs + gh run view ${{ github.event.workflow_run.id }} --log-failed > /tmp/failed_logs.txt 2>&1 || true + + # Get workflow run details + gh api /repos/${{ github.repository }}/actions/runs/${{ github.event.workflow_run.id }} > /tmp/run_details.json + + # Get jobs for the run + gh api /repos/${{ github.repository }}/actions/runs/${{ github.event.workflow_run.id }}/jobs > /tmp/jobs.json + + echo "logs_path=/tmp/failed_logs.txt" >> $GITHUB_OUTPUT + echo "run_details_path=/tmp/run_details.json" >> $GITHUB_OUTPUT + echo "jobs_path=/tmp/jobs.json" >> $GITHUB_OUTPUT + + - name: Analyze failure + id: analyze + env: + FAILED_WORKFLOW: ${{ github.event.workflow_run.name }} + FAILED_WORKFLOW_ID: ${{ github.event.workflow_run.id }} + FAILED_RUN_URL: ${{ github.event.workflow_run.html_url }} + HEAD_BRANCH: ${{ github.event.workflow_run.head_branch }} + HEAD_SHA: ${{ github.event.workflow_run.head_sha }} + run: | + node .github/scripts/analyze-failure.js \ + --logs "${{ steps.download-logs.outputs.logs_path }}" \ + --run-details "${{ steps.download-logs.outputs.run_details_path }}" \ + --jobs "${{ steps.download-logs.outputs.jobs_path }}" \ + --config ".github/self-healing-config.yml" + + retry-workflow: + name: Retry Failed Workflow + needs: analyze-failure + if: ${{ needs.analyze-failure.outputs.should_retry == 'true' }} + runs-on: ubuntu-latest + + steps: + - name: Check retry count + id: check-retry + env: + GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} + run: | + # Check if this workflow has been retried too many times + WORKFLOW_NAME="${{ github.event.workflow_run.name }}" + HEAD_SHA="${{ github.event.workflow_run.head_sha }}" + + # Count recent runs of the same workflow on the same commit + RETRY_COUNT=$(gh api "/repos/${{ github.repository }}/actions/runs?head_sha=${HEAD_SHA}" \ + --jq "[.workflow_runs[] | select(.name == \"${WORKFLOW_NAME}\")] | length") + + echo "Current retry count: $RETRY_COUNT" + + if [ "$RETRY_COUNT" -ge 3 ]; then + echo "Max retries reached, will not retry" + echo "should_proceed=false" >> $GITHUB_OUTPUT + else + echo "Will proceed with retry" + echo "should_proceed=true" >> $GITHUB_OUTPUT + fi + + - name: Wait before retry + if: steps.check-retry.outputs.should_proceed == 'true' + run: sleep 60 + + - name: Rerun failed workflow + if: steps.check-retry.outputs.should_proceed == 'true' + env: + GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} + run: | + echo "Rerunning workflow ${{ github.event.workflow_run.id }}" + gh run rerun ${{ github.event.workflow_run.id }} --failed + + create-issue: + name: Create Failure Issue + needs: analyze-failure + if: ${{ needs.analyze-failure.outputs.should_create_issue == 'true' }} + runs-on: ubuntu-latest + + steps: + - name: Checkout repository + uses: actions/checkout@v4 + + - name: Check for existing issue + id: check-issue + env: + GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} + run: | + # Check if an issue already exists for this failure + WORKFLOW_NAME="${{ github.event.workflow_run.name }}" + HEAD_BRANCH="${{ github.event.workflow_run.head_branch }}" + + EXISTING_ISSUE=$(gh issue list \ + --label "ci-failure" \ + --label "self-healing" \ + --state open \ + --search "in:title ${WORKFLOW_NAME} ${HEAD_BRANCH}" \ + --json number \ + --jq '.[0].number // empty') + + if [ -n "$EXISTING_ISSUE" ]; then + echo "Existing issue found: #${EXISTING_ISSUE}" + echo "exists=true" >> $GITHUB_OUTPUT + echo "issue_number=${EXISTING_ISSUE}" >> $GITHUB_OUTPUT + else + echo "No existing issue found" + echo "exists=false" >> $GITHUB_OUTPUT + fi + + - name: Create or update issue + env: + GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} + DIAGNOSIS: ${{ needs.analyze-failure.outputs.diagnosis }} + CLASSIFICATION: ${{ needs.analyze-failure.outputs.classification }} + run: | + WORKFLOW_NAME="${{ github.event.workflow_run.name }}" + RUN_URL="${{ github.event.workflow_run.html_url }}" + HEAD_BRANCH="${{ github.event.workflow_run.head_branch }}" + HEAD_SHA="${{ github.event.workflow_run.head_sha }}" + + ISSUE_BODY=$(cat < + ℹ️ About Self-Healing CI + + This repository has a self-healing CI/CD pipeline that: + - Monitors all workflow failures + - Analyzes logs to determine root cause + - Classifies failures (code, workflow, infrastructure, quality) + - Suggests or auto-applies fixes when safe + + Configuration: \`.github/self-healing-config.yml\` + + EOF + ) + + if [ "${{ steps.check-issue.outputs.exists }}" == "true" ]; then + # Update existing issue with a comment + gh issue comment "${{ steps.check-issue.outputs.issue_number }}" \ + --body "## 🔄 New Failure Detected + + **Run URL:** ${RUN_URL} + **Commit:** ${HEAD_SHA} + + ### Updated Diagnosis + + ${DIAGNOSIS}" + else + # Create new issue + gh issue create \ + --title "🔧 CI Failure: ${WORKFLOW_NAME} on ${HEAD_BRANCH}" \ + --body "${ISSUE_BODY}" \ + --label "ci-failure" \ + --label "self-healing" + fi + + create-fix-pr: + name: Create Fix Pull Request + needs: analyze-failure + if: ${{ needs.analyze-failure.outputs.should_create_pr == 'true' }} + runs-on: ubuntu-latest + + steps: + - name: Checkout repository + uses: actions/checkout@v4 + with: + fetch-depth: 0 + + - name: Setup Node.js + uses: actions/setup-node@v4 + with: + node-version: '20' + + - name: Install dependencies + run: npm install yaml + + - name: Apply fix + id: apply-fix + env: + DIAGNOSIS: ${{ needs.analyze-failure.outputs.diagnosis }} + CLASSIFICATION: ${{ needs.analyze-failure.outputs.classification }} + run: | + node .github/scripts/apply-fix.js + + - name: Create Pull Request + if: steps.apply-fix.outputs.changes_made == 'true' + env: + GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} + run: | + BRANCH_NAME="self-healing/fix-${{ github.event.workflow_run.id }}" + + git config user.name "github-actions[bot]" + git config user.email "github-actions[bot]@users.noreply.github.com" + + git checkout -b "${BRANCH_NAME}" + git add -A + git commit -m "fix: Auto-fix CI failure in ${{ github.event.workflow_run.name }} + + This fix was automatically generated by the self-healing CI agent. + + Workflow: ${{ github.event.workflow_run.name }} + Original run: ${{ github.event.workflow_run.html_url }} + Classification: ${{ needs.analyze-failure.outputs.classification }}" + + git push origin "${BRANCH_NAME}" + + gh pr create \ + --title "🤖 Auto-fix: CI failure in ${{ github.event.workflow_run.name }}" \ + --body "## 🤖 Self-Healing CI Auto-Fix + + This PR was automatically generated by the self-healing CI agent. + + **Original Failure:** ${{ github.event.workflow_run.html_url }} + **Classification:** ${{ needs.analyze-failure.outputs.classification }} + + ### Diagnosis + + ${{ needs.analyze-failure.outputs.diagnosis }} + + ### Changes Made + + ${{ steps.apply-fix.outputs.changes_description }} + + --- + + ⚠️ **Please review these changes carefully before merging.**" \ + --label "self-healing" \ + --label "auto-fix" + + comment-on-pr: + name: Comment on PR + needs: analyze-failure + if: ${{ github.event.workflow_run.event == 'pull_request' }} + runs-on: ubuntu-latest + + steps: + - name: Get PR number + id: get-pr + env: + GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} + run: | + # Get the PR number from the workflow run + PR_NUMBER=$(gh api "/repos/${{ github.repository }}/actions/runs/${{ github.event.workflow_run.id }}" \ + --jq '.pull_requests[0].number // empty') + + if [ -n "$PR_NUMBER" ]; then + echo "pr_number=${PR_NUMBER}" >> $GITHUB_OUTPUT + echo "has_pr=true" >> $GITHUB_OUTPUT + else + echo "has_pr=false" >> $GITHUB_OUTPUT + fi + + - name: Add comment to PR + if: steps.get-pr.outputs.has_pr == 'true' + env: + GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} + DIAGNOSIS: ${{ needs.analyze-failure.outputs.diagnosis }} + CLASSIFICATION: ${{ needs.analyze-failure.outputs.classification }} + run: | + gh pr comment "${{ steps.get-pr.outputs.pr_number }}" \ + --body "## 🔧 CI Failure Analysis + + **Workflow:** ${{ github.event.workflow_run.name }} + **Classification:** ${CLASSIFICATION} + **Run:** ${{ github.event.workflow_run.html_url }} + + ### 📋 Diagnosis + + ${DIAGNOSIS} + + --- + 🤖 Generated by Self-Healing CI Agent"