Skip to content

🏒 Enterprise K3s maintenance automation - Zero-downtime OS patching with intelligent health checks, Longhorn integration, and role-based Ansible architecture. Production-ready sequential node updates.

License

Notifications You must be signed in to change notification settings

sudo-kraken/k3s-cluster-maintenance

Repository files navigation

K3s Cluster Maintenance

License: MIT Ansible

Enterprise-grade automated OS patching and system maintenance for K3s cluster nodes with zero-downtime operations. This tool safely applies operating system updates, security patches, and package upgrades to your K3s nodes using a modular Ansible role architecture.

🎯 Quick Start

Run maintenance operations with simple commands:

# Update all worker nodes
ansible-playbook -i hosts.yml maintenance.yml --limit k3s_workers

# Update all master nodes  
ansible-playbook -i hosts.yml maintenance.yml --limit k3s_masters

# Update specific node
ansible-playbook -i hosts.yml maintenance.yml --limit node-01

# Update entire cluster
ansible-playbook -i hosts.yml maintenance.yml

πŸ—οΈ Enterprise Architecture

This tool uses a modular Ansible role-based architecture for production deployments:

Role Structure

roles/
  k3s_node_maintenance/
    β”œβ”€β”€ tasks/
    β”‚   β”œβ”€β”€ main.yml              # Main task orchestration
    β”‚   β”œβ”€β”€ prerequisites.yml     # Pre-flight checks
    β”‚   β”œβ”€β”€ package_checks.yml    # Update detection
    β”‚   β”œβ”€β”€ cluster_preparation.yml # Node draining
    β”‚   β”œβ”€β”€ package_updates.yml   # OS updates
    β”‚   β”œβ”€β”€ debian_updates.yml    # Debian/Ubuntu specific
    β”‚   β”œβ”€β”€ redhat_updates.yml    # RHEL/CentOS specific
    β”‚   β”œβ”€β”€ reboot_handling.yml   # Reboot coordination
    β”‚   └── cluster_restoration.yml # Node restoration
    β”œβ”€β”€ defaults/
    β”‚   └── main.yml              # Default variables
    β”œβ”€β”€ handlers/
    β”‚   └── main.yml              # Event handlers
    └── meta/
        └── main.yml              # Role metadata

Group Variables

group_vars/
  β”œβ”€β”€ k3s_masters/main.yml      # Master-specific settings
  β”œβ”€β”€ k3s_workers/main.yml      # Worker-specific settings
  β”œβ”€β”€ os_debian/main.yml        # Debian/Ubuntu settings
  └── os_redhat/main.yml        # RHEL/CentOS settings

πŸš€ Features

  • πŸ”„ Automated OS Patching: System updates, security patches, and package upgrades
  • ⚑ Zero-Downtime Operations: Sequential node processing preserves cluster availability
  • πŸ” Intelligent Detection: Automatically skips maintenance when no updates are available
  • πŸ›‘οΈ Health Monitoring: Comprehensive cluster and storage validation
  • πŸ”„ Storage Recovery: Automatic wait for degraded Longhorn volumes to recover
  • πŸŽ›οΈ Control Plane Safety: Master node handling with quorum protection
  • πŸ’Ύ Storage Integration: Native Longhorn support with volume health verification
  • πŸ”„ Reboot Management: Smart reboot handling that adapts to node boot speeds
  • πŸ—οΈ Enterprise Ready: Modular role architecture for scalability and customisation

πŸ“¦ Repository Contents

File Description
maintenance.yml Main playbook using enterprise role architecture
hosts.yml.example Example inventory with group structure
ansible.cfg Ansible configuration
roles/ Modular role architecture
group_vars/ Node type and OS-specific variables
requirements.txt Python dependencies

πŸ“‹ Prerequisites

  • K3s cluster (single or multi-node)
  • Ansible (>= 2.9, tested with 2.14.x)
  • kubectl configured for your cluster
  • SSH access to all nodes with key-based authentication
  • kubernetes.core collection for native Kubernetes operations
  • Python Kubernetes client for API operations

Optional Components

  • Longhorn storage system (health checks included)

πŸ› οΈ Installation

1. Clone Repository

git clone https://github.com/sudo-kraken/k3s-cluster-maintenance.git
cd k3s-cluster-maintenance

2. Install Dependencies

pip install -r requirements.txt

3. Install Ansible Collections

# Install required Kubernetes collection
ansible-galaxy collection install kubernetes.core

# Or install all collections from requirements
ansible-galaxy collection install -r collections/requirements.yml

Required Collections:

  • kubernetes.core (>= 2.4.0) - For native Kubernetes API operations

4. Configure Inventory

cp hosts.yml.example hosts.yml
# Edit hosts.yml with your cluster details

5. Test Connectivity

ansible all -i hosts.yml -m ping

πŸ“š Usage Examples

Basic Maintenance Operations

# Update all worker nodes
ansible-playbook -i hosts.yml maintenance.yml --limit k3s_workers

# Update all master nodes
ansible-playbook -i hosts.yml maintenance.yml --limit k3s_masters

# Update specific node
ansible-playbook -i hosts.yml maintenance.yml --limit worker-01

# Update all Debian/Ubuntu nodes
ansible-playbook -i hosts.yml maintenance.yml --limit os_debian

# Update entire cluster
ansible-playbook -i hosts.yml maintenance.yml

Advanced Operations

# Dry run (check mode)
ansible-playbook -i hosts.yml maintenance.yml --check

# Update with custom variables
ansible-playbook -i hosts.yml maintenance.yml -e "k3s_node_maintenance_wait_timeout=1200"

# Verbose output for debugging
ansible-playbook -i hosts.yml maintenance.yml -v

# Update specific nodes by pattern
ansible-playbook -i hosts.yml maintenance.yml --limit "*master*"

Tagged Operations

Use tags to run specific phases of maintenance:

# Run only prerequisite checks
ansible-playbook -i hosts.yml maintenance.yml --tags prerequisites

# Check for available updates only
ansible-playbook -i hosts.yml maintenance.yml --tags packages,check_updates

# Run only cluster preparation (cordon/drain)
ansible-playbook -i hosts.yml maintenance.yml --tags cluster,prepare

# Run only package updates
ansible-playbook -i hosts.yml maintenance.yml --tags packages,updates

# Run only reboot handling
ansible-playbook -i hosts.yml maintenance.yml --tags reboot

# Run only cluster restoration (uncordon)
ansible-playbook -i hosts.yml maintenance.yml --tags restore

# Resume after manual reboot or failure
ansible-playbook -i hosts.yml maintenance.yml --tags resume

# OS-specific operations
ansible-playbook -i hosts.yml maintenance.yml --tags debian    # Debian/Ubuntu only
ansible-playbook -i hosts.yml maintenance.yml --tags redhat   # RHEL/CentOS only

# Longhorn-specific operations
ansible-playbook -i hosts.yml maintenance.yml --tags longhorn

Recovery Operations

# Resume maintenance after reboot failure
ansible-playbook -i hosts.yml maintenance.yml --limit node-01 --tags resume

# Manual uncordon after successful maintenance
ansible-playbook -i hosts.yml maintenance.yml --limit node-01 --tags uncordon

# Re-enable Longhorn scheduling only
ansible-playbook -i hosts.yml maintenance.yml --limit node-01 --tags longhorn

βš™οΈ Configuration

Role Variables

Customise behaviour through group variables:

Master Nodes (group_vars/k3s_masters/main.yml)

k3s_node_maintenance_drain_timeout: 600
k3s_node_maintenance_wait_timeout: 1800
k3s_node_maintenance_skip_drain: true  # Masters are not drained

Worker Nodes (group_vars/k3s_workers/main.yml)

k3s_node_maintenance_drain_timeout: 300
k3s_node_maintenance_wait_timeout: 600
k3s_node_maintenance_skip_drain: false

OS-Specific Settings

# Debian/Ubuntu (group_vars/os_debian/main.yml)
k3s_node_maintenance_package_manager: apt
k3s_node_maintenance_cache_valid_time: 3600

# RHEL/CentOS (group_vars/os_redhat/main.yml)
k3s_node_maintenance_package_manager: dnf
k3s_node_maintenance_needs_restarting_available: true

Inventory Structure

Define your cluster in hosts.yml:

all:
  children:
    k3s_cluster:
      children:
        k3s_masters:
          hosts:
            master-01:
              ansible_host: 10.0.0.100
            master-02:
              ansible_host: 10.0.0.101
            master-03:
              ansible_host: 10.0.0.102
        k3s_workers:
          hosts:
            worker-01:
              ansible_host: 10.0.0.150
            worker-02:
              ansible_host: 10.0.0.151
        os_debian:
          hosts:
            master-01:
            worker-01:
        os_redhat:
          hosts:
            master-02:
            master-03:
            worker-02:

πŸ›‘οΈ Safety Features

Intelligent Detection

  • Early Exit: Automatically skips maintenance when no updates are available
  • Update Assessment: Checks for available packages before cluster operations
  • Resource Preservation: Prevents unnecessary downtime and resource usage

Health Validation

  • Pre-flight Checks: Validates prerequisites and cluster health
  • Node Readiness: Ensures nodes are healthy before/after maintenance
  • Control Plane: Validates API server and etcd health for masters
  • Storage Integration: Checks Longhorn volume health (when available)
  • Volume Recovery: Waits for degraded Longhorn volumes to recover before proceeding

Operational Safety

  • Sequential Processing: Maintains only one node at a time
  • Drain Protection: Workers are properly drained; masters are never drained
  • Smart Reboot Handling: Adaptive monitoring that waits for actual state changes
  • Rollback Support: Stops on first failure to prevent cascade issues

πŸ”§ Troubleshooting

Check Maintenance Status

# Verify what updates are available
ansible all -i hosts.yml -m package_facts

# Check cluster health
kubectl get nodes
kubectl get pods --all-namespaces

# Verify Longhorn status (if applicable)
kubectl get pods -n longhorn-system

Common Issues

"No updates needed"

This is normal behaviour - the role intelligently skips maintenance when no packages need updating.

Node not ready after maintenance

# Check node status
kubectl get nodes

# Manual uncordon if needed
kubectl uncordon <node-name>

Ansible connection issues

# Test connectivity
ansible all -i hosts.yml -m ping

# Check SSH access
ssh user@node-ip

Debug Mode

# Run with maximum verbosity
ansible-playbook -i hosts.yml maintenance.yml -vvv

# List all available tags
ansible-playbook -i hosts.yml maintenance.yml --list-tags

# Check specific task without running
ansible-playbook -i hosts.yml maintenance.yml --tags check_updates --check

# Resume from specific point
ansible-playbook -i hosts.yml maintenance.yml --limit node-01 --tags resume

🏷️ Tag Reference

Tag Description Use Case
prerequisites Pre-flight checks Validate environment setup
check_updates Package update detection See what updates are available
prepare Cluster preparation Cordon/drain nodes only
packages All package operations Package management only
updates Package installation Install updates only
reboot Reboot coordination Reboot handling only
restore Cluster restoration Uncordon and restore scheduling
resume Manual recovery Resume after failures (includes restore)
uncordon Node uncordoning Restore node scheduling only
debian Debian/Ubuntu only OS-specific operations
redhat RHEL/CentOS only OS-specific operations
longhorn Longhorn operations Storage-specific tasks

πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

⚠️ Disclaimer

This tool performs maintenance operations on your Kubernetes cluster. Always:

  • Test in a non-production environment first
  • Ensure you have recent backups
  • Review the role tasks before deployment
  • Monitor the process during execution

Use at your own risk. The authors are not responsible for any damage or data loss.


Enterprise-grade K3s maintenance made simple

About

🏒 Enterprise K3s maintenance automation - Zero-downtime OS patching with intelligent health checks, Longhorn integration, and role-based Ansible architecture. Production-ready sequential node updates.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages