Enterprise-grade automated OS patching and system maintenance for K3s cluster nodes with zero-downtime operations. This tool safely applies operating system updates, security patches, and package upgrades to your K3s nodes using a modular Ansible role architecture.
Run maintenance operations with simple commands:
# Update all worker nodes
ansible-playbook -i hosts.yml maintenance.yml --limit k3s_workers
# Update all master nodes
ansible-playbook -i hosts.yml maintenance.yml --limit k3s_masters
# Update specific node
ansible-playbook -i hosts.yml maintenance.yml --limit node-01
# Update entire cluster
ansible-playbook -i hosts.yml maintenance.yml
This tool uses a modular Ansible role-based architecture for production deployments:
roles/
k3s_node_maintenance/
βββ tasks/
β βββ main.yml # Main task orchestration
β βββ prerequisites.yml # Pre-flight checks
β βββ package_checks.yml # Update detection
β βββ cluster_preparation.yml # Node draining
β βββ package_updates.yml # OS updates
β βββ debian_updates.yml # Debian/Ubuntu specific
β βββ redhat_updates.yml # RHEL/CentOS specific
β βββ reboot_handling.yml # Reboot coordination
β βββ cluster_restoration.yml # Node restoration
βββ defaults/
β βββ main.yml # Default variables
βββ handlers/
β βββ main.yml # Event handlers
βββ meta/
βββ main.yml # Role metadata
group_vars/
βββ k3s_masters/main.yml # Master-specific settings
βββ k3s_workers/main.yml # Worker-specific settings
βββ os_debian/main.yml # Debian/Ubuntu settings
βββ os_redhat/main.yml # RHEL/CentOS settings
- π Automated OS Patching: System updates, security patches, and package upgrades
- β‘ Zero-Downtime Operations: Sequential node processing preserves cluster availability
- π Intelligent Detection: Automatically skips maintenance when no updates are available
- π‘οΈ Health Monitoring: Comprehensive cluster and storage validation
- π Storage Recovery: Automatic wait for degraded Longhorn volumes to recover
- ποΈ Control Plane Safety: Master node handling with quorum protection
- πΎ Storage Integration: Native Longhorn support with volume health verification
- π Reboot Management: Smart reboot handling that adapts to node boot speeds
- ποΈ Enterprise Ready: Modular role architecture for scalability and customisation
File | Description |
---|---|
maintenance.yml |
Main playbook using enterprise role architecture |
hosts.yml.example |
Example inventory with group structure |
ansible.cfg |
Ansible configuration |
roles/ |
Modular role architecture |
group_vars/ |
Node type and OS-specific variables |
requirements.txt |
Python dependencies |
- K3s cluster (single or multi-node)
- Ansible (>= 2.9, tested with 2.14.x)
- kubectl configured for your cluster
- SSH access to all nodes with key-based authentication
- kubernetes.core collection for native Kubernetes operations
- Python Kubernetes client for API operations
- Longhorn storage system (health checks included)
git clone https://github.com/sudo-kraken/k3s-cluster-maintenance.git
cd k3s-cluster-maintenance
pip install -r requirements.txt
# Install required Kubernetes collection
ansible-galaxy collection install kubernetes.core
# Or install all collections from requirements
ansible-galaxy collection install -r collections/requirements.yml
Required Collections:
kubernetes.core
(>= 2.4.0) - For native Kubernetes API operations
cp hosts.yml.example hosts.yml
# Edit hosts.yml with your cluster details
ansible all -i hosts.yml -m ping
# Update all worker nodes
ansible-playbook -i hosts.yml maintenance.yml --limit k3s_workers
# Update all master nodes
ansible-playbook -i hosts.yml maintenance.yml --limit k3s_masters
# Update specific node
ansible-playbook -i hosts.yml maintenance.yml --limit worker-01
# Update all Debian/Ubuntu nodes
ansible-playbook -i hosts.yml maintenance.yml --limit os_debian
# Update entire cluster
ansible-playbook -i hosts.yml maintenance.yml
# Dry run (check mode)
ansible-playbook -i hosts.yml maintenance.yml --check
# Update with custom variables
ansible-playbook -i hosts.yml maintenance.yml -e "k3s_node_maintenance_wait_timeout=1200"
# Verbose output for debugging
ansible-playbook -i hosts.yml maintenance.yml -v
# Update specific nodes by pattern
ansible-playbook -i hosts.yml maintenance.yml --limit "*master*"
Use tags to run specific phases of maintenance:
# Run only prerequisite checks
ansible-playbook -i hosts.yml maintenance.yml --tags prerequisites
# Check for available updates only
ansible-playbook -i hosts.yml maintenance.yml --tags packages,check_updates
# Run only cluster preparation (cordon/drain)
ansible-playbook -i hosts.yml maintenance.yml --tags cluster,prepare
# Run only package updates
ansible-playbook -i hosts.yml maintenance.yml --tags packages,updates
# Run only reboot handling
ansible-playbook -i hosts.yml maintenance.yml --tags reboot
# Run only cluster restoration (uncordon)
ansible-playbook -i hosts.yml maintenance.yml --tags restore
# Resume after manual reboot or failure
ansible-playbook -i hosts.yml maintenance.yml --tags resume
# OS-specific operations
ansible-playbook -i hosts.yml maintenance.yml --tags debian # Debian/Ubuntu only
ansible-playbook -i hosts.yml maintenance.yml --tags redhat # RHEL/CentOS only
# Longhorn-specific operations
ansible-playbook -i hosts.yml maintenance.yml --tags longhorn
# Resume maintenance after reboot failure
ansible-playbook -i hosts.yml maintenance.yml --limit node-01 --tags resume
# Manual uncordon after successful maintenance
ansible-playbook -i hosts.yml maintenance.yml --limit node-01 --tags uncordon
# Re-enable Longhorn scheduling only
ansible-playbook -i hosts.yml maintenance.yml --limit node-01 --tags longhorn
Customise behaviour through group variables:
k3s_node_maintenance_drain_timeout: 600
k3s_node_maintenance_wait_timeout: 1800
k3s_node_maintenance_skip_drain: true # Masters are not drained
k3s_node_maintenance_drain_timeout: 300
k3s_node_maintenance_wait_timeout: 600
k3s_node_maintenance_skip_drain: false
# Debian/Ubuntu (group_vars/os_debian/main.yml)
k3s_node_maintenance_package_manager: apt
k3s_node_maintenance_cache_valid_time: 3600
# RHEL/CentOS (group_vars/os_redhat/main.yml)
k3s_node_maintenance_package_manager: dnf
k3s_node_maintenance_needs_restarting_available: true
Define your cluster in hosts.yml
:
all:
children:
k3s_cluster:
children:
k3s_masters:
hosts:
master-01:
ansible_host: 10.0.0.100
master-02:
ansible_host: 10.0.0.101
master-03:
ansible_host: 10.0.0.102
k3s_workers:
hosts:
worker-01:
ansible_host: 10.0.0.150
worker-02:
ansible_host: 10.0.0.151
os_debian:
hosts:
master-01:
worker-01:
os_redhat:
hosts:
master-02:
master-03:
worker-02:
- Early Exit: Automatically skips maintenance when no updates are available
- Update Assessment: Checks for available packages before cluster operations
- Resource Preservation: Prevents unnecessary downtime and resource usage
- Pre-flight Checks: Validates prerequisites and cluster health
- Node Readiness: Ensures nodes are healthy before/after maintenance
- Control Plane: Validates API server and etcd health for masters
- Storage Integration: Checks Longhorn volume health (when available)
- Volume Recovery: Waits for degraded Longhorn volumes to recover before proceeding
- Sequential Processing: Maintains only one node at a time
- Drain Protection: Workers are properly drained; masters are never drained
- Smart Reboot Handling: Adaptive monitoring that waits for actual state changes
- Rollback Support: Stops on first failure to prevent cascade issues
# Verify what updates are available
ansible all -i hosts.yml -m package_facts
# Check cluster health
kubectl get nodes
kubectl get pods --all-namespaces
# Verify Longhorn status (if applicable)
kubectl get pods -n longhorn-system
This is normal behaviour - the role intelligently skips maintenance when no packages need updating.
# Check node status
kubectl get nodes
# Manual uncordon if needed
kubectl uncordon <node-name>
# Test connectivity
ansible all -i hosts.yml -m ping
# Check SSH access
ssh user@node-ip
# Run with maximum verbosity
ansible-playbook -i hosts.yml maintenance.yml -vvv
# List all available tags
ansible-playbook -i hosts.yml maintenance.yml --list-tags
# Check specific task without running
ansible-playbook -i hosts.yml maintenance.yml --tags check_updates --check
# Resume from specific point
ansible-playbook -i hosts.yml maintenance.yml --limit node-01 --tags resume
Tag | Description | Use Case |
---|---|---|
prerequisites |
Pre-flight checks | Validate environment setup |
check_updates |
Package update detection | See what updates are available |
prepare |
Cluster preparation | Cordon/drain nodes only |
packages |
All package operations | Package management only |
updates |
Package installation | Install updates only |
reboot |
Reboot coordination | Reboot handling only |
restore |
Cluster restoration | Uncordon and restore scheduling |
resume |
Manual recovery | Resume after failures (includes restore) |
uncordon |
Node uncordoning | Restore node scheduling only |
debian |
Debian/Ubuntu only | OS-specific operations |
redhat |
RHEL/CentOS only | OS-specific operations |
longhorn |
Longhorn operations | Storage-specific tasks |
This project is licensed under the MIT License - see the LICENSE file for details.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This tool performs maintenance operations on your Kubernetes cluster. Always:
- Test in a non-production environment first
- Ensure you have recent backups
- Review the role tasks before deployment
- Monitor the process during execution
Use at your own risk. The authors are not responsible for any damage or data loss.
Enterprise-grade K3s maintenance made simple