K3s Cluster Maintenance

Enterprise-grade automated OS patching and system maintenance for K3s cluster nodes with zero-downtime operations. This tool safely applies operating system updates, security patches, and package upgrades to your K3s nodes using a modular Ansible role architecture.

🎯 Quick Start

Run maintenance operations with simple commands:

# Update all worker nodes
ansible-playbook -i hosts.yml maintenance.yml --limit k3s_workers

# Update all master nodes  
ansible-playbook -i hosts.yml maintenance.yml --limit k3s_masters

# Update specific node
ansible-playbook -i hosts.yml maintenance.yml --limit node-01

# Update entire cluster
ansible-playbook -i hosts.yml maintenance.yml

🏗️ Enterprise Architecture

This tool uses a modular Ansible role-based architecture for production deployments:

Role Structure

roles/
  k3s_node_maintenance/
    ├── tasks/
    │   ├── main.yml              # Main task orchestration
    │   ├── prerequisites.yml     # Pre-flight checks
    │   ├── package_checks.yml    # Update detection
    │   ├── cluster_preparation.yml # Node draining
    │   ├── package_updates.yml   # OS updates
    │   ├── debian_updates.yml    # Debian/Ubuntu specific
    │   ├── redhat_updates.yml    # RHEL/CentOS specific
    │   ├── reboot_handling.yml   # Reboot coordination
    │   └── cluster_restoration.yml # Node restoration
    ├── defaults/
    │   └── main.yml              # Default variables
    ├── handlers/
    │   └── main.yml              # Event handlers
    └── meta/
        └── main.yml              # Role metadata

Group Variables

group_vars/
  ├── k3s_masters/main.yml      # Master-specific settings
  ├── k3s_workers/main.yml      # Worker-specific settings
  ├── os_debian/main.yml        # Debian/Ubuntu settings
  └── os_redhat/main.yml        # RHEL/CentOS settings

🚀 Features

🔄 Automated OS Patching: System updates, security patches, and package upgrades
⚡ Zero-Downtime Operations: Sequential node processing preserves cluster availability
🔍 Intelligent Detection: Automatically skips maintenance when no updates are available
🛡️ Health Monitoring: Comprehensive cluster and storage validation
🔄 Storage Recovery: Automatic wait for degraded Longhorn volumes to recover
🎛️ Control Plane Safety: Master node handling with quorum protection
💾 Storage Integration: Native Longhorn support with volume health verification
🔄 Reboot Management: Smart reboot handling that adapts to node boot speeds
🏗️ Enterprise Ready: Modular role architecture for scalability and customisation

📦 Repository Contents

File	Description
`maintenance.yml`	Main playbook using enterprise role architecture
`hosts.yml.example`	Example inventory with group structure
`ansible.cfg`	Ansible configuration
`roles/`	Modular role architecture
`group_vars/`	Node type and OS-specific variables
`requirements.txt`	Python dependencies

📋 Prerequisites

K3s cluster (single or multi-node)
Ansible (>= 2.9, tested with 2.14.x)
kubectl configured for your cluster
SSH access to all nodes with key-based authentication
kubernetes.core collection for native Kubernetes operations
Python Kubernetes client for API operations

Optional Components

Longhorn storage system (health checks included)

🛠️ Installation

1. Clone Repository

git clone https://github.com/sudo-kraken/k3s-cluster-maintenance.git
cd k3s-cluster-maintenance

2. Install Dependencies

pip install -r requirements.txt

3. Install Ansible Collections

# Install required Kubernetes collection
ansible-galaxy collection install kubernetes.core

# Or install all collections from requirements
ansible-galaxy collection install -r collections/requirements.yml

Required Collections:

kubernetes.core (>= 2.4.0) - For native Kubernetes API operations

4. Configure Inventory

cp hosts.yml.example hosts.yml
# Edit hosts.yml with your cluster details

5. Test Connectivity

ansible all -i hosts.yml -m ping

📚 Usage Examples

Basic Maintenance Operations

# Update all worker nodes
ansible-playbook -i hosts.yml maintenance.yml --limit k3s_workers

# Update all master nodes
ansible-playbook -i hosts.yml maintenance.yml --limit k3s_masters

# Update specific node
ansible-playbook -i hosts.yml maintenance.yml --limit worker-01

# Update all Debian/Ubuntu nodes
ansible-playbook -i hosts.yml maintenance.yml --limit os_debian

# Update entire cluster
ansible-playbook -i hosts.yml maintenance.yml

Advanced Operations

# Dry run (check mode)
ansible-playbook -i hosts.yml maintenance.yml --check

# Update with custom variables
ansible-playbook -i hosts.yml maintenance.yml -e "k3s_node_maintenance_wait_timeout=1200"

# Verbose output for debugging
ansible-playbook -i hosts.yml maintenance.yml -v

# Update specific nodes by pattern
ansible-playbook -i hosts.yml maintenance.yml --limit "*master*"

Tagged Operations

Use tags to run specific phases of maintenance:

# Run only prerequisite checks
ansible-playbook -i hosts.yml maintenance.yml --tags prerequisites

# Check for available updates only
ansible-playbook -i hosts.yml maintenance.yml --tags packages,check_updates

# Run only cluster preparation (cordon/drain)
ansible-playbook -i hosts.yml maintenance.yml --tags cluster,prepare

# Run only package updates
ansible-playbook -i hosts.yml maintenance.yml --tags packages,updates

# Run only reboot handling
ansible-playbook -i hosts.yml maintenance.yml --tags reboot

# Run only cluster restoration (uncordon)
ansible-playbook -i hosts.yml maintenance.yml --tags restore

# Resume after manual reboot or failure
ansible-playbook -i hosts.yml maintenance.yml --tags resume

# OS-specific operations
ansible-playbook -i hosts.yml maintenance.yml --tags debian    # Debian/Ubuntu only
ansible-playbook -i hosts.yml maintenance.yml --tags redhat   # RHEL/CentOS only

# Longhorn-specific operations
ansible-playbook -i hosts.yml maintenance.yml --tags longhorn

Recovery Operations

# Resume maintenance after reboot failure
ansible-playbook -i hosts.yml maintenance.yml --limit node-01 --tags resume

# Manual uncordon after successful maintenance
ansible-playbook -i hosts.yml maintenance.yml --limit node-01 --tags uncordon

# Re-enable Longhorn scheduling only
ansible-playbook -i hosts.yml maintenance.yml --limit node-01 --tags longhorn

⚙️ Configuration

Role Variables

Customise behaviour through group variables:

Master Nodes (`group_vars/k3s_masters/main.yml`)

k3s_node_maintenance_drain_timeout: 600
k3s_node_maintenance_wait_timeout: 1800
k3s_node_maintenance_skip_drain: true  # Masters are not drained

Worker Nodes (`group_vars/k3s_workers/main.yml`)

k3s_node_maintenance_drain_timeout: 300
k3s_node_maintenance_wait_timeout: 600
k3s_node_maintenance_skip_drain: false

OS-Specific Settings

# Debian/Ubuntu (group_vars/os_debian/main.yml)
k3s_node_maintenance_package_manager: apt
k3s_node_maintenance_cache_valid_time: 3600

# RHEL/CentOS (group_vars/os_redhat/main.yml)
k3s_node_maintenance_package_manager: dnf
k3s_node_maintenance_needs_restarting_available: true

Inventory Structure

Define your cluster in hosts.yml:

all:
  children:
    k3s_cluster:
      children:
        k3s_masters:
          hosts:
            master-01:
              ansible_host: 10.0.0.100
            master-02:
              ansible_host: 10.0.0.101
            master-03:
              ansible_host: 10.0.0.102
        k3s_workers:
          hosts:
            worker-01:
              ansible_host: 10.0.0.150
            worker-02:
              ansible_host: 10.0.0.151
        os_debian:
          hosts:
            master-01:
            worker-01:
        os_redhat:
          hosts:
            master-02:
            master-03:
            worker-02:

🛡️ Safety Features

Intelligent Detection

Early Exit: Automatically skips maintenance when no updates are available
Update Assessment: Checks for available packages before cluster operations
Resource Preservation: Prevents unnecessary downtime and resource usage

Health Validation

Pre-flight Checks: Validates prerequisites and cluster health
Node Readiness: Ensures nodes are healthy before/after maintenance
Control Plane: Validates API server and etcd health for masters
Storage Integration: Checks Longhorn volume health (when available)
Volume Recovery: Waits for degraded Longhorn volumes to recover before proceeding

Operational Safety

Sequential Processing: Maintains only one node at a time
Drain Protection: Workers are properly drained; masters are never drained
Smart Reboot Handling: Adaptive monitoring that waits for actual state changes
Rollback Support: Stops on first failure to prevent cascade issues

🔧 Troubleshooting

Check Maintenance Status

# Verify what updates are available
ansible all -i hosts.yml -m package_facts

# Check cluster health
kubectl get nodes
kubectl get pods --all-namespaces

# Verify Longhorn status (if applicable)
kubectl get pods -n longhorn-system

Common Issues

"No updates needed"

This is normal behaviour - the role intelligently skips maintenance when no packages need updating.

Node not ready after maintenance

# Check node status
kubectl get nodes

# Manual uncordon if needed
kubectl uncordon <node-name>

Ansible connection issues

# Test connectivity
ansible all -i hosts.yml -m ping

# Check SSH access
ssh user@node-ip

Debug Mode

# Run with maximum verbosity
ansible-playbook -i hosts.yml maintenance.yml -vvv

# List all available tags
ansible-playbook -i hosts.yml maintenance.yml --list-tags

# Check specific task without running
ansible-playbook -i hosts.yml maintenance.yml --tags check_updates --check

# Resume from specific point
ansible-playbook -i hosts.yml maintenance.yml --limit node-01 --tags resume

🏷️ Tag Reference

Tag	Description	Use Case
`prerequisites`	Pre-flight checks	Validate environment setup
`check_updates`	Package update detection	See what updates are available
`prepare`	Cluster preparation	Cordon/drain nodes only
`packages`	All package operations	Package management only
`updates`	Package installation	Install updates only
`reboot`	Reboot coordination	Reboot handling only
`restore`	Cluster restoration	Uncordon and restore scheduling
`resume`	Manual recovery	Resume after failures (includes restore)
`uncordon`	Node uncordoning	Restore node scheduling only
`debian`	Debian/Ubuntu only	OS-specific operations
`redhat`	RHEL/CentOS only	OS-specific operations
`longhorn`	Longhorn operations	Storage-specific tasks

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

⚠️ Disclaimer

This tool performs maintenance operations on your Kubernetes cluster. Always:

Test in a non-production environment first
Ensure you have recent backups
Review the role tasks before deployment
Monitor the process during execution

Use at your own risk. The authors are not responsible for any damage or data loss.

Enterprise-grade K3s maintenance made simple

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
collections		collections
group_vars		group_vars
roles/k3s_node_maintenance		roles/k3s_node_maintenance
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
ansible.cfg		ansible.cfg
hosts.yml.example		hosts.yml.example
install-collections.sh		install-collections.sh
maintenance.yml		maintenance.yml
requirements.txt		requirements.txt

License

sudo-kraken/k3s-cluster-maintenance

Folders and files

Latest commit

History

Repository files navigation