Skip to content

Conversation

Exairnous
Copy link
Contributor

@Exairnous Exairnous commented Jun 5, 2025

What?

Adds scripts to backup the data from your Hubs instance to your local hard drive and restore a backup to your instance.

Why?

This will allow you to keep one or more local copies of your data, restore your data to your instance if needed, and migrate all your data from one instance to another, e.g. when moving from one hosting company to another hosting company.

Examples

Backup folder structure

  • data_backups
    • data_backup_1749037375184
      • reticulum_storage_data
        • cached
        • expiring
        • owned
      • pg_dump.sql
    • data_backup_1749058020987
    • etc.

Note: additional folders may be present in the reticulum_storage_data_folder and/or some may be omitted. This depends on the individual instance.

How to test

  1. Run the backup script.
  2. Make a change, e.g. create a small scene.
  3. Run the backup script again.
  4. Run the restore-backup script with the name of the first backup (see the readme instructions for details).
  5. Check that your change has been reverted.
  6. Run the restore-backup script with no backup specified (this will use the latest backup).
  7. See that your change is back.

Documentation of functionality

Instructions have been added to the readme. A PR for further documentation is planned for the Hubs docs repository.

Limitations

This requires the pods to be running, so in order to prevent people from using the instance while backing up/restoring you'd need to remove the load balancer IP from your DNS A records (unless there's some other way to bar people from the instance while keeping the pods running that I'm unaware of). UPDATE: This has been addressed for the restore script, by introducing a crude maintenance mode. Thanks to the review comments for pointing me toward this solution.

Alternatives considered

  • Using Node directly to transfer the data, but this is easier and doesn't reinvent the wheel.
  • Using something like rclone, but that would require another dependency to be installed.

Open questions

  • What happens if someone is saving something on the instance while a backup is running? Will the backup be corrupted?
  • What happens if someone is saving something on the instance while a backup is being restored? Will the instance get corrupted? UPDATE: No longer applicable after updates from the review - a crude maintenance mode has been introduced to prevent people being connected to the instance during the restore.
  • Is there a way to bar people from the instance, while keeping the pods running, that doesn't require you to edit your DNS A records? UPDATE: Yes. See the update for the previous question.

Additional details or related context

This will be needed to migrate the data from the persistent volumes on your node, to persistent volumes that are completely separate from the node (PR #363).

What: Adds scripts to backup the data from your Hubs instance to your local hard drive and restore a backup to your instance.

Why: This will allow you to keep one or more local copies of your data, restore your data to your instance if needed, and migrate all your data from one instance to another, e.g. when moving from one hosting company to another hosting company.

Note: This will be needed to migrate the data from the persistent volumes on your node, to persistent volumes that are completely separate from the node (PR Hubs-Foundation#363).
@DougReeder
Copy link
Contributor

If no other app is using the load balancer used by Hubs, changing the DNS address is fine (though it doesn't take effect immediately nor can be reversed immediately, due to DNS cacheing).

A more focussed approach is to add this annotation to the Ingress:

    haproxy.org/allow-list: 11.22.33.44

where 11.22.33.44 is the external IP address of your development machine. It takes effect immediately, can be reversed immediately, and allows you access to your Hubs instance while denying all others.

Editing the ingress can be done with the command

kub edit ingress reticulum -n hcce

and similarly for dialog and nearspark.

To restore access, it's probably best to re-apply hcce.yaml

@Exairnous
Copy link
Contributor Author

Exairnous commented Jun 6, 2025

If no other app is using the load balancer used by Hubs...

True, I didn't think about other apps using the load balancer. So yes, disabling the DNS is not a great solution :(

A more focussed approach is to add this annotation to the Ingress:

I tried adding haproxy.org/allow-list: 66.66.66.66 to the annotations section of the three ingresses to test the blocking, but my instance still loaded fine for me.

These are the commands I used to edit the ingresses (I'm guessing kub is your alias for kubectl? Also, it appears the ingress name for reticulum is ret, which we should update to reticulum at some point):

EDITOR=kwrite kubectl edit ingress ret -n hcce
EDITOR=kwrite kubectl edit ingress dialog -n hcce
EDITOR=kwrite kubectl edit ingress nearspark -n hcce

Also, reapplying hcce.yaml didn't seem to remove the edits (running those three commands again showed the additions still there).

Am I misunderstanding the procedure you're suggesting or is there something more I need to do to apply the edits (I closed the editor and saw this in the terminal: ingress.networking.k8s.io/ret edited)?

@DougReeder
Copy link
Contributor

You're correct that reapplying the template after manually editing the ingress doesn't reset it. The correct procedure is to add the annotation to the template file and apply it. To revert, one then comments out the annotation and re-applys the template.

It's possible that that annotation wasn't supported for the beta version of HAProxy ingress controller that Hubs normally uses. (I'm using the latest).

You might try

     haproxy.org/request-redirect: example.com

@DougReeder
Copy link
Contributor

DougReeder commented Jun 6, 2025

As I'm using an external database, I can't fully test these. I have to comment out the pgsql bits, and I'm hesitant to restore only the reticulum files.

That said, the code looks fine and the backup script did create a copy of the reticulum files for me.

What: Adds a primitive maintenance mode to the restore backup script.  This is applied at the beginning and sets haproxy to redirect traffic to a non-existent maintenance mode subdomain, then restarts the instance to totally disconnect any people present on the instance and prevent anyone new from joining.  The redirects are removed and the instance is returned to normal at the end after the restore is finished.

Why: So that people can't interrupt/corrupt the restore by modifying data on the instance while the restore is happening.

Note: At present the maintenance mode isn't a real page, so it's not all that pretty, and you won't be redirected back to your previous page once the restore is finished (even if you reload), but it gets the job done.  Ideally, these faults should be addressed at some point in the future.
What: Prints headings for the general steps of the restore script and prints the command output to the terminal.

Why: This is a very involved and potentially long running script, and the additional output should help reduce confusion as to whether the script is running normally or has gotten stuck.

Note: This implements similar behavior to the apply script, but that, at present, will only work with the main configuration file and not a secondary, temporary one.  In the future, the code to apply a Kubernetes configuration and monitor the deployment status should potentially be further abstracted so it can work with any configuration files and only one version of the code is needed.
…g a backup

What: Explicitly specifies the Reticulum container in the Reticulum pod as the container to copy the data to instead of relying on it being automatically selected by default.

Why: Reduces ambiguity and prevents bugs from cropping up in the future if anything changes and the Reticulum container is no longer the first container.
@Exairnous
Copy link
Contributor Author

You might try
haproxy.org/request-redirect: example.com

Thanks. That looks like it'll work. I've updated the PR.

As I'm using an external database, I can't fully test these. I have to comment out the pgsql bits, and I'm hesitant to restore only the reticulum files.

Good point. If you think it would be an easy change to support backing up/restoring an external database as well, then it would be good to add that in. If not, then we should probably wait until we add official support for creating external database setups and add backup/restore support then.

@DougReeder
Copy link
Contributor

Let's leave external DB support for another PR - it won't be trivial to add.

@Exairnous
Copy link
Contributor Author

Let's leave external DB support for another PR - it won't be trivial to add.

Okay. Sounds good.

@Exairnous
Copy link
Contributor Author

@hobbs-Hobbler tested the restore-backup script on Windows with an old backup from a previous version of the scripts (but using the latest version of the restore-backup script) and everything appeared to work well (it was a non-standard setup, so the results should probably be interpreted with some reservation, but I'm optimistic that at least the restore-backup script should work correctly on Windows).

@DougReeder
Copy link
Contributor

restoring Reticulum '._cached' folder
restoring Reticulum '._expiring' folder
restoring Reticulum '._owned' folder
restoring Reticulum '._storage' folder

The restore-backup script should probably not copy back folders created by the OS, like these.

Why: An instance using an external database (https://hominidsoftware.com/tech-personal-growth/Hubs-Managed-Databse/Hubs-Managed-Database/) will not have a pgsql pod.
Also, a damaged instance might not be running the pgsql pod.
There is still value in backing up and/or restoring just the reticulum files.

Also handles empty blocks in `hcce.yaml`.
Also extracts IP address of all load balancers, as a modern ingress controller might not be in the `hcce` namespace.

Open Question: backing and restoring up an external postgresql database might or might not fit in these scripts
@Exairnous
Copy link
Contributor Author

....
restoring Reticulum '._storage' folder
The restore-backup script should probably not copy back folders created by the OS, like these.

I'm not familiar with these folders. Are they a Mac thing? The reason I'm looping over them is because we can't know exactly what folders will be present for a backup/restore, but if those will never be needed and can be reliably detected by the leading period and underscore then I could add a guard to filter them out.

@DougReeder
Copy link
Contributor

A key source of kubectl cp failures appears to be that, under the hood, it's using tar, which is not designed to deal gracefully with network failures. So, having the backup script retry kubectl cp is okay. I recommend not attempting more than a dozen retries at either level, though.

DougReeder and others added 5 commits August 13, 2025 11:09
Why: If there is more than one load balancer in the cluster, the user needs to select the appropriate one.
Backup and restore scripts now continue if pgsql pod is missing.
What: Uses the "junk" package to remove any OS helper files/folders that were created in the Reticulum storage data before restoring the backup.

Why: Various user actions can result in the user's OS generating helper files/folders that aren't needed by Hubs, which increases the upload size and clutters up the restored reticulum data back on the Kubernetes storage.

Note: This encloses the entire restore-backup script in an async function in order to allow loading the "junk" package, which doesn't support require/CommonJS modules.
What: Passes an environment variable to the kubectl cp command to disable using websockets.

Why: Websockets are enabled by default in kubectl 1.30+ and this can cause transfers to fail and not retry.  Disabling websockets avoids the issue.

Referrences:
Link to GitHub issue with the documented workaround:
kubernetes/kubernetes#60140 (comment)

Link to GitHub PR which introduced websockets as the default and the note that it affects kubectl cp:
kubernetes/kubernetes#123281 (comment)
What: Uses the "find" command in the Reticulum pod to remove the contents of the Reticulum storage directory on the Kubernetes cluster before restoring the contents of the local backup.

Why: To ensure a full restoration.  kubectl cp merges the source directory into the destination directory, so depending on what's in the Reticulum storage on the Kubernetes cluster, there may be stuff left over from before the backup was applied that will remain if the Reticulum storage isn't cleared first, which would cause the final result to be different from the backup.
@Exairnous
Copy link
Contributor Author

Updates:

I have updated this to integrate the "junk" npm package and remove all the OS files from the backup before uploading. I realized that the OS files could potentially be present in any of the subfolders as well as the main folder, so since kubectl cp doesn't have any option for ignoring files/folders I thought this was an easier solution than rebuilding the structure piece by piece on the pod storage.

I found that disabling the websockets did work reliably for me, so I think this is a much better solution than auto-retrying, and it should hopefully just automatically get phased out when kubectl gets fixed (@DougReeder you were right at the dev meetup last week that there was a better way). I still think we want to keep it on infinite retries for kubectl cp itself, though; so far there have been no issues reported with doing that and it should ensure that the backup will succeed no matter how large it is, even when there is an unstable network connection. But yes, kubectl using tar under the hood is very far from ideal; I hope they redo it to be much more reliable.

I have als updated the restore script to clear the Reticulum pod storage before uploading the backup in order to ensure a return to the exact state of the backup.

I think these updates should make this about ready to merge (assuming no one finds any issues when reviewing/testing and I didn't miss any review comments), but it would be good to try and get as many people to test as we can (I'll see if I can get this tested at the documentation meetup this week).

@Exairnous
Copy link
Contributor Author

Oh, and documentation for the Hubs docs for the backup/restore scripts has been written, but hasn't been put up as a PR yet.

@Exairnous
Copy link
Contributor Author

Also, sorry about the completely mangled diff for the Filter out OS created helper files/folders commit, it's because of the indentation change. If you're looking at it, you'll want to use GitHub's Hide whitespace option (which actually shows the changes really well).

Copy link
Contributor

@DougReeder DougReeder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

working fine for me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants