s3stor: A Deduplicating S3 Backup Utility

s3stor is a command-line tool for backing up files to S3-compatible storage (e.g., AWS S3, Wasabi) with block-based deduplication, point-in-time snapshots, and efficient file management. Designed for reliability and multi-writer safety, it supports syncing files, creating snapshots (with Volume Shadow Copy Service on Windows), listing/restoring files, and cleaning up unused data. Ideal for backup scenarios requiring data integrity and storage efficiency.

TL;DR (For Impatient Users)

Quick Setup

# Install Go (https://go.dev/doc/install)
git clone <your-repo-url>
cd s3stor
go build -o s3stor

# Configure Wasabi (or other S3-compatible storage)
export S3_PROVIDER=wasabi
export S3_BUCKET=your-bucket-name
export S3_REGION=us-east-1
export S3_ENDPOINT=https://s3.us-east-1.wasabisys.com
export AWS_ACCESS_KEY_ID=your-wasabi-access-key
export AWS_SECRET_ACCESS_KEY=your-wasabi-secret-key

Basic Usage

# Sync a file to S3
./s3stor sync test_out/file1.txt
# Output: Synced file1.txt (123 bytes)

# Create a snapshot
./s3stor snapshot test_out sn001 file1.txt
# Output: Snapshot sandow-sn001 created with 1 files

# List files in snapshot
./s3stor ls sandow-sn001
# Output: Files in snapshot sandow-sn001 (created 2025-07-27T22:50:00Z by sandow):
#         - file1.txt (123 bytes)

# Restore a file from snapshot
./s3stor get sandow-sn001 file1.txt ./restore
# Output: File reconstructed to: ./restore/file1.txt

# Delete a file from global catalog
./s3stor delete file1.txt
# Output: Deleted file: file1.txt
#         Block cleanup completed: 0 blocks deleted

Jump to Usage for more examples or Architecture for how it works.

Features

Block-Based Deduplication: Splits files into blocks, stores unique blocks by SHA-256 hash, and reuses them across files and snapshots to save storage.
Point-in-Time Snapshots: Creates consistent backups using Volume Shadow Copy Service (VSS) on Windows, with independent file maps for each snapshot.
Multi-Writer Safety: Uses S3-based locking to prevent conflicts when multiple instances (e.g., on different machines) access the same bucket.
File Management:
- sync: Upload files to S3 with deduplication, creating global catalog if missing.
- ls: List files in global catalog or snapshots, creating global catalog if missing.
- get: Restore files from snapshots or global catalog.
- map: Display block mappings for a file in global catalog or a snapshot.
- snapshot: Create snapshots of specified files.
- delete-snapshot: Remove snapshots and their metadata.
- delete: Remove files from global catalog with safe block cleanup.
- cleanup-blocks: Remove unreferenced blocks to reclaim storage.
S3 Compatibility: Works with AWS S3, Wasabi, and other S3-compatible providers.
Efficient Cleanup: Safely deletes unreferenced blocks only after checking all file maps (global and snapshot).

Architecture

s3stor organizes data in an S3 bucket using a structured layout, with separate catalogs for global files and snapshots, deduplicated block storage, and a locking mechanism for concurrency.

Components

Global Catalog (catalog.json):

Stores metadata for files synced via sync.
Automatically created as an empty catalog ([]) on first sync or ls if not found.

Format: JSON array of entries:

[
  {
    "file_name": "file1.txt",
    "file_size": 123,
    "map_key": "maps/file1.txt.json"
  },
  {
    "file_name": "d001/f005.txt",
    "file_size": 456,
    "map_key": "maps/d001/f005.txt.json"
  }
]

map_key points to a file map listing block hashes.

File Maps (maps/<file_name>.json):
- For each file in the global catalog, stores metadata and a list of SHA-256 block hashes:
```
{
  "file_name": "file1.txt",
  "file_size": 123,
  "block_size": 1048576,
  "blocks": ["a1b2c3d4...", "e5f6g7h8..."]
}
```
- Blocks are stored in blocks/<hash>.

Snapshot Catalog (<hostname>/snapshots/<snapshot_id>/catalog.json):

Created by snapshot, stores metadata for files in a snapshot (e.g., sandow/snapshots/sandow-sn001).

Format: JSON object:

{
  "snapshot_id": "sandow-sn001",
  "timestamp": "2025-07-27T22:50:00Z",
  "computer_id": "sandow",
  "files": [
    {
      "file_name": "file1.txt",
      "file_size": 123,
      "map_key": "sandow/snapshots/sandow-sn001/maps/file1.txt.json"
    }
  ]
}

Independent of global catalog, with separate file maps.

Snapshot File Maps (<hostname>/snapshots/<snapshot_id>/maps/<file_name>.json):
- Similar to global file maps, lists block hashes for snapshot files.
- Ensures snapshots are self-contained, unaffected by global catalog changes.
Block Storage (blocks/<hash>):
- Stores unique file blocks, identified by SHA-256 hashes.
- Deduplication ensures identical blocks are stored only once, referenced by multiple file maps.
Locks (locks/global/<resource>.lock, locks/<hostname>/snapshots/<snapshot_id>/<resource>.lock):
- S3 objects used for concurrency control (e.g., locks/global/catalog.lock, locks/global/file1.txt.lock).
- Prevents race conditions in multi-writer scenarios (e.g., multiple s3stor instances).
- Automatically expire via S3 lifecycle policy (1-day retention).

Data Flow

Sync:
1. Read local file, split into blocks, compute SHA-256 hashes.
2. Upload new blocks to blocks/<hash> if not already present.
3. Create file map (maps/<file_name>.json) listing block hashes.
4. Create or update catalog.json with file metadata.
Snapshot:
1. Use VSS (Windows) for consistent file access.
2. Create snapshot catalog (<hostname>/snapshots/<snapshot_id>/catalog.json).
3. Copy or create file maps in <hostname>/snapshots/<snapshot_id>/maps/.
4. Reuse existing blocks in blocks/<hash>.
Delete:
1. Remove file from catalog.json and delete its file map.
2. Clean up unreferenced blocks by checking all file maps (global and snapshot).
Get:
1. Read file map (from global catalog or snapshot) to get block hashes.
2. Download blocks from blocks/<hash>.
3. Reconstruct file locally.

Deduplication

Files are split into fixed-size blocks (default: 1MB).
Each block’s SHA-256 hash is computed and stored in blocks/<hash>.
File maps reference these blocks, enabling deduplication across files and snapshots.
Example: If file1.txt and file2.txt share a block, it’s stored once in blocks/a1b2c3d4... and referenced by both file maps.

Installation

Install Go:
- Download and install Go (version 1.16+): https://go.dev/doc/install.
Clone Repository:
```
git clone <your-repo-url>
cd s3stor
```
Build:
```
go build -o s3stor
```

Verify:

./s3stor
# Output: Usage: go run main.go <sync|ls|get|map|snapshot|delete-snapshot|cleanup-blocks|delete> [args...]

Configuration

s3stor uses environment variables for S3 configuration. Example for Wasabi:

export S3_PROVIDER=wasabi
export S3_BUCKET=your-bucket-name
export S3_REGION=us-east-1
export S3_ENDPOINT=https://s3.us-east-1.wasabisys.com
export AWS_ACCESS_KEY_ID=your-wasabi-access-key
export AWS_SECRET_ACCESS_KEY=your-wasabi-secret-key

Required S3 Permissions

Ensure your S3 credentials allow:

{
  "Effect": "Allow",
  "Action": ["s3:PutObject", "s3:GetObject", "s3:DeleteObject", "s3:ListBucket"],
  "Resource": ["arn:aws:s3:::your-bucket-name/*", "arn:aws:s3:::your-bucket-name"]
}

Lock Expiration

Set an S3 lifecycle policy to expire locks after 1 day:

aws s3api put-bucket-lifecycle-configuration --bucket your-bucket-name --lifecycle-configuration '{
  "Rules": [{
    "ID": "CleanLocks",
    "Status": "Enabled",
    "Filter": {"Prefix": "locks/"},
    "Expiration": {"Days": 1}
  }]
}'

Usage

s3stor <command> [args...]

Commands

sync <file_or_dir>:
- Uploads files to S3 with deduplication, creating global catalog if missing.
- Example: ./s3stor sync test_out/file1.txt
ls [snapshot_id]:
- Lists files in global catalog (creates empty catalog if missing) or a specific snapshot.
- Example: ./s3stor ls or ./s3stor ls sandow-sn001
get [<snapshot_id>] <file_name> <output_dir>:
- Restores a file from a snapshot (if snapshot_id provided) or global catalog.
- Example: ./s3stor get sandow-sn001 file1.txt ./restore or ./s3stor get file1.txt ./restore
map [<snapshot_id>] <file_name>:
- Displays block mappings for a file in the global catalog or a snapshot (if snapshot_id provided).
- Example: ./s3stor map file1.txt or ./s3stor map sandow-sn001 file1.txt
snapshot <source_dir> <snapshot_id> [file_names...]:
- Creates a snapshot of specified files using VSS (Windows).
- Example: ./s3stor snapshot test_out sn001 file1.txt
delete-snapshot <snapshot_id>:
- Deletes a snapshot and its metadata, with block cleanup.
- Example: ./s3stor delete-snapshot sandow-sn001
cleanup-blocks:
- Removes unreferenced blocks after checking all file maps.
- Example: ./s3stor cleanup-blocks
delete <file_name>:
- Removes a file from the global catalog, with block cleanup.
- Example: ./s3stor delete file1.txt

Examples

1. Sync Files

Upload a file and a directory to S3:

./s3stor sync test_out/file1.txt
# Output: Synced file1.txt (123 bytes)

./s3stor sync test_out/d001
# Output: Synced d001/f005.txt (456 bytes)

2. Create a Snapshot

Create a snapshot of specific files:

./s3stor snapshot test_out sn001 file1.txt d001/f005.txt
# Output: Snapshot sandow-sn001 created with 2 files

3. List Files

List files in the global catalog (creates empty catalog if none exists):

./s3stor ls
# Output: Files in global catalog:
#         - file1.txt (123 bytes)
#         - d001/f005.txt (456 bytes)
# If no catalog exists:
# Output: Files in global catalog:
#         (none)

List files in a snapshot:

./s3stor ls sandow-sn001
# Output: Files in snapshot sandow-sn001 (created 2025-07-27T22:50:00Z by sandow):
#         - file1.txt (123 bytes)
#         - d001/f005.txt (456 bytes)

4. View File Map

View block mappings for a file in the global catalog:

./s3stor map file1.txt
# Output: File Map for file1.txt:
#         File Name: file1.txt
#         File Size: 123 bytes
#         Block Size: 1048576 bytes
#         Blocks:
#           1: a1b2c3d4...
#           2: e5f6g7h8...

View block mappings for a file in a snapshot:

./s3stor map sandow-sn001 file1.txt
# Output: File Map for file1.txt:
#         File Name: file1.txt
#         File Size: 123 bytes
#         Block Size: 1048576 bytes
#         Blocks:
#           1: a1b2c3d4...
#           2: e5f6g7h8...

5. Restore a File

Restore a file from a snapshot:

./s3stor get sandow-sn001 file1.txt ./restore
# Output: File reconstructed to: ./restore/file1.txt

Restore a file from the global catalog:

./s3stor get file1.txt ./restore
# Output: File reconstructed to: ./restore/file1.txt

6. Delete a File

Remove a file from the global catalog:

./s3stor delete file1.txt
# Output: Deleted file: file1.txt
#         Block cleanup completed: 0 blocks deleted

7. Delete a Snapshot

Remove a snapshot:

./s3stor delete-snapshot sandow-sn001
# Output: Snapshot sandow-sn001 deleted

8. Clean Up Blocks

Manually clean unreferenced blocks:

./s3stor cleanup-blocks
# Output: Block cleanup completed: 2 blocks deleted

S3 Bucket Structure

After running commands, your bucket (your-bucket-name) will have:

your-bucket-name/
├── catalog.json
├── maps/
│   ├── file1.txt.json
│   ├── d001/f005.txt.json
├── blocks/
│   ├── a1b2c3d4...
│   ├── e5f6g7h8...
├── <hostname>/
│   ├── snapshots/
│   │   ├── sandow-sn001/
│   │   │   ├── catalog.json
│   │   │   ├── maps/
│   │   │   │   ├── file1.txt.json
│   │   │   │   ├── d001/f005.txt.json
├── locks/
│   ├── global/
│   │   ├── catalog.lock
│   │   ├── file1.txt.lock
│   │   ├── cleanup.lock
│   ├── <hostname>/
│   │   ├── snapshots/
│   │   │   ├── sandow-sn001/
│   │   │   │   ├── file1.txt.lock

Locking Mechanism

Purpose: Ensures thread-safety in multi-writer scenarios (e.g., multiple s3stor instances on sandow or other machines).
Implementation: S3 objects (locks/global/<resource>.lock, locks/<hostname>/snapshots/<snapshot_id>/<resource>.lock) act as mutexes.
- Example: locks/global/catalog.lock for global catalog updates.
- Example: locks/sandow/snapshots/sandow-sn001/file1.txt.lock for snapshot file operations.
Acquisition:
- Attempts to write lock object with a unique owner (e.g., hostname sandow).
- Retries (default: 3 attempts) if locked by another instance.
Expiration: Locks expire after 1 day via S3 lifecycle policy, preventing deadlocks.
Commands Using Locks:
- sync, delete, snapshot, delete-snapshot, cleanup-blocks.

Troubleshooting

Snapshot Creates 0 Files:
- Cause: Files not found in source_dir, VSS access denied, or lock conflicts.
- Fix:
  - Verify files: ls test_out/file1.txt.
  - Check VSS permissions (Windows): Run as administrator.
  - List locks: aws s3 ls s3://your-bucket-name/locks/.
  - Remove stuck locks: aws s3 rm s3://your-bucket-name/locks/global/file1.txt.lock.
File Not Found in Catalog:
- Cause: File not synced or deleted.
- Fix: Run ./s3stor ls to check catalog, then sync the file.
Global Catalog Not Found:
- Cause: No prior sync or ls commands executed.
- Fix: Run ./s3stor ls or ./s3stor sync <file> to create an empty catalog.
Lock Acquisition Fails:
- Cause: Another instance holds the lock.
- Fix: Wait and retry, or increase maxLockRetries in code (default: 3).
S3 Permission Errors:
- Cause: Insufficient IAM permissions.
- Fix: Update policy with required actions (PutObject, GetObject, DeleteObject, ListBucket).
Blocks Not Cleaned Up:
- Cause: Eventual consistency in S3 or recent snapshot creation.
- Fix: Retry cleanup-blocks or add delay (e.g., time.Sleep(1 * time.Second) in deleteFile).

Contributing

Fork the repository and submit pull requests.
Report issues or suggest features via GitHub Issues.
Enhance features:
- Add --dry-run for delete and cleanup-blocks.
- Support multiple file deletions: ./s3stor delete file1.txt file2.txt.
- Parallelize block cleanup for large buckets.
- Add man page: man s3stor.

License

MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
.env_example		.env_example
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

s3stor: A Deduplicating S3 Backup Utility

TL;DR (For Impatient Users)

Quick Setup

Basic Usage

Table of Contents

Features

Architecture

Components

Data Flow

Deduplication

Installation

Configuration

Required S3 Permissions

Lock Expiration

Usage

Commands

Examples

1. Sync Files

2. Create a Snapshot

3. List Files

4. View File Map

5. Restore a File

6. Delete a File

7. Delete a Snapshot

8. Clean Up Blocks

S3 Bucket Structure

Locking Mechanism

Troubleshooting

Contributing

License

About

Uh oh!

Releases

Packages

Languages

shvechkov/s3stor

Folders and files

Latest commit

History

Repository files navigation

s3stor: A Deduplicating S3 Backup Utility

TL;DR (For Impatient Users)

Quick Setup

Basic Usage

Table of Contents

Features

Architecture

Components

Data Flow

Deduplication

Installation

Configuration

Required S3 Permissions

Lock Expiration

Usage

Commands

Examples

1. Sync Files

2. Create a Snapshot

3. List Files

4. View File Map

5. Restore a File

6. Delete a File

7. Delete a Snapshot

8. Clean Up Blocks

S3 Bucket Structure

Locking Mechanism

Troubleshooting

Contributing

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages