Skip to content

Added new documentation an fixed some typos #14

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 22, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "cc-downloader"
version = "0.6.0"
version = "0.6.1"
edition = "2024"
authors = ["Pedro Ortiz Suarez <pedro@commoncrawl.org>"]
description = "A polite and user-friendly downloader for Common Crawl data."
Expand Down
85 changes: 81 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# CC-Downloader

This is an experimental polite downloader for Common Crawl data writter in `rust`. This tool is intended for use outside of AWS.
This is an experimental polite downloader for Common Crawl data written in `rust`. This tool is intended for use outside of AWS.

## Todo

Expand All @@ -10,9 +10,86 @@ This is an experimental polite downloader for Common Crawl data writter in `rust

## Installation

For now, the only supported way to install the tool is to use `cargo`. For this you need to have `rust` installed. You can install `rust` by following the instructions on the [official website](https://www.rust-lang.org/tools/install).
You can install `cc-downloader` via our pre-built binaries, or by compiling it from source.

After installing `rust`, ``cc-downloader`` can be installed with the following command:
### Pre-built binaries

You can find our pre-built binaries on our [GitHub releases page](https://github.com/commoncrawl/cc-downloader/releases). They are available for `Linux`, `macOS`, and `Windows`, in `x86_64` and `aarch64` architectures (Windows is only supported in `x86_64`). In order to use them please select and download the correct binary for your system.

```bash
wget https://github.com/commoncrawl/cc-downloader/releases/download/[VERSION]/cc-downloader-[VERSION]-[ARCH]-[OS].[COMPRESSION-FORMAT]
```

After downloading it, please verify the checksum of the binary. You can find the checksum file in the same location as the binary. The checksum is generated using `sha512sum`. You can verify it by running the following command:

```bash
wget https://github.com/commoncrawl/cc-downloader/releases/download/[VERSION]/cc-downloader-[VERSION]-[ARCH]-[OS].sha512
sha512sum -c cc-downloader-[VERSION]-[ARCH]-[OS].sha512
```

If the checksum is valid, which will be indicated by and `OK` message, you can proceed to extract the binary. For `tar.gz` files you can use the following command:

```bash
tar -xzf cc-downloader-[VERSION]-[ARCH]-[OS].tar.gz
```

For `zip` files you can use the following command:

```bash
unzip cc-downloader-[VERSION]-[ARCH]-[OS].zip
```

This will extract the binary, the licenses and the readme file **in the current folder**. After extracting the binary, you can run it by executing the following command:

```bash
./cc-downloader
```

If you want to use the binary from anywhere, you can move it to a folder in your `PATH`. For more information on how to do this, please refer to the documentation of your operating system. For example, on `Linux` and `macOS` you can move it to `~/.bin`:

```bash
mv cc-downloader ~/.bin
```

And then add the following line to your `~/.bashrc` or `~/.zshrc` file:

```bash
export PATH=$PATH:~/.bin
```

then run the following command to apply the changes:

```bash
source ~/.bashrc
```

or

```bash
source ~/.zshrc
```

Then, you can run the binary from anywhere. If you want to update the binary, you can repeat the process and download the new version. Make sure to replace the binary that is stored in the folder that you added to your `PATH`. If you want to remove the binary, you can simply delete from this folder.

### Compiling from source

For this you need to have `rust` installed. You can install `rust` by following the instructions on the [official website](https://www.rust-lang.org/tools/install).

Or by running the following command:

```bash
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
```

Even if you have `rust` a system-wide installation, we recommend the linked installation method. A system-wide installation and a user installation can co-exist without any problems.

When compiling from source, please make sure you have the latest version of `rust` installed by running the following command:

```bash
rustup update
```

Now you can install the `cc-downloader` tool by running the following command:

```bash
cargo install cc-downloader
Expand Down Expand Up @@ -71,4 +148,4 @@ Options:

## Number of threads

The number of threads can be set using the `-t` flag. The default value is 10. It is advised to use the default value to avoid being blocked by the server. If you make too many requests in a short period of time, you will satrt receiving `403` errors which are unrecoverable and cannot be retried by the downloader.
The number of threads can be set using the `-t` flag. The default value is 10. It is advised to use the default value to avoid being blocked by the server. If you make too many requests in a short period of time, you will start receiving `403` errors which are unrecoverable and cannot be retried by the downloader.