Skip to content

Add --precheck-connections option #6

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

Mr0grog
Copy link
Member

@Mr0grog Mr0grog commented Aug 18, 2025

Passing --precheck-connections when generating seeds will cause one URL from each hostname in the seed set to be requested and checked for connection errors before outputting seeds. URLs at hostnames that appear to be unreachable (right now, that's DNS resolution failures or timing out on the connection) will be stripped from the seed list. A precheck.log.json file will be output along with the seeds that contains information about what hostnames were unreachable so other tools can later record it in our DB or wherever.

This is mainly meant to work around problems like we had this weekend, where a whole bunch of epa.gov subdomains went offline, causing our crawls to slow down so much that they got cancelled for running too long. This avoids checking such URLs in the first place. It also gives us a convenient place to start automatically recording connection failures in the DB (we added the ability to record them near the start of 2025, but have been putting them in manually based on logs in exceptional situations).

Ideally browsertrix-crawler might do something like this as it crawls (so we don't waste a lot of time double-checking so many URLs before we even start the crawl proper), but it's definitely an edge case for them. Not sure if they'd want to bake it in (see webrecorder/browsertrix-crawler#879).

Passing `--precheck-connections` when generating seeds will cause one URL from each hostname in the seed set to be requested and checked for connection errors before outputting seeds. URLs at hostnames that appear to be unreachable (right now, that's DNS resolution failures or timing out on the connection) will be stripped from the seed list. A `precheck.log.json` file will be output along with the seeds that contains information about what hostnames were unreachable so other tools can later record it in our DB or wherever.

This is mainly meant to work around problems like we had this weekend, where a whole bunch of epa.gov subdomains went down, causing our crawls to slow up so much that they got cancelled for running too long. This avoids checking such URLs in the first place. However, it also gives us a convenient place to start automatically recording connection failures in the DB (we added the ability to record them near the start of 2025, but have been putting them in manually based on logs in exceptional situations).

Ideally browsertrix-crawler might do something like this internally (so we don't waste a lot of time double-checking so many URLs before we even start the crawl proper), but it's definitely an edge case for them. Not sure if they'd want to bake it in (see webrecorder/browsertrix-crawler#879).
@Mr0grog Mr0grog moved this from Inbox to In Progress in Web Monitoring Aug 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

1 participant