Add `--precheck-connections` option #6

Mr0grog · 2025-08-18T18:52:00Z

Passing --precheck-connections when generating seeds will cause one URL from each hostname in the seed set to be requested and checked for connection errors before outputting seeds. URLs at hostnames that appear to be unreachable (right now, that's DNS resolution failures or timing out on the connection) will be stripped from the seed list. A precheck.log.json file will be output along with the seeds that contains information about what hostnames were unreachable so other tools can later record it in our DB or wherever.

This is mainly meant to work around problems like we had this weekend, where a whole bunch of epa.gov subdomains went offline, causing our crawls to slow down so much that they got cancelled for running too long. This avoids checking such URLs in the first place. It also gives us a convenient place to start automatically recording connection failures in the DB (we added the ability to record them near the start of 2025, but have been putting them in manually based on logs in exceptional situations).

Ideally browsertrix-crawler might do something like this as it crawls (so we don't waste a lot of time double-checking so many URLs before we even start the crawl proper), but it's definitely an edge case for them. Not sure if they'd want to bake it in (see webrecorder/browsertrix-crawler#879).

Passing `--precheck-connections` when generating seeds will cause one URL from each hostname in the seed set to be requested and checked for connection errors before outputting seeds. URLs at hostnames that appear to be unreachable (right now, that's DNS resolution failures or timing out on the connection) will be stripped from the seed list. A `precheck.log.json` file will be output along with the seeds that contains information about what hostnames were unreachable so other tools can later record it in our DB or wherever. This is mainly meant to work around problems like we had this weekend, where a whole bunch of epa.gov subdomains went down, causing our crawls to slow up so much that they got cancelled for running too long. This avoids checking such URLs in the first place. However, it also gives us a convenient place to start automatically recording connection failures in the DB (we added the ability to record them near the start of 2025, but have been putting them in manually based on logs in exceptional situations). Ideally browsertrix-crawler might do something like this internally (so we don't waste a lot of time double-checking so many URLs before we even start the crawl proper), but it's definitely an edge case for them. Not sure if they'd want to bake it in (see webrecorder/browsertrix-crawler#879).

Mr0grog added this to Web Monitoring Aug 18, 2025

github-project-automation bot moved this to Inbox in Web Monitoring Aug 18, 2025

Mr0grog moved this from Inbox to In Progress in Web Monitoring Aug 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add `--precheck-connections` option #6

Add `--precheck-connections` option #6

Uh oh!

Mr0grog commented Aug 18, 2025

Uh oh!

Uh oh!

Add --precheck-connections option #6

Are you sure you want to change the base?

Add --precheck-connections option #6

Uh oh!

Conversation

Mr0grog commented Aug 18, 2025

Uh oh!

Uh oh!

Add `--precheck-connections` option #6

Add `--precheck-connections` option #6