Add --precheck-connections
option
#6
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Passing
--precheck-connections
when generating seeds will cause one URL from each hostname in the seed set to be requested and checked for connection errors before outputting seeds. URLs at hostnames that appear to be unreachable (right now, that's DNS resolution failures or timing out on the connection) will be stripped from the seed list. Aprecheck.log.json
file will be output along with the seeds that contains information about what hostnames were unreachable so other tools can later record it in our DB or wherever.This is mainly meant to work around problems like we had this weekend, where a whole bunch of epa.gov subdomains went offline, causing our crawls to slow down so much that they got cancelled for running too long. This avoids checking such URLs in the first place. It also gives us a convenient place to start automatically recording connection failures in the DB (we added the ability to record them near the start of 2025, but have been putting them in manually based on logs in exceptional situations).
Ideally browsertrix-crawler might do something like this as it crawls (so we don't waste a lot of time double-checking so many URLs before we even start the crawl proper), but it's definitely an edge case for them. Not sure if they'd want to bake it in (see webrecorder/browsertrix-crawler#879).