Add seed file support to Browsertrix backend (#2710) #2760
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #2673
Changes in this PR:
file_uploads.pymodule and corresponding/filesAPI prefix with methods/endpoints for uploading, GETing, and deleting seed files (can be extended to other types of files moving forward)CrawlConfig.config.seedFileIdon POST and PATCH endpoints. This seedFileId is replaced by a presigned url when passed to the crawler by the operatorfirstSeedandseedCountand store them in the database, and this is copied into the workflow and crawl documents when they are created.firstSeedandseedCountfor other workflows as well, and a migration added to backfill data, to maintain consistency and fix some of the pymongo aggregations that previously assumed all workflows would have at least oneSeedobject inCrawlConfig.seeds/jobsAPI endpoints, but retrying of this type of regularly scheduled background job is not supported as we don't want to accidentally create multiple competing scheduled jobs.min_seed_file_crawler_imagevalue to the Helm chart that is checked before creating a crawl from a workflow if set. If a workflow cannot be run, return the detail of the exception inCrawlConfigAddedResponse.errorDetailso that we can display the reason in the frontend