Create a url-frontier Frontier implementation

Building on the experience with the Redis-based frontier, it should be possible to build a frontier based on [url-frontier](https://github.com/crawler-commons/url-frontier). The rough outline of the approach is in this discussion: https://github.com/crawler-commons/url-frontier/discussions/12#discussioncomment-1229076

The main problem is that H3 relies on queue prioritization to make sure pre-requisites are crawled, in contrast to may other crawlers that handle things like DNS or robots.txt outside of the crawl frontier. When H3 finds a pre-requisite it pushes the current URL back into the queue and enqueues the pre-requisite so that it will be dequeued first.  This can be done with url-frontier, although I think it's taking advantage of a grey area in the API spec. and so it's not clear if the behaviour would be immediately portable to other implementations.

Notes:

- I think it should be possible to code this so that the URL Frontier can be either embedded directly or accessed over GRPC (at least once https://github.com/crawler-commons/url-frontier/issues/42 is implemented).  Having a fully local option might aid uptake for institutions that don't like running this kind of thing as a service suite.
- The [new Crawl-ID field](https://github.com/crawler-commons/url-frontier/pull/47) could be used to share the instance with multiple jobs, and even allow clients to shift URLs between crawlers. 
- As with the Redis implementation, to fully and transparently integrate into Heritrix as-is, it is necessary to store the (e.g. Kryo) serialised CrawlURI in it's entirety. This is pretty horrible and not really in the spirit of using an external frontier, but it likely an unavoidable arrangement, at least for now.

See also https://github.com/crawler-commons/url-frontier/discussions/45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Create a url-frontier Frontier implementation #80

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Create a url-frontier Frontier implementation #80

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions