Skip to content

Create a url-frontier Frontier implementation #80

@anjackson

Description

@anjackson

Building on the experience with the Redis-based frontier, it should be possible to build a frontier based on url-frontier. The rough outline of the approach is in this discussion: crawler-commons/url-frontier#12 (reply in thread)

The main problem is that H3 relies on queue prioritization to make sure pre-requisites are crawled, in contrast to may other crawlers that handle things like DNS or robots.txt outside of the crawl frontier. When H3 finds a pre-requisite it pushes the current URL back into the queue and enqueues the pre-requisite so that it will be dequeued first. This can be done with url-frontier, although I think it's taking advantage of a grey area in the API spec. and so it's not clear if the behaviour would be immediately portable to other implementations.

Notes:

  • I think it should be possible to code this so that the URL Frontier can be either embedded directly or accessed over GRPC (at least once Publish service jar on Maven crawler-commons/url-frontier#42 is implemented). Having a fully local option might aid uptake for institutions that don't like running this kind of thing as a service suite.
  • The new Crawl-ID field could be used to share the instance with multiple jobs, and even allow clients to shift URLs between crawlers.
  • As with the Redis implementation, to fully and transparently integrate into Heritrix as-is, it is necessary to store the (e.g. Kryo) serialised CrawlURI in it's entirety. This is pretty horrible and not really in the spirit of using an external frontier, but it likely an unavoidable arrangement, at least for now.

See also crawler-commons/url-frontier#45

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions