config defaults for url_norm and PDF validation

The field `url_norm` is essential for looking up URLs entered by humans, but it is disabled per default in `reference.conf` and enabling it is buried as a side-effect to enabling `warc.index.extract.linked.normalise`. This option should be default `true` and have a dedicated entry, such as `warc.index.extract.normalise_url`, which could also provide the default value for `warc.index.extract.linked.normalise`.

Besides plain search, one case for normalising enabling per default for the Solr fields `url_norm`, `links` and `links_images` is graph-queries. Normalising raises the number of false positives, but this shall be seen against the large number of false negatives in the case on non-normalisation due to `http`/`https` and `www.foo.com`/`foo.com` differences.

Another property is `warc.index.extract.content.extractApachePreflightErrors` which validates PDFs and adds validation errors to the index. this is turned on per default. This is a heavy indexing step and was the primary cause of timeouts for webarchive indexing at the Royal Danish Library, until it was turned off. We recommend that the default is that it is turned off.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

config defaults for url_norm and PDF validation #158

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

config defaults for url_norm and PDF validation #158

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions