Skip to content

Examine field lengths #152

@tokee

Description

@tokee

The removal of the (accidentally very low) field length limit of 200 characters in favour of no limit at all triggered an "immense term" exception for @ruebot with the field pdf_pdfa_errors, when it exceeded 32KB. As discussed on the IIPC#webarchive-discovery slack, a default maximum field length should be re-introduced and a note (warning?) should be logged when it is met during the indexing process.

Ideally the limits should be fine-tuned for the different fields, but that is quite a lot of work. An easier process is to set a best-guess limit (4K? This was used several years back) and index a few thousand random WARC-files with the previous mentioned logging turned on. Hopefully this will catch the fields that needs to be adjusted.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions