-
Notifications
You must be signed in to change notification settings - Fork 26
Closed
Labels
Milestone
Description
The removal of the (accidentally very low) field length limit of 200 characters in favour of no limit at all triggered an "immense term" exception for @ruebot with the field pdf_pdfa_errors
, when it exceeded 32KB. As discussed on the IIPC#webarchive-discovery slack, a default maximum field length should be re-introduced and a note (warning?) should be logged when it is met during the indexing process.
Ideally the limits should be fine-tuned for the different fields, but that is quite a lot of work. An easier process is to set a best-guess limit (4K? This was used several years back) and index a few thousand random WARC-files with the previous mentioned logging turned on. Hopefully this will catch the fields that needs to be adjusted.