Heuristic fix of charset issues

Servers mess up and it is not uncommon that we at the Royal Danish Library encounter pages where the charset is set to one thing in the HTTP headers, another in the HTML and that the stream of bytes defining the text represent non-ASCII characters as something third. It is quite visible for Danish pages as the characters `æ ø å` are commonly used in our spelling.

The most common problems we see are "UTF-8 read as ISO-8859-1" and the other way around. At least for Danish they are reasonably easy to guess, as the end result are character combinations that are "never" used for real text. I would be surprised if it wasn't already available somewhere on the net. We should perform such guessing during indexing and correct the problem.

We have two fields in Solr: `content` which is the raw text content and `text` which is the catch-all search field and contains `content` along with other field content. The solution could change that so that `content` holds the raw content, faulty character encodings and all, while the processed content gets added to `text` for proper search. Or they could both contain the corrected content. I don't know what's best.

Ping @thomasegense as I know he's getting complaints about this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Heuristic fix of charset issues #301

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Heuristic fix of charset issues #301

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions