Skip to content

Lists consisting of mostly links get removed #747

@Liamolucko

Description

@Liamolucko

Expected Behavior

Postlight Parser should preserve all the actual content of the page.

Current Behavior

Postlight Parser will get rid of any bulleted / numbered lists which consist mostly of links.

Steps to Reproduce

Run Postlight Parser on https://faultlore.com/blah/defaults-affect-inference. The bulleted list a bit after the 'Some Wild Shit Swift Does' heading gets removed.

Picture of the list in question:

Screenshot 2023-08-08 at 9 10 09 pm

Detailed Description

This is the code that causes the problem:

const density = linkDensity($node);
// Too high of link density, is probably a menu or
// something similar.
// console.log(weight, density, contentLength)
if (weight < 25 && density > 0.2 && contentLength > 75) {
$node.remove();
return;
}
// Too high of a link density, despite the score being
// high.
if (weight >= 25 && density > 0.5) {
// Don't remove the node if it's a list and the
// previous sibling starts with a colon though. That
// means it's probably content.
const tagName = $node.get(0).tagName.toLowerCase();
const nodeIsList = tagName === 'ol' || tagName === 'ul';
if (nodeIsList) {
const previousNode = $node.prev();
if (
previousNode &&
normalizeSpaces(previousNode.text()).slice(-1) === ':'
) {
return;
}
}
$node.remove();
return;
}

It's aiming to try and get rid of menus and things.

Possible Solution

The easiest solution would be to also apply the special case from the weight >= 25 bit of the code above to the weight < 25 bit of the code, which keeps any list that comes after a paragraph ending in a colon. (The lists which don't work fall into the weight < 25 camp, which is why they don't already work thanks to that special case.)

Another solution I thought of would be to look at either the average or maximum length of links in a list (or table / div / everything else that the tag-cleaning code gets applied to), and if it's longer than some threshold include it. In theory that should differentiate between shorter links in menus and longer sentence-length links in content; but looking at the example I provided again those links are actually quite short so that might not work as well as I'd hoped.

So yeah, probably that first solution. I've already implemented it at https://github.com/Liamolucko/postlight-parser/tree/fix-link-lists and confirmed that it works.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions