-
Notifications
You must be signed in to change notification settings - Fork 527
Description
Expected Behavior
Postlight Parser should preserve all the actual content of the page.
Current Behavior
Postlight Parser will get rid of any bulleted / numbered lists which consist mostly of links.
Steps to Reproduce
Run Postlight Parser on https://faultlore.com/blah/defaults-affect-inference. The bulleted list a bit after the 'Some Wild Shit Swift Does' heading gets removed.
Picture of the list in question:

Detailed Description
This is the code that causes the problem:
parser/src/utils/dom/clean-tags.js
Lines 43 to 73 in e8ba7ec
const density = linkDensity($node); | |
// Too high of link density, is probably a menu or | |
// something similar. | |
// console.log(weight, density, contentLength) | |
if (weight < 25 && density > 0.2 && contentLength > 75) { | |
$node.remove(); | |
return; | |
} | |
// Too high of a link density, despite the score being | |
// high. | |
if (weight >= 25 && density > 0.5) { | |
// Don't remove the node if it's a list and the | |
// previous sibling starts with a colon though. That | |
// means it's probably content. | |
const tagName = $node.get(0).tagName.toLowerCase(); | |
const nodeIsList = tagName === 'ol' || tagName === 'ul'; | |
if (nodeIsList) { | |
const previousNode = $node.prev(); | |
if ( | |
previousNode && | |
normalizeSpaces(previousNode.text()).slice(-1) === ':' | |
) { | |
return; | |
} | |
} | |
$node.remove(); | |
return; | |
} |
It's aiming to try and get rid of menus and things.
Possible Solution
The easiest solution would be to also apply the special case from the weight >= 25
bit of the code above to the weight < 25
bit of the code, which keeps any list that comes after a paragraph ending in a colon. (The lists which don't work fall into the weight < 25
camp, which is why they don't already work thanks to that special case.)
Another solution I thought of would be to look at either the average or maximum length of links in a list (or table / div / everything else that the tag-cleaning code gets applied to), and if it's longer than some threshold include it. In theory that should differentiate between shorter links in menus and longer sentence-length links in content; but looking at the example I provided again those links are actually quite short so that might not work as well as I'd hoped.
So yeah, probably that first solution. I've already implemented it at https://github.com/Liamolucko/postlight-parser/tree/fix-link-lists and confirmed that it works.