Lists consisting of mostly links get removed

## Expected Behavior

Postlight Parser should preserve all the actual content of the page.

## Current Behavior

Postlight Parser will get rid of any bulleted / numbered lists which consist mostly of links.

## Steps to Reproduce

Run Postlight Parser on https://faultlore.com/blah/defaults-affect-inference. The bulleted list a bit after the ['Some Wild Shit Swift Does'](https://faultlore.com/blah/defaults-affect-inference/#some-wild-shit-swift-does) heading gets removed.

Picture of the list in question:

<img width="736" alt="Screenshot 2023-08-08 at 9 10 09 pm" src="https://github.com/postlight/parser/assets/43807659/cd9dca56-47b6-4e6a-adc9-0d76059c1c9a">

## Detailed Description








This is the code that causes the problem:

https://github.com/postlight/parser/blob/e8ba7ece291efa4d915d50dd4deeec17d54359f2/src/utils/dom/clean-tags.js#L43-L73

It's aiming to try and get rid of menus and things.

## Possible Solution

The easiest solution would be to also apply the special case from the `weight >= 25` bit of the code above to the `weight < 25` bit of the code, which keeps any list that comes after a paragraph ending in a colon. (The lists which don't work fall into the `weight < 25` camp, which is why they don't already work thanks to that special case.)

Another solution I thought of would be to look at either the average or maximum length of links in a list (or table / div / everything else that the tag-cleaning code gets applied to), and if it's longer than some threshold include it. In theory that should differentiate between shorter links in menus and longer sentence-length links in content; but looking at the example I provided again those links are actually quite short so that might not work as well as I'd hoped.

So yeah, probably that first solution. I've already implemented it at https://github.com/Liamolucko/postlight-parser/tree/fix-link-lists and confirmed that it works.

	const density = linkDensity($node);

	// Too high of link density, is probably a menu or
	// something similar.
	// console.log(weight, density, contentLength)
	if (weight < 25 && density > 0.2 && contentLength > 75) {
	$node.remove();
	return;
	}

	// Too high of a link density, despite the score being
	// high.
	if (weight >= 25 && density > 0.5) {
	// Don't remove the node if it's a list and the
	// previous sibling starts with a colon though. That
	// means it's probably content.
	const tagName = $node.get(0).tagName.toLowerCase();
	const nodeIsList = tagName === 'ol' \|\| tagName === 'ul';
	if (nodeIsList) {
	const previousNode = $node.prev();
	if (
	previousNode &&
	normalizeSpaces(previousNode.text()).slice(-1) === ':'
	) {
	return;
	}
	}

	$node.remove();
	return;
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Lists consisting of mostly links get removed #747

Expected Behavior

Current Behavior

Steps to Reproduce

Detailed Description

Possible Solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Lists consisting of mostly links get removed #747

Description

Expected Behavior

Current Behavior

Steps to Reproduce

Detailed Description

Possible Solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions