-
-
Notifications
You must be signed in to change notification settings - Fork 21
Automatically fetch data-xxx attributes #74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/autofetcher.ts
Outdated
| @@ -327,5 +329,20 @@ export class AutoFetcher extends BackgroundBehavior { | |||
|
|
|||
| text.replace(STYLE_REGEX, urlExtractor).replace(IMPORT_REGEX, urlExtractor); | |||
| } | |||
|
|
|||
| extractDataAttributes(document) { | |||
| const allElements = document.querySelectorAll('*'); | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CSS selectors don't allow querying by attribute name start, but Xpath does!
I think it might be more efficient to do this with xpath, for example, the following snippet can be pasted into the browser.
function* xpathNodes(path, root) {
root = root || document;
let iter = document.evaluate(path, root, null, XPathResult.ORDERED_NODE_ITERATOR_TYPE);
let result = null;
while ((result = iter.iterateNext()) !== null) {
yield result;
}
}
for (const res of xpathNodes("//@*[starts-with(name(), 'data-') and (starts-with(., 'http') or starts-with(., '/') or starts-with(., './') or starts-with(., '../'))]")) {
console.log(res.value);
}
The xpathNodes is already defined and can be imported from utils
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ended up implementing this to try it out, pushed to the PR!
|
This works, though on the sample page (https://solar.lowtechmagazine.com/2024/03/how-to-escape-from-the-iron-age/) I didn't see it loading any of those images even without this. Main concern would be crawling things that never actually get loaded, increasing size. Do those resources get loaded in different resolutions or some other condition? |
Ah ok, yep, I do see it now! Can confirm it worked with this change.
Yeah, this is a bit tricky, the speculative URL lookup can lead to false positives. See this issue for example: internetarchive/heritrix3#225. Heritrix does this much more aggressively of course. There's also a link to some regex patterns that might make sense to match if want to go beyond What we could also do is add a special header to these requests, which Browsertrix can then check for and not store if they're 404s. But let's see if this will be an issue before implementing this, data-* attributes are often used as URLs. |

Fix #72
Changes:
autofetchbehavior is modified to also automatically fetch any URL found in anydata-xxxattribute