Automatically fetch data-xxx attributes #74

benoit74 · 2024-07-02T09:07:35Z

Changes:

the autofetch behavior is modified to also automatically fetch any URL found in any data-xxx attribute

benoit74 · 2024-07-02T09:08:53Z

@ikreymer @tw4l this is ready for review and open to discussion of course

ikreymer · 2024-07-11T17:39:29Z

src/autofetcher.ts

@@ -327,5 +329,20 @@ export class AutoFetcher extends BackgroundBehavior {

    text.replace(STYLE_REGEX, urlExtractor).replace(IMPORT_REGEX, urlExtractor);
  }
+
+  extractDataAttributes(document) {
+    const allElements = document.querySelectorAll('*');


CSS selectors don't allow querying by attribute name start, but Xpath does!

I think it might be more efficient to do this with xpath, for example, the following snippet can be pasted into the browser.

function* xpathNodes(path, root) { root = root || document; let iter = document.evaluate(path, root, null, XPathResult.ORDERED_NODE_ITERATOR_TYPE); let result = null; while ((result = iter.iterateNext()) !== null) { yield result; } } for (const res of xpathNodes("//@*[starts-with(name(), 'data-') and (starts-with(., 'http') or starts-with(., '/') or starts-with(., './') or starts-with(., '../'))]")) { console.log(res.value); }

The xpathNodes is already defined and can be imported from utils

Ended up implementing this to try it out, pushed to the PR!

ikreymer · 2024-07-12T00:30:55Z

This works, though on the sample page (https://solar.lowtechmagazine.com/2024/03/how-to-escape-from-the-iron-age/) I didn't see it loading any of those images even without this. Main concern would be crawling things that never actually get loaded, increasing size. Do those resources get loaded in different resolutions or some other condition?

benoit74 · 2024-07-12T00:40:51Z

Thank you! On this particular website, these images are loaded when clicking the "View original image" button:
. In general they are often images loaded at different resolutions.

Concern about crawling things that never actually get loaded is however valid. I don't know if we should have a switch to activate this "sub-behavior". Such a switch is however probably complex to implement and complex for users to understand (it needs a significant expertise on HTML/JS to understand this is needed).

ikreymer · 2024-07-12T02:41:13Z

Thank you! On this particular website, these images are loaded when clicking the "View original image" button

Ah ok, yep, I do see it now! Can confirm it worked with this change.

Concern about crawling things that never actually get loaded is however valid. I don't know if we should have a switch to activate this "sub-behavior". Such a switch is however probably complex to implement and complex for users to understand (it needs a significant expertise on HTML/JS to understand this is needed).

Yeah, this is a bit tricky, the speculative URL lookup can lead to false positives. See this issue for example: internetarchive/heritrix3#225. Heritrix does this much more aggressively of course. There's also a link to some regex patterns that might make sense to match if want to go beyond ./.

What we could also do is add a special header to these requests, which Browsertrix can then check for and not store if they're 404s. But let's see if this will be an issue before implementing this, data-* attributes are often used as URLs.

Automatically fetch data-xxx attributes

ea1ff32

tw4l assigned ikreymer Jul 10, 2024

ikreymer reviewed Jul 11, 2024

View reviewed changes

ikreymer added 2 commits July 11, 2024 17:10

trying out xpath for data- attr findin

85dc32a

remove duplicate

9b9ed78

formatting tweaks

d34818f

ikreymer merged commit ecf3093 into webrecorder:main Jul 12, 2024
2 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Automatically fetch data-xxx attributes #74

Automatically fetch data-xxx attributes #74

Uh oh!

benoit74 commented Jul 2, 2024

Uh oh!

benoit74 commented Jul 2, 2024

Uh oh!

ikreymer Jul 11, 2024

Uh oh!

ikreymer Jul 12, 2024

Uh oh!

ikreymer commented Jul 12, 2024

Uh oh!

benoit74 commented Jul 12, 2024

Uh oh!

ikreymer commented Jul 12, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Automatically fetch data-xxx attributes #74

Automatically fetch data-xxx attributes #74

Uh oh!

Conversation

benoit74 commented Jul 2, 2024

Uh oh!

benoit74 commented Jul 2, 2024

Uh oh!

ikreymer Jul 11, 2024

Choose a reason for hiding this comment

Uh oh!

ikreymer Jul 12, 2024

Choose a reason for hiding this comment

Uh oh!

ikreymer commented Jul 12, 2024

Uh oh!

benoit74 commented Jul 12, 2024

Uh oh!

ikreymer commented Jul 12, 2024

Uh oh!

Uh oh!

Uh oh!