Skip to content

Automatically fetch data-xxx attributes #74

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 12, 2024

Conversation

benoit74
Copy link
Contributor

@benoit74 benoit74 commented Jul 2, 2024

Fix #72

Changes:

  • the autofetch behavior is modified to also automatically fetch any URL found in any data-xxx attribute

@benoit74
Copy link
Contributor Author

benoit74 commented Jul 2, 2024

@ikreymer @tw4l this is ready for review and open to discussion of course

@@ -327,5 +329,20 @@ export class AutoFetcher extends BackgroundBehavior {

text.replace(STYLE_REGEX, urlExtractor).replace(IMPORT_REGEX, urlExtractor);
}

extractDataAttributes(document) {
const allElements = document.querySelectorAll('*');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CSS selectors don't allow querying by attribute name start, but Xpath does!

I think it might be more efficient to do this with xpath, for example, the following snippet can be pasted into the browser.

function* xpathNodes(path, root) {
  root = root || document;
  let iter = document.evaluate(path, root, null, XPathResult.ORDERED_NODE_ITERATOR_TYPE);
  let result = null;
  while ((result = iter.iterateNext()) !== null) {
    yield result;
  }
}

for (const res of xpathNodes("//@*[starts-with(name(), 'data-') and (starts-with(., 'http') or starts-with(., '/') or starts-with(., './') or starts-with(., '../'))]")) {
    console.log(res.value);
}

The xpathNodes is already defined and can be imported from utils

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ended up implementing this to try it out, pushed to the PR!

@ikreymer
Copy link
Member

This works, though on the sample page (https://solar.lowtechmagazine.com/2024/03/how-to-escape-from-the-iron-age/) I didn't see it loading any of those images even without this. Main concern would be crawling things that never actually get loaded, increasing size. Do those resources get loaded in different resolutions or some other condition?

@benoit74
Copy link
Contributor Author

Thank you! On this particular website, these images are loaded when clicking the "View original image" button:
image. In general they are often images loaded at different resolutions.

Concern about crawling things that never actually get loaded is however valid. I don't know if we should have a switch to activate this "sub-behavior". Such a switch is however probably complex to implement and complex for users to understand (it needs a significant expertise on HTML/JS to understand this is needed).

@ikreymer
Copy link
Member

Thank you! On this particular website, these images are loaded when clicking the "View original image" button

Ah ok, yep, I do see it now! Can confirm it worked with this change.

Concern about crawling things that never actually get loaded is however valid. I don't know if we should have a switch to activate this "sub-behavior". Such a switch is however probably complex to implement and complex for users to understand (it needs a significant expertise on HTML/JS to understand this is needed).

Yeah, this is a bit tricky, the speculative URL lookup can lead to false positives. See this issue for example: internetarchive/heritrix3#225. Heritrix does this much more aggressively of course. There's also a link to some regex patterns that might make sense to match if want to go beyond ./.

What we could also do is add a special header to these requests, which Browsertrix can then check for and not store if they're 404s. But let's see if this will be an issue before implementing this, data-* attributes are often used as URLs.

@ikreymer ikreymer merged commit ecf3093 into webrecorder:main Jul 12, 2024
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

New Behavior Request to retrieve automatically the assets present in a data-xxx tag
2 participants