-
-
Notifications
You must be signed in to change notification settings - Fork 420
un-indenting head-safe check to apply for cached files too #3232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3232 +/- ##
=======================================
Coverage 69.37% 69.37%
=======================================
Files 232 232
Lines 19684 19684
=======================================
Hits 13655 13655
Misses 6029 6029 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Hi again @eas342, thanks for this PR! This change is to a file in the main Astroquery directory and not part of any particular module, so it will impact more than just the MAST module. I would rather we change the MAST module directly so that it knows to continue incomplete file downloads. See my comment on the issue here: #3231 (comment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See other comment!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unindenting this block is not the correct behavior; it will overwrite the continuation
block above, which would result in the download continuation failing - it will download the whole file and append it to whatever was partly completed.
However, this same code block (the if head_safe: response = request(...)
block) needs to be repeated above, in the other location open_mode='wb'
Thank you @keflavich , Unfortunately, the change you made does not seem to pass #3231
WARNING: Found cached file jw01185103001_02102_00001-seg001_nrcalong_rate.fits with size 0 that is different from expected size 83520 [astroquery.query] |
Thanks @eas342. Indeed, there was another corner case we hadn't checked. The latest commit catches it. |
That works, super! Thank you @keflavich |
@eas342 @keflavich - I'm not sure how to test this in CI, etc. But if you are certain it covers all the cases, and add a changelog, it can still go in the next release ~tomorrow. |
let's add tests; i think i can turn @eas342 's mwe into a test easily enough. so delay merging |
@bsipocz this is ready for final review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments.
Please add back the multiple inheritances in the tests, and do a rebase for the changelog.
If you can squash out/consolidate a few of the flake8 and similar converging commits that would be good, too, but not a big deal if not.
you need to rebase, 0.4.10 is already out. |
continuation was set to False (which is the case - MAST hard-codes that), no download was being done - we were only retrieving the header, not the data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, and I'll fixup the changelog prior merging.
OK, let's merge this now. I didn't see any problems trying it locally, but nevertheless it will be good to have it in for as long as possible before the next release. |
# This case doesn't appear reachable under normal circumstances | ||
# It is not covered by tests, and probably indicates a badly-behaved server | ||
raise ValueError(f"Found cached file {local_filepath}. Could not verify length.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block is causing some issues for our internal MAST test suite. We actually do have certain cases where the response returned by the download request doesn't contain a header for "Content-Length". I'm working to understand these edge cases better, but it may be that there's no way to get the content length, in which case trying to download a file that's already cached fails. This may have been why the continuation
parameter is set to False
in MAST download calls.
Would the best course of action be to re-download completely if no content length is found?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to have the foresight of not including this in the release just yet hoping that this will get tested while in dev.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@snbianco can you tell us the answer to that question? What does it mean if there's no content-length? I'd argue that continuation in that scenario is probably not possible, but I don't understand the case here - why are we downloading a file without a specified length?
iirc, I couldn't test this because httpbin wouldn't serve anything without that header keyword
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To the best of my knowledge, there are certain files that are generated and delivered by a different server than the MAST portal. For example, the catalog files for the Hubble Legacy Archive are generated by https://hla.stsci.edu/. So, content length may be unknown to the MAST portal, and the portal doesn't currently return it in the download response.
If it does turn out that we can't get the length of these files from the Portal, redownloading seems like the only way to go. The downside to that is the time lost redownloading when a complete cached file could have been used. Would it be possible to raise a warning instead of a ValueError
when the content length header is not found? Or add a parameter to control whether a warning or an error is raised?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's safe to resume a partial download with no known endpoint. Maybe, but I feel uncomfortable about it.
Why is the MAST portal involved if the data are served from another portal? Shouldn't we just be downloading from the other server? Maybe you don't know and it's all weird server configuration stuff, but this feels super weird and difficult to handle robustly without understanding what's going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been talking about this issue with a few folks at MAST, and I know a bit more about why these particular files don't have a content length. They are generated on-the-fly by a service that queries our database and puts the results into the requested format. They are also streamed and not staged at all. There is a content length coming from the service, but I assume its being dropped somewhere between there and the portal API. There are a few people looking into it, and I'm hopeful that we can add the header to the files that come from that particular service.
That being said, we might have this issue with other files that are generated dynamically and then streamed from the database. Since this only applies in the case that the file is already cached and continuation is not possible (MAST sets continuation
to False
anyway), maybe the default behavior should be to log a warning and re-download the file entirely? Or log a warning and leave the cached file without re-downloading at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm leaning toward 're-download and warn' as the most sensible default. If there are large files for which you really need continuation, we can consider adding a MAST-specific workaround
Un-indenting the head-safe boolean check to apply to both fixing incomplete cached files and new files. This appears to fix this issue:
#3231
However, I do not understand what this
head_safe
keyword does, so please check that this does not break the intended functionality or flow.