Skip to content

un-indenting head-safe check to apply for cached files too #3232

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Mar 26, 2025

Conversation

eas342
Copy link
Contributor

@eas342 eas342 commented Feb 27, 2025

Un-indenting the head-safe boolean check to apply to both fixing incomplete cached files and new files. This appears to fix this issue:
#3231

However, I do not understand what this head_safe keyword does, so please check that this does not break the intended functionality or flow.

Copy link

codecov bot commented Feb 27, 2025

Codecov Report

Attention: Patch coverage is 88.88889% with 2 lines in your changes missing coverage. Please review.

Project coverage is 69.37%. Comparing base (0864cb6) to head (9dca39b).
Report is 283 commits behind head on main.

Files with missing lines Patch % Lines
astroquery/query.py 88.88% 2 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3232   +/-   ##
=======================================
  Coverage   69.37%   69.37%           
=======================================
  Files         232      232           
  Lines       19684    19684           
=======================================
  Hits        13655    13655           
  Misses       6029     6029           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@snbianco
Copy link
Contributor

snbianco commented Mar 4, 2025

Hi again @eas342, thanks for this PR! This change is to a file in the main Astroquery directory and not part of any particular module, so it will impact more than just the MAST module. I would rather we change the MAST module directly so that it knows to continue incomplete file downloads. See my comment on the issue here: #3231 (comment)

Copy link
Contributor

@snbianco snbianco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See other comment!

Copy link
Contributor

@keflavich keflavich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unindenting this block is not the correct behavior; it will overwrite the continuation block above, which would result in the download continuation failing - it will download the whole file and append it to whatever was partly completed.

However, this same code block (the if head_safe: response = request(...) block) needs to be repeated above, in the other location open_mode='wb'

@eas342
Copy link
Contributor Author

eas342 commented Mar 17, 2025

Thank you @keflavich ,

Unfortunately, the change you made does not seem to pass #3231

import os
from astropy.io import fits
from astroquery.mast import Observations
fileN = 'jw01185103001_02102_00001-seg001_nrcalong_rate.fits'
#os.remove(fileN)
Observations.download_file('mast:jwst/product/' + fileN)
with open(fileN, 'r+') as f:
    f.seek(1000)
    f.truncate()
Observations.download_file('mast:jwst/product/' + fileN)
tmpHDU = fits.open(fileN)

WARNING: Found cached file jw01185103001_02102_00001-seg001_nrcalong_rate.fits with size 0 that is different from expected size 83520 [astroquery.query]
Downloading URL https://mast.stsci.edu/api/v0.1/Download/file?uri=mast:jwst/product/jw01185103001_02102_00001-seg001_nrcalong_rate.fits to jw01185103001_02102_00001-seg001_nrcalong_rate.fits ...
|==========================================| 83k/ 83k (100.00%) 0s
WARNING: Found cached file jw01185103001_02102_00001-seg001_nrcalong_rate.fits with size 1000 that is different from expected size 83520 [astroquery.query]
Downloading URL https://mast.stsci.edu/api/v0.1/Download/file?uri=mast:jwst/product/jw01185103001_02102_00001-seg001_nrcalong_rate.fits to jw01185103001_02102_00001-seg001_nrcalong_rate.fits ...
|==========================================| 83k/ 83k (100.00%) 0s
OSError: Empty or corrupt FITS file

@keflavich
Copy link
Contributor

Thanks @eas342. Indeed, there was another corner case we hadn't checked. The latest commit catches it.

@bsipocz bsipocz added this to the v0.4.11 milestone Mar 17, 2025
@eas342
Copy link
Contributor Author

eas342 commented Mar 17, 2025

That works, super! Thank you @keflavich

@bsipocz
Copy link
Member

bsipocz commented Mar 17, 2025

@eas342 @keflavich - I'm not sure how to test this in CI, etc. But if you are certain it covers all the cases, and add a changelog, it can still go in the next release ~tomorrow.
But if there is no rush or there is any uncertainty, then it should sit the release out and get it merged after it's out to have it in main for a little longer.

@keflavich
Copy link
Contributor

let's add tests; i think i can turn @eas342 's mwe into a test easily enough. so delay merging

@keflavich keflavich requested a review from bsipocz March 18, 2025 15:10
@keflavich
Copy link
Contributor

@bsipocz this is ready for final review

@bsipocz bsipocz added infrastructure query astroquery internals and removed mast labels Mar 25, 2025
Copy link
Member

@bsipocz bsipocz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments.
Please add back the multiple inheritances in the tests, and do a rebase for the changelog.

If you can squash out/consolidate a few of the flake8 and similar converging commits that would be good, too, but not a big deal if not.

@bsipocz
Copy link
Member

bsipocz commented Mar 25, 2025

you need to rebase, 0.4.10 is already out.

eas342 and others added 3 commits March 25, 2025 18:40
continuation was set to False (which is the case - MAST hard-codes
that), no download was being done - we were only retrieving the header,
not the data
Copy link
Member

@bsipocz bsipocz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, and I'll fixup the changelog prior merging.

@bsipocz
Copy link
Member

bsipocz commented Mar 26, 2025

OK, let's merge this now. I didn't see any problems trying it locally, but nevertheless it will be good to have it in for as long as possible before the next release.

@bsipocz bsipocz merged commit f3f4426 into astropy:main Mar 26, 2025
9 of 10 checks passed
Comment on lines +490 to +492
# This case doesn't appear reachable under normal circumstances
# It is not covered by tests, and probably indicates a badly-behaved server
raise ValueError(f"Found cached file {local_filepath}. Could not verify length.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block is causing some issues for our internal MAST test suite. We actually do have certain cases where the response returned by the download request doesn't contain a header for "Content-Length". I'm working to understand these edge cases better, but it may be that there's no way to get the content length, in which case trying to download a file that's already cached fails. This may have been why the continuation parameter is set to False in MAST download calls.

Would the best course of action be to re-download completely if no content length is found?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to have the foresight of not including this in the release just yet hoping that this will get tested while in dev.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@snbianco can you tell us the answer to that question? What does it mean if there's no content-length? I'd argue that continuation in that scenario is probably not possible, but I don't understand the case here - why are we downloading a file without a specified length?

iirc, I couldn't test this because httpbin wouldn't serve anything without that header keyword

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To the best of my knowledge, there are certain files that are generated and delivered by a different server than the MAST portal. For example, the catalog files for the Hubble Legacy Archive are generated by https://hla.stsci.edu/. So, content length may be unknown to the MAST portal, and the portal doesn't currently return it in the download response.

If it does turn out that we can't get the length of these files from the Portal, redownloading seems like the only way to go. The downside to that is the time lost redownloading when a complete cached file could have been used. Would it be possible to raise a warning instead of a ValueError when the content length header is not found? Or add a parameter to control whether a warning or an error is raised?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's safe to resume a partial download with no known endpoint. Maybe, but I feel uncomfortable about it.

Why is the MAST portal involved if the data are served from another portal? Shouldn't we just be downloading from the other server? Maybe you don't know and it's all weird server configuration stuff, but this feels super weird and difficult to handle robustly without understanding what's going on.

Copy link
Contributor

@snbianco snbianco Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been talking about this issue with a few folks at MAST, and I know a bit more about why these particular files don't have a content length. They are generated on-the-fly by a service that queries our database and puts the results into the requested format. They are also streamed and not staged at all. There is a content length coming from the service, but I assume its being dropped somewhere between there and the portal API. There are a few people looking into it, and I'm hopeful that we can add the header to the files that come from that particular service.

That being said, we might have this issue with other files that are generated dynamically and then streamed from the database. Since this only applies in the case that the file is already cached and continuation is not possible (MAST sets continuation to False anyway), maybe the default behavior should be to log a warning and re-download the file entirely? Or log a warning and leave the cached file without re-downloading at all?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaning toward 're-download and warn' as the most sensible default. If there are large files for which you really need continuation, we can consider adding a MAST-specific workaround

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug infrastructure query astroquery internals
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants