Skip to content

un-indenting head-safe check to apply for cached files too #3232

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Mar 26, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,14 @@ heasarc
Infrastructure, Utility and Other Changes and Additions
-------------------------------------------------------

query.py
^^^^^^^^

- ``BaseQuery._download_file`` now returns the local file path in all cases.
Some corner cases where downloads were not properly continued have been
fixed. [#3232]



0.4.10 (2025-03-18)
===================
Expand Down Expand Up @@ -133,6 +141,7 @@ Infrastructure, Utility and Other Changes and Additions
``astroquery.test()`` functionality. [#3215]



0.4.9 (2025-01-24)
==================

Expand Down
47 changes: 30 additions & 17 deletions astroquery/query.py
Original file line number Diff line number Diff line change
Expand Up @@ -412,7 +412,6 @@ def _download_file(self, url, local_filepath, timeout=None, auth=None,
verbose : bool
Whether to show download progress. Defaults to True.
"""

if head_safe:
response = self._session.request("HEAD", url,
timeout=timeout, stream=True,
Expand All @@ -426,23 +425,29 @@ def _download_file(self, url, local_filepath, timeout=None, auth=None,
if 'content-length' in response.headers:
length = int(response.headers['content-length'])
if length == 0:
log.warn('URL {0} has length=0'.format(url))
log.warning('URL {0} has length=0'.format(url))
else:
length = None

if ((os.path.exists(local_filepath)
and ('Accept-Ranges' in response.headers)
and length is not None
and continuation)):
open_mode = 'ab'

existing_file_length = os.stat(local_filepath).st_size
if length is not None and existing_file_length >= length:
# all done!
log.info("Found cached file {0} with expected size {1}."
.format(local_filepath, existing_file_length))
return
elif existing_file_length == 0:
if existing_file_length == 0:
log.info(f"Found existing {local_filepath} file with length 0. Overwriting.")
open_mode = 'wb'
if head_safe:
response = self._session.request(method, url,
timeout=timeout, stream=True,
auth=auth, **kwargs)
response.raise_for_status()
elif existing_file_length >= length:
# all done!
log.info(f"Found cached file {local_filepath} with size {existing_file_length} = {length}.")
return local_filepath
else:
log.info("Continuing download of file {0}, with {1} bytes to "
"go ({2}%)".format(local_filepath,
Expand All @@ -454,6 +459,7 @@ def _download_file(self, url, local_filepath, timeout=None, auth=None,
end = "{0}".format(length-1) if length is not None else ""
self._session.headers['Range'] = "bytes={0}-{1}".format(existing_file_length,
end)
log.debug(f"Continuing with range={self._session.headers['Range']}")

response = self._session.request(method, url,
timeout=timeout, stream=True,
Expand All @@ -466,17 +472,24 @@ def _download_file(self, url, local_filepath, timeout=None, auth=None,
statinfo = os.stat(local_filepath)
if statinfo.st_size != length:
log.warning(f"Found cached file {local_filepath} with size {statinfo.st_size} "
f"that is different from expected size {length}")
f"that is different from expected size {length}. ")
if continuation:
log.warning(
"Continuation was requested but is not possible because "
"'Accepts-Ranges' is not in the response headers.")
open_mode = 'wb'
response = self._session.request(method, url,
timeout=timeout, stream=True,
auth=auth, **kwargs)
response.raise_for_status()
else:
log.info("Found cached file {0} with expected size {1}."
.format(local_filepath, statinfo.st_size))
log.info(f"Found cached file {local_filepath} with expected size {statinfo.st_size}.")
response.close()
return
return local_filepath
else:
log.info("Found cached file {0}.".format(local_filepath))
response.close()
return
# This case doesn't appear reachable under normal circumstances
# It is not covered by tests, and probably indicates a badly-behaved server
raise ValueError(f"Found cached file {local_filepath}. Could not verify length.")
Comment on lines +490 to +492
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block is causing some issues for our internal MAST test suite. We actually do have certain cases where the response returned by the download request doesn't contain a header for "Content-Length". I'm working to understand these edge cases better, but it may be that there's no way to get the content length, in which case trying to download a file that's already cached fails. This may have been why the continuation parameter is set to False in MAST download calls.

Would the best course of action be to re-download completely if no content length is found?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to have the foresight of not including this in the release just yet hoping that this will get tested while in dev.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@snbianco can you tell us the answer to that question? What does it mean if there's no content-length? I'd argue that continuation in that scenario is probably not possible, but I don't understand the case here - why are we downloading a file without a specified length?

iirc, I couldn't test this because httpbin wouldn't serve anything without that header keyword

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To the best of my knowledge, there are certain files that are generated and delivered by a different server than the MAST portal. For example, the catalog files for the Hubble Legacy Archive are generated by https://hla.stsci.edu/. So, content length may be unknown to the MAST portal, and the portal doesn't currently return it in the download response.

If it does turn out that we can't get the length of these files from the Portal, redownloading seems like the only way to go. The downside to that is the time lost redownloading when a complete cached file could have been used. Would it be possible to raise a warning instead of a ValueError when the content length header is not found? Or add a parameter to control whether a warning or an error is raised?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's safe to resume a partial download with no known endpoint. Maybe, but I feel uncomfortable about it.

Why is the MAST portal involved if the data are served from another portal? Shouldn't we just be downloading from the other server? Maybe you don't know and it's all weird server configuration stuff, but this feels super weird and difficult to handle robustly without understanding what's going on.

Copy link
Contributor

@snbianco snbianco Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been talking about this issue with a few folks at MAST, and I know a bit more about why these particular files don't have a content length. They are generated on-the-fly by a service that queries our database and puts the results into the requested format. They are also streamed and not staged at all. There is a content length coming from the service, but I assume its being dropped somewhere between there and the portal API. There are a few people looking into it, and I'm hopeful that we can add the header to the files that come from that particular service.

That being said, we might have this issue with other files that are generated dynamically and then streamed from the database. Since this only applies in the case that the file is already cached and continuation is not possible (MAST sets continuation to False anyway), maybe the default behavior should be to log a warning and re-download the file entirely? Or log a warning and leave the cached file without re-downloading at all?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaning toward 're-download and warn' as the most sensible default. If there are large files for which you really need continuation, we can consider adding a MAST-specific workaround

else:
open_mode = 'wb'
if head_safe:
Expand All @@ -488,7 +501,7 @@ def _download_file(self, url, local_filepath, timeout=None, auth=None,
blocksize = astropy.utils.data.conf.download_block_size

log.debug(f"Downloading URL {url} to {local_filepath} with size {length} "
f"by blocks of {blocksize}")
f"by blocks of {blocksize} with open_mode={open_mode}")

bytes_read = 0

Expand All @@ -514,7 +527,7 @@ def _download_file(self, url, local_filepath, timeout=None, auth=None,
f.write(response.content)

response.close()
return response
return local_filepath


@deprecated(since="v0.4.7", message=("The suspend_cache function is deprecated,"
Expand Down
Loading
Loading