Add prefer_async param to TAPService constructor to control what mode uses #686

stvoutsin · 2025-06-27T05:55:00Z

Description

Adds an optional prefer_async parameter to TAPService that allows users to specify that they want search to default to async execution.
This change should be backwards-compatbile.

Problem: Currently, TAPService.search() always uses sync regardless. Some TAP service providers prefer asynchronous execution for better stability and resource management.

Solution: Implements approach 2 from issue #685 - an opt-in prefer_async parameter that:

Defaults to False (no behavior change for existing code)
When set to True, makes search() use run_async() automatically
Includes force_sync=True parameter to override async preference when needed

Usage:

# Default behavior (unchanged)
service = TAPService("http://example.com/tap")
results = service.search("SELECT * FROM table")  # Uses sync

# Opt-in to async-first behavior  
service = TAPService("http://example.com/tap", prefer_async=True)
results = service.search("SELECT * FROM table")  # Uses async
results = service.search("SELECT * FROM table", force_sync=True)  # Forces sync

Docs

Slightly modified/extended the dal/index.rst documentation with additional content describing the new parameter and how it can be used.
I've also included a small change/fix to UWS to make the pattern check more generic for the job list output.

Tests

Added a few unit tests to validate behaviour

Closes
#685

Is this a reasonable enhancement? Any thing I should change in the implementation or documentation?

codecov · 2025-06-27T06:06:56Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.01%. Comparing base (e2b5999) to head (654dad9).
Report is 7 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #686   +/-   ##
=======================================
  Coverage   84.00%   84.01%           
=======================================
  Files          80       80           
  Lines        8522     8526    +4     
=======================================
+ Hits         7159     7163    +4     
  Misses       1363     1363

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

… search uses

msdemlei

I consider the functionality reasonable and can't find flaw with the execution. I particularly like the explanation of sync vs. async in the docs. So... I'm all for merging, but since it's touching the API and people might want to point out that we should drop the force_sync argument to query and just tell people to use run_sync, I'd like to leave the actual merge to someone else.

Thanks!

bsipocz

The implementation looks good to me, and the docs is excellent, but I feel this is the type of thing that may lead to overcomplicated API.

I mean encouraging the explicit use of run_sync or run_async could be enough?

That being said, this is not a strong preference and is not a blocker for merging, but I pass on the ball to one more maintainer to chime in.

msdemlei · 2025-07-02T10:56:48Z

On Mon, Jun 30, 2025 at 03:47:04PM -0700, Brigitta Sipőcz wrote: That being said, this is not a strong preference and is not a blocker for merging, but I pass on the ball to one more maintainer to chime in.

I remain mildly in favour, so... @andamian or @tomdonaldson, if you're mildly in favour, too, would you merge?

andamian · 2025-07-08T17:05:50Z

I've followed the interesting discussion in the original issue and I, myself, had to grapple with the sync vs async issue in the past so I appreciate any effort that is being made to make it clearer to the users. However, I do think that this solution is not heading in the right direction for the following reasons:

I don't think we can say that a service prefers async mode, when the sync vs async mode is primarily a function of the query being executed. If a service provider prefers async mode, then I believe there are ways to implement the sync end point that way without the caller being aware of that.
The problem is the search method not allowing longer lasting queries to run. So this is what we should fix. And my guess is that the typical user of search doesn't want to know about the difference between sync and async so the extra flag could only say long_query or just be a timeout - how long to wait before timing out. There are other consequences for running async queries such as that the size of the result is limited (by the max size of temporary storage on the server side) and that the results cannot be streamed out before execution finishes but I don't know if it's worth mentioning them in the documentation.

stvoutsin · 2025-07-08T18:33:42Z

Thanks for the feedback @andamian!

I do see the point that many users of "search" shouldn't need to think about sync/async implementation details and generally agree that query characteristics matter for sync/async choice:
i.e. Simple queries don't need async overhead while complex queries benefit from async capabilities

However I'd argue that sync/async is not just a matter of the query.
Service providers may have different limits (runtime/result size) for these endpoints and this choice can affect whether the query succeeds or not.

In our current implementation we hide async execution behind the sync endpoint.
We internally execute sync queries as async jobs behind the scenes (submit job, wait/poll for completion, return result synchronously).
In theory, this should hide the async nature from the client. In practice we found that it has fundamental limitations:

HTTP connections must stay open for the entire query duration
Proxy/load balancer timeouts kill long-running connections
Client-side timeouts occur even when the service doesn't timeout
HTTP seems less ideal for very long synchronous requests

As a results we decided to implement a low sync timeout (1 minute) with much longer async limits despite wanting to hide this complexity from users.
Perhaps there are better approaches that we have not considered here?

In terms of the proposed long_query or timeout attributes, I'm wondering:

How does a user know what values to set? (for a query they haven't run before):
How should users determine these values?

results = service.search("SELECT * FROM table WHERE ..", timeout=???)
results = service.search("SELECT * FROM table WHERE ..", long_query=???)

Users probably will not know how long their query will actually take on this service, what the service's sync timeout limits are and whether their query will hit large tables or use slow indexes.

If the service has different async capabilities, how do these attributes determine execution method?

Does long_query=True automatically mean async?
Does timeout > 60 trigger async?
What happens when timeout=30 but the service sync limit is 10 seconds?
What if async has storage limitations that affect the query?

Also different services may have very different characteristics.

While our service has 1-minute sync limit, (very-long) async, other services: may have 10-second sync limits, or 5-minute limits and others may have limited temp storage.

So I don't see how a fixed timeout threshold or long_query attribute would help here, unless I've misunderstood the proposal.
I think adding this would be the decision back to the user but in a more confusing way, as they won't really know what it means or what the right values should be.

I generally lean towards an optional parameter that they don't have to set (explicitly choosing async or sync) which gives explicit control but doesn't confuse users who don't care about the sync/async implementations, but maybe I'm missing the bigger picture of how your proposal would work in practice.

andamian · 2025-07-08T19:29:15Z

I agree that the long_query|timeout is not an easy decision on user but I don't see how sync|async one is any better. The limits container info the @msdemlei is proposing would potentially aid in the decision but there are so many other variables as you've noted (table joins, indexes, etc).

I incline to think that search should be async by default to maximize the chances of success and recommend users to use run_sync for fast queries (like TAP schema and others) or queries that return large results but @msdemlei comment that there are a lot of services that implement this incorrectly makes me hesitate.

Another alternative is to hide all these under the hood without any changes to the API. Just assume that the user doesn't know (or doesn't want to know) the implementation details. Try the sync end first and if it times out automatically re-try the same query on the async end point. Call search the very basic, blocking method that is maximizing the chances of execution success. For more tuned, implementation specific cases there is run_sync and run_async. I think I would vote for that.

stvoutsin · 2025-07-08T21:27:16Z

I agree that the long_query|timeout is not an easy decision on user but I don't see how sync|async one is any better. The limits container info the @msdemlei is proposing would potentially aid in the decision but there are so many other variables as you've noted (table joins, indexes, etc).

I incline to think that search should be async by default to maximize the chances of success and recommend users to use run_sync for fast queries (like TAP schema and others) or queries that return large results but @msdemlei comment that there are a lot of services that implement this incorrectly makes me hesitate.

Another alternative is to hide all these under the hood without any changes to the API. Just assume that the user doesn't know (or doesn't want to know) the implementation details. Try the sync end first and if it times out automatically re-try the same query on the async end point. Call search the very basic, blocking method that is maximizing the chances of execution success. For more tuned, implementation specific cases there is run_sync and run_async. I think I would vote for that.

The way I'm looking at this I see the explicit async/sync choice better because it directly maps to reality and is more honest to what each option will do.

When a user explicitly sets the run method, i.e:
results = service.search(query, method='async')
They are probably a user who knows what async/sync is and they are explicitly saying they want the query run as async. For example our users would know to initialize the TAPService with async as the preferred method because of our runtime limitations for sync.

If a user has a query that would normally timeout for a given sync but would run within the limits of the async endpoint, I don't know how they would choose the right values for long_query/timeout without trial & error.

Then for users who don't care or want to know about TAP internals, they can just run search without the optional parameter.

I also feel that since pyvo already exposes an API for running sync & async (run_sync() & run_async()) it feels natural that these are options that they can specify in search as well. There is sort of an educational value here in teaching about TAP architecture and understand advantages/disadvantages of the two options, rather than trying to hide it from them.

I'm not against the idea of using sync and falling back to async, and had contemplating proposing this approach as well.
However there were a few factors that didn't feel quite right that I don't have good answers for, perhaps you have some thoughts on these:

Loss of control over performance expectations
So a query that would run in 5 minutes when run asynchronously, will now run in 6 if my service has a 1 minute sync timeout
Resource waste on the provider
For long queries, forcing a sync request feels like it may be wasting provider resources
Debugging
It becomes a bit trickier to debug / diagnose when things go wrong. Which query failed (async/sync?), which error message should be shown to the user?
False retry scenarios
How do we reliably distinguish "query too long for sync" from other errors? Different services return timeouts differently (HTTP 408, DAL errors, connection drops), and misclassifying errors could lead to inappropriate retries or confusing errors.

bsipocz · 2025-07-08T21:48:07Z

Another alternative is to hide all these under the hood without any changes to the API. Just assume that the user doesn't know (or doesn't want to know) the implementation details. Try the sync end first and if it times out automatically re-try the same query on the async end point. Call search the very basic, blocking method that is maximizing the chances of execution success. For more tuned, implementation specific cases there is run_sync and run_async. I think I would vote for that.

If we're going this route, then I would advocate for adding a warning after the timeout, something along the line to "rerunning query with async, you may consider changing your code to use run_async instead".

msdemlei · 2025-07-09T07:23:27Z

On Tue, Jul 08, 2025 at 02:48:29PM -0700, Brigitta Sipőcz wrote: If we're going this route, then I would advocate for adding a warning after the timeout, something along the line to "rerunning query with async, you may consider changing your code to use `run_async` instead".

The trouble with this is that a TAP client cannot yet reliably tell that a query timed out (rather than fail in some different way). You point makes me think we should change this. As it happens, DALI is right before RFC now, and I think we should advocate some language on making timeouts machine-readable in there. I've considered using QUERY_STATUS for that, in analogy to OVERFLOW. But extending the QUERY_STATUS "vocabulary" seems tricky, and as it happens, HTTP itself has 408 Request Timeout. I believe just amending DALI's 5.2 with something saying "services are urged to use a 408 HTTP status code if a request took more time to process than the server was willing to allot it; clients could use this information to recommend authentication, switching to async mode, or whatever" would prepare us for what Brigitta is proposing here (which I think does make sense). Are there any volunteers for writing a bug against DALI in this matter (or perhaps even a PR)?

stvoutsin · 2025-07-09T18:32:27Z

On Tue, Jul 08, 2025 at 02:48:29PM -0700, Brigitta Sipőcz wrote: If we're going this route, then I would advocate for adding a warning after the timeout, something along the line to "rerunning query with async, you may consider changing your code to use run_async instead".
The trouble with this is that a TAP client cannot yet reliably tell that a query timed out (rather than fail in some different way). You point makes me think we should change this. As it happens, DALI is right before RFC now, and I think we should advocate some language on making timeouts machine-readable in there. I've considered using QUERY_STATUS for that, in analogy to OVERFLOW. But extending the QUERY_STATUS "vocabulary" seems tricky, and as it happens, HTTP itself has 408 Request Timeout. I believe just amending DALI's 5.2 with something saying "services are urged to use a 408 HTTP status code if a request took more time to process than the server was willing to allot it; clients could use this information to recommend authentication, switching to async mode, or whatever" would prepare us for what Brigitta is proposing here (which I think does make sense). Are there any volunteers for writing a bug against DALI in this matter (or perhaps even a PR)?

I've started with an issue here and happy to work on the text if there is consensus:
ivoa-std/DALI#62

stvoutsin · 2025-07-15T22:44:07Z

Shall I close this PR in favour of addressing this at the level of better timeout messages once the plumbing is in place in DALI, or do is this still an approach we are considering?

msdemlei · 2025-07-16T07:58:23Z

On Tue, Jul 15, 2025 at 03:44:27PM -0700, stvoutsin wrote: stvoutsin left a comment (astropy/pyvo#686) Shall I close this PR in favour of addressing this at the level of better timeout messages once the plumbing is in place in DALI, or do is this still an approach we are considering?

I'm happy to have it closed given that there was some amount of pushback and there may really be not much advantage in telling people to run search(,prefer_async) against pointing them to run_async() in the first place.

stvoutsin force-pushed the issue-685 branch from 894052c to 79ac214 Compare June 27, 2025 05:57

Add prefer_async param to TAPService constructor to control what mode…

654dad9

… search uses

stvoutsin force-pushed the issue-685 branch from 5e0b321 to 654dad9 Compare June 27, 2025 06:21

stvoutsin marked this pull request as draft June 27, 2025 06:21

stvoutsin marked this pull request as ready for review June 27, 2025 06:27

msdemlei approved these changes Jun 27, 2025

View reviewed changes

bsipocz added the component: TAP label Jun 30, 2025

bsipocz added this to the v1.8 milestone Jun 30, 2025

bsipocz reviewed Jun 30, 2025

View reviewed changes

stvoutsin mentioned this pull request Jul 9, 2025

Standardizing Timeout Error Responses ivoa-std/DALI#62

Open

stvoutsin closed this Jul 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add prefer_async param to TAPService constructor to control what mode uses #686

Add prefer_async param to TAPService constructor to control what mode uses #686

Uh oh!

stvoutsin commented Jun 27, 2025 •

edited

Loading

Uh oh!

codecov bot commented Jun 27, 2025 •

edited

Loading

Uh oh!

msdemlei left a comment

Uh oh!

bsipocz left a comment

Uh oh!

msdemlei commented Jul 2, 2025 via email

Uh oh!

andamian commented Jul 8, 2025

Uh oh!

stvoutsin commented Jul 8, 2025

Uh oh!

andamian commented Jul 8, 2025

Uh oh!

stvoutsin commented Jul 8, 2025 •

edited

Loading

Uh oh!

bsipocz commented Jul 8, 2025

Uh oh!

msdemlei commented Jul 9, 2025 via email

Uh oh!

stvoutsin commented Jul 9, 2025

Uh oh!

stvoutsin commented Jul 15, 2025

Uh oh!

msdemlei commented Jul 16, 2025 via email

Uh oh!

Uh oh!

Uh oh!

Add prefer_async param to TAPService constructor to control what mode uses #686

Add prefer_async param to TAPService constructor to control what mode uses #686

Uh oh!

Conversation

stvoutsin commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Docs

Tests

Uh oh!

codecov bot commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

msdemlei left a comment

Choose a reason for hiding this comment

Uh oh!

bsipocz left a comment

Choose a reason for hiding this comment

Uh oh!

msdemlei commented Jul 2, 2025 via email

Uh oh!

andamian commented Jul 8, 2025

Uh oh!

stvoutsin commented Jul 8, 2025

Uh oh!

andamian commented Jul 8, 2025

Uh oh!

stvoutsin commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bsipocz commented Jul 8, 2025

Uh oh!

msdemlei commented Jul 9, 2025 via email

Uh oh!

stvoutsin commented Jul 9, 2025

Uh oh!

stvoutsin commented Jul 15, 2025

Uh oh!

msdemlei commented Jul 16, 2025 via email

Uh oh!

Uh oh!

stvoutsin commented Jun 27, 2025 •

edited

Loading

codecov bot commented Jun 27, 2025 •

edited

Loading

stvoutsin commented Jul 8, 2025 •

edited

Loading