Skip to content

Add prefer_async param to TAPService constructor to control what mode uses #686

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

stvoutsin
Copy link
Contributor

@stvoutsin stvoutsin commented Jun 27, 2025

Description

Adds an optional prefer_async parameter to TAPService that allows users to specify that they want search to default to async execution.
This change should be backwards-compatbile.

Problem: Currently, TAPService.search() always uses sync regardless. Some TAP service providers prefer asynchronous execution for better stability and resource management.

Solution: Implements approach 2 from issue #685 - an opt-in prefer_async parameter that:

  • Defaults to False (no behavior change for existing code)
  • When set to True, makes search() use run_async() automatically
  • Includes force_sync=True parameter to override async preference when needed

Usage:

# Default behavior (unchanged)
service = TAPService("http://example.com/tap")
results = service.search("SELECT * FROM table")  # Uses sync

# Opt-in to async-first behavior  
service = TAPService("http://example.com/tap", prefer_async=True)
results = service.search("SELECT * FROM table")  # Uses async
results = service.search("SELECT * FROM table", force_sync=True)  # Forces sync

Docs

Slightly modified/extended the dal/index.rst documentation with additional content describing the new parameter and how it can be used.
I've also included a small change/fix to UWS to make the pattern check more generic for the job list output.

Tests

Added a few unit tests to validate behaviour

Closes
#685

Is this a reasonable enhancement? Any thing I should change in the implementation or documentation?

Copy link

codecov bot commented Jun 27, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.01%. Comparing base (e2b5999) to head (654dad9).
Report is 7 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #686   +/-   ##
=======================================
  Coverage   84.00%   84.01%           
=======================================
  Files          80       80           
  Lines        8522     8526    +4     
=======================================
+ Hits         7159     7163    +4     
  Misses       1363     1363           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@stvoutsin stvoutsin marked this pull request as draft June 27, 2025 06:21
@stvoutsin stvoutsin marked this pull request as ready for review June 27, 2025 06:27
Copy link
Contributor

@msdemlei msdemlei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I consider the functionality reasonable and can't find flaw with the execution. I particularly like the explanation of sync vs. async in the docs. So... I'm all for merging, but since it's touching the API and people might want to point out that we should drop the force_sync argument to query and just tell people to use run_sync, I'd like to leave the actual merge to someone else.

Thanks!

@bsipocz bsipocz added this to the v1.8 milestone Jun 30, 2025
Copy link
Member

@bsipocz bsipocz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks good to me, and the docs is excellent, but I feel this is the type of thing that may lead to overcomplicated API.

I mean encouraging the explicit use of run_sync or run_async could be enough?

That being said, this is not a strong preference and is not a blocker for merging, but I pass on the ball to one more maintainer to chime in.

@msdemlei
Copy link
Contributor

msdemlei commented Jul 2, 2025 via email

@andamian
Copy link
Contributor

andamian commented Jul 8, 2025

I've followed the interesting discussion in the original issue and I, myself, had to grapple with the sync vs async issue in the past so I appreciate any effort that is being made to make it clearer to the users. However, I do think that this solution is not heading in the right direction for the following reasons:

  1. I don't think we can say that a service prefers async mode, when the sync vs async mode is primarily a function of the query being executed. If a service provider prefers async mode, then I believe there are ways to implement the sync end point that way without the caller being aware of that.
  2. The problem is the search method not allowing longer lasting queries to run. So this is what we should fix. And my guess is that the typical user of search doesn't want to know about the difference between sync and async so the extra flag could only say long_query or just be a timeout - how long to wait before timing out. There are other consequences for running async queries such as that the size of the result is limited (by the max size of temporary storage on the server side) and that the results cannot be streamed out before execution finishes but I don't know if it's worth mentioning them in the documentation.

@stvoutsin
Copy link
Contributor Author

Thanks for the feedback @andamian!

I do see the point that many users of "search" shouldn't need to think about sync/async implementation details and generally agree that query characteristics matter for sync/async choice:
i.e. Simple queries don't need async overhead while complex queries benefit from async capabilities

However I'd argue that sync/async is not just a matter of the query.
Service providers may have different limits (runtime/result size) for these endpoints and this choice can affect whether the query succeeds or not.

In our current implementation we hide async execution behind the sync endpoint.
We internally execute sync queries as async jobs behind the scenes (submit job, wait/poll for completion, return result synchronously).
In theory, this should hide the async nature from the client. In practice we found that it has fundamental limitations:

  • HTTP connections must stay open for the entire query duration
  • Proxy/load balancer timeouts kill long-running connections
  • Client-side timeouts occur even when the service doesn't timeout
  • HTTP seems less ideal for very long synchronous requests

As a results we decided to implement a low sync timeout (1 minute) with much longer async limits despite wanting to hide this complexity from users.
Perhaps there are better approaches that we have not considered here?

In terms of the proposed long_query or timeout attributes, I'm wondering:

How does a user know what values to set? (for a query they haven't run before):
How should users determine these values?

results = service.search("SELECT * FROM table WHERE ..", timeout=???)
results = service.search("SELECT * FROM table WHERE ..", long_query=???)

Users probably will not know how long their query will actually take on this service, what the service's sync timeout limits are and whether their query will hit large tables or use slow indexes.

If the service has different async capabilities, how do these attributes determine execution method?

Does long_query=True automatically mean async?
Does timeout > 60 trigger async?
What happens when timeout=30 but the service sync limit is 10 seconds?
What if async has storage limitations that affect the query?

Also different services may have very different characteristics.

While our service has 1-minute sync limit, (very-long) async, other services: may have 10-second sync limits, or 5-minute limits and others may have limited temp storage.

So I don't see how a fixed timeout threshold or long_query attribute would help here, unless I've misunderstood the proposal.
I think adding this would be the decision back to the user but in a more confusing way, as they won't really know what it means or what the right values should be.

I generally lean towards an optional parameter that they don't have to set (explicitly choosing async or sync) which gives explicit control but doesn't confuse users who don't care about the sync/async implementations, but maybe I'm missing the bigger picture of how your proposal would work in practice.

@andamian
Copy link
Contributor

andamian commented Jul 8, 2025

I agree that the long_query|timeout is not an easy decision on user but I don't see how sync|async one is any better. The limits container info the @msdemlei is proposing would potentially aid in the decision but there are so many other variables as you've noted (table joins, indexes, etc).

I incline to think that search should be async by default to maximize the chances of success and recommend users to use run_sync for fast queries (like TAP schema and others) or queries that return large results but @msdemlei comment that there are a lot of services that implement this incorrectly makes me hesitate.

Another alternative is to hide all these under the hood without any changes to the API. Just assume that the user doesn't know (or doesn't want to know) the implementation details. Try the sync end first and if it times out automatically re-try the same query on the async end point. Call search the very basic, blocking method that is maximizing the chances of execution success. For more tuned, implementation specific cases there is run_sync and run_async. I think I would vote for that.

@stvoutsin
Copy link
Contributor Author

stvoutsin commented Jul 8, 2025

I agree that the long_query|timeout is not an easy decision on user but I don't see how sync|async one is any better. The limits container info the @msdemlei is proposing would potentially aid in the decision but there are so many other variables as you've noted (table joins, indexes, etc).

I incline to think that search should be async by default to maximize the chances of success and recommend users to use run_sync for fast queries (like TAP schema and others) or queries that return large results but @msdemlei comment that there are a lot of services that implement this incorrectly makes me hesitate.

Another alternative is to hide all these under the hood without any changes to the API. Just assume that the user doesn't know (or doesn't want to know) the implementation details. Try the sync end first and if it times out automatically re-try the same query on the async end point. Call search the very basic, blocking method that is maximizing the chances of execution success. For more tuned, implementation specific cases there is run_sync and run_async. I think I would vote for that.

The way I'm looking at this I see the explicit async/sync choice better because it directly maps to reality and is more honest to what each option will do.

When a user explicitly sets the run method, i.e:
results = service.search(query, method='async')
They are probably a user who knows what async/sync is and they are explicitly saying they want the query run as async. For example our users would know to initialize the TAPService with async as the preferred method because of our runtime limitations for sync.

If a user has a query that would normally timeout for a given sync but would run within the limits of the async endpoint, I don't know how they would choose the right values for long_query/timeout without trial & error.

Then for users who don't care or want to know about TAP internals, they can just run search without the optional parameter.

I also feel that since pyvo already exposes an API for running sync & async (run_sync() & run_async()) it feels natural that these are options that they can specify in search as well. There is sort of an educational value here in teaching about TAP architecture and understand advantages/disadvantages of the two options, rather than trying to hide it from them.

I'm not against the idea of using sync and falling back to async, and had contemplating proposing this approach as well.
However there were a few factors that didn't feel quite right that I don't have good answers for, perhaps you have some thoughts on these:

  • Loss of control over performance expectations
    So a query that would run in 5 minutes when run asynchronously, will now run in 6 if my service has a 1 minute sync timeout

  • Resource waste on the provider
    For long queries, forcing a sync request feels like it may be wasting provider resources

  • Debugging
    It becomes a bit trickier to debug / diagnose when things go wrong. Which query failed (async/sync?), which error message should be shown to the user?

  • False retry scenarios
    How do we reliably distinguish "query too long for sync" from other errors? Different services return timeouts differently (HTTP 408, DAL errors, connection drops), and misclassifying errors could lead to inappropriate retries or confusing errors.

@bsipocz
Copy link
Member

bsipocz commented Jul 8, 2025

Another alternative is to hide all these under the hood without any changes to the API. Just assume that the user doesn't know (or doesn't want to know) the implementation details. Try the sync end first and if it times out automatically re-try the same query on the async end point. Call search the very basic, blocking method that is maximizing the chances of execution success. For more tuned, implementation specific cases there is run_sync and run_async. I think I would vote for that.

If we're going this route, then I would advocate for adding a warning after the timeout, something along the line to "rerunning query with async, you may consider changing your code to use run_async instead".

@msdemlei
Copy link
Contributor

msdemlei commented Jul 9, 2025 via email

@stvoutsin
Copy link
Contributor Author

On Tue, Jul 08, 2025 at 02:48:29PM -0700, Brigitta Sipőcz wrote: If we're going this route, then I would advocate for adding a warning after the timeout, something along the line to "rerunning query with async, you may consider changing your code to use run_async instead".
The trouble with this is that a TAP client cannot yet reliably tell that a query timed out (rather than fail in some different way). You point makes me think we should change this. As it happens, DALI is right before RFC now, and I think we should advocate some language on making timeouts machine-readable in there. I've considered using QUERY_STATUS for that, in analogy to OVERFLOW. But extending the QUERY_STATUS "vocabulary" seems tricky, and as it happens, HTTP itself has 408 Request Timeout. I believe just amending DALI's 5.2 with something saying "services are urged to use a 408 HTTP status code if a request took more time to process than the server was willing to allot it; clients could use this information to recommend authentication, switching to async mode, or whatever" would prepare us for what Brigitta is proposing here (which I think does make sense). Are there any volunteers for writing a bug against DALI in this matter (or perhaps even a PR)?

I've started with an issue here and happy to work on the text if there is consensus:
ivoa-std/DALI#62

@stvoutsin
Copy link
Contributor Author

Shall I close this PR in favour of addressing this at the level of better timeout messages once the plumbing is in place in DALI, or do is this still an approach we are considering?

@msdemlei
Copy link
Contributor

msdemlei commented Jul 16, 2025 via email

@stvoutsin stvoutsin closed this Jul 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants