-
-
Notifications
You must be signed in to change notification settings - Fork 362
check for nursery misnesting on task exit #3307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3307 +/- ##
===============================================
Coverage 100.00000% 100.00000%
===============================================
Files 125 125
Lines 19253 19317 +64
Branches 1304 1307 +3
===============================================
+ Hits 19253 19317 +64
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't currently see why this doesn't work but I think placing the code in the nursery methods for handling closed tasks might work?
What appears to happen in the failing test is that |
Sorry I mean I don't see why the test doesn't raise RuntimeError. I don't think we need to guarantee anything, simply cancelling the abandoned nursery is probably enough. As long as we raise an error noting our invariants were violated. |
well, we want to avoid the |
Well errors passing silently is probably a bad idea. Seperately maybe we can add the check into the codepaths used to notify a nursery when a child/nested task finishes? That way we aren't raising in an instrument (I assume based on method names?) |
…sks and handle nested scenarios (needs test)
Not quite sure how it would differ to move it.
async with _core.open_nursery() as outer_nursery:
inner_cm = _core.open_nursery()
await inner_cm.__aenter__() you'll get an error
I suppose there's no reason not to add a message to that |
Maybe we can push a runtime error into the nursery? Pretend like the task finished with an exception or something. I didn't realize We can probably change the assert to be a RuntimeError with whatever explanation of the problem we come up with? |
src/trio/_core/_run.py
Outdated
for nursery in task._child_nurseries: | ||
for child in nursery._children.copy(): | ||
if child in runner.runq: | ||
runner.runq.remove(child) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool method to abort tasks but to me having nice behavior in this failure mode isn't important. If it's simple to just cancel the nursery then that's fine by me (if it isn't simple then oh well, this is fine. I just don't know what invariants this is relying on/wouldn't be able to spot bugs in this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is if I don't do this cleanup then a lot of assumptions are broken elsewhere and the resulting TrioInternalError
masks the RuntimeError
we want to surface. But yeah, this seems ripe for bugs.
I think it's when the child tasks are finished they want to yield back to their caller, which now doesn't exist. So we can't even rely on Cancellation semantics either.
There might be a cleaner solution where we postpone the exiting of the current task after requesting cancellation on the nursery... hmm gonna go try that
src/trio/_core/_run.py
Outdated
raise RuntimeError( | ||
"Nursery stack corrupted: nurseries spawned by " | ||
f"{task!r} was still live when the task exited\n{MISNESTING_ADVICE}", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be inclined to make this some kind of BaseException
, to make it really clear that this is not a recoverable error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the cancel scope misnesting error is RuntimeError
, so I think this should be the same. With cleaning up aborted tasks I think this is technically recoverable, the runtime loop is not corrupted beyond redemption or anything.
So unless we want to do a thorough replacing of a lot of RuntimeError
I think we're better off staying consistent.
Shouldn't this done for cancel scopes, rather than nurseries? That's what AnyIO does. And since each nursery comes with its own cancel scopes, it should cover nurseries too. |
there is cancelscope misnesting checking, but it's not showing up in the cases with nurseries (see repro & output in #3298). I'm not completely sure why, but I think we're hitting internal errors before those checks get a chance to surface properly. edit: I no longer believe this is the reason. No clue why the current checks don't trigger. |
After trying to delay the closure of the outer nursery (which I failed) I stumbled upon a different approach which seems less dangerous. We cancel the misnested nursery, and then tell it not to reschedule its parent task when done (because that task does no longer exist). This needs some additional polish, but pushing for now to get any feedback on the new approach while I'm off for a festival weekend ✨
|
I assumed it was something like "cancel scope checks don't check this subtle interaction". Where the interaction in question is that we're starting a nursery in a new task and never exiting it. I guess maybe "check this task has no unclosed cancel scopes" would be nice... and actually we do have the state do that properly (check Edit: actually this might be because it's hard to know what to even do in this case. At least with nurseries we can do something as shown by this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good! A couple points:
-
I would be interested in knowing how the error here works:
Lines 451 to 456 in 7f7bf5e
# CancelStatus. Note that it's possible for us to get here # without CancelScope._close() raising an error, if a # nursery's cancel scope is closed within the nursery's # nested child and no other cancel scopes are involved, # but in that case task_exited() will deal with raising # the error. -
it would be nice to double check this can't be done as a more general check about cancel scopes. I imagine this would look like peeking at cancel status's children on task exit? I think this won't work because things might run in a broken nursery, but it's worth a shot.
-
I think it would be very nice to not leak a Cancelled. I don't care about the other two items mentioned, but Cancelled derives from BaseException and people really shouldn't be suppressing it. So this would in theory lead to failure of the whole program which is a) bad and b) could be avoided since we're not even corrupting anything.
- maybe this looks like checking
if self._closed and not hasattr(self, "_pending_excs"):
and that we're not cancelled, and then suppressing Cancelled? Not very pretty. - alternatively maybe people want the whole program to fail if there's an invariant violated? after all the code is clearly buggy. I think the runtimeerror is enough though.
- maybe this looks like checking
I definitely prefer the whole program to crash if this invariant is violated; even if we can get the remaining runtime back into a valid state the program overall is buggy and needs to change. I won't block over this though, and we've already discussed If we can do the check at cancel-scope level that'd be lovely, if not a nursery-level check would also be fine. @jakkdl does this catch the repro I shared in the issue now? |
It should catch the repro, that's one of the test cases (it's the one that leaks a Cancelled actually) |
It looks like all the cancel scope misnesting error handling is from #1061, so it'd be great if @oremanj could pipe in with any further detail. It's possible that the exact error they're pointing out in that comment is what we're trying to fix now.
Peeking at I'm pretty sure I at the very least need Lines 2013 to 2017 in 06640eb
to stop everything from breaking down. |
Aha, it seems that the Lines 646 to 651 in 06640eb
but parent_cancellation_is_visible_to_us still returns True , so we don't clean up Cancelled .So the only thing we need to do is check if we're misnested when deciding whether to strip out Cancelled , and this seems like something that should be safe to do. Like, whether the parent cancellation is visible to us is not what should determine whether we pass through Cancelled , because that parent won't see the Cancelled that we're reraising! If anything we should step up the chain until we find the first scope that hasn't "abandoned" us and see if it is cancelled (and maybe check for shields along the way), but I'm not sure we care enough and I think we're guaranteeing that the original parent scope will raise an error - so it doesn't really matter that Cancelled is suppressed.
Even if that's what we want, this is not the way to do it. The But yeah, I'm sticking with doing the same as the other |
AnyIO indeed does cover the original problem in #3298, giving |
nevermind! I only tested it against the
from typing import Any, AsyncGenerator
from contextlib import AsyncExitStack, asynccontextmanager
import anyio
async def main() -> None:
async with AsyncExitStack() as stack, anyio.create_task_group() as nursery:
# The asynccontextmanager is going to create a nursery that outlives this nursery!
nursery.start_soon(
stack.enter_async_context,
asynccontextmanager_that_creates_a_nursery_internally(),
)
@asynccontextmanager
async def asynccontextmanager_that_creates_a_nursery_internally() -> AsyncGenerator[None]:
async with anyio.create_task_group() as nursery:
nursery.start_soon(
print_sleep_print, "task_in_asynccontextmanager_nursery", 2.0
)
yield
async def print_sleep_print(name: str, sleep_time: float) -> None:
await anyio.sleep(sleep_time)
if __name__ == "__main__":
anyio.run(main, backend='trio') If running the program with this PR you get the more helpful @agronholm you might want this test for anyio too |
edit: nvm I'm dumb |
fixes #3298 (hopefully, eventually)
This works for a minified repro (
test_nursery_misnest
), but not on the full repro posted by @Zac-HD (test_asyncexitstack_nursery_misnest
), so something is missing. I had a bunch of false starts of different ways of handling it that left the internal state messed up, but this one seems the most promising so far.