Skip to content

check for nursery misnesting on task exit #3307

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

jakkdl
Copy link
Member

@jakkdl jakkdl commented Jul 28, 2025

fixes #3298 (hopefully, eventually)

This works for a minified repro (test_nursery_misnest), but not on the full repro posted by @Zac-HD (test_asyncexitstack_nursery_misnest), so something is missing. I had a bunch of false starts of different ways of handling it that left the internal state messed up, but this one seems the most promising so far.

Copy link

codecov bot commented Jul 28, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00000%. Comparing base (f8a51b6) to head (b7704d1).

Additional details and impacted files
@@               Coverage Diff               @@
##                 main        #3307   +/-   ##
===============================================
  Coverage   100.00000%   100.00000%           
===============================================
  Files             125          125           
  Lines           19253        19317   +64     
  Branches         1304         1307    +3     
===============================================
+ Hits            19253        19317   +64     
Files with missing lines Coverage Δ
src/trio/_core/_run.py 100.00000% <100.00000%> (ø)
src/trio/_core/_tests/test_run.py 100.00000% <100.00000%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@A5rocks A5rocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't currently see why this doesn't work but I think placing the code in the nursery methods for handling closed tasks might work?

@jakkdl
Copy link
Member Author

jakkdl commented Jul 28, 2025

What appears to happen in the failing test is that print_sleep_print has been scheduled in the inner nursery by the time we want to exit the outer nursery, and after exiting the outer it tries to run to spectacular failure. So I'm trying to find the right way to just .. fully abort and kill all child tasks in all nurseries in task._child_nurseries in task_exited. But I'm not quite sure how to do that correctly.

@A5rocks
Copy link
Contributor

A5rocks commented Jul 28, 2025

Sorry I mean I don't see why the test doesn't raise RuntimeError.

I don't think we need to guarantee anything, simply cancelling the abandoned nursery is probably enough. As long as we raise an error noting our invariants were violated.

@jakkdl
Copy link
Member Author

jakkdl commented Jul 28, 2025

Sorry I mean I don't see why the test doesn't raise RuntimeError.

I don't think we need to guarantee anything, simply cancelling the abandoned nursery is probably enough. As long as we raise an error noting our invariants were violated.

well, we want to avoid the TrioInternalError. I think the test doesn't show RuntimeError because it gets lost when all hell breaks loose in the scheduling

@A5rocks
Copy link
Contributor

A5rocks commented Jul 28, 2025

Well errors passing silently is probably a bad idea.

Seperately maybe we can add the check into the codepaths used to notify a nursery when a child/nested task finishes? That way we aren't raising in an instrument (I assume based on method names?)

@jakkdl jakkdl marked this pull request as ready for review July 29, 2025 13:52
…sks and handle nested scenarios (needs test)
@jakkdl
Copy link
Member Author

jakkdl commented Jul 29, 2025

Seperately maybe we can add the check into the codepaths used to notify a nursery when a child/nested task finishes? That way we aren't raising in an instrument (I assume based on method names?)

Not quite sure how it would differ to move it. Runner.task_exited directly calls Nursery._child_finished, except if it's the init task, so other than if we want to disable the check for the init task (which could also have bad nurseries, right?) I don't see any reason to have it in one or the other. And given that it directly messes with Runner.runq I think having it in Runner.task_exited might be better.

_nested_child_finished is a different beast, but I'm also not sure how you manage to get the nested child itself to have unclosed nurseries without doing very very dumb stuff. Currently if you do

    async with _core.open_nursery() as outer_nursery:
        inner_cm = _core.open_nursery()
        await inner_cm.__aenter__()

you'll get an error

========================================== FAILURES ===========================================
__________________________________ test_nursery_nested_child __________________________________

    async def test_nursery_nested_child() -> None:
>       async with _core.open_nursery() as outer_nursery:
                   ^^^^^^^^^^^^^^^^^^^^

src/trio/_core/_tests/test_run.py:871: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
src/trio/_core/_run.py:1109: in __aexit__
    new_exc = await self._nursery._nested_child_finished(exc)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <trio.Nursery object at 0x780dbe802fd0>, nested_child_exc = None

    async def _nested_child_finished(
        self,
        nested_child_exc: BaseException | None,
    ) -> BaseException | None:

...

        popped = self._parent_task._child_nurseries.pop()
>       assert popped is self
               ^^^^^^^^^^^^^^
E       AssertionError

src/trio/_core/_run.py:1324: AssertionError

I suppose there's no reason not to add a message to that AssertionError, but I'm not sure we should bother with putting more logic in there.

@A5rocks
Copy link
Contributor

A5rocks commented Jul 29, 2025

Maybe we can push a runtime error into the nursery? Pretend like the task finished with an exception or something.

I didn't realize _child_finished is called from an instrument... the init task runs user code in a nested task but the two methods aren't as useful as I thought they were (even if it would work, it would be duplicated).

We can probably change the assert to be a RuntimeError with whatever explanation of the problem we come up with?

for nursery in task._child_nurseries:
for child in nursery._children.copy():
if child in runner.runq:
runner.runq.remove(child)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool method to abort tasks but to me having nice behavior in this failure mode isn't important. If it's simple to just cancel the nursery then that's fine by me (if it isn't simple then oh well, this is fine. I just don't know what invariants this is relying on/wouldn't be able to spot bugs in this)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is if I don't do this cleanup then a lot of assumptions are broken elsewhere and the resulting TrioInternalError masks the RuntimeError we want to surface. But yeah, this seems ripe for bugs.

I think it's when the child tasks are finished they want to yield back to their caller, which now doesn't exist. So we can't even rely on Cancellation semantics either.

There might be a cleaner solution where we postpone the exiting of the current task after requesting cancellation on the nursery... hmm gonna go try that

Comment on lines 2030 to 2033
raise RuntimeError(
"Nursery stack corrupted: nurseries spawned by "
f"{task!r} was still live when the task exited\n{MISNESTING_ADVICE}",
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be inclined to make this some kind of BaseException, to make it really clear that this is not a recoverable error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the cancel scope misnesting error is RuntimeError, so I think this should be the same. With cleaning up aborted tasks I think this is technically recoverable, the runtime loop is not corrupted beyond redemption or anything.

So unless we want to do a thorough replacing of a lot of RuntimeError I think we're better off staying consistent.

@agronholm
Copy link
Contributor

Shouldn't this done for cancel scopes, rather than nurseries? That's what AnyIO does. And since each nursery comes with its own cancel scopes, it should cover nurseries too.

@jakkdl
Copy link
Member Author

jakkdl commented Jul 30, 2025

Shouldn't this done for cancel scopes, rather than nurseries? That's what AnyIO does. And since each nursery comes with its own cancel scopes, it should cover nurseries too.

there is cancelscope misnesting checking, but it's not showing up in the cases with nurseries (see repro & output in #3298). I'm not completely sure why, but I think we're hitting internal errors before those checks get a chance to surface properly.

edit: I no longer believe this is the reason. No clue why the current checks don't trigger.

@jakkdl
Copy link
Member Author

jakkdl commented Jul 30, 2025

After trying to delay the closure of the outer nursery (which I failed) I stumbled upon a different approach which seems less dangerous. We cancel the misnested nursery, and then tell it not to reschedule its parent task when done (because that task does no longer exist).

This needs some additional polish, but pushing for now to get any feedback on the new approach while I'm off for a festival weekend ✨

  • Leaking Cancelled isn't great, but I also don't think it's a big downside - esp if I add a reason to it.
  • This solution will let the child tasks do cleanup in any finally, but exceptions may be lost. (should have a test demonstrating this).
  • We might want a new attribute on the Nursery instead of abusing _closed + not hasattr(self, 'pending_excs'), but also seems kinda niche.

@A5rocks
Copy link
Contributor

A5rocks commented Jul 30, 2025

No clue why the current checks don't trigger.

I assumed it was something like "cancel scope checks don't check this subtle interaction". Where the interaction in question is that we're starting a nursery in a new task and never exiting it.

I guess maybe "check this task has no unclosed cancel scopes" would be nice... and actually we do have the state do that properly (check task._cancel_status.children), but ATM I think errors are raised on exiting an cancel scope. Probably this is to have a nice stack trace, but this means we don't trigger this error as we never quite exit the abandoned cancel scope. (This is all hypothesis from reading code and probably wrong)

Edit: actually this might be because it's hard to know what to even do in this case. At least with nurseries we can do something as shown by this PR.

Copy link
Contributor

@A5rocks A5rocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! A couple points:

  • I would be interested in knowing how the error here works:

    trio/src/trio/_core/_run.py

    Lines 451 to 456 in 7f7bf5e

    # CancelStatus. Note that it's possible for us to get here
    # without CancelScope._close() raising an error, if a
    # nursery's cancel scope is closed within the nursery's
    # nested child and no other cancel scopes are involved,
    # but in that case task_exited() will deal with raising
    # the error.

  • it would be nice to double check this can't be done as a more general check about cancel scopes. I imagine this would look like peeking at cancel status's children on task exit? I think this won't work because things might run in a broken nursery, but it's worth a shot.

  • I think it would be very nice to not leak a Cancelled. I don't care about the other two items mentioned, but Cancelled derives from BaseException and people really shouldn't be suppressing it. So this would in theory lead to failure of the whole program which is a) bad and b) could be avoided since we're not even corrupting anything.

    • maybe this looks like checking if self._closed and not hasattr(self, "_pending_excs"): and that we're not cancelled, and then suppressing Cancelled? Not very pretty.
    • alternatively maybe people want the whole program to fail if there's an invariant violated? after all the code is clearly buggy. I think the runtimeerror is enough though.

@Zac-HD
Copy link
Member

Zac-HD commented Aug 7, 2025

I definitely prefer the whole program to crash if this invariant is violated; even if we can get the remaining runtime back into a valid state the program overall is buggy and needs to change. I won't block over this though, and we've already discussed RuntimeError above.

If we can do the check at cancel-scope level that'd be lovely, if not a nursery-level check would also be fine.

@jakkdl does this catch the repro I shared in the issue now?

@A5rocks
Copy link
Contributor

A5rocks commented Aug 7, 2025

It should catch the repro, that's one of the test cases (it's the one that leaks a Cancelled actually)

@jakkdl
Copy link
Member Author

jakkdl commented Aug 8, 2025

This looks good! A couple points:

  • I would be interested in knowing how the error here works:

    trio/src/trio/_core/_run.py

    Lines 451 to 456 in 7f7bf5e

    # CancelStatus. Note that it's possible for us to get here
    # without CancelScope._close() raising an error, if a
    # nursery's cancel scope is closed within the nursery's
    # nested child and no other cancel scopes are involved,
    # but in that case task_exited() will deal with raising
    # the error.

It looks like all the cancel scope misnesting error handling is from #1061, so it'd be great if @oremanj could pipe in with any further detail. It's possible that the exact error they're pointing out in that comment is what we're trying to fix now.

  • it would be nice to double check this can't be done as a more general check about cancel scopes. I imagine this would look like peeking at cancel status's children on task exit? I think this won't work because things might run in a broken nursery, but it's worth a shot.

Peeking at CancelStatus._children in task_exited it's only ever empty.. I think because at that point it's abandoned all its children?

I'm pretty sure I at the very least need

trio/src/trio/_core/_run.py

Lines 2013 to 2017 in 06640eb

if task._child_nurseries:
for nursery in task._child_nurseries:
nursery.cancel_scope.cancel() # TODO: add reason
nursery._parent_waiting_in_aexit = False
nursery._closed = True

to stop everything from breaking down.

@jakkdl
Copy link
Member Author

jakkdl commented Aug 8, 2025

  • I think it would be very nice to not leak a Cancelled. I don't care about the other two items mentioned, but Cancelled derives from BaseException and people really shouldn't be suppressing it. So this would in theory lead to failure of the whole program which is a) bad and b) could be avoided since we're not even corrupting anything.

    • maybe this looks like checking if self._closed and not hasattr(self, "_pending_excs"): and that we're not cancelled, and then suppressing Cancelled? Not very pretty.

Aha, it seems that the Cancelled leaking is when a CancelStatus has been abandoned_by_misnesting, i.e.

trio/src/trio/_core/_run.py

Lines 646 to 651 in 06640eb

if self._cancel_status.abandoned_by_misnesting:
# We are an inner cancel scope that was still active when
# some outer scope was closed. The closure of that outer
# scope threw an error, so we don't need to throw another
# one; it would just confuse the traceback.
pass

but parent_cancellation_is_visible_to_us still returns True, so we don't clean up Cancelled.
So the only thing we need to do is check if we're misnested when deciding whether to strip out Cancelled, and this seems like something that should be safe to do. Like, whether the parent cancellation is visible to us is not what should determine whether we pass through Cancelled, because that parent won't see the Cancelled that we're reraising! If anything we should step up the chain until we find the first scope that hasn't "abandoned" us and see if it is cancelled (and maybe check for shields along the way), but I'm not sure we care enough and I think we're guaranteeing that the original parent scope will raise an error - so it doesn't really matter that Cancelled is suppressed.

  • alternatively maybe people want the whole program to fail if there's an invariant violated? after all the code is clearly buggy. I think the runtimeerror is enough though.

Even if that's what we want, this is not the way to do it. The Cancelled would be caught by any CancelScope further up the chain that are cancelled.

But yeah, I'm sticking with doing the same as the other CancelScope stack violations - and if we want to make all of them+this raise a ReallyBadRuntimeError that inherits from BaseException that should be a separate PR.

@jakkdl
Copy link
Member Author

jakkdl commented Aug 8, 2025

Shouldn't this done for cancel scopes, rather than nurseries? That's what AnyIO does. And since each nursery comes with its own cancel scopes, it should cover nurseries too.

AnyIO indeed does cover the original problem in #3298, giving RuntimeError: Attempted to exit cancel scope in a different task than it was entered in. So maybe there is a variant out there that only makes use of CancelScope and doesn't mess around with nurseries. I have yet to compare the internals enough to see where they differ.

@jakkdl
Copy link
Member Author

jakkdl commented Aug 11, 2025

Shouldn't this done for cancel scopes, rather than nurseries? That's what AnyIO does. And since each nursery comes with its own cancel scopes, it should cover nurseries too.

AnyIO indeed does cover the original problem in #3298, giving RuntimeError: Attempted to exit cancel scope in a different task than it was entered in. So maybe there is a variant out there that only makes use of CancelScope and doesn't mess around with nurseries. I have yet to compare the internals enough to see where they differ.

nevermind! I only tested it against the asyncio backend. With trio as the backend we get a nasty crash with the KeyError. So the nursery-specific logic does seem necessary.

$ python foo.py 
Traceback (most recent call last):
  File "/tmp/anyio_nursery_corruption/.venv/lib/python3.13/site-packages/trio/_core/_run.py", line 2778, in unrolled_run
    runner.task_exited(task, final_outcome)
    ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/anyio_nursery_corruption/.venv/lib/python3.13/site-packages/trio/_core/_run.py", line 1956, in task_exited
    self.tasks.remove(task)
    ~~~~~~~~~~~~~~~~~^^^^^^
KeyError: <Task 'contextlib.AsyncExitStack.enter_async_context' at 0x77a10622c470>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/tmp/anyio_nursery_corruption/foo.py", line 27, in <module>
    anyio.run(main, backend='trio')
    ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/anyio_nursery_corruption/.venv/lib/python3.13/site-packages/anyio/_core/_eventloop.py", line 74, in run
    return async_backend.run(func, args, {}, backend_options)
           ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/anyio_nursery_corruption/.venv/lib/python3.13/site-packages/anyio/_backends/_trio.py", line 1001, in run
    return trio.run(func, *args)
           ~~~~~~~~^^^^^^^^^^^^^
  File "/tmp/anyio_nursery_corruption/.venv/lib/python3.13/site-packages/trio/_core/_run.py", line 2424, in run
    timeout = gen.send(next_send)
  File "/tmp/anyio_nursery_corruption/.venv/lib/python3.13/site-packages/trio/_core/_run.py", line 2834, in unrolled_run
    raise TrioInternalError("internal error in Trio - please file a bug!") from exc
trio.TrioInternalError: internal error in Trio - please file a bug!
Exception ignored in: <function Nursery.__del__ at 0x77a106492c00>
Traceback (most recent call last):
  File "/tmp/anyio_nursery_corruption/.venv/lib/python3.13/site-packages/trio/_core/_run.py", line 1407, in __del__
AssertionError: 
Exception ignored in: <coroutine object Runner.init at 0x77a106e73a70>
Traceback (most recent call last):
  File "/tmp/anyio_nursery_corruption/.venv/lib/python3.13/site-packages/trio/_core/_run.py", line 2054, in init
  File "/tmp/anyio_nursery_corruption/.venv/lib/python3.13/site-packages/trio/_core/_run.py", line 1057, in __aexit__
  File "/tmp/anyio_nursery_corruption/.venv/lib/python3.13/site-packages/trio/_core/_run.py", line 1223, in _nested_child_finished
  File "/tmp/anyio_nursery_corruption/.venv/lib/python3.13/site-packages/trio/_core/_run.py", line 1197, in _add_exc
  File "/tmp/anyio_nursery_corruption/.venv/lib/python3.13/site-packages/trio/_core/_run.py", line 897, in cancel
  File "/tmp/anyio_nursery_corruption/.venv/lib/python3.13/site-packages/trio/_core/_run.py", line 474, in recalculate
  File "/tmp/anyio_nursery_corruption/.venv/lib/python3.13/site-packages/trio/_core/_run.py", line 1581, in _attempt_delivery_of_any_pending_cancel
  File "/tmp/anyio_nursery_corruption/.venv/lib/python3.13/site-packages/trio/_core/_run.py", line 1563, in _attempt_abort
  File "/tmp/anyio_nursery_corruption/.venv/lib/python3.13/site-packages/trio/_core/_io_epoll.py", line 308, in abort
  File "/tmp/anyio_nursery_corruption/.venv/lib/python3.13/site-packages/trio/_core/_io_epoll.py", line 277, in _update_registrations
ValueError: I/O operation on closed epoll object
Exception ignored in: <function Nursery.__del__ at 0x77a106492c00>
Traceback (most recent call last):
  File "/tmp/anyio_nursery_corruption/.venv/lib/python3.13/site-packages/trio/_core/_run.py", line 1407, in __del__
AssertionError: 
from typing import Any, AsyncGenerator
from contextlib import AsyncExitStack, asynccontextmanager
import anyio

async def main() -> None:
    async with AsyncExitStack() as stack, anyio.create_task_group() as nursery:
        # The asynccontextmanager is going to create a nursery that outlives this nursery!
        nursery.start_soon(
            stack.enter_async_context,
            asynccontextmanager_that_creates_a_nursery_internally(),
        )

@asynccontextmanager
async def asynccontextmanager_that_creates_a_nursery_internally() -> AsyncGenerator[None]:
    async with anyio.create_task_group() as nursery:
        nursery.start_soon(
            print_sleep_print, "task_in_asynccontextmanager_nursery", 2.0
        )
        yield

async def print_sleep_print(name: str, sleep_time: float) -> None:
    await anyio.sleep(sleep_time)

if __name__ == "__main__":
    anyio.run(main, backend='trio')

If running the program with this PR you get the more helpful RuntimeError

@agronholm you might want this test for anyio too

@jakkdl
Copy link
Member Author

jakkdl commented Aug 11, 2025

edit: nvm I'm dumb

@jakkdl jakkdl requested a review from A5rocks August 11, 2025 11:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Internal error from nursery mis-nesting (due to bad use of AsyncExitStack)
4 participants