vello_hybrid: remove suspending work from scheduler #1174

ajakubowicz-canva · 2025-08-23T13:27:01Z

This PR addresses multiple tasks tracked in #1165

Context

Recently we landed #1155 but it had lots of caveats. It also utilised suspending wide tile commands in order to pack rounds. A large pain point is that the system could deadlock itself.

At renderer office hours we discussed pre-planning out wide tiles into rounds and whether suspending was necessary. It was agreed that suspending was used as it was lower cost – but I felt like I hadn't explored a non-suspending solution adequately.

Originally I was working towards solving the deadlock issue – which would require either removing the suspending concept, or making suspending smarter such that it wouldn't get into deadlocks as much. Adding another system on top of the scheduler to "guess" resources to prevent deadlocks was really thorny and complex, so I also explored simply removing the complexity of suspending wide tile commands, and it worked!
This PR removes the entire concept of suspending.

Removing suspending lets us:

Delete many of the allocations I added.
Makes deadlocks impossible as they are coupled to the removed system.
Simplifies the code, and improves performance.

How

The most pertinent changes in this PR are the round calculations in PushBuf. I messed up the rounds where we copy the destination slot contents back to the temporary texture. By fixing these rounds, the whole system "just works" without suspending because work is scheduled into rounds correctly.

What is the insight and why could suspending be removed?

On main note that suspending only happens when pushing a buffer. It happens when the current top of stack (tos) has an invalid temporary slot which needs to be copied to the alternate texture such that a future blend operation will work.

There was a bug such that rounds were calculated incorrectly in push buffer. Hence, by suspending we would force enough rounds to pass such that temporary slots are made valid.

The key insight is that rounds tracked on the tiles track the round in which that tile is doing work. You can do work in parallel on different tiles in the stack as long as they are mutually exclusive. It's only when blending or popping buffers that we max(nos.round, tos.round) to synchronize the rounds and ensure that all work is done before blending.

This insight is what lets us fix the round calculations for compositing and allows us to remove suspending as a hack to get around the round calculation bug. This explains why we get better round utilization with this PR.

But, I also vigorously empirically proved that this change is extremely worthwhile.

Empirical Tests

Round stats

With the following code in the Scheduler::flush method we can capture some insightful statistics:

#[cfg(debug_assertions)]
        {
            let strips = [
                round.draws[0].0.len(),
                round.draws[1].0.len(),
                round.draws[2].0.len(),
            ];
            let total_strips: usize = strips.iter().sum();
            let slots_cleared = round.clear[0].len() + round.clear[1].len();
            let slots_freed = round.free[0].len() + round.free[1].len();

            eprintln!(
                "Round {}: {} strips (tex0:{}, tex1:{}, target:{}), {} cleared, {} freed",
                self.round,
                total_strips,
                strips[0],
                strips[1],
                strips[2],
                slots_cleared,
                slots_freed
            );
        }

Command: cargo nextest run complex_composed_layers_hybrid --locked --all-features --no-fail-fast --no-capture

On main:

Round 0: 299 strips (tex0:59, tex1:240, target:0), 225 cleared, 0 freed
Round 1: 152 strips (tex0:60, tex1:92, target:0), 120 cleared, 40 freed
Round 2: 134 strips (tex0:66, tex1:68, target:0), 265 cleared, 60 freed
Round 3: 81 strips (tex0:35, tex1:46, target:0), 30 cleared, 95 freed
Round 4: 50 strips (tex0:25, tex1:25, target:0), 10 cleared, 100 freed
Round 5: 50 strips (tex0:25, tex1:25, target:0), 5 cleared, 90 freed
Round 6: 70 strips (tex0:25, tex1:25, target:20), 5 cleared, 135 freed
Round 7: 15 strips (tex0:5, tex1:5, target:5), 0 cleared, 30 freed

This PR:

Round 0: 496 strips (tex0:135, tex1:361, target:0), 550 cleared, 15 freed
Round 1: 120 strips (tex0:60, tex1:60, target:0), 25 cleared, 130 freed
Round 2: 100 strips (tex0:50, tex1:50, target:0), 50 cleared, 150 freed
Round 3: 50 strips (tex0:25, tex1:25, target:0), 30 cleared, 90 freed
Round 4: 70 strips (tex0:25, tex1:25, target:20), 5 cleared, 135 freed
Round 5: 15 strips (tex0:5, tex1:5, target:5), 0 cleared, 30 freed

This PR results in fewer rounds for drawing the same complex blend scene.

Deadlock tests

This was tested with the following diff:

diff --git a/sparse_strips/vello_hybrid/src/render/wgpu.rs b/sparse_strips/vello_hybrid/src/render/wgpu.rs
index f6998c02..1e3e3aef 100644
--- a/sparse_strips/vello_hybrid/src/render/wgpu.rs
+++ b/sparse_strips/vello_hybrid/src/render/wgpu.rs
@@ -71,7 +71,7 @@ impl Renderer {
 
         Self {
             programs: Programs::new(device, render_target_config, total_slots),
-            scheduler: Scheduler::new(total_slots),
+            scheduler: Scheduler::new(12),
             image_cache,
         }
     }

The command: cargo nextest run --locked --all-features --no-fail-fast --release run on sparse_strips as working directory.

With the extremely limited slots, this PR passes all tests. On main, the tests fail with a deadlock.

`wgpu_webgl` example (scalar)

(this is very rough and measured while dragging and zooming)

On main each rAF for the blend example is ~6.5ms.
This PR makes the rAF ~5ms.

Test plan

Tested manually with cargo run_wasm -p wgpu_webgl --release, cargo run -p vello_hybrid_winit --release, and by leaning on the existing test corpus of tests that include hybrid compositing tests.

Risks

I don't think there are major risks introduced by this PR. This PR really polishes rough edges introduced by the initial blend layers PR and eliminates the deadlock hazard.

ajakubowicz-canva · 2025-08-23T13:28:40Z

sparse_strips/vello_hybrid/src/schedule.rs

-            #[cfg(debug_assertions)]
-            clear: [Vec::new(), Vec::new()],


I removed the clear debug assertion because it was not providing value. I couldn't get tests to fail or provide any strong signal with it.

Removing it cleans up a bunch of code.

sparse_strips/vello_hybrid/src/schedule.rs

taj-p

Thank you so much for explaining this so well in our call. LGTM - really makes sense! 🚀

grebmeg

Great to see that we can rely solely on round as the synchronization primitive! LGTM!

As a side note, it might be worth adding some tests to the scheduler, not just for verification, but also to clarify the expected states of texture slots.

remove suspending, fix deadlock, and fix allocations

567d42c

ajakubowicz-canva commented Aug 23, 2025

View reviewed changes

fix lint

d354946

ajakubowicz-canva requested a review from taj-p August 23, 2025 13:34

ajakubowicz-canva added the C-hybrid Applies to the vello_hybrid crate label Aug 23, 2025

add back duplicates clearing debug assertion

a48ab37

ajakubowicz-canva mentioned this pull request Aug 21, 2025

vello_hybrid Spatiotemporal compositing tracking issues #1165

Open

7 tasks

ajakubowicz-canva changed the title ~~vello_hybrid: remove suspending from scheduler and add lots of performance~~ vello_hybrid: remove suspending work from scheduler Aug 23, 2025

fix lint

f545afc

ajakubowicz-canva requested a review from grebmeg August 24, 2025 22:19

taj-p reviewed Aug 25, 2025

View reviewed changes

sparse_strips/vello_hybrid/src/schedule.rs Outdated Show resolved Hide resolved

taj-p reviewed Aug 25, 2025

View reviewed changes

sparse_strips/vello_hybrid/src/schedule.rs Show resolved Hide resolved

taj-p reviewed Aug 25, 2025

View reviewed changes

sparse_strips/vello_hybrid/src/schedule.rs Outdated Show resolved Hide resolved

cleanup redundant max

ac6e7bf

taj-p approved these changes Aug 25, 2025

View reviewed changes

review feedback

254c023

grebmeg approved these changes Aug 25, 2025

View reviewed changes

ajakubowicz-canva added this pull request to the merge queue Aug 25, 2025

Merged via the queue into main with commit 189431a Aug 26, 2025
17 checks passed

ajakubowicz-canva deleted the ajakubowicz-composite-cleanup branch August 26, 2025 00:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vello_hybrid: remove suspending work from scheduler #1174

vello_hybrid: remove suspending work from scheduler #1174

ajakubowicz-canva commented Aug 23, 2025 •

edited

Loading

Uh oh!

ajakubowicz-canva Aug 23, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

taj-p left a comment

Uh oh!

grebmeg left a comment

Uh oh!

Uh oh!

Uh oh!

vello_hybrid: remove suspending work from scheduler #1174

vello_hybrid: remove suspending work from scheduler #1174

Conversation

ajakubowicz-canva commented Aug 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

How

What is the insight and why could suspending be removed?

Empirical Tests

Round stats

Deadlock tests

wgpu_webgl example (scalar)

Test plan

Risks

Uh oh!

ajakubowicz-canva Aug 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

taj-p left a comment

Choose a reason for hiding this comment

Uh oh!

grebmeg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ajakubowicz-canva commented Aug 23, 2025 •

edited

Loading

`wgpu_webgl` example (scalar)