Skip to content

Conversation

ajakubowicz-canva
Copy link
Contributor

@ajakubowicz-canva ajakubowicz-canva commented Aug 23, 2025

This PR addresses multiple tasks tracked in #1165

Context

Recently we landed #1155 but it had lots of caveats. It also utilised suspending wide tile commands in order to pack rounds. A large pain point is that the system could deadlock itself.

At renderer office hours we discussed pre-planning out wide tiles into rounds and whether suspending was necessary. It was agreed that suspending was used as it was lower cost – but I felt like I hadn't explored a non-suspending solution adequately.

Originally I was working towards solving the deadlock issue – which would require either removing the suspending concept, or making suspending smarter such that it wouldn't get into deadlocks as much. Adding another system on top of the scheduler to "guess" resources to prevent deadlocks was really thorny and complex, so I also explored simply removing the complexity of suspending wide tile commands, and it worked!
This PR removes the entire concept of suspending.

Removing suspending lets us:

  • Delete many of the allocations I added.
  • Makes deadlocks impossible as they are coupled to the removed system.
  • Simplifies the code, and improves performance.

How

The most pertinent changes in this PR are the round calculations in PushBuf. I messed up the rounds where we copy the destination slot contents back to the temporary texture. By fixing these rounds, the whole system "just works" without suspending because work is scheduled into rounds correctly.

What is the insight and why could suspending be removed?

On main note that suspending only happens when pushing a buffer. It happens when the current top of stack (tos) has an invalid temporary slot which needs to be copied to the alternate texture such that a future blend operation will work.

There was a bug such that rounds were calculated incorrectly in push buffer. Hence, by suspending we would force enough rounds to pass such that temporary slots are made valid.

The key insight is that rounds tracked on the tiles track the round in which that tile is doing work. You can do work in parallel on different tiles in the stack as long as they are mutually exclusive. It's only when blending or popping buffers that we max(nos.round, tos.round) to synchronize the rounds and ensure that all work is done before blending.

This insight is what lets us fix the round calculations for compositing and allows us to remove suspending as a hack to get around the round calculation bug. This explains why we get better round utilization with this PR.

But, I also vigorously empirically proved that this change is extremely worthwhile.

Empirical Tests

Round stats

With the following code in the Scheduler::flush method we can capture some insightful statistics:

#[cfg(debug_assertions)]
        {
            let strips = [
                round.draws[0].0.len(),
                round.draws[1].0.len(),
                round.draws[2].0.len(),
            ];
            let total_strips: usize = strips.iter().sum();
            let slots_cleared = round.clear[0].len() + round.clear[1].len();
            let slots_freed = round.free[0].len() + round.free[1].len();

            eprintln!(
                "Round {}: {} strips (tex0:{}, tex1:{}, target:{}), {} cleared, {} freed",
                self.round,
                total_strips,
                strips[0],
                strips[1],
                strips[2],
                slots_cleared,
                slots_freed
            );
        }

Command: cargo nextest run complex_composed_layers_hybrid --locked --all-features --no-fail-fast --no-capture

On main:

Round 0: 299 strips (tex0:59, tex1:240, target:0), 225 cleared, 0 freed
Round 1: 152 strips (tex0:60, tex1:92, target:0), 120 cleared, 40 freed
Round 2: 134 strips (tex0:66, tex1:68, target:0), 265 cleared, 60 freed
Round 3: 81 strips (tex0:35, tex1:46, target:0), 30 cleared, 95 freed
Round 4: 50 strips (tex0:25, tex1:25, target:0), 10 cleared, 100 freed
Round 5: 50 strips (tex0:25, tex1:25, target:0), 5 cleared, 90 freed
Round 6: 70 strips (tex0:25, tex1:25, target:20), 5 cleared, 135 freed
Round 7: 15 strips (tex0:5, tex1:5, target:5), 0 cleared, 30 freed

This PR:

Round 0: 496 strips (tex0:135, tex1:361, target:0), 550 cleared, 15 freed
Round 1: 120 strips (tex0:60, tex1:60, target:0), 25 cleared, 130 freed
Round 2: 100 strips (tex0:50, tex1:50, target:0), 50 cleared, 150 freed
Round 3: 50 strips (tex0:25, tex1:25, target:0), 30 cleared, 90 freed
Round 4: 70 strips (tex0:25, tex1:25, target:20), 5 cleared, 135 freed
Round 5: 15 strips (tex0:5, tex1:5, target:5), 0 cleared, 30 freed

This PR results in fewer rounds for drawing the same complex blend scene.

Deadlock tests

This was tested with the following diff:

diff --git a/sparse_strips/vello_hybrid/src/render/wgpu.rs b/sparse_strips/vello_hybrid/src/render/wgpu.rs
index f6998c02..1e3e3aef 100644
--- a/sparse_strips/vello_hybrid/src/render/wgpu.rs
+++ b/sparse_strips/vello_hybrid/src/render/wgpu.rs
@@ -71,7 +71,7 @@ impl Renderer {
 
         Self {
             programs: Programs::new(device, render_target_config, total_slots),
-            scheduler: Scheduler::new(total_slots),
+            scheduler: Scheduler::new(12),
             image_cache,
         }
     }

The command: cargo nextest run --locked --all-features --no-fail-fast --release run on sparse_strips as working directory.

With the extremely limited slots, this PR passes all tests. On main, the tests fail with a deadlock.

wgpu_webgl example (scalar)

(this is very rough and measured while dragging and zooming)

On main each rAF for the blend example is ~6.5ms.
This PR makes the rAF ~5ms.

Test plan

Tested manually with cargo run_wasm -p wgpu_webgl --release, cargo run -p vello_hybrid_winit --release, and by leaning on the existing test corpus of tests that include hybrid compositing tests.

Risks

I don't think there are major risks introduced by this PR. This PR really polishes rough edges introduced by the initial blend layers PR and eliminates the deadlock hazard.

Comment on lines -403 to -404
#[cfg(debug_assertions)]
clear: [Vec::new(), Vec::new()],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the clear debug assertion because it was not providing value. I couldn't get tests to fail or provide any strong signal with it.

Removing it cleans up a bunch of code.

@ajakubowicz-canva ajakubowicz-canva added the C-hybrid Applies to the vello_hybrid crate label Aug 23, 2025
@ajakubowicz-canva ajakubowicz-canva changed the title vello_hybrid: remove suspending from scheduler and add lots of performance vello_hybrid: remove suspending work from scheduler Aug 23, 2025
Copy link
Contributor

@taj-p taj-p left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for explaining this so well in our call. LGTM - really makes sense! 🚀

Copy link
Contributor

@grebmeg grebmeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to see that we can rely solely on round as the synchronization primitive! LGTM!

As a side note, it might be worth adding some tests to the scheduler, not just for verification, but also to clarify the expected states of texture slots.

@ajakubowicz-canva ajakubowicz-canva added this pull request to the merge queue Aug 25, 2025
Merged via the queue into main with commit 189431a Aug 26, 2025
17 checks passed
@ajakubowicz-canva ajakubowicz-canva deleted the ajakubowicz-composite-cleanup branch August 26, 2025 00:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-hybrid Applies to the vello_hybrid crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants