Skip to content

GPU Constraints #200

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 40 commits into from
Aug 13, 2025
Merged

GPU Constraints #200

merged 40 commits into from
Aug 13, 2025

Conversation

ejmeitz
Copy link
Collaborator

@ejmeitz ejmeitz commented Jun 13, 2025

This PR (WIP) implements SHAKE/RATTLE constraints on GPU using the M-SHAKE algorithm. It also has angle constraints!

The traditional SHAKE implementation is not possible to parallelize easily. M-SHAKE is parallelizable but comes at the cost of the types of constraints you can have. The "fast" implementation of M-SHAKE uses an analytical matrix inverse and is therefore limited to clusters of 3 constraints or less. I do not really see this as a problem. Typically you only constrain H-bonds or the bonds of the solvent molecules. Furthermore, bonds along the back bond are not constrainable due to the assumption of a "central" atom in my current implementation. This can be removed in the future, but I think implementing rigid body dynamics is the correct way to constraint a backbone and not SHAKE.

The types of clusters supported:

  • Single bond
  • 1 central atom bonded to 2 other atoms
  • 1 central atom bonded to 3 other atoms
  • 1 central atom bonded to 2 other atoms with the angle constrained

This leaves out things like CH4.

The current implementation is complete for all RATTLE variants and for SHAKE with a single bond as these are all analytical solutions and do not require iteration. The 3 remaining SHAKE kernels require iteration and I am not sure how best to handle that as load balancing is important for GPU performance. Open to suggestions as to how that should be done.

Some extra work will be required to actually support angle constraints. Maybe we leave this to another PR, and I just throw an error if I detect angle constraints.

  • Update things like n_dof which assume there are no angle constraints
  • Update build_clusters for angle constraints

@ejmeitz
Copy link
Collaborator Author

ejmeitz commented Jun 13, 2025

Also I tried to be clever and sort the atoms before hand to remove uncoalesced memory accesses, but the sorting seems to dwarf the benefit from the coalesced reads. This implementation is really not clever at all, but I also do not see room to be clever. Anyway the constraints should be a tiny part of the total cost, so its better they are on GPU to mitigate data moving from CPU to GPU during the simulation.

@jgreener64
Copy link
Collaborator

Great. I can't run the branch at the minute due to a missing import of StructArrays and a missing type parameter. Could you commit the working version?

@ejmeitz
Copy link
Collaborator Author

ejmeitz commented Jun 27, 2025

@jgreener64 The 3-atom constraints test works as well now. This means the 4-atom constraints (1 central + 3 bonds) are un-tested (but should work as the kernels are similar). There is also still some work during the setup to get angle constraints supported, but the shake kernel is done and should work.

@ejmeitz
Copy link
Collaborator Author

ejmeitz commented Jun 30, 2025

Ok all the code is written and seems to work. There are tests for diatomic, triatomic, 4 atom, angle, and angle + triatomic at the same time. None of these tests are on GPUs.

Things I don't love about the existing implementation:

  • To avoid mutating the input to the SHAKE_RATTLE constructor when both angle_constraints and distance_constraints are passed I make a copy of distance_constraints inside of build_clusters : dist_constraints = copy(dist_constraints)
  • The intermediate storage inside of each kernel has some weird types due to units and I have to do stuff like:
@uniform A_type = typeof(zero(L)*zero(L) / zero(M))
  • I am not sure if the way I allocate A,C, lambda inside the shake/rattle kernels is correct for thread local storage.
  • constraints_helper.jl is a freakin mess. It might be slow for systems with many many constraints. It only needs to run once though.

Things that would be worth checking:

  • Test a protein with explicit solvent. Constraining all H and all water
  • Does it work on GPU (it should if KA.jl doesnt have bugs)
  • Performance
    • How does it compare to prior implementation
    • How much faster in GPU vs. CPU
    • How many iterations to converge the various SHAKE kernels. Purportedly this is supposed to be as good as SETTLE for water.

Todo:

  • Docs, but maybe I do this in a separate PR to avoid making this one even more complex

@ejmeitz ejmeitz marked this pull request as ready for review June 30, 2025 21:46
@jgreener64
Copy link
Collaborator

Great, I'll try it this week. Do you have access to a GPU to test on?

@ejmeitz
Copy link
Collaborator Author

ejmeitz commented Jul 1, 2025

Yeah I have access to an AMD and NVIDIA GPU. Just never used Molly with GPU so was easier to test on CPU. Out of town this week, but can test code on another system if you'd like.

There's a good chance some atray isn't moved to the GPU that should be but hopefully not!

@jgreener64
Copy link
Collaborator

I wouldn't worry about performance for things that only run once.

I get the following error with CUDA:

ERROR: GPU compilation of MethodInstance for Molly.gpu_shake2_kernel!(::KernelAbstractions.CompilerMetadata{…}, ::Vector{…}, ::CuDeviceVector{…}, ::CuDeviceVector{…}, ::CuDeviceVector{…}, ::CubicBoundary{…}) failed
KernelError: passing non-bitstype argument

Argument 3 to your kernel function is of type Vector{Molly.ConstraintCluster}, which is not a bitstype:
  .ref is of type MemoryRef{Molly.ConstraintCluster} which is not isbits.
    .mem is of type Memory{Molly.ConstraintCluster} which is not isbits.


Only bitstypes, which are "plain data" types that are immutable
and contain no references to other values, can be used in GPU kernels.
For more information, see the `Base.isbitstype` function.

Stacktrace:
  [1] check_invocation(job::GPUCompiler.CompilerJob)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/aNr5U/src/validation.jl:108
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/aNr5U/src/driver.jl:87 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/Tracy/GcShf/src/tracepoint.jl:158 [inlined]
  [4] compile_unhooked(output::Symbol, job::GPUCompiler.CompilerJob; kwargs::@Kwargs{})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/aNr5U/src/driver.jl:85
  [5] compile_unhooked
    @ ~/.julia/packages/GPUCompiler/aNr5U/src/driver.jl:80 [inlined]
  [6] compile(target::Symbol, job::GPUCompiler.CompilerJob; kwargs::@Kwargs{})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/aNr5U/src/driver.jl:67
  [7] compile
    @ ~/.julia/packages/GPUCompiler/aNr5U/src/driver.jl:55 [inlined]
  [8] #1181
    @ ~/.julia/dev/CUDA/src/compiler/compilation.jl:250 [inlined]
  [9] JuliaContext(f::CUDA.var"#1181#1184"{GPUCompiler.CompilerJob{…}}; kwargs::@Kwargs{})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/aNr5U/src/driver.jl:34
 [10] JuliaContext
    @ ~/.julia/packages/GPUCompiler/aNr5U/src/driver.jl:25
 [11] compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/dev/CUDA/src/compiler/compilation.jl:249
 [12] actual_compilation(cache::Dict{…}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{…}, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/aNr5U/src/execution.jl:245
 [13] cached_compilation(cache::Dict{…}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{…}, compiler::Function, linker::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/aNr5U/src/execution.jl:159
 [14] macro expansion
    @ ~/.julia/dev/CUDA/src/compiler/execution.jl:373 [inlined]
 [15] macro expansion
    @ ./lock.jl:273 [inlined]
 [16] cufunction(f::typeof(Molly.gpu_shake2_kernel!), tt::Type{…}; kwargs::@Kwargs{…})
    @ CUDA ~/.julia/dev/CUDA/src/compiler/execution.jl:368
 [17] macro expansion
    @ ~/.julia/dev/CUDA/src/compiler/execution.jl:112 [inlined]
 [18] (::KernelAbstractions.Kernel{…})(::Vector{…}, ::Vararg{…}; ndrange::Int64, workgroupsize::Nothing)
    @ CUDA.CUDAKernels ~/.julia/dev/CUDA/src/CUDAKernels.jl:122
 [19] apply_position_constraints!(sys::System{…}, ca::SHAKE_RATTLE{…}, r_pre_unconstrained_update::CuArray{…}; kwargs::@Kwargs{…})
    @ Molly ~/.julia/dev/Molly/src/constraints/shake_gpu.jl:678
 [20] apply_position_constraints!
    @ ~/.julia/dev/Molly/src/constraints/shake_gpu.jl:658 [inlined]
 [21] #apply_position_constraints!#252
    @ ~/.julia/dev/Molly/src/constraints/constraints_helper.jl:344 [inlined]
 [22] apply_position_constraints!
    @ ~/.julia/dev/Molly/src/constraints/constraints_helper.jl:342 [inlined]
 [23] #simulate!#259
    @ ~/.julia/dev/Molly/src/simulators.jl:91 [inlined]
 [24] simulate!(sys::System{…}, sim::SteepestDescentMinimizer{…})
    @ Molly ~/.julia/dev/Molly/src/simulators.jl:65
 [25] top-level scope
    @ REPL[99]:1

If you could try on GPU, and adapt the tests to also run on GPU where available (see for AT in array_list elsewhere in the file), that would be great.

I also needed CUDA.@allowscalar during constraints setup and system setup, you can move things to the CPU for these steps as they only happen once.

The branchless functions might be replaceable with cond ? a : b.

@ejmeitz
Copy link
Collaborator Author

ejmeitz commented Jul 5, 2025

I'll work on the first thing. I think I probably need to just use replace_storage from StructArrays.jl on the clusters stored inside of the SHAKE_RATTLE object.

Where did you need the @allowscalar by the way? I think the constraints you pass to the System constructor should be on CPU and then I'll move them to whatever backend the coordinates/velocities are on. The "setup" process completely mangles the user input and uses Graphs.jl so it definitely has to be CPU.

Wouldn't the ternary operator introduce warp divergence? I'm not sure what the ? lowers to inside a GPU kernel.

By the way it looks like you were using SteepestDescentMinimizer. I'm pretty sure SHAKE will not work with this assuming you could get past the setup phase. The derivation assumes there's some kind of dynamics. In LAMMPS they replace the constraints with a really strong harmonic bond I believe.

@jgreener64
Copy link
Collaborator

Wouldn't the ternary operator introduce warp divergence?

Ah I meant ifelse, my bad.

Fair enough about the constraints being on CPU. We could move them back to CPU for setup if they are on the GPU, so the user isn't confused by the error if they provide them on the GPU.

@ejmeitz
Copy link
Collaborator Author

ejmeitz commented Jul 8, 2025

Ok clusters are isbits now. Have not tried on GPU yet, AMDGPU was not playing nice with my computer 🫤.

I also want to note that if any clusters are not used the type is resolved to Any[]. I do not know if that is actually a problem or not. I've never found a pattern I liked for optional arguments in a struct. Union{Nothing, ...} is an option but that also feels wrong.

TODO:

  • Move clusters to backend of sys.coords
  • Launch all kernels simultaneously, since no single kernel will saturate the GPU (unclear if KA supports this)
  • Optimize default block_size
  • Create versions of check_position_constraints and check_velocity_constraints for GPU

POSSIBLE OPTIMIZATIONS:

  • Only move unique idxs, and distances for SHAKE, only unique idxs for rattle?
  • Do not move indices from GPU -> CPU -> GPU in shake_gpu

OPPORTUNITIES TO CLEAN CODE:

  • SHAKE+RATTLE could re-use the main part of the kernels instead of having separate functions for SHAKE & RATTLE

@jgreener64
Copy link
Collaborator

I doubt Any[] will be a problem.

I would profile before optimisation, since I am often surprised where the time is.

Copy link
Collaborator

@jgreener64 jgreener64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot of work, well done for diving in. Here are some initial comments, I'll look more once these are addressed but overall it seems good.

Here are some Float32 timings for 1UBQ with no solvent (1,231 atoms):

CPU 1 thread no_constraints
4.860738971 ms/step
4.781459008 ms/step
4.842652095 ms/step

CPU 1 thread hydrogen_constraints
6.098673073 ms/step
6.717082253 ms/step
6.113765265 ms/step

CPU 16 thread no_constraints
3.844828712 ms/step
3.984333143 ms/step
3.91534444 ms/step

CPU 16 thread hydrogen_constraints
4.684524647 ms/step
4.88355524 ms/step
4.684905093 ms/step

CUDA no_constraints
0.313419713 ms/step
0.297092759 ms/step
0.282583017 ms/step

CUDA hydrogen_constraints
0.48397155 ms/step
0.486101783 ms/step
0.483098682 ms/step

And some Float32 timings for solvated 6MRR (15,954 atoms):

CPU 1 thread no_constraints
77.809855056 ms/step
78.834513861 ms/step
79.802540967 ms/step

CPU 1 thread hydrogen_constraints
88.181474596 ms/step
87.49463656 ms/step
88.946655989 ms/step

CPU 16 thread no_constraints
25.219622083 ms/step
24.691678186 ms/step
24.707366003 ms/step

CPU 16 thread hydrogen_constraints
29.048901858 ms/step
29.235323313 ms/step
29.40364796 ms/step

CUDA no_constraints
1.864682376 ms/step
1.89919547 ms/step
1.918378767 ms/step

CUDA hydrogen_constraints
2.199372642 ms/step
2.228519373 ms/step
2.307705759 ms/step

The GPU is an A6000 and here just bonds to hydrogen are constrained, not water angles. Overall I think performance is okay for now and may improve if some of the Float64 promotion is fixed.

@ejmeitz
Copy link
Collaborator Author

ejmeitz commented Aug 13, 2025

@jgreener64 This should be good to go.

I added kernels to check the velocity/distance constraints on CPU+GPU. I also re-wrote how the constraint data is stored. I thought it would make things faster, but on my system it did not really have much impact.

I still have inv_masses.

Copy link
Collaborator

@jgreener64 jgreener64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, there are a few changes I want to make here but I can do it later.

Just one query about the StructArrays dependency.

Also you will have to merge in the ewald branch changes I just pushed to master, sorry about that.

@jgreener64
Copy link
Collaborator

Okay, merge once master is merged in and CI passes.

Copy link

codecov bot commented Aug 13, 2025

Codecov Report

❌ Patch coverage is 89.93994% with 67 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.38%. Comparing base (9feb5fa) to head (1a5dac2).
⚠️ Report is 41 commits behind head on master.

Files with missing lines Patch % Lines
src/constraints/constraints.jl 82.87% 31 Missing ⚠️
src/constraints/shake.jl 53.57% 26 Missing ⚠️
src/constraints/constraints_kernel_helper.jl 77.41% 7 Missing ⚠️
src/constraints/shake_kernels.jl 99.21% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #200      +/-   ##
==========================================
+ Coverage   68.83%   70.38%   +1.55%     
==========================================
  Files          39       41       +2     
  Lines        6237     6777     +540     
==========================================
+ Hits         4293     4770     +477     
- Misses       1944     2007      +63     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ejmeitz ejmeitz merged commit 5142a67 into JuliaMolSim:master Aug 13, 2025
5 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants