GPU Constraints #200

ejmeitz · 2025-06-13T16:17:33Z

This PR (WIP) implements SHAKE/RATTLE constraints on GPU using the M-SHAKE algorithm. It also has angle constraints!

The traditional SHAKE implementation is not possible to parallelize easily. M-SHAKE is parallelizable but comes at the cost of the types of constraints you can have. The "fast" implementation of M-SHAKE uses an analytical matrix inverse and is therefore limited to clusters of 3 constraints or less. I do not really see this as a problem. Typically you only constrain H-bonds or the bonds of the solvent molecules. Furthermore, bonds along the back bond are not constrainable due to the assumption of a "central" atom in my current implementation. This can be removed in the future, but I think implementing rigid body dynamics is the correct way to constraint a backbone and not SHAKE.

The types of clusters supported:

Single bond
1 central atom bonded to 2 other atoms
1 central atom bonded to 3 other atoms
1 central atom bonded to 2 other atoms with the angle constrained

This leaves out things like CH4.

The current implementation is complete for all RATTLE variants and for SHAKE with a single bond as these are all analytical solutions and do not require iteration. The 3 remaining SHAKE kernels require iteration and I am not sure how best to handle that as load balancing is important for GPU performance. Open to suggestions as to how that should be done.

Some extra work will be required to actually support angle constraints. Maybe we leave this to another PR, and I just throw an error if I detect angle constraints.

Update things like n_dof which assume there are no angle constraints
Update build_clusters for angle constraints

ejmeitz · 2025-06-13T16:21:20Z

Also I tried to be clever and sort the atoms before hand to remove uncoalesced memory accesses, but the sorting seems to dwarf the benefit from the coalesced reads. This implementation is really not clever at all, but I also do not see room to be clever. Anyway the constraints should be a tiny part of the total cost, so its better they are on GPU to mitigate data moving from CPU to GPU during the simulation.

jgreener64 · 2025-06-16T17:56:22Z

Great. I can't run the branch at the minute due to a missing import of StructArrays and a missing type parameter. Could you commit the working version?

ejmeitz · 2025-06-27T17:22:51Z

@jgreener64 The 3-atom constraints test works as well now. This means the 4-atom constraints (1 central + 3 bonds) are un-tested (but should work as the kernels are similar). There is also still some work during the setup to get angle constraints supported, but the shake kernel is done and should work.

ejmeitz · 2025-06-30T21:45:58Z

Ok all the code is written and seems to work. There are tests for diatomic, triatomic, 4 atom, angle, and angle + triatomic at the same time. None of these tests are on GPUs.

Things I don't love about the existing implementation:

To avoid mutating the input to the SHAKE_RATTLE constructor when both angle_constraints and distance_constraints are passed I make a copy of distance_constraints inside of build_clusters : dist_constraints = copy(dist_constraints)
The intermediate storage inside of each kernel has some weird types due to units and I have to do stuff like:

@uniform A_type = typeof(zero(L)*zero(L) / zero(M))

I am not sure if the way I allocate A,C, lambda inside the shake/rattle kernels is correct for thread local storage.
constraints_helper.jl is a freakin mess. It might be slow for systems with many many constraints. It only needs to run once though.

Things that would be worth checking:

Test a protein with explicit solvent. Constraining all H and all water
Does it work on GPU (it should if KA.jl doesnt have bugs)
Performance
- How does it compare to prior implementation
- How much faster in GPU vs. CPU
- How many iterations to converge the various SHAKE kernels. Purportedly this is supposed to be as good as SETTLE for water.

Todo:

Docs, but maybe I do this in a separate PR to avoid making this one even more complex

jgreener64 · 2025-07-01T14:07:15Z

Great, I'll try it this week. Do you have access to a GPU to test on?

ejmeitz · 2025-07-01T19:19:24Z

Yeah I have access to an AMD and NVIDIA GPU. Just never used Molly with GPU so was easier to test on CPU. Out of town this week, but can test code on another system if you'd like.

There's a good chance some atray isn't moved to the GPU that should be but hopefully not!

jgreener64 · 2025-07-02T19:17:25Z

I wouldn't worry about performance for things that only run once.

I get the following error with CUDA:

ERROR: GPU compilation of MethodInstance for Molly.gpu_shake2_kernel!(::KernelAbstractions.CompilerMetadata{…}, ::Vector{…}, ::CuDeviceVector{…}, ::CuDeviceVector{…}, ::CuDeviceVector{…}, ::CubicBoundary{…}) failed
KernelError: passing non-bitstype argument

Argument 3 to your kernel function is of type Vector{Molly.ConstraintCluster}, which is not a bitstype:
  .ref is of type MemoryRef{Molly.ConstraintCluster} which is not isbits.
    .mem is of type Memory{Molly.ConstraintCluster} which is not isbits.


Only bitstypes, which are "plain data" types that are immutable
and contain no references to other values, can be used in GPU kernels.
For more information, see the `Base.isbitstype` function.

Stacktrace:
  [1] check_invocation(job::GPUCompiler.CompilerJob)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/aNr5U/src/validation.jl:108
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/aNr5U/src/driver.jl:87 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/Tracy/GcShf/src/tracepoint.jl:158 [inlined]
  [4] compile_unhooked(output::Symbol, job::GPUCompiler.CompilerJob; kwargs::@Kwargs{})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/aNr5U/src/driver.jl:85
  [5] compile_unhooked
    @ ~/.julia/packages/GPUCompiler/aNr5U/src/driver.jl:80 [inlined]
  [6] compile(target::Symbol, job::GPUCompiler.CompilerJob; kwargs::@Kwargs{})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/aNr5U/src/driver.jl:67
  [7] compile
    @ ~/.julia/packages/GPUCompiler/aNr5U/src/driver.jl:55 [inlined]
  [8] #1181
    @ ~/.julia/dev/CUDA/src/compiler/compilation.jl:250 [inlined]
  [9] JuliaContext(f::CUDA.var"#1181#1184"{GPUCompiler.CompilerJob{…}}; kwargs::@Kwargs{})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/aNr5U/src/driver.jl:34
 [10] JuliaContext
    @ ~/.julia/packages/GPUCompiler/aNr5U/src/driver.jl:25
 [11] compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/dev/CUDA/src/compiler/compilation.jl:249
 [12] actual_compilation(cache::Dict{…}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{…}, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/aNr5U/src/execution.jl:245
 [13] cached_compilation(cache::Dict{…}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{…}, compiler::Function, linker::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/aNr5U/src/execution.jl:159
 [14] macro expansion
    @ ~/.julia/dev/CUDA/src/compiler/execution.jl:373 [inlined]
 [15] macro expansion
    @ ./lock.jl:273 [inlined]
 [16] cufunction(f::typeof(Molly.gpu_shake2_kernel!), tt::Type{…}; kwargs::@Kwargs{…})
    @ CUDA ~/.julia/dev/CUDA/src/compiler/execution.jl:368
 [17] macro expansion
    @ ~/.julia/dev/CUDA/src/compiler/execution.jl:112 [inlined]
 [18] (::KernelAbstractions.Kernel{…})(::Vector{…}, ::Vararg{…}; ndrange::Int64, workgroupsize::Nothing)
    @ CUDA.CUDAKernels ~/.julia/dev/CUDA/src/CUDAKernels.jl:122
 [19] apply_position_constraints!(sys::System{…}, ca::SHAKE_RATTLE{…}, r_pre_unconstrained_update::CuArray{…}; kwargs::@Kwargs{…})
    @ Molly ~/.julia/dev/Molly/src/constraints/shake_gpu.jl:678
 [20] apply_position_constraints!
    @ ~/.julia/dev/Molly/src/constraints/shake_gpu.jl:658 [inlined]
 [21] #apply_position_constraints!#252
    @ ~/.julia/dev/Molly/src/constraints/constraints_helper.jl:344 [inlined]
 [22] apply_position_constraints!
    @ ~/.julia/dev/Molly/src/constraints/constraints_helper.jl:342 [inlined]
 [23] #simulate!#259
    @ ~/.julia/dev/Molly/src/simulators.jl:91 [inlined]
 [24] simulate!(sys::System{…}, sim::SteepestDescentMinimizer{…})
    @ Molly ~/.julia/dev/Molly/src/simulators.jl:65
 [25] top-level scope
    @ REPL[99]:1

If you could try on GPU, and adapt the tests to also run on GPU where available (see for AT in array_list elsewhere in the file), that would be great.

I also needed CUDA.@allowscalar during constraints setup and system setup, you can move things to the CPU for these steps as they only happen once.

The branchless functions might be replaceable with cond ? a : b.

ejmeitz · 2025-07-05T01:16:50Z

I'll work on the first thing. I think I probably need to just use replace_storage from StructArrays.jl on the clusters stored inside of the SHAKE_RATTLE object.

Where did you need the @allowscalar by the way? I think the constraints you pass to the System constructor should be on CPU and then I'll move them to whatever backend the coordinates/velocities are on. The "setup" process completely mangles the user input and uses Graphs.jl so it definitely has to be CPU.

Wouldn't the ternary operator introduce warp divergence? I'm not sure what the ? lowers to inside a GPU kernel.

By the way it looks like you were using SteepestDescentMinimizer. I'm pretty sure SHAKE will not work with this assuming you could get past the setup phase. The derivation assumes there's some kind of dynamics. In LAMMPS they replace the constraints with a really strong harmonic bond I believe.

jgreener64 · 2025-07-05T17:00:17Z

Wouldn't the ternary operator introduce warp divergence?

Ah I meant ifelse, my bad.

Fair enough about the constraints being on CPU. We could move them back to CPU for setup if they are on the GPU, so the user isn't confused by the error if they provide them on the GPU.

ejmeitz · 2025-07-08T15:15:58Z

Ok clusters are isbits now. Have not tried on GPU yet, AMDGPU was not playing nice with my computer 🫤.

I also want to note that if any clusters are not used the type is resolved to Any[]. I do not know if that is actually a problem or not. I've never found a pattern I liked for optional arguments in a struct. Union{Nothing, ...} is an option but that also feels wrong.

TODO:

Move clusters to backend of sys.coords
Launch all kernels simultaneously, since no single kernel will saturate the GPU (unclear if KA supports this)
Optimize default block_size
Create versions of check_position_constraints and check_velocity_constraints for GPU

POSSIBLE OPTIMIZATIONS:

Only move unique idxs, and distances for SHAKE, only unique idxs for rattle?
Do not move indices from GPU -> CPU -> GPU in shake_gpu

OPPORTUNITIES TO CLEAN CODE:

SHAKE+RATTLE could re-use the main part of the kernels instead of having separate functions for SHAKE & RATTLE

jgreener64 · 2025-07-09T09:57:30Z

I doubt Any[] will be a problem.

I would profile before optimisation, since I am often surprised where the time is.

jgreener64

This is a lot of work, well done for diving in. Here are some initial comments, I'll look more once these are addressed but overall it seems good.

Here are some Float32 timings for 1UBQ with no solvent (1,231 atoms):

CPU 1 thread no_constraints
4.860738971 ms/step
4.781459008 ms/step
4.842652095 ms/step

CPU 1 thread hydrogen_constraints
6.098673073 ms/step
6.717082253 ms/step
6.113765265 ms/step

CPU 16 thread no_constraints
3.844828712 ms/step
3.984333143 ms/step
3.91534444 ms/step

CPU 16 thread hydrogen_constraints
4.684524647 ms/step
4.88355524 ms/step
4.684905093 ms/step

CUDA no_constraints
0.313419713 ms/step
0.297092759 ms/step
0.282583017 ms/step

CUDA hydrogen_constraints
0.48397155 ms/step
0.486101783 ms/step
0.483098682 ms/step

And some Float32 timings for solvated 6MRR (15,954 atoms):

CPU 1 thread no_constraints
77.809855056 ms/step
78.834513861 ms/step
79.802540967 ms/step

CPU 1 thread hydrogen_constraints
88.181474596 ms/step
87.49463656 ms/step
88.946655989 ms/step

CPU 16 thread no_constraints
25.219622083 ms/step
24.691678186 ms/step
24.707366003 ms/step

CPU 16 thread hydrogen_constraints
29.048901858 ms/step
29.235323313 ms/step
29.40364796 ms/step

CUDA no_constraints
1.864682376 ms/step
1.89919547 ms/step
1.918378767 ms/step

CUDA hydrogen_constraints
2.199372642 ms/step
2.228519373 ms/step
2.307705759 ms/step

The GPU is an A6000 and here just bonds to hydrogen are constrained, not water angles. Overall I think performance is okay for now and may improve if some of the Float64 promotion is fixed.

Project.toml

src/constraints/constraints_helper.jl

src/constraints/shake_gpu.jl

src/simulators.jl

src/types.jl

src/constraints/shake.jl

ustrip charge in SimpleCrystals constructor

AtomsBase interface requires this I think. At the very least, I tried using an EAM potential through ASEconvert and it calls this function....

add charge(system, ::Colon)

ejmeitz · 2025-08-13T15:50:25Z

@jgreener64 This should be good to go.

I added kernels to check the velocity/distance constraints on CPU+GPU. I also re-wrote how the constraint data is stored. I thought it would make things faster, but on my system it did not really have much impact.

I still have inv_masses.

jgreener64

Great, there are a few changes I want to make here but I can do it later.

Just one query about the StructArrays dependency.

Also you will have to merge in the ewald branch changes I just pushed to master, sorry about that.

src/Molly.jl

jgreener64 · 2025-08-13T17:56:29Z

Okay, merge once master is merged in and CI passes.

codecov · 2025-08-13T21:32:22Z

Codecov Report

❌ Patch coverage is 89.93994% with 67 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.38%. Comparing base (9feb5fa) to head (1a5dac2).
⚠️ Report is 41 commits behind head on master.

Files with missing lines	Patch %	Lines
src/constraints/constraints.jl	82.87%	31 Missing ⚠️
src/constraints/shake.jl	53.57%	26 Missing ⚠️
src/constraints/constraints_kernel_helper.jl	77.41%	7 Missing ⚠️
src/constraints/shake_kernels.jl	99.21%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #200      +/-   ##
==========================================
+ Coverage   68.83%   70.38%   +1.55%     
==========================================
  Files          39       41       +2     
  Lines        6237     6777     +540     
==========================================
+ Hits         4293     4770     +477     
- Misses       1944     2007      +63

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ejmeitz added 5 commits May 27, 2025 23:26

stuff

7ac3b3b

2-atom kernel done, re-ordering partially done

1ccfc42

start rattle kernels

bb7896d

rattle angle kernel

173f6c9

remove sorting

46fb013

ejmeitz added 12 commits June 18, 2025 15:18

molly compiles

572731b

should be close

8cd6bed

runs, but SHAKE wrong

0f76429

fixed shake+rattle 2 body

aa86aa1

working on shake 3-atom (two atom still works)

200a5a1

angle + 4 atom shake implemented

affa390

angle constraint constructor

c4ea96e

add iteration for shake3/4/angle

97bfbda

shake should be close

6e96a87

units shenanigans

39c5f04

shake 3-atom works!

180a35e

oops now it works...

41713ce

ejmeitz added 3 commits June 30, 2025 10:32

test for 4 atom, track iteration stats, support angle

b616984

angle constraints work!

289fac1

add test for angle + distance at same time

1fc3dd4

ejmeitz marked this pull request as ready for review June 30, 2025 21:46

ConstraintCluster isbits

2df3eb5

ejmeitz added 4 commits July 10, 2025 15:52

Dist constraints work on GPU!!

70a5929

remove prints + stupid thing in simulators

e3547d8

small optimizations

76994dc

working on struct array version

e17d159

jgreener64 reviewed Jul 15, 2025

View reviewed changes

ejmeitz and others added 10 commits July 21, 2025 10:20

2-atom constraints work in new framework

f258770

Merge pull request JuliaMolSim#207 from ejmeitz/strip-charge

4b16a13

ustrip charge in SimpleCrystals constructor

add charge(system, ::Colon)

b3609b8

AtomsBase interface requires this I think. At the very least, I tried using an EAM potential through ASEconvert and it calls this function....

Merge pull request JuliaMolSim#210 from JuliaMolSim/ejmeitz-patch-1

d1cacc6

add charge(system, ::Colon)

works but is slower...

2c43e7c

rm Online Stats

434dd7a

use ifelse

c7e837b

remove some allocs

5542cc2

check constraints on GPU

21e0104

Merge branch 'master' into gpu-constraints

a36008c

add inv_masses to Replica

a37df91

jgreener64 reviewed Aug 13, 2025

View reviewed changes

src/Molly.jl Show resolved Hide resolved

jgreener64 force-pushed the master branch from 2423320 to 9feb5fa Compare August 13, 2025 17:52

ejmeitz added 2 commits August 13, 2025 13:58

bug in SHAKE constructor when no dist constraints passed

88dc18c

Merge branch 'master' into gpu-constraints

7c2cc9f

ejmeitz mentioned this pull request Aug 13, 2025

Allowing SHAKE/RATTLE constraints on GPU simulation #175

Closed

ejmeitz added 2 commits August 13, 2025 14:22

missing some subtype thingies

a256416

fix angle constraint changes

1a5dac2

ejmeitz merged commit 5142a67 into JuliaMolSim:master Aug 13, 2025
5 of 9 checks passed

GPU Constraints #200

GPU Constraints #200

Uh oh!

Conversation

ejmeitz commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ejmeitz commented Jun 13, 2025

Uh oh!

jgreener64 commented Jun 16, 2025

Uh oh!

ejmeitz commented Jun 27, 2025

Uh oh!

ejmeitz commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jgreener64 commented Jul 1, 2025

Uh oh!

ejmeitz commented Jul 1, 2025

Uh oh!

jgreener64 commented Jul 2, 2025

Uh oh!

ejmeitz commented Jul 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jgreener64 commented Jul 5, 2025

Uh oh!

ejmeitz commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jgreener64 commented Jul 9, 2025

Uh oh!

jgreener64 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ejmeitz commented Aug 13, 2025

Uh oh!

jgreener64 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jgreener64 commented Aug 13, 2025

Uh oh!

codecov bot commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

ejmeitz commented Jun 13, 2025 •

edited

Loading

ejmeitz commented Jun 30, 2025 •

edited

Loading

ejmeitz commented Jul 5, 2025 •

edited

Loading

ejmeitz commented Jul 8, 2025 •

edited

Loading

codecov bot commented Aug 13, 2025 •

edited

Loading