-
Notifications
You must be signed in to change notification settings - Fork 58
GPU Constraints #200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU Constraints #200
Conversation
Also I tried to be clever and sort the atoms before hand to remove uncoalesced memory accesses, but the sorting seems to dwarf the benefit from the coalesced reads. This implementation is really not clever at all, but I also do not see room to be clever. Anyway the constraints should be a tiny part of the total cost, so its better they are on GPU to mitigate data moving from CPU to GPU during the simulation. |
Great. I can't run the branch at the minute due to a missing import of StructArrays and a missing type parameter. Could you commit the working version? |
@jgreener64 The 3-atom constraints test works as well now. This means the 4-atom constraints (1 central + 3 bonds) are un-tested (but should work as the kernels are similar). There is also still some work during the setup to get angle constraints supported, but the shake kernel is done and should work. |
Ok all the code is written and seems to work. There are tests for diatomic, triatomic, 4 atom, angle, and angle + triatomic at the same time. None of these tests are on GPUs. Things I don't love about the existing implementation:
Things that would be worth checking:
Todo:
|
Great, I'll try it this week. Do you have access to a GPU to test on? |
Yeah I have access to an AMD and NVIDIA GPU. Just never used Molly with GPU so was easier to test on CPU. Out of town this week, but can test code on another system if you'd like. There's a good chance some atray isn't moved to the GPU that should be but hopefully not! |
I wouldn't worry about performance for things that only run once. I get the following error with CUDA:
If you could try on GPU, and adapt the tests to also run on GPU where available (see I also needed The branchless functions might be replaceable with |
I'll work on the first thing. I think I probably need to just use Where did you need the Wouldn't the ternary operator introduce warp divergence? I'm not sure what the By the way it looks like you were using |
Ah I meant Fair enough about the constraints being on CPU. We could move them back to CPU for setup if they are on the GPU, so the user isn't confused by the error if they provide them on the GPU. |
Ok clusters are I also want to note that if any clusters are not used the type is resolved to TODO:
POSSIBLE OPTIMIZATIONS:
OPPORTUNITIES TO CLEAN CODE:
|
I doubt I would profile before optimisation, since I am often surprised where the time is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a lot of work, well done for diving in. Here are some initial comments, I'll look more once these are addressed but overall it seems good.
Here are some Float32
timings for 1UBQ with no solvent (1,231 atoms):
CPU 1 thread no_constraints
4.860738971 ms/step
4.781459008 ms/step
4.842652095 ms/step
CPU 1 thread hydrogen_constraints
6.098673073 ms/step
6.717082253 ms/step
6.113765265 ms/step
CPU 16 thread no_constraints
3.844828712 ms/step
3.984333143 ms/step
3.91534444 ms/step
CPU 16 thread hydrogen_constraints
4.684524647 ms/step
4.88355524 ms/step
4.684905093 ms/step
CUDA no_constraints
0.313419713 ms/step
0.297092759 ms/step
0.282583017 ms/step
CUDA hydrogen_constraints
0.48397155 ms/step
0.486101783 ms/step
0.483098682 ms/step
And some Float32
timings for solvated 6MRR (15,954 atoms):
CPU 1 thread no_constraints
77.809855056 ms/step
78.834513861 ms/step
79.802540967 ms/step
CPU 1 thread hydrogen_constraints
88.181474596 ms/step
87.49463656 ms/step
88.946655989 ms/step
CPU 16 thread no_constraints
25.219622083 ms/step
24.691678186 ms/step
24.707366003 ms/step
CPU 16 thread hydrogen_constraints
29.048901858 ms/step
29.235323313 ms/step
29.40364796 ms/step
CUDA no_constraints
1.864682376 ms/step
1.89919547 ms/step
1.918378767 ms/step
CUDA hydrogen_constraints
2.199372642 ms/step
2.228519373 ms/step
2.307705759 ms/step
The GPU is an A6000 and here just bonds to hydrogen are constrained, not water angles. Overall I think performance is okay for now and may improve if some of the Float64
promotion is fixed.
ustrip charge in SimpleCrystals constructor
AtomsBase interface requires this I think. At the very least, I tried using an EAM potential through ASEconvert and it calls this function....
add charge(system, ::Colon)
@jgreener64 This should be good to go. I added kernels to check the velocity/distance constraints on CPU+GPU. I also re-wrote how the constraint data is stored. I thought it would make things faster, but on my system it did not really have much impact. I still have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, there are a few changes I want to make here but I can do it later.
Just one query about the StructArrays dependency.
Also you will have to merge in the ewald
branch changes I just pushed to master, sorry about that.
Okay, merge once |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #200 +/- ##
==========================================
+ Coverage 68.83% 70.38% +1.55%
==========================================
Files 39 41 +2
Lines 6237 6777 +540
==========================================
+ Hits 4293 4770 +477
- Misses 1944 2007 +63 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This PR (WIP) implements SHAKE/RATTLE constraints on GPU using the M-SHAKE algorithm. It also has angle constraints!
The traditional SHAKE implementation is not possible to parallelize easily. M-SHAKE is parallelizable but comes at the cost of the types of constraints you can have. The "fast" implementation of M-SHAKE uses an analytical matrix inverse and is therefore limited to clusters of 3 constraints or less. I do not really see this as a problem. Typically you only constrain H-bonds or the bonds of the solvent molecules. Furthermore, bonds along the back bond are not constrainable due to the assumption of a "central" atom in my current implementation. This can be removed in the future, but I think implementing rigid body dynamics is the correct way to constraint a backbone and not SHAKE.
The types of clusters supported:
This leaves out things like CH4.
The current implementation is complete for all RATTLE variants and for SHAKE with a single bond as these are all analytical solutions and do not require iteration. The 3 remaining SHAKE kernels require iteration and I am not sure how best to handle that as load balancing is important for GPU performance. Open to suggestions as to how that should be done.
Some extra work will be required to actually support angle constraints. Maybe we leave this to another PR, and I just throw an error if I detect angle constraints.