Skip to content

Add top-level mutex class #3739

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 6 commits into
base: development
Choose a base branch
from
Draft

Conversation

d-torrance
Copy link
Member

@d-torrance d-torrance commented Apr 18, 2025

This week, @LukeOeding inquired on Zulip about the possibility of a parallelSum function. In order to make something like this work, we'd need some kind of mutex object to make sure that multiple threads aren't trying to modify the same thing at the same time.

We're already using a wrapper around pthread_mutex_t in the interpreter, so we export this to top level as a new Mutex class. In addition to the constructor method, there are three methods :

  • lock, which blocks until it can lock the mutex
  • tryLock, which tries to lock the mutex and raises an error if it can't
  • unlock, which unlocks the mutex

For example, here's a possible implementation of parallelSum using this new class:

i2 : parallelSum = (X, f, result) -> (
    mutex := new Mutex;
    T := apply(X, x -> schedule(() -> (
                y := f x;
		lock mutex;
		result += y;
		unlock mutex)));
    while not all(T, isReady) do null;
    result)

                                   
o2 = parallelSum

o2 : FunctionClosure

i3 : R = QQ[a..z]

o3 = R

o3 : PolynomialRing

i4 : parallelSum(gens R, x -> x^2, 0_R)

      2    2    2    2    2    2    2    2    2    2    2    2    2    2    2  
o4 = a  + b  + c  + d  + e  + f  + g  + h  + i  + j  + k  + l  + m  + n  + o  +
     --------------------------------------------------------------------------
      2    2    2    2    2    2    2    2    2    2    2
     p  + q  + r  + s  + t  + u  + v  + w  + x  + y  + z

o4 : R

@jkyang92 - Would you be willing to review this when you get a chance?

@d-torrance d-torrance requested a review from jkyang92 April 18, 2025 18:05
@mahrud
Copy link
Member

mahrud commented Apr 18, 2025

Is this actually faster than sequential sum? The while not all(T, isReady) do null; makes me suspect not.

@d-torrance
Copy link
Member Author

Yeah -- the while loop just waits until each of the tasks are done, but they're still running in parallel.

Here's a silly example:

i2 : sleepIdentity = x -> (sleep 1; x)

o2 = sleepIdentity

o2 : FunctionClosure

i3 : elapsedTime sum(1..10, sleepIdentity)
 -- 10.0012s elapsed

o3 = 55

i4 : elapsedTime parallelSum(1..10, sleepIdentity, 0)
 -- 2.36233s elapsed

o4 = 55

@mahrud
Copy link
Member

mahrud commented Apr 18, 2025

Could you try a real example?

@d-torrance
Copy link
Member Author

Here's a more interesting example:

i2 : R = QQ[x_0..x_8];

i3 : I = ideal random(R^4, R^{-2});

o3 : Ideal of R

i4 : J = ideal gens R;

o4 : Ideal of R

i5 : elapsedTime sum(gens R, f -> saturate(ideal I_*, f));
 -- 3.96602s elapsed

o5 : Ideal of R

i6 : elapsedTime parallelSum(gens R, f -> saturate(ideal I_*, f), ideal 0_R);
 -- 2.80849s elapsed

o6 : Ideal of R

@mahrud
Copy link
Member

mahrud commented Apr 19, 2025

No really, you parallelized computing the entries, but the summing part is even slower:

i1 : parallelSum = (X, f, result) -> (
         mutex := new Mutex;
         T := apply(X, x -> schedule(() -> (
     		y := f x;
     		lock mutex;
     		result += y;
     		unlock mutex)));
         while not all(T, isReady) do null;
         result)

o1 = parallelSum

o1 : FunctionClosure

i2 : R = QQ[x_0..x_8];

i3 : I = ideal random(R^4, R^{-2});

o3 : Ideal of R

i4 : J = ideal gens R;

o4 : Ideal of R

i5 : L = apply(gens R, f -> saturate(ideal I_*, f));

i6 : benchmark "sum L" -- 0.007

o6 = .00718855184482758

o6 : RR (of precision 53)

i7 : benchmark "parallelSum(L, identity, ideal 0_R)"

o7 = .01879672231067965

o7 : RR (of precision 53)

And even in parallelizing the individual saturations, the mutexes and the while loop are hurting more than they're helping. For instance, using await/async pattern is faster without mutexes:

i35 : elapsedTime apply(10, i -> sum await apply(gens R, async(f -> saturate(ideal I_*, f))));
 -- 26.623s elapsed

i36 : elapsedTime apply(10, i -> parallelSum(gens R, f -> saturate(ideal I_*, f), ideal 0_R));
 -- 30.9081s elapsed

@d-torrance
Copy link
Member Author

Of course the summing part is slower -- there's the added overhead of dealing with the mutexes.

I'm not proposing that we add this particular parallelSum function to Core or anything, and maybe this was a bad example since we can always just compute the summands concurrently and add sequentially at the end without mutexes as in your example.

But mutexes would be useful in certain parallel algorithms where there's some mutable shared data structure between the different threads like, say, a queue in a breadth-first search.

@mahrud
Copy link
Member

mahrud commented Apr 19, 2025

Sure, I'm not opposed to adding mutexes, and in theory it could be very useful in things like capture to parallalize running examples and tests, but I want to see an example that demonstrates a realistic advantage in Macaulay2 first, otherwise it's hard to tell if there are bugs or leaks under the surface or not.

@mahrud
Copy link
Member

mahrud commented Apr 19, 2025

More specifically, I am worried about what happens when there is an error or a code is interrupted while a mutex is locked. Language-wise, too, I wonder if what we want is the low level pthread_mutex proposed here, or something higher level, like a keyword that prevents one piece of code from being executed by multiple threads and automatically unlocks if the code is interrupted.

Here is an example of what I would rather have:

n = 0
f = x -> threadLock ( n += x )

where threadLock is a keyword that asks the parser to bind the resulting Code object to a mutex which is locked when it is being evaluated and is unlocked upon completion or interruption. The tricky part here is the interruption, because the only thing worse than sequentially computing something very time consuming that should be parallel is crashing M2 while running it in parallel!

@d-torrance
Copy link
Member Author

Ooh, that's a cool idea! The current proposed implementation definitely has its downsides, e.g., lock m; lock m will create an uninterruptible loop, and a higher level threadLock would help avoid that.

Converting to a draft for now.

@d-torrance d-torrance marked this pull request as draft April 19, 2025 17:00
@d-torrance
Copy link
Member Author

Closing this -- I have a working draft of a threadLock keyword. I still need to work out a few kinks (e.g. what happens when we cancel a task that currently holds the lock?) but I should have a PR soon.

@mahrud
Copy link
Member

mahrud commented Jun 26, 2025

I frequently wish I could write blah = M -> threadLock M.cache.blah ??= (...) in a way that if blah M is called many times at the same time, only one thread is used to run the computation and cache the value and the rest instantly return the cached value.

@mahrud
Copy link
Member

mahrud commented Jun 26, 2025

Actually, I think having top-level mutexes might be a pretty good solution for #3895 ...

@d-torrance d-torrance reopened this Jul 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants