-
Notifications
You must be signed in to change notification settings - Fork 238
misc: Thread-safe cache rewrite #2683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR rewrites the memoization system to be thread-safe, replacing the legacy memoized_func
with Python's built-in functools.cache
and redesigning memoized_meth
and memoized_generator
for concurrent access. Additionally, it introduces a global lock for the symbol cache to enable thread-safe symbol construction.
- Replaces
memoized_func
withfunctools.cache
throughout the codebase - Implements thread-local caching for
memoized_meth
with one cache per thread - Creates a thread-safe
SafeTee
implementation formemoized_generator
that allows concurrent iteration - Adds global locking mechanism for symbol cache operations
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
tests/test_tools.py | Adds comprehensive tests for concurrent behavior of memoized methods and generators |
devito/types/caching.py | Introduces global lock for thread-safe symbol cache operations |
devito/types/basic.py | Updates symbol construction to use the global cache lock |
devito/tools/memoization.py | Complete rewrite of memoization decorators with thread-safety and SafeTee implementation |
devito/arch/compiler.py | Replaces memoized_func with functools.cache |
devito/arch/archinfo.py | Replaces memoized_func with functools.cache |
Comments suppressed due to low confidence (3)
tests/test_tools.py:273
- The test uses
time.sleep()
but doesn't import thetime
module. This will cause a NameError when the test runs.
time.sleep(0.2)
tests/test_tools.py:330
- The test uses
time.sleep()
but doesn't import thetime
module. This will cause a NameError when the test runs.
time.sleep(0.25)
tests/test_tools.py:333
- The test uses
time.sleep()
but doesn't import thetime
module. This will cause a NameError when the test runs.
time.sleep(0.25)
try: | ||
# Try to retrieve the cached value | ||
res = cache[key] | ||
except KeyError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Using try-except for cache lookup may be slower than if key in cache
for the common case where keys are present. Consider using cache.get(key)
with a sentinel value to avoid the exception overhead.
try: | |
# Try to retrieve the cached value | |
res = cache[key] | |
except KeyError: | |
# Use a sentinel to avoid exception overhead | |
sentinel = object() | |
res = cache.get(key, sentinel) | |
if res is sentinel: |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After re-reading the AI, I might agree. Backtraces that encounter caches which use exception handling for flow control are painful to read
try: | ||
# Try to retrieve the cached value | ||
res = cache[key] | ||
except KeyError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Using try-except for cache lookup may be slower than if key in cache
for the common case where keys are present. Consider using cache.get(key)
with a sentinel value to avoid the exception overhead.
try: | |
# Try to retrieve the cached value | |
res = cache[key] | |
except KeyError: | |
# Use a sentinel to check for cache misses | |
_MISSING = object() | |
res = cache.get(key, _MISSING) | |
if res is _MISSING: |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at a bare minimum we need a new TestMultithreading
batch in test_caching
it if necessary. | ||
""" | ||
# Try-catch is theoretically faster on the happy path | ||
_local: local |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is improvable... :)
self.func = func | ||
# If the cache doesn't exist, initialize it | ||
except AttributeError: | ||
with self._lock: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it's thread-local why would you need a lock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The attribute itself contains thread-local data but it still needs to exist on the instance—I think it still needs to be initialized safely or one thread might erase existing thread local storage for another thread.
Although on second thought, maybe the cost of that occasional clash is worth getting rid of the lock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see now.
Potentially easy workaround (brutal pseudocode just to give u the gist):
def ThreadSafeCache(dict):
# {thread_id -> {key -> value}}
def __init__(.......):
<this has to lead to a dict of dict somehow... perhaps subclass default dict... not sure>
def __getitem__(self, k, v=None):
return self[threadid][k] ...
This way you need no lock
I had to do something similar in the dark age here https://github.com/devitocodes/devito/blob/main/devito/tools/timing.py#L17 for slightly different reasons but still multi-threading related
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well then you need a lock for all accesses no? Or at least to place dictionaries in the top-level dict, since afaik that is not safe for concurrent modification
|
||
|
||
class SafeTee(Iterator[YieldType]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could call this just tee
and return python's tee
(overriding __new__
) if mono thread
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels a little gross to be honest. What's the purpose of doing that if SafeTee
isn't needed outside of memoized generators?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, ultimately all I want is that single-threaded programs do not suffer from the inherent overheads of thread-safe data structures.
So, re-reading my comment, I'm guilty of mixing up syntactic sugar and suggestions; the suggestion would be falling back to plain tee if mono thread (does it matter? again, the mono thread program overhead when switching to thread-safe data structure is still unclear to mean -- and hopefully none!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I mean with falling back—that's something I was going to go through and update for all of this code, to the extent that it's possible: avoid any locking overhead if the GIL is enabled (since presumably we won't be trying to multithread anything with the GIL as everything is basically interpreter bound)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overhead is arguably negligible but definitely nonzero. I can try to do some profiling to see if it's worth eliminating when the GIL is enabled
name = kwargs.pop('name', None) or args.pop(0) | ||
newobj = cls.__xnew__(cls, name, **assumptions) | ||
# Lock against the symbol cache and double-check the cache | ||
with CacheManager.lock(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not comfortable with introducing thread-safety logic into the types hierarchy, because __new__
gets overridden often, so this might create unexpected issues and subsequently burden on the developer. Why don't we move all this complexity into types/caching
?
Secondly, if mono thread, it should somehow fallback to the previous "unlocked cache" data structure, unless you tell me the overhead of acquiring and releasing a lock with mono thread is practically zero
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seconded on both of these points. Wrt the first, it is also worth considering that the integration of cache locks into the __new__
effectively acts as a barrier to modifying and working on subclasses of Symbol
(possibly preventing contributions), increases the chance for bugs to creep in though mistakes, and will generally uglify and complicate the codebase since this change will need propagating down into subclasses
@@ -76,8 +78,10 @@ def _cache_get(cls, key): | |||
obj = obj_cached() | |||
if obj is None: | |||
# Cleanup _SymbolCache (though practically unnecessary) | |||
# does not fail if it's already gone | |||
_SymbolCache.pop(key, None) | |||
with _cache_lock: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code you're changing is monumentally critical, and at this stage I cannot claim these changes are safe yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, was going to try to figure out some reasonable tests for this
if obj() is None: | ||
# (key could be removed in another thread since get() above) | ||
_SymbolCache.pop(key, None) | ||
for key, obj_cached in cache_copied.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code you're changing is monumentally critical, and at this stage I cannot claim these changes are safe yet
return self + arg | ||
Obj.add_to(1) # not enough arguments | ||
Obj.add_to(1, 2) # returns 3, result is not cached | ||
def __init__(self, meth: Callable[Concatenate[InstanceType, ParamsType], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tidy the typing up by putting Callable[Concatenate[InstanceType, ParamsType], ReturnType]
on its own line as methodtype
or similar
_local: local | ||
try: | ||
# Attempt to access the thread-local data | ||
_local = obj._memoized_meth__local |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe call it on_local
? The double underscore in obj._memoized_meth__local
is upsetting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The double underscore is emulating name mangling (like, when a class Something
declares an attribute __starting_with_double_underscores
it's mangled to _Something__starting_with_double_underscores
to avoid clashing). Can rename it, but I did want it to be clear that this attribute is assigned by memoized_meth
if someone happens to inspect a live object later and wonder what the heck _on_local
comes from
return result | ||
with self._lock: | ||
# Check again in case another thread initialized outside the lock | ||
if not hasattr(obj, '_memoized_generator__cache'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: I think the double underscore can be reduced to a single one in this method name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See reasoning above; this emulates name mangling but I guess just removing one of the underscores is not unreasonable
source_iter = self._meth(obj, *args, **kwargs) | ||
res = cache[key] = SafeTee(source_iter) | ||
|
||
return res.tee() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this essentially copying the SafeTee
you have just created? Why do you need to do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point, yeah we can just return the root SafeTee
if we created it here
name = kwargs.pop('name', None) or args.pop(0) | ||
newobj = cls.__xnew__(cls, name, **assumptions) | ||
# Lock against the symbol cache and double-check the cache | ||
with CacheManager.lock(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seconded on both of these points. Wrt the first, it is also worth considering that the integration of cache locks into the __new__
effectively acts as a barrier to modifying and working on subclasses of Symbol
(possibly preventing contributions), increases the chance for bugs to creep in though mistakes, and will generally uglify and complicate the codebase since this change will need propagating down into subclasses
|
||
return newobj | ||
# Store new instance in symbol cache | ||
Cached.__init__(newobj, key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a fairly fundamental class in Devito and will need substantial testing to be sure that these changes are not breaking any edge cases
@@ -209,3 +211,156 @@ def __init__(self, value: int): | |||
# Cache should be cleared after Operator construction | |||
cache_size = Object._instance_cache.cache_info()[-1] | |||
assert cache_size == 0 | |||
|
|||
|
|||
class TestMemoizedMethods: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these tests get specifically run in the GIL-less CI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should also probably have some tests to make sure this doesn't do anything weird when using _rebuild
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2683 +/- ##
==========================================
+ Coverage 78.59% 87.52% +8.92%
==========================================
Files 245 245
Lines 49089 49230 +141
Branches 4322 4322
==========================================
+ Hits 38582 43087 +4505
+ Misses 9714 5408 -4306
+ Partials 793 735 -58 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to see some more tests and maybe a bit more exposition. Maybe not a notebook, but a wiki page. Some reference for how all this is designed and supposed to work.
When I had to previously fix major bugs in a caching framework, most of what I was doing was fixing how people were incorrectly using the existing cache. There needs to be a readily available reference with examples, pitfalls and anti-patterns.
Most contributors (I expect) will have never had to deal with concurrent code execution or the principles that the whole paradigm is based on.
def __init__(self, meth: Callable[Concatenate[InstanceType, ParamsType], | ||
ReturnType]) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def __init__(self, meth: Callable[Concatenate[InstanceType, ParamsType], | |
ReturnType]) -> None: | |
def __init__( | |
self, | |
meth: Callable[Concatenate[InstanceType, ParamsType], ReturnType] | |
) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be my preferred style for all function signatures that are too long due to type hinting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Black style! Yeah I have been sorta torn between these two styles but if there's a preference I'll go with this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would personally just stick Callable[Concatenate[InstanceType, ParamsType], ReturnType]
on its own line, like MethodType = Callable[Concatenate[InstanceType, ParamsType], ReturnType]
Retrieves the thread-local cache for the given object instance, initializing | ||
it if necessary. | ||
""" | ||
# Try-catch is theoretically faster on the happy path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Faster than what?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Faster than two dictionary lookups
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add it maybe
try: | ||
# Try to retrieve the cached value | ||
res = cache[key] | ||
except KeyError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After re-reading the AI, I might agree. Backtraces that encounter caches which use exception handling for flow control are painful to read
@memoized_meth | ||
def compute(self, x): | ||
self.misses += 1 | ||
return x * 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can these deeply nested objects be fixtures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was thinking of factoring them out to fixtures, will do that when I revisit more cache testing
A thread-safe version of `itertools.tee` that allows multiple iterators to safely | ||
share the same buffer. | ||
|
||
In theory, this comes at a cost to performance of iterating elements that haven't |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started typing a long reply to this, but then I deleted it... I'll raise a question instead:
does this make data dependence analysis (Scope) run faster when executing with multiple threads, and what's the cutoff point ? I imagine mono thread will suffer from overhead
depending on the answer to this question, my next question would be: is this really the way we want to exploit multi-threading to parallelise data dependence analysis?
Generators are inherently sequential, and it feels like here we're forcing innatural behavior... won't all threads clash by attempting to yield contiguous elements back to back? do u see what I mean here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, part of the issue here is that we have to (want to) maintain the structure of the existing implementation based on generators -- an implementation that over the years has been extremely optimized for single threaded applications. If multi-threading is all we had to worry about, we might as we well forget about generators and use the thread pool to collectively construct the entire set of dependencies once and for all...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another thought spurred by ☝️ :
we could have the thread pool construct a subset of Dependencies (e.g. each thread constructs K Dependencies at a time, possibly K=1) and then yield them one at a time
another orthogonal idea: we use N-1 threads to eagerly populate the Scope at instantiation time, so that when at the caller site we access any of the Scope attributes we find Dependencies already pre-built (rather than constructing them on the fly).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this latter idea, we would be creating a producer-consumer pattern . The guy who instantiate the Scope is the consumer, and instantiation fires up the producer(s, ?) threads that are now in charge of populating the Scope, while the consumer runs its logic one element at a time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is something I was thinking about a lot when I was rewriting the generator cache. I wanted to see if eagerly evaluating them once might make more sense for multi-threaded use.
My concern was that lots of the higher-level opportunities for parallelism are (or contain) methods that heavily utilize Scope
s, so there might be potential for a deadlock if a bunch of threads are waiting for a Scope
to be populated but there aren't any workers available to act as producers. So this might need some more complex machinery like a work queue I think? And I know how you feel about complexity
Rewrites
memoized_meth
andmemoized_generator
for concurrency, as well as dropping the legacymemoized_func
in favor offunctools.cache
. Also adds a global lock for the symbol cache allowing thread-safeSymbol
construction and cache manipulation.The
memoized_meth
decorator now stores one cache per thread, whereasmemoized_generator
stores a single cache for a given method (though there may still be misses if the cache is initialized concurrently). This means neither method cache has a call-once guarantee.Memoized generators block for the initial call of the generator function and construct a thread-safe version of
itertools.tee
, which allows for concurrent iteration (but blocks when iterating elements that are not yet in the buffer). After the source generator is consumed, there is no blocking and subsequently threads can iterate the buffer in parallel (seeSafeTee
intools.memoization
).