Skip to content

misc: Thread-safe cache rewrite #2683

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Conversation

enwask
Copy link
Contributor

@enwask enwask commented Jul 24, 2025

Rewrites memoized_meth and memoized_generator for concurrency, as well as dropping the legacy memoized_func in favor of functools.cache. Also adds a global lock for the symbol cache allowing thread-safe Symbol construction and cache manipulation.

The memoized_meth decorator now stores one cache per thread, whereas memoized_generator stores a single cache for a given method (though there may still be misses if the cache is initialized concurrently). This means neither method cache has a call-once guarantee.

Memoized generators block for the initial call of the generator function and construct a thread-safe version of itertools.tee, which allows for concurrent iteration (but blocks when iterating elements that are not yet in the buffer). After the source generator is consumed, there is no blocking and subsequently threads can iterate the buffer in parallel (see SafeTee in tools.memoization).

@enwask enwask marked this pull request as ready for review July 24, 2025 15:05
@ggorman ggorman requested a review from Copilot July 25, 2025 11:30
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR rewrites the memoization system to be thread-safe, replacing the legacy memoized_func with Python's built-in functools.cache and redesigning memoized_meth and memoized_generator for concurrent access. Additionally, it introduces a global lock for the symbol cache to enable thread-safe symbol construction.

  • Replaces memoized_func with functools.cache throughout the codebase
  • Implements thread-local caching for memoized_meth with one cache per thread
  • Creates a thread-safe SafeTee implementation for memoized_generator that allows concurrent iteration
  • Adds global locking mechanism for symbol cache operations

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/test_tools.py Adds comprehensive tests for concurrent behavior of memoized methods and generators
devito/types/caching.py Introduces global lock for thread-safe symbol cache operations
devito/types/basic.py Updates symbol construction to use the global cache lock
devito/tools/memoization.py Complete rewrite of memoization decorators with thread-safety and SafeTee implementation
devito/arch/compiler.py Replaces memoized_func with functools.cache
devito/arch/archinfo.py Replaces memoized_func with functools.cache
Comments suppressed due to low confidence (3)

tests/test_tools.py:273

  • The test uses time.sleep() but doesn't import the time module. This will cause a NameError when the test runs.
                time.sleep(0.2)

tests/test_tools.py:330

  • The test uses time.sleep() but doesn't import the time module. This will cause a NameError when the test runs.
                time.sleep(0.25)

tests/test_tools.py:333

  • The test uses time.sleep() but doesn't import the time module. This will cause a NameError when the test runs.
                time.sleep(0.25)

Comment on lines 72 to 75
try:
# Try to retrieve the cached value
res = cache[key]
except KeyError:
Copy link
Preview

Copilot AI Jul 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Using try-except for cache lookup may be slower than if key in cache for the common case where keys are present. Consider using cache.get(key) with a sentinel value to avoid the exception overhead.

Suggested change
try:
# Try to retrieve the cached value
res = cache[key]
except KeyError:
# Use a sentinel to avoid exception overhead
sentinel = object()
res = cache.get(key, sentinel)
if res is sentinel:

Copilot uses AI. Check for mistakes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After re-reading the AI, I might agree. Backtraces that encounter caches which use exception handling for flow control are painful to read

Comment on lines +201 to +204
try:
# Try to retrieve the cached value
res = cache[key]
except KeyError:
Copy link
Preview

Copilot AI Jul 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Using try-except for cache lookup may be slower than if key in cache for the common case where keys are present. Consider using cache.get(key) with a sentinel value to avoid the exception overhead.

Suggested change
try:
# Try to retrieve the cached value
res = cache[key]
except KeyError:
# Use a sentinel to check for cache misses
_MISSING = object()
res = cache.get(key, _MISSING)
if res is _MISSING:

Copilot uses AI. Check for mistakes.

Copy link
Contributor

@FabioLuporini FabioLuporini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at a bare minimum we need a new TestMultithreading batch in test_caching

it if necessary.
"""
# Try-catch is theoretically faster on the happy path
_local: local
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is improvable... :)

self.func = func
# If the cache doesn't exist, initialize it
except AttributeError:
with self._lock:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's thread-local why would you need a lock?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The attribute itself contains thread-local data but it still needs to exist on the instance—I think it still needs to be initialized safely or one thread might erase existing thread local storage for another thread.

Although on second thought, maybe the cost of that occasional clash is worth getting rid of the lock

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now.

Potentially easy workaround (brutal pseudocode just to give u the gist):

def  ThreadSafeCache(dict):

     # {thread_id -> {key -> value}}

     def __init__(.......):
           <this has to lead to a dict of dict somehow... perhaps subclass default dict... not sure>
           
      def __getitem__(self, k, v=None):
             return self[threadid][k] ...

This way you need no lock

I had to do something similar in the dark age here https://github.com/devitocodes/devito/blob/main/devito/tools/timing.py#L17 for slightly different reasons but still multi-threading related

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well then you need a lock for all accesses no? Or at least to place dictionaries in the top-level dict, since afaik that is not safe for concurrent modification



class SafeTee(Iterator[YieldType]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could call this just tee and return python's tee (overriding __new__) if mono thread

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels a little gross to be honest. What's the purpose of doing that if SafeTee isn't needed outside of memoized generators?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, ultimately all I want is that single-threaded programs do not suffer from the inherent overheads of thread-safe data structures.

So, re-reading my comment, I'm guilty of mixing up syntactic sugar and suggestions; the suggestion would be falling back to plain tee if mono thread (does it matter? again, the mono thread program overhead when switching to thread-safe data structure is still unclear to mean -- and hopefully none!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I mean with falling back—that's something I was going to go through and update for all of this code, to the extent that it's possible: avoid any locking overhead if the GIL is enabled (since presumably we won't be trying to multithread anything with the GIL as everything is basically interpreter bound)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overhead is arguably negligible but definitely nonzero. I can try to do some profiling to see if it's worth eliminating when the GIL is enabled

name = kwargs.pop('name', None) or args.pop(0)
newobj = cls.__xnew__(cls, name, **assumptions)
# Lock against the symbol cache and double-check the cache
with CacheManager.lock():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not comfortable with introducing thread-safety logic into the types hierarchy, because __new__ gets overridden often, so this might create unexpected issues and subsequently burden on the developer. Why don't we move all this complexity into types/caching?

Secondly, if mono thread, it should somehow fallback to the previous "unlocked cache" data structure, unless you tell me the overhead of acquiring and releasing a lock with mono thread is practically zero

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seconded on both of these points. Wrt the first, it is also worth considering that the integration of cache locks into the __new__ effectively acts as a barrier to modifying and working on subclasses of Symbol (possibly preventing contributions), increases the chance for bugs to creep in though mistakes, and will generally uglify and complicate the codebase since this change will need propagating down into subclasses

@@ -76,8 +78,10 @@ def _cache_get(cls, key):
obj = obj_cached()
if obj is None:
# Cleanup _SymbolCache (though practically unnecessary)
# does not fail if it's already gone
_SymbolCache.pop(key, None)
with _cache_lock:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code you're changing is monumentally critical, and at this stage I cannot claim these changes are safe yet

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, was going to try to figure out some reasonable tests for this

if obj() is None:
# (key could be removed in another thread since get() above)
_SymbolCache.pop(key, None)
for key, obj_cached in cache_copied.items():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code you're changing is monumentally critical, and at this stage I cannot claim these changes are safe yet

return self + arg
Obj.add_to(1) # not enough arguments
Obj.add_to(1, 2) # returns 3, result is not cached
def __init__(self, meth: Callable[Concatenate[InstanceType, ParamsType],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tidy the typing up by putting Callable[Concatenate[InstanceType, ParamsType], ReturnType] on its own line as methodtype or similar

_local: local
try:
# Attempt to access the thread-local data
_local = obj._memoized_meth__local
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe call it on_local? The double underscore in obj._memoized_meth__local is upsetting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The double underscore is emulating name mangling (like, when a class Something declares an attribute __starting_with_double_underscores it's mangled to _Something__starting_with_double_underscores to avoid clashing). Can rename it, but I did want it to be clear that this attribute is assigned by memoized_meth if someone happens to inspect a live object later and wonder what the heck _on_local comes from

return result
with self._lock:
# Check again in case another thread initialized outside the lock
if not hasattr(obj, '_memoized_generator__cache'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: I think the double underscore can be reduced to a single one in this method name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See reasoning above; this emulates name mangling but I guess just removing one of the underscores is not unreasonable

source_iter = self._meth(obj, *args, **kwargs)
res = cache[key] = SafeTee(source_iter)

return res.tee()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this essentially copying the SafeTee you have just created? Why do you need to do that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, yeah we can just return the root SafeTee if we created it here

name = kwargs.pop('name', None) or args.pop(0)
newobj = cls.__xnew__(cls, name, **assumptions)
# Lock against the symbol cache and double-check the cache
with CacheManager.lock():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seconded on both of these points. Wrt the first, it is also worth considering that the integration of cache locks into the __new__ effectively acts as a barrier to modifying and working on subclasses of Symbol (possibly preventing contributions), increases the chance for bugs to creep in though mistakes, and will generally uglify and complicate the codebase since this change will need propagating down into subclasses


return newobj
# Store new instance in symbol cache
Cached.__init__(newobj, key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fairly fundamental class in Devito and will need substantial testing to be sure that these changes are not breaking any edge cases

@@ -209,3 +211,156 @@ def __init__(self, value: int):
# Cache should be cleared after Operator construction
cache_size = Object._instance_cache.cache_info()[-1]
assert cache_size == 0


class TestMemoizedMethods:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these tests get specifically run in the GIL-less CI?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should also probably have some tests to make sure this doesn't do anything weird when using _rebuild

Copy link

codecov bot commented Aug 5, 2025

Codecov Report

❌ Patch coverage is 89.13043% with 25 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.52%. Comparing base (cd9058b) to head (8bb95f2).
⚠️ Report is 101 commits behind head on main.

Files with missing lines Patch % Lines
tests/test_tools.py 83.33% 17 Missing ⚠️
devito/tools/memoization.py 95.18% 1 Missing and 3 partials ⚠️
devito/types/basic.py 84.61% 1 Missing and 1 partial ⚠️
devito/types/caching.py 84.61% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2683      +/-   ##
==========================================
+ Coverage   78.59%   87.52%   +8.92%     
==========================================
  Files         245      245              
  Lines       49089    49230     +141     
  Branches     4322     4322              
==========================================
+ Hits        38582    43087    +4505     
+ Misses       9714     5408    -4306     
+ Partials      793      735      -58     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@JDBetteridge JDBetteridge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see some more tests and maybe a bit more exposition. Maybe not a notebook, but a wiki page. Some reference for how all this is designed and supposed to work.

When I had to previously fix major bugs in a caching framework, most of what I was doing was fixing how people were incorrectly using the existing cache. There needs to be a readily available reference with examples, pitfalls and anti-patterns.

Most contributors (I expect) will have never had to deal with concurrent code execution or the principles that the whole paradigm is based on.

Comment on lines +22 to +23
def __init__(self, meth: Callable[Concatenate[InstanceType, ParamsType],
ReturnType]) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def __init__(self, meth: Callable[Concatenate[InstanceType, ParamsType],
ReturnType]) -> None:
def __init__(
self,
meth: Callable[Concatenate[InstanceType, ParamsType], ReturnType]
) -> None:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be my preferred style for all function signatures that are too long due to type hinting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Black style! Yeah I have been sorta torn between these two styles but if there's a preference I'll go with this

Copy link
Contributor

@EdCaunt EdCaunt Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would personally just stick Callable[Concatenate[InstanceType, ParamsType], ReturnType] on its own line, like MethodType = Callable[Concatenate[InstanceType, ParamsType], ReturnType]

Retrieves the thread-local cache for the given object instance, initializing
it if necessary.
"""
# Try-catch is theoretically faster on the happy path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Faster than what?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Faster than two dictionary lookups

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add it maybe

Comment on lines 72 to 75
try:
# Try to retrieve the cached value
res = cache[key]
except KeyError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After re-reading the AI, I might agree. Backtraces that encounter caches which use exception handling for flow control are painful to read

@memoized_meth
def compute(self, x):
self.misses += 1
return x * 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these deeply nested objects be fixtures?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was thinking of factoring them out to fixtures, will do that when I revisit more cache testing

A thread-safe version of `itertools.tee` that allows multiple iterators to safely
share the same buffer.

In theory, this comes at a cost to performance of iterating elements that haven't
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started typing a long reply to this, but then I deleted it... I'll raise a question instead:

does this make data dependence analysis (Scope) run faster when executing with multiple threads, and what's the cutoff point ? I imagine mono thread will suffer from overhead

depending on the answer to this question, my next question would be: is this really the way we want to exploit multi-threading to parallelise data dependence analysis?

Generators are inherently sequential, and it feels like here we're forcing innatural behavior... won't all threads clash by attempting to yield contiguous elements back to back? do u see what I mean here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, part of the issue here is that we have to (want to) maintain the structure of the existing implementation based on generators -- an implementation that over the years has been extremely optimized for single threaded applications. If multi-threading is all we had to worry about, we might as we well forget about generators and use the thread pool to collectively construct the entire set of dependencies once and for all...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another thought spurred by ☝️ :

we could have the thread pool construct a subset of Dependencies (e.g. each thread constructs K Dependencies at a time, possibly K=1) and then yield them one at a time

another orthogonal idea: we use N-1 threads to eagerly populate the Scope at instantiation time, so that when at the caller site we access any of the Scope attributes we find Dependencies already pre-built (rather than constructing them on the fly).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this latter idea, we would be creating a producer-consumer pattern . The guy who instantiate the Scope is the consumer, and instantiation fires up the producer(s, ?) threads that are now in charge of populating the Scope, while the consumer runs its logic one element at a time

Copy link
Contributor Author

@enwask enwask Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is something I was thinking about a lot when I was rewriting the generator cache. I wanted to see if eagerly evaluating them once might make more sense for multi-threaded use.

My concern was that lots of the higher-level opportunities for parallelism are (or contain) methods that heavily utilize Scopes, so there might be potential for a deadlock if a bunch of threads are waiting for a Scope to be populated but there aren't any workers available to act as producers. So this might need some more complex machinery like a work queue I think? And I know how you feel about complexity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants