Update for new version of HF transformers. #104
+14
−14
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We've recently merged a layer-wise refactor of the cache system in Transformers: huggingface/transformers#39106.
While testing your repo for compatibility, I had to adapt parts of the code to the new interface. To help with the migration, I've included my changes below. These are not intended as a full PR (I've only tested a small subset) but they should serve as a helpful guide.
Some updates are deprecations (e.g.,
cache.key_cache[i]
is still supported via a backward-compatibility layer, thoughcache.layers[i].keys
is preferred). However, there are also breaking changes, particularly in private attributes: for example,cache._quantized_key_cache
is nowcache.cache_processor._quantized_keys
.I also encountered some CUDA illegal memory access errors, which I suspect are related to: huggingface/transformers#39474 and contiguous memory requirements in FlashAttention v2.
In short, the upcoming Transformers release introduces necessary but potentially breaking changes that may impact this repo. I recommend testing against the
main
branch, and I'm happy to help if further issues come up.