Skip to content

Try using ChunkCodec as an HDF5 filter #1207

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

Conversation

nhz2
Copy link
Member

@nhz2 nhz2 commented Aug 10, 2025

This PR aims to explore how the ChunkCodecCore API could be used to implement an HDF5 filter, and to see what changes to the ChunkCodecCore API could make this easier.

@mkitti
Copy link
Member

mkitti commented Aug 10, 2025

Could we implement this along side the existing (yet unreleased extension) rather than in place of it?

If I recall, there is are some version bounds that may restrict usage pre-Julia 1.10 or later.

@nhz2
Copy link
Member Author

nhz2 commented Aug 10, 2025

Yes, I can try that. The next version of the ChunkCodec packages can work with Julia 1.6, so I'm not sure what the version bounds issue would be.

Another issue is that even though decoding with BZ2Codec is compatible with what the HDF5 filter was doing, there are differences in how concatenated compressed data is handled. Like the command line tool bunzip2, BZ2Codec decoding accepts concatenated frames and returns the decompressed data concatenated.
Unlike bunzip2, BZ2Codec decoding will error if the compressed stream has invalid data appended to it.

From what I can tell, the HDF5 filter will only decode the first frame and ignore all data appended afterwards.

@mkitti
Copy link
Member

mkitti commented Aug 10, 2025

I think it makes sense to have multiple implementations with distinct features and that we should be especially broad in terms of compatibility on the decoding side.

We should also raise these issues with HDF Group.

Something that we might need to work out is priority of multiple implementations of the codec.

@nhz2
Copy link
Member Author

nhz2 commented Aug 11, 2025

In general, I think we should have one excellent default filter implementation, and then make it easier for advanced users to use custom implementations with the https://juliaio.github.io/HDF5.jl/stable/interface/dataset/#Chunks API, which is currently a bit difficult to use correctly with filtered data.

This is probably not much better than the current bzip2 filter, so there isn't much point in merging this right now.

Also, I may want to change the return type of some of the low-level chunk codec functions JuliaIO/ChunkCodecs.jl#72

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants