Skip to content
This repository was archived by the owner on Jun 22, 2025. It is now read-only.

Refactor Bulk Insert to Use Memory<object> for Enhanced Flexibility and Efficiency #628

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

xwwwx
Copy link

@xwwwx xwwwx commented Apr 22, 2025

🔧 Summary
This PR refactors the ClickHouse bulk insert mechanism to leverage Memory<object> and IMemoryOwner<Memory<object>>, replacing the previous object[][]-based implementation.

🧾 Key Changes

  • Batch.Rows

    • Changed from object[][] to IMemoryOwner<Memory<object>>

    • Updated Dispose() to properly release owned memory

  • WriteToServerAsync

    • Added support for IEnumerable<Memory<object>>

    • Existing overloads (object[], DataTable, etc.) now internally convert to Memory<object>

  • IntoBatches

    • Refactored to use MemoryPool<T> instead of ArrayPool<T>

    • Returns IMemoryOwner<T> for improved memory safety

  • Row Serializers (IRowSerializer, RowBinarySerializer, RowBinaryWithDefaultsSerializer)

    • Refactored to use Span<object> instead of object[]

    • Enables zero-allocation serialization where possible

Benefits

  • Improved Flexibility
    Allows upstream systems to pass reusable or pooled Memory<object> buffers directly.

  • Better Memory Management
    Controlled lifetime via IMemoryOwner<T>, reduced GC pressure, and support for memory pooling.

  • Zero Allocation Serialization
    Using Span<object> avoids intermediate heap allocations when writing rows.

  • Backward Compatibility Maintained
    Existing interfaces remain usable; new overloads provide better performance paths.

📌 Motivation
This refactor lays the foundation for more scalable, high-performance bulk insert scenarios. It will enable cleaner integration with upstream systems that use buffer pooling or stream processing architectures.

@xwwwx xwwwx requested a review from DarkWanderer as a code owner April 22, 2025 05:23
Copy link

codecov bot commented Apr 22, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.56%. Comparing base (0d4b060) to head (722904a).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #628      +/-   ##
==========================================
+ Coverage   82.18%   82.56%   +0.38%     
==========================================
  Files         103      103              
  Lines        2245     2243       -2     
  Branches      340      339       -1     
==========================================
+ Hits         1845     1852       +7     
+ Misses        275      264      -11     
- Partials      125      127       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@DarkWanderer
Copy link
Owner

Hi,

Thank you for your PR, I appreciate the intent to help with optimization. However, according to benchmark it doesn't quite improve performance and in fact makes it slightly worse:

With ArrayPool:

Method Count Mean Error StdDev Gen0 Gen1 Gen2 Allocated
BulkInsertInt32 100000 32.02 ms 0.503 ms 0.618 ms 812.5000 812.5000 812.5000 8.14 MB

With MemoryPool:

Method Count Mean Error StdDev Gen0 Gen1 Gen2 Allocated
BulkInsertInt32 100000 33.02 ms 0.636 ms 1.285 ms 906.2500 906.2500 906.2500 8.27 MB

The primary cause for that seems to be that MemoryPool is actually backed by ArrayPool internally: https://stackoverflow.com/a/61859719/1732138

But a good attempt nevertheless

@xwwwx
Copy link
Author

xwwwx commented Apr 23, 2025

Thank you for the feedback!

The main motivation behind this change is not to improve performance per se, but to improve flexibility for upstream callers.

In our use case, we use a pooled object[] array and slice it into variable-sized segments for bulk insert. Since the segment size isn't fixed, using Memory allows us to safely represent those slices without allocating new arrays. This helps us reuse memory buffers efficiently across multiple insert operations.

I agree that Memory may introduce some overhead compared to object[][], and it's definitely not zero-allocation. However, for us the tradeoff in allocation vs. reusability and composability is worthwhile.

That said, I’m open to suggestions on how we can better support this kind of usage while minimizing the impact on performance and the library’s core design.

@xwwwx
Copy link
Author

xwwwx commented Apr 23, 2025

I've reverted back to ArrayPool as suggested.
Please help check if this improves the performance. 🙇‍♂️

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants