Rework partitioned data generation #168

EnricoMi · 2025-06-26T12:48:21Z

There is currently no way to control the number of partitions, as they are split based on number of lines. What can be controlled the number of batches the tbl files are generated in, and the paralellism used for each batch. The result is as many partitions needed for each batch as there are rows per table.

Fixes:

use of PartitionMaxSize
bucket name when syncing to AWS S3
puts each parquet file into its own directory, syncing this to S3 improves read throughput
as AWS throttles bandwith per prefix (directory)

There is currently no way to control the number of partitions, as they are split based on number of lines. What can be controlled the number of batches the tbl files are generated in, and the paralellism used for each batch. The result is as many partitions needed for each batch as there are rows per table. Fixes: - use of PartitionMaxSize - bucket name when syncing to AWS S3

EnricoMi force-pushed the gen-in-batches branch 2 times, most recently from 0772508 to ce39f74 Compare June 26, 2025 12:57

Fix linting

e33ea90

EnricoMi force-pushed the gen-in-batches branch 2 times, most recently from cce0b1d to 93f1fc3 Compare June 26, 2025 13:16

Support querying partitioned dataset

8a1ecdd

EnricoMi force-pushed the gen-in-batches branch from 93f1fc3 to 8a1ecdd Compare June 26, 2025 14:21

ritchie46 merged commit c3e288e into pola-rs:main Jun 28, 2025
2 checks passed

r-brink pushed a commit to r-brink/polars-benchmark that referenced this pull request Jul 1, 2025

Rework partitioned data generation (pola-rs#168)

b8ba4fb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rework partitioned data generation #168

Rework partitioned data generation #168

Uh oh!

EnricoMi commented Jun 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Rework partitioned data generation #168

Rework partitioned data generation #168

Uh oh!

Conversation

EnricoMi commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

EnricoMi commented Jun 26, 2025 •

edited

Loading