Skip to content

Conversation

EnricoMi
Copy link
Contributor

@EnricoMi EnricoMi commented Jun 26, 2025

There is currently no way to control the number of partitions, as they are split based on number of lines. What can be controlled the number of batches the tbl files are generated in, and the paralellism used for each batch. The result is as many partitions needed for each batch as there are rows per table.

Fixes:

  • use of PartitionMaxSize
  • bucket name when syncing to AWS S3
  • puts each parquet file into its own directory, syncing this to S3 improves read throughput
    as AWS throttles bandwith per prefix (directory)

There is currently no way to control the number of partitions,
as they are split based on number of lines. What can be controlled
the number of batches the tbl files are generated in, and the
paralellism used for each batch. The result is as many partitions
needed for each batch as there are rows per table.

Fixes:
- use of PartitionMaxSize
- bucket name when syncing to AWS S3
@EnricoMi EnricoMi force-pushed the gen-in-batches branch 2 times, most recently from 0772508 to ce39f74 Compare June 26, 2025 12:57
@EnricoMi EnricoMi force-pushed the gen-in-batches branch 2 times, most recently from cce0b1d to 93f1fc3 Compare June 26, 2025 13:16
@ritchie46 ritchie46 merged commit c3e288e into pola-rs:main Jun 28, 2025
2 checks passed
r-brink pushed a commit to r-brink/polars-benchmark that referenced this pull request Jul 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants