Skip to content

Conversation

ZacBlanco
Copy link
Contributor

@ZacBlanco ZacBlanco commented Aug 28, 2024

Description

Iceberg timestamps are written with microsecond precision as defined by the spec. Before this change, the parquet batch reader optimization did not properly support reading microsecond-precision time types. Presto returned results which differed depending on whether the optimization was enabled

This change adds proper support in the batch reader.

Motivation and Context

Previously Presto would return wrong results in the Iceberg connector when parquet batch read optimizations are enabled.

Behavior before this change:

CREATE TABLE t(i time);
INSERT INTO t VALUES time '1:00';
-- without batch optimization
presto:tpch> SET SESSION iceberg.parquet_batch_read_optimization_enabled = false;
presto:tpch> SELECT CAST(i as bigint) FROM t;
  _col0
----------
 32400000 -- (5 trailing zeroes)
(1 row)


-- with batch optimization
presto:tpch> SET SESSION iceberg.parquet_batch_read_optimization_enabled = true;
SET SESSION
presto:tpch> SELECT CAST(i as bigint) FROM t;
    _col0
-------------
 32400000000 -- (8 trailing zeroes)
(1 row)

Behavior after this change:

CREATE TABLE t(i time);
INSERT INTO t VALUES time '1:00';
-- without batch optimization
presto:tpch> SET SESSION iceberg.parquet_batch_read_optimization_enabled = false;
presto:tpch> SELECT CAST(i as bigint) FROM t;
  _col0
----------
 32400000 -- (5 trailing zeroes)
(1 row)


-- with batch optimization
presto:tpch> SET SESSION iceberg.parquet_batch_read_optimization_enabled = true;
SET SESSION
presto:tpch> SELECT CAST(i as bigint) FROM t;
    _col0
-------------
 32400000 -- (5 trailing zeroes)
(1 row)

Impact

Proper time type support.

Test Plan

Added test to verify equivalent return values with and without the optimization in the Iceberg connector.

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

== RELEASE NOTES ==

Iceberg Connector Changes
* Fix time-type columns to return properly when ``iceberg.parquet-batch-read-optimization-enabled`` is set to ``TRUE``. :pr:`23542`

@ZacBlanco ZacBlanco force-pushed the upstream-parquet-int64-time-micros branch from 2447d11 to cf488ca Compare August 28, 2024 21:11
@ZacBlanco ZacBlanco changed the title [parquet] Add support for microsecond precision Time type [parquet] Add support us precision Time type in batch reader Aug 28, 2024
@ZacBlanco ZacBlanco changed the title [parquet] Add support us precision Time type in batch reader [parquet] Support us precision Time type in batch reader Aug 28, 2024
@ZacBlanco ZacBlanco marked this pull request as ready for review August 29, 2024 03:14
@ZacBlanco ZacBlanco requested review from shangxinli, a team and hantangwangd as code owners August 29, 2024 03:14
@ZacBlanco ZacBlanco requested a review from presto-oss August 29, 2024 03:14
@steveburnett
Copy link
Contributor

Nit suggestion for release note entry:

== RELEASE NOTES ==

Iceberg Connector Changes
* Fix time-type columns to return properly when ``iceberg.parquet-batch-read-optimization-enabled`` is set to ``TRUE``. :pr:`23542`

@tdcmeehan tdcmeehan requested a review from yingsu00 August 29, 2024 17:21
@ZacBlanco ZacBlanco force-pushed the upstream-parquet-int64-time-micros branch 2 times, most recently from 72a0341 to 2597100 Compare September 3, 2024 22:09
@ZacBlanco ZacBlanco force-pushed the upstream-parquet-int64-time-micros branch from 2597100 to 4720f2f Compare September 4, 2024 17:00
hantangwangd
hantangwangd previously approved these changes Sep 5, 2024
Copy link
Member

@hantangwangd hantangwangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix. Lgtm, one little nit.

@tdcmeehan tdcmeehan self-assigned this Sep 5, 2024
@ZacBlanco ZacBlanco force-pushed the upstream-parquet-int64-time-micros branch 2 times, most recently from 9eccb5c to a937f57 Compare September 5, 2024 20:31
hantangwangd
hantangwangd previously approved these changes Sep 6, 2024
Copy link
Contributor

@yingsu00 yingsu00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ZacBlanco I see the Time decoders are exactly the same with existing Timestamp ones. There's no point to duplicate so much code. Maybe we can rename the existing Timestamp decoders like Int64TimestampMicrosRLEDictionaryValuesDecoder to Int64TimeAndTimestampMicrosRLEDictionaryValuesDecoder?

@ZacBlanco ZacBlanco force-pushed the upstream-parquet-int64-time-micros branch 4 times, most recently from 22b54d4 to 45d8680 Compare September 9, 2024 17:30
@ZacBlanco
Copy link
Contributor Author

Thanks for the review @yingsu00 . I agree - let's try not to duplicate code. I updated the naming of the decoders across the module and updated the conditionals to check isTimestampMicros || isTimeMicros

@ZacBlanco ZacBlanco force-pushed the upstream-parquet-int64-time-micros branch from 45d8680 to 370ed94 Compare September 9, 2024 17:51
Iceberg timestamps are written with microsecond precision as defined
by the spec. Before this change, the parquet batch reader optimization
did not support reading microsecond-precision `time` types. This led
to Presto returning incorrect results.

This change adds proper support in the batch reader.
@ZacBlanco ZacBlanco force-pushed the upstream-parquet-int64-time-micros branch from 370ed94 to 0f0cedb Compare September 9, 2024 18:33
@yingsu00 yingsu00 merged commit 0902cb3 into prestodb:master Sep 10, 2024
56 checks passed
@jaystarshot jaystarshot mentioned this pull request Nov 1, 2024
25 tasks
@tdcmeehan tdcmeehan added the from:IBM PR from IBM label Dec 13, 2024
@prestodb-ci prestodb-ci requested review from a team, nishithakbhaskaran and jp-sivaprasad and removed request for a team December 13, 2024 15:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
from:IBM PR from IBM
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants