clarify reduced range of valid values

bumblefudge · bumblefudge · commit e7b0c9806935 · 2024-09-18T16:42:53.000+02:00
diff --git a/draft-goldman-tid.md b/draft-goldman-tid.md
@@ -40,9 +40,9 @@ informative:
 Developed for use in a high-bandwidth distributed social networking context, the
 TID is essentially a highly-compact UUIDv6 variant that optimizes for a few
 specific properties (most notably being sortable both bytewise and lexically)
-and fits into the space-efficient `int64` type of languages that support it.
-It uses a bespoke base-32 encoding alphabet rather than the similar base32hex
-encoding, as the latter mandates a final padding character and is less URL-safe.
+and fits into the space-efficient `int64` type of languages that support it. One
+way it achieves these properties is by using a bespoke base-32 encoding alphabet
+rather than the similar base32hex encoding.
 
 --- middle
 
@@ -53,19 +53,21 @@ properties:
 
 1. sortable both bytewise and lexically when encoded with a base32 variant (also
 specified in this document);
-1. collision-resistant up to a 1024 independent parallel timestamping services,
+1. collision-resistant for up to 024 independent parallel timestamping services,
 with this set of 1024 broken up in 3 contiguous namespace to support various
-use-cases described below
+use-cases described below;
 1. based on microseconds since unix epoch to simplify translation to other
 timestamp formats;
-1. fits in an `int64` and
+1. fits in an `int64` for efficient storage, sorting and compute; and,
 1. works well across the type systems or most major compiled languages in use
 today for application-level development.
 
 Many minor choices, such as the choice of code points in the alternate base32
 encoding and the signed nature of the timestamp bytespace are primarily informed
 by cross-language ergonomics.
 
+## Base32tid encoding
+
 The 32 code points chosen to encode from binary, in order, are:
 
 ~~~~ bash
@@ -86,46 +88,62 @@ base32tid  234567abcdefghijklmnopqrstuvwxyz
 base32     ABCDEFGHIJKLMNOPQRSTUVWXYZ234567
 ~~~~
 
-# Timestamp Component
+# TID computation
+
+## Timestamp Component
 
-The form of timestamp used to generate a TID is the number microseconds since
+The form of timestamp used to generate a TID is the number of microseconds since
 the Unix epoch (1970-01-01T00:00:00+00:00), i.e. with three more digits than an
 {{?RFC5102}} `dateTimeMilliseconds`. In cases where a timestamping service may
 be returning timestamps faster than 1000 times every millisecond, uniqueness
 should be favored over microsecond accuracy; i.e., the ID generator should
 return the current microsecond since epoch OR the last microsecond returned plus
 1, whichever is greater.
 
-Note that the range of times that fit in the `int54` is actually
-purposefully limited to 2^53 by dropping the negative half of the range. This is
-to avoid a quirk of the Java type system that converts 54-byte integers to
-floats. The following tables shows the min, zero, and max values of the integer
-range of microseconds, expressed in the 11-codepoint `base32tid` encoding:
-
-~~~~ bash
-s222-222-2222 1684-07-28T00:12:25.259008 min  i64
-2222-222-2222 1970-01-01T00:00:00.000000 zero i64
-bzzz-zzz-zzzz 2255-06-05T23:47:34.740992 max  i64
-~~~~
-
-# Node Identifier Component
+The effective range of TIDs is limited by the compaction into `int64` form and
+with 10 bytes of its range being used to encode the nodeId segment; for reasons
+that will be explained below, it is further limited by one byte to avoid some
+translation problems with the targeted encodings and type systems, leaving 53
+bytes of signed space for a subset of unix microsecond timestamps. Effectively,
+this means that the range of microseconds before or after 1970, expressed as a
+signed integer, is not (-2^63+1) to (2^63-1), but (-2^53+1) to (2^53-1). The
+following tables shows the min, zero, and max values of the integer range of
+microseconds, expressed in the 11-codepoint `base32tid` encoding. The additional
+2 codepoints for the nodeId segment, covered below, are omitted for clarity.
+
+|tid          |microseconds     |valid?            |ISO timestamp      |
+|---          |---              |---               |---                |
+|s222-222-2222|-9007199254740991|yes (min value)   |1684-07-28T00:12:25|
+|2222-222-2222|                0|yes               |1970-01-01T00:00:00|
+|bzzz-zzz-zzzz| 9007199254740991|yes (max value)   |2255-06-05T23:47:34|
+|zzzz-zzz-zzzz|18014398509481982|no (binary unsafe)|2540-11-07T23:35:09|
+
+Note that half the possible range of values encodable in 11 codepoints are
+considered invalid TIDs, as their binary form would not fit safely in an `int64`
+bytestring. As the canonical form of TIDs is an `int64` bytestring, the invalid
+half of the string-encodable range should not be mistaken for valid TIDs and
+software handling these TID should validate strings accordingly.
+
+## Node Identifier Component
 
 The `node` identifier, by analogy to the equivalent element in an {{?RFC9562}}
-UUIDv6, is a spatially-unique identifier, but from much smaller space (10 bits,
-as opposed to UUIDv6's 48 bits). It is divided into three contiguous ranges. The
-first 32 values (0-31, i.e. "20" - "2z" base-encoded) are reserved for "best
-effort" TIDs. The bulk of the range, (32-991, i.e. "30" - "yz" base-encoded) is
-reserved for context-dependent use. The remaining 32 entries (992-1023, i.e.
-"z0" - "zz" base-encoded) are reserved for globally unique TIDs.
-
-"Best effort" node identifiers can be generated without coordination but may
-collide.
-
-Context-dependent should be use in the context of a specific application where
-they can be derived stably from application context. The application developer
-should take steps to ensure the that in any given time range, no node
-identifiers are in simultaneous use by two different actors. No process is
-specified for coordinating leases of node identifiers to actors.
+UUIDv6, is a spatially-unique identifier, but occupying a much smaller space (10
+bits, as opposed to UUIDv6's 48 bits). It is divided into three contiguous
+ranges. The first 32 values (0-31, i.e. "20" - "2z" base-encoded) are reserved
+for "best effort" collision-resistance TIDs. The bulk of the range, (32-991,
+i.e. "30" - "yz" base-encoded) is reserved for context-dependent use. The
+remaining 32 entries (992-1023, i.e. "z0" - "zz" base-encoded) are reserved for
+globally unique TIDs.
+
+"Best effort" node identifiers can be generated without coordination or deferral
+to external authorities, but are considered likely to collide when merged with
+data from external sources.
+
+Context-dependent node identifiers should be use in the context of a specific
+application where they can be derived stably from application context. The
+application developer should take steps to ensure the that in any given time
+range, no node identifiers are in simultaneous use by two different actors. No
+process is specified for coordinating leases of node identifiers to actors.
 
 Globally-unique node identifiers should only be used after being registered
 globally. At time of writing, there is only one public TimeID service
@@ -135,23 +153,39 @@ operating.
 |---------|-------------------|------------|---------------|---------|
 | z0      | http://ccn.bz/tid | todo       | 2222-222-2222 | ongoing |
 
-# Base-Encoded String Expression
+## Base-Encoded String Expression
 
 The TimeID concatenates the timestamp and the node identifier. The string format
 is 11 code points of timestamp and 2 code points of node identifier, displayed
 for readability with `-` segment dividers after the 4th, 7th, and 11th code
 points:
 
 ~~~~ bash
-TTTT-TTT-TTTT-CC
+STTT-TTT-TTTT-CC
 ~~~~
 
-Since each code point in base32-encoding represents 5 bits, we need to
-sign-extend the 54 bits of the time stamp to 55 bits to convert to 11 char of
-b32.
+Where:
+
+* S represents a character with limited range, representing 3 bytes of timestamp
+* each T represents 5-bytes of the timestamp, and
+* each C represent a 5-byte character from the node identifier
+
+The timestamp is expressed in a string-form TID as the first 11 codepoints, i.e.
+55 bytes, but in the binary form as the first 54 bytes of an `int64`. Note that
+the range of times that fit into those 54 bytes is actually a little smaller
+than 0 +/- (2^55-1); namely, it is 0 +/- (2^53-1). This is to accomodate various
+type-system quirks in the targeted languages and encodings: firstly, unsigned
+integers are problematic for JSON encoding (particularly JSON tooling),
+requiring the "assumed byte" (the 55th byte hard-coded into the decoding
+process) to designate a positive or negative number. Similarly, the 54th byte
+has to "sign extend" that positive or negative sign to keep the total integer
+expressible in a sign-extended 63 bytes to accomodate a quirk of the Java type
+system that converts integers bigger than 63 bytes to `float64`s. Sacrificing
+these two bytes of range safeguards round-trip translation of these `int64`s to
+JSON or Java and back.
 
 We can take the timestamp `2024-07-19T09:40:46.480310` as an example
-to show the process. This is `1721382046481` in seconds since Unix
+to show the process. This is `1721382046481000` microseconds since Unix
 epoch. On a node with identifer `01`, this base-encodes to:
 
 ~~~~ bash