@@ -40,9 +40,9 @@ informative:
40
40
Developed for use in a high-bandwidth distributed social networking context, the
41
41
TID is essentially a highly-compact UUIDv6 variant that optimizes for a few
42
42
specific properties (most notably being sortable both bytewise and lexically)
43
- and fits into the space-efficient ` int64 ` type of languages that support it.
44
- It uses a bespoke base-32 encoding alphabet rather than the similar base32hex
45
- encoding, as the latter mandates a final padding character and is less URL-safe .
43
+ and fits into the space-efficient ` int64 ` type of languages that support it. One
44
+ way it achieves these properties is by using a bespoke base-32 encoding alphabet
45
+ rather than the similar base32hex encoding .
46
46
47
47
--- middle
48
48
@@ -53,19 +53,21 @@ properties:
53
53
54
54
1 . sortable both bytewise and lexically when encoded with a base32 variant (also
55
55
specified in this document);
56
- 1 . collision-resistant up to a 1024 independent parallel timestamping services,
56
+ 1 . collision-resistant for up to 024 independent parallel timestamping services,
57
57
with this set of 1024 broken up in 3 contiguous namespace to support various
58
- use-cases described below
58
+ use-cases described below;
59
59
1 . based on microseconds since unix epoch to simplify translation to other
60
60
timestamp formats;
61
- 1 . fits in an ` int64 ` and
61
+ 1 . fits in an ` int64 ` for efficient storage, sorting and compute; and,
62
62
1 . works well across the type systems or most major compiled languages in use
63
63
today for application-level development.
64
64
65
65
Many minor choices, such as the choice of code points in the alternate base32
66
66
encoding and the signed nature of the timestamp bytespace are primarily informed
67
67
by cross-language ergonomics.
68
68
69
+ ## Base32tid encoding
70
+
69
71
The 32 code points chosen to encode from binary, in order, are:
70
72
71
73
~~~~ bash
@@ -86,46 +88,62 @@ base32tid 234567abcdefghijklmnopqrstuvwxyz
86
88
base32 ABCDEFGHIJKLMNOPQRSTUVWXYZ234567
87
89
~~~~
88
90
89
- # Timestamp Component
91
+ # TID computation
92
+
93
+ ## Timestamp Component
90
94
91
- The form of timestamp used to generate a TID is the number microseconds since
95
+ The form of timestamp used to generate a TID is the number of microseconds since
92
96
the Unix epoch (1970-01-01T00:00:00+00:00), i.e. with three more digits than an
93
97
{{?RFC5102}} ` dateTimeMilliseconds ` . In cases where a timestamping service may
94
98
be returning timestamps faster than 1000 times every millisecond, uniqueness
95
99
should be favored over microsecond accuracy; i.e., the ID generator should
96
100
return the current microsecond since epoch OR the last microsecond returned plus
97
101
1, whichever is greater.
98
102
99
- Note that the range of times that fit in the ` int54 ` is actually
100
- purposefully limited to 2^53 by dropping the negative half of the range. This is
101
- to avoid a quirk of the Java type system that converts 54-byte integers to
102
- floats. The following tables shows the min, zero, and max values of the integer
103
- range of microseconds, expressed in the 11-codepoint ` base32tid ` encoding:
104
-
105
- ~~~~ bash
106
- s222-222-2222 1684-07-28T00:12:25.259008 min i64
107
- 2222-222-2222 1970-01-01T00:00:00.000000 zero i64
108
- bzzz-zzz-zzzz 2255-06-05T23:47:34.740992 max i64
109
- ~~~~
110
-
111
- # Node Identifier Component
103
+ The effective range of TIDs is limited by the compaction into ` int64 ` form and
104
+ with 10 bytes of its range being used to encode the nodeId segment; for reasons
105
+ that will be explained below, it is further limited by one byte to avoid some
106
+ translation problems with the targeted encodings and type systems, leaving 53
107
+ bytes of signed space for a subset of unix microsecond timestamps. Effectively,
108
+ this means that the range of microseconds before or after 1970, expressed as a
109
+ signed integer, is not (-2^63+1) to (2^63-1), but (-2^53+1) to (2^53-1). The
110
+ following tables shows the min, zero, and max values of the integer range of
111
+ microseconds, expressed in the 11-codepoint ` base32tid ` encoding. The additional
112
+ 2 codepoints for the nodeId segment, covered below, are omitted for clarity.
113
+
114
+ | tid | microseconds | valid? | ISO timestamp |
115
+ | --- | --- | --- | --- |
116
+ | s222-222-2222| -9007199254740991| yes (min value) | 1684-07-28T00:12:25|
117
+ | 2222-222-2222| 0| yes | 1970-01-01T00:00:00|
118
+ | bzzz-zzz-zzzz| 9007199254740991| yes (max value) | 2255-06-05T23:47:34|
119
+ | zzzz-zzz-zzzz| 18014398509481982| no (binary unsafe)| 2540-11-07T23:35:09|
120
+
121
+ Note that half the possible range of values encodable in 11 codepoints are
122
+ considered invalid TIDs, as their binary form would not fit safely in an ` int64 `
123
+ bytestring. As the canonical form of TIDs is an ` int64 ` bytestring, the invalid
124
+ half of the string-encodable range should not be mistaken for valid TIDs and
125
+ software handling these TID should validate strings accordingly.
126
+
127
+ ## Node Identifier Component
112
128
113
129
The ` node ` identifier, by analogy to the equivalent element in an {{?RFC9562}}
114
- UUIDv6, is a spatially-unique identifier, but from much smaller space (10 bits,
115
- as opposed to UUIDv6's 48 bits). It is divided into three contiguous ranges. The
116
- first 32 values (0-31, i.e. "20" - "2z" base-encoded) are reserved for "best
117
- effort" TIDs. The bulk of the range, (32-991, i.e. "30" - "yz" base-encoded) is
118
- reserved for context-dependent use. The remaining 32 entries (992-1023, i.e.
119
- "z0" - "zz" base-encoded) are reserved for globally unique TIDs.
120
-
121
- "Best effort" node identifiers can be generated without coordination but may
122
- collide.
123
-
124
- Context-dependent should be use in the context of a specific application where
125
- they can be derived stably from application context. The application developer
126
- should take steps to ensure the that in any given time range, no node
127
- identifiers are in simultaneous use by two different actors. No process is
128
- specified for coordinating leases of node identifiers to actors.
130
+ UUIDv6, is a spatially-unique identifier, but occupying a much smaller space (10
131
+ bits, as opposed to UUIDv6's 48 bits). It is divided into three contiguous
132
+ ranges. The first 32 values (0-31, i.e. "20" - "2z" base-encoded) are reserved
133
+ for "best effort" collision-resistance TIDs. The bulk of the range, (32-991,
134
+ i.e. "30" - "yz" base-encoded) is reserved for context-dependent use. The
135
+ remaining 32 entries (992-1023, i.e. "z0" - "zz" base-encoded) are reserved for
136
+ globally unique TIDs.
137
+
138
+ "Best effort" node identifiers can be generated without coordination or deferral
139
+ to external authorities, but are considered likely to collide when merged with
140
+ data from external sources.
141
+
142
+ Context-dependent node identifiers should be use in the context of a specific
143
+ application where they can be derived stably from application context. The
144
+ application developer should take steps to ensure the that in any given time
145
+ range, no node identifiers are in simultaneous use by two different actors. No
146
+ process is specified for coordinating leases of node identifiers to actors.
129
147
130
148
Globally-unique node identifiers should only be used after being registered
131
149
globally. At time of writing, there is only one public TimeID service
@@ -135,23 +153,39 @@ operating.
135
153
| ---------| -------------------| ------------| ---------------| ---------|
136
154
| z0 | http://ccn.bz/tid | todo | 2222-222-2222 | ongoing |
137
155
138
- # Base-Encoded String Expression
156
+ ## Base-Encoded String Expression
139
157
140
158
The TimeID concatenates the timestamp and the node identifier. The string format
141
159
is 11 code points of timestamp and 2 code points of node identifier, displayed
142
160
for readability with ` - ` segment dividers after the 4th, 7th, and 11th code
143
161
points:
144
162
145
163
~~~~ bash
146
- TTTT -TTT-TTTT-CC
164
+ STTT -TTT-TTTT-CC
147
165
~~~~
148
166
149
- Since each code point in base32-encoding represents 5 bits, we need to
150
- sign-extend the 54 bits of the time stamp to 55 bits to convert to 11 char of
151
- b32.
167
+ Where:
168
+
169
+ * S represents a character with limited range, representing 3 bytes of timestamp
170
+ * each T represents 5-bytes of the timestamp, and
171
+ * each C represent a 5-byte character from the node identifier
172
+
173
+ The timestamp is expressed in a string-form TID as the first 11 codepoints, i.e.
174
+ 55 bytes, but in the binary form as the first 54 bytes of an ` int64 ` . Note that
175
+ the range of times that fit into those 54 bytes is actually a little smaller
176
+ than 0 +/- (2^55-1); namely, it is 0 +/- (2^53-1). This is to accomodate various
177
+ type-system quirks in the targeted languages and encodings: firstly, unsigned
178
+ integers are problematic for JSON encoding (particularly JSON tooling),
179
+ requiring the "assumed byte" (the 55th byte hard-coded into the decoding
180
+ process) to designate a positive or negative number. Similarly, the 54th byte
181
+ has to "sign extend" that positive or negative sign to keep the total integer
182
+ expressible in a sign-extended 63 bytes to accomodate a quirk of the Java type
183
+ system that converts integers bigger than 63 bytes to ` float64 ` s. Sacrificing
184
+ these two bytes of range safeguards round-trip translation of these ` int64 ` s to
185
+ JSON or Java and back.
152
186
153
187
We can take the timestamp ` 2024-07-19T09:40:46.480310 ` as an example
154
- to show the process. This is ` 1721382046481 ` in seconds since Unix
188
+ to show the process. This is ` 1721382046481000 ` microseconds since Unix
155
189
epoch. On a node with identifer ` 01 ` , this base-encodes to:
156
190
157
191
~~~~ bash
0 commit comments