Skip to content

Commit f3b0e04

Browse files
authored
Merge pull request #9 from oxinabox/ox/notype
Remove the InternedString type, return strings directly
2 parents 1522ae8 + 9bfc681 commit f3b0e04

12 files changed

+257
-265
lines changed

News.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
v0.5.0
2+
------
3+
- InternedString type is gone. It deprecates to string but does not cause immediate interning.
4+
- Now it is fully transparent, `intern(::S)::S`.
5+
- Works with all types of input. e.g. Strs.jl Strings
6+
- No longer do operations (regex or otherwise) on interned strings return interned strings, as there is nolonger a type to catch, but it is kinda OK, as it doesn't actually change the number of allocations doing all the interning at the end, just the timing.
7+
- Additional 2.5x speedup on top of v0.4.0
8+
9+
10+
v0.4.0
11+
------
12+
- Serious performance optimization of the pool lookup. 2-5x speed-up
13+
14+
15+
v0.3.0
16+
-------
17+
- More operations esp regex on InternedStrings return InternedStrings.
18+
19+
v0.2.0
20+
-----
21+
- Basic operations like spit, on InternedStrings return InternedStrings.
22+
- String Macro created
23+
24+
25+
v0.1.0
26+
------
27+
InternedString type created
28+
It works fully like a String

README.md

Lines changed: 79 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -7,41 +7,76 @@ For not having duplicate strings in memory.
77

88
## Usage
99

10-
`InternedString(s)` returns an interned string.
11-
it won't allocate new memory if an interned string with that content already exists.
10+
`intern(s)` returns an interned string.
11+
The short of it is that you can call `intern(s)` on any strings you expect to have multiple copies of in memory, and you will enjoy memory savings.
12+
You'll also enjoy much faster equality checks.
13+
14+
15+
If a string with that content was interned before, calling `intern(s)` will returns (a reference to) the earlier string; if this is the first time the string was interned it will return (a reference to) its input.
16+
Using `s=intern(s)` or otherwise getting rid of old references to memory that you are interning allows the old references to be garbage collected so you only have memory used by unique strings.
17+
18+
The interned strings are fully transparent -- they are normal references to the original string.
19+
So when all references to that string (i.e. all "copies" of it from interning ) go out of scope, it will be garbage collected.
1220
And when that interned string goes out of scope, it **will** be garbage collected, so don't worry about it.
1321

1422
For convenience it also comes in string macro form:
15-
`i"My String Uses Less Memory than Yours"`, makes an interned string with that content.
23+
`i"My String Uses Less Memory than Yours"`, makes a string with that content and interns it immediately.
24+
25+
### What types can I intern?
26+
You can intern any type really.
27+
It doesn't actually have to be a string at all.
28+
Strange things will happen if you mutate something that has been interned though; so it is recommended for use with immutable types only.
29+
30+
All types go into their own interning pool.
31+
Except `SubString`s, which are interned into their parent string type,
32+
as we do not want to be holding on to reference to the parent string while a interned reference exists.
33+
You can overload the behavior of `intern(::MyType)` in the usual way.
34+
35+
You might like to intern the strings from [Strs.jl](https://github.com/JuliaString/Strs.jl)
36+
37+
### What exactly is going on?
38+
If your not familiar with the concept of string interning perhaps the following example will help.
39+
40+
```
41+
julia> using InternedStrings
1642
17-
Use them just like you would Strings and enjoy your memory savings.
43+
julia> a = "Gold"
44+
"Gold"
1845
46+
julia> typeof(a), object_id(a) #This is the original reference
47+
(String, 0x2052f7ed641c9475)
1948
20-
#### `split` and regex the functions don't return substrings anymore :-( :-(
21-
Yes, `split`ing an InternedString does not make a vector of `SubString{InternedString}`.
22-
It just make an `InternedString`.
23-
Similar for all the regex function.
49+
julia> a = intern(a)
50+
"Gold"
2451
25-
Ideally we would also change every `SubStrings{InternedString}` everywhere, to be just `InternedString`.
26-
But it is a bit too breaking.
52+
julia> typeof(a), object_id(a) # No change still same memory
53+
(String, 0x2052f7ed641c9475)
2754
28-
SubStrings and InternedStrings solve roughly the same problem.
29-
But with different techniques and trade-offs.
30-
If you are using InternedStrings you probably don't want a substring anywhere.
31-
Since you might mistakenly end-up holding on to a really big string.
32-
The very problem this is designed to avoid.
55+
julia> b = "Gold"
56+
"Gold"
3357
34-
Please raise issues if you find functions that are returning SubStrings,
35-
that shouldn't be.
58+
julia> typeof(b),object_id(b) # New memory, see different ID
59+
(String, 0x927fe26348e44a27)
60+
61+
julia> b = intern(b) # Replace it,
62+
"Gold"
63+
64+
julia> typeof(b),object_id(b) # See it is same memory as for the original `a`
65+
(String, 0x2052f7ed641c9475)
66+
67+
#now the memory allocated to "b" with id=0x927fe26348e44a27 can be garbage collected
68+
69+
julia> object_id(intern("Gold")) # Same again
70+
0x2052f7ed641c9475
71+
```
3672

3773

3874
## Motivation (/Ranting)
39-
In natural language processing,
40-
when looking at a document,
75+
In natural language processing, when looking at a document,
4176
the first thing to do is to break it up into tokens.
42-
Tokenization can often be done simply:
43-
the most simple-case is just `split`,
77+
Tokenization can often be done simply: the most simple-case is just `split`,
4478
more complex use some regex, or even something fairly sophisticated.
79+
See [WordTokenizers.jl](https://github.com/JuliaText/WordTokenizers.jl)
4580

4681
There is an issue though:
4782
How much are these tokens costing you in memory use?
@@ -77,39 +112,33 @@ If you are smart you will spot it and convert them to Strings, so the content ca
77112
But i am not smart, and have made that mistake many times.
78113

79114

80-
One option is to use [WeakRefStrings.jl](https://github.com/quinnj/WeakRefStrings.jl).
81-
In those, keeping you WeakRef substrings in memory won't keep the original string in memory.
82-
Only now you are responsible for managing that memory yourself.
83-
And for strengthening those references as required.
84-
85115
So there has to be a better way.
86116
We want to:
87117

88-
1. Have lots of Strings, without lots of allocations (like SubString/WeakRefString, unlike String)
89-
2. Not have to worry about mistakenly keeping original huge source string in memory (like WeakRefString/String, unlike SubString)
90-
3. Not have to worry about managing the memory of the strings ourself (like SubString/String, unlike WeakRefString)
91-
4. Just outright use less memory. (Unlike any String string type)
118+
1. Have lots of Strings, without lots of allocations (like SubString, unlike String)
119+
2. Not have to worry about mistakenly keeping original huge source string in memory (like String, unlike SubString)
120+
3. Not have to worry about managing the memory of the strings ourself
121+
4. Just outright use less memory.
92122

93123
Can we do that? Yes we can.
94124

95-
#### InternedString
125+
#### `intern`
96126

97-
Every InternedString is a strong reference to a real String.
98-
But unlike normal Strings, if two InternedStrings are content equal, they are reference equal.
127+
The value returned by `intern`is a strong reference to a real String.
128+
But unlike for normal use of Strings, if `s1==s1` then `intern(s1)===intern(s2)` i.e. strings are that content equal, they are reference equal (once interned).
99129
That is to say if they look like each other, then they are each other.
100130

101-
When a new InternedStrings is created,
102-
before allocating new memory, we check to see if there already is an InternedString with that content, and if so we just grab a (Strong) reference to that existing String.
103-
This solves point **1.** by reducing allocations, (though not as much as SubStrings, which only have to allocated there pointers and length markers)
104-
131+
When a string is interned is created we check to see if there already is an interned string with that content, and if so return it.
132+
interning a string has no on-going new allocations -- not even the pointer and length marker that `SubString` has.
133+
This solves point **1.** by reducing allocations.
105134

106135
You don't have to worry about mistakenly keeping the huge source string in memory, (Like `SubString`)
107136
as they do not have a reference to that huge string, unless they **are** that huge source string.
108137
So that solves point **2.**
109138

110139
On point **3.** you don't have to worry about managing the memory yourself,
111-
because each InternedString is a strong reference to it's content.
112-
Once the last InternedString with that content goes out of scope (and is garbage collected),
140+
because each is just a normal reference to it's content.
141+
Once the last string with with that content goes out of scope (and is garbage collected),
113142
removing the copy in the interning pool will be handled automatically (it is a WeakRef, so won't keep it alive).
114143

115144

@@ -120,14 +149,15 @@ The original 10⁸ byte document, with 10⁷ words probably only has about 50,00
120149
is has 3.5×10⁵ words, but that is before rare words, numbers etc are removed)
121150
At an average of 10 bytes long you only need to be keeping 5×10⁵ bytes of content,
122151
plus for each 8 bytes of pointers/length markers (8×10⁴), plus 1 byte each for null terminating them all. (Grand total: 5.9×10⁵ bytes vs original 10⁸+9 bytes).
152+
The only difference memory wise between tokenizing into Strings or SubStrings is that the memory for the content in substrings is all contiguous, where as for Strings it need to be reallocated.
153+
123154

124-
Since each `InternedString` is only one point (to the actual String)
125-
you only have 4×10⁷ bytes of pointers (don't need the 4 bytes of length markers).
126-
vs SubString's 8×10⁷ bytes of pointers/length markers,
127-
or individual String's 9×10⁷ bytes of pointers/length markers/null terminating.
155+
- Original: 10⁸ byte content, 8 bytes pointers/length markers (To be tokenized to 10⁷ words)
156+
- Tokenized: 10×10⁷=10⁸ byte content, 8×10⁷ bytes pointers/length markers. Total 1.8×10⁸ bytes.
157+
- Tokenized and interned: 10×5×10⁴=5×10⁵ byte content, 8×10⁷ bytes pointers/length markers. Total 0.805×10⁸ bytes.
128158

129159
These numbers are all pretty rough, I've probably screwed up in a few places.
130-
Point is though, this saves you like an order of magnitude in memory.
160+
Point is though, this can better than halve the memory use.
131161
It only gets better when you increase the size of the original document.
132162
As the size of the vocabulary increases only logarithmically with the size of the document.
133163

@@ -168,15 +198,12 @@ But they are focused on Pooling for a single array.
168198

169199
The unmaintained (and unregistered) PooledElements.jl, did global pools.
170200
However, no automatic garbage collection.
171-
Also not referentially sane -- magic is required make sure it was workign with serialization etc.
201+
Also not referentially sane -- magic is required make sure it was working with serialization etc.
172202

173203
### What is the downside?
174-
There is basically no downside to InternedString vs String.
175-
String is always 1 pointer allocation + content allocation.
176-
InternedString is always 1 pointer allocation + maybe 1 extra pointer and content (if new).
177-
So worse case you end up paying to allocate 1 extra pointer.
204+
There is basically no downside to interning a String.
205+
It just takes a little time to hash the string to check if it is there or not.
178206

179207
There are most downsides vs SubString.
180-
Substring is per token always 1 pointer + 1 length marker allocation,
181-
never more (but you never get to release the content from its parent).
182-
InternedString is as above (chance at 1 pointer only, chance at more if new), but you get the release the content from it's parent.
208+
Substrings avoid allocating memory for segments of content,
209+
which means you can put off and potentially outright avoid expensive allocations.

src/InternedStrings.jl

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,12 @@
11
module InternedStrings
22
using Base
33

4-
export InternedString, @i_str
4+
export @i_str, intern
55

66
include("corefunctionality.jl")
7-
include("operations.jl")
87

98

10-
end # module
9+
Base.@deprecate_binding(InternedString, String, true)
10+
#InternedString(s)=intern(String(s))
11+
12+
end

src/corefunctionality.jl

Lines changed: 31 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,12 @@
1-
const pool = WeakKeyDict{String, Void}()
1+
########################
2+
# The pool/interning lookup core code
23

34
# This forces the type to be inferred (I don't know that the @noinline is reqired or even good)
45
@noinline getvalue(::Type{K}, wk) where K = wk.value::K
56

6-
7-
@inline function intern!(wkd::WeakKeyDict{K}, key)::K where K
8-
intern!(wkd, convert(K, key))
9-
end
10-
117
# NOTE: This code is carefully optimised. Do not tweak it (for readability or otherwise) without benchmarking
12-
@inline function intern!(wkd::WeakKeyDict{K}, kk::K)::K where K
13-
8+
@inline function intern!(wkd::WeakKeyDict{K}, key)::K where K
9+
kk::K = convert(K, key)
1410

1511
lock(wkd.lock)
1612
# hand positioning the locks and unlocks (rather than do block or try finally, seems to be faster)
@@ -30,28 +26,40 @@ end
3026
return kk # Return the strong ref
3127
end
3228
end
29+
#####################################################
30+
# Setup for types
3331

34-
struct InternedString <: AbstractString
35-
value::String
32+
const pool = Dict{DataType, WeakKeyDict}()
3633

37-
InternedString(s) = new(intern!(pool, s))
34+
@inline function get_pool(::Type{T})::WeakKeyDict{T, Void} where T
35+
get!(pool, T) do
36+
WeakKeyDict{T, Void}()
37+
end
3838
end
3939

40-
macro i_str(s)
41-
true_string_expr = esc(parse(string('"', unescape_string(s), '"')))
42-
Expr(:call, InternedString,true_string_expr)
40+
41+
###################################
42+
43+
function intern(s::T)::T where T
44+
intern!(get_pool(T), s)
4345
end
4446

45-
Base.convert(::Type{InternedString}, s::AbstractString) = InternedString(s)
46-
Base.convert(::Type{String}, s::InternedString) = String(s)
47-
Base.String(s::InternedString) = s.value
47+
intern(s::String)=intern!(get_pool(String), s) # Break stack-overflow
48+
4849

4950

50-
Base.endof(s::InternedString) = endof(s.value)
51-
Base.next(s::InternedString, i::Int) = next(s.value, i)
51+
"""
52+
Substrings are interned as their parent string type
53+
"""
54+
function intern(substr::SubString{T})::T where T
55+
intern(T(substr))
56+
end
57+
5258

53-
Base.:(==)(s1::InternedString, s2::InternedString) = s1.value === s2.value # InternedStrings have refernitally equal values
54-
Base.:(==)(s1::String, s2::InternedString) = s1 == s2.value # use faster than the AbstractString equality check
55-
Base.:(==)(s1::InternedString, s2::String) = s2 == s1
59+
#############################
5660

57-
Base.hash(s::InternedString, h::UInt) = hash(s.value, h)
61+
62+
macro i_str(s)
63+
true_string_expr = esc(parse(string('"', unescape_string(s), '"')))
64+
Expr(:call, intern, true_string_expr)
65+
end

0 commit comments

Comments
 (0)