You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- InternedString type is gone. It deprecates to string but does not cause immediate interning.
4
+
- Now it is fully transparent, `intern(::S)::S`.
5
+
- Works with all types of input. e.g. Strs.jl Strings
6
+
- No longer do operations (regex or otherwise) on interned strings return interned strings, as there is nolonger a type to catch, but it is kinda OK, as it doesn't actually change the number of allocations doing all the interning at the end, just the timing.
7
+
- Additional 2.5x speedup on top of v0.4.0
8
+
9
+
10
+
v0.4.0
11
+
------
12
+
- Serious performance optimization of the pool lookup. 2-5x speed-up
13
+
14
+
15
+
v0.3.0
16
+
-------
17
+
- More operations esp regex on InternedStrings return InternedStrings.
18
+
19
+
v0.2.0
20
+
-----
21
+
- Basic operations like spit, on InternedStrings return InternedStrings.
Copy file name to clipboardExpand all lines: README.md
+79-52Lines changed: 79 additions & 52 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,41 +7,76 @@ For not having duplicate strings in memory.
7
7
8
8
## Usage
9
9
10
-
`InternedString(s)` returns an interned string.
11
-
it won't allocate new memory if an interned string with that content already exists.
10
+
`intern(s)` returns an interned string.
11
+
The short of it is that you can call `intern(s)` on any strings you expect to have multiple copies of in memory, and you will enjoy memory savings.
12
+
You'll also enjoy much faster equality checks.
13
+
14
+
15
+
If a string with that content was interned before, calling `intern(s)` will returns (a reference to) the earlier string; if this is the first time the string was interned it will return (a reference to) its input.
16
+
Using `s=intern(s)` or otherwise getting rid of old references to memory that you are interning allows the old references to be garbage collected so you only have memory used by unique strings.
17
+
18
+
The interned strings are fully transparent -- they are normal references to the original string.
19
+
So when all references to that string (i.e. all "copies" of it from interning ) go out of scope, it will be garbage collected.
12
20
And when that interned string goes out of scope, it **will** be garbage collected, so don't worry about it.
13
21
14
22
For convenience it also comes in string macro form:
15
-
`i"My String Uses Less Memory than Yours"`, makes an interned string with that content.
23
+
`i"My String Uses Less Memory than Yours"`, makes a string with that content and interns it immediately.
24
+
25
+
### What types can I intern?
26
+
You can intern any type really.
27
+
It doesn't actually have to be a string at all.
28
+
Strange things will happen if you mutate something that has been interned though; so it is recommended for use with immutable types only.
29
+
30
+
All types go into their own interning pool.
31
+
Except `SubString`s, which are interned into their parent string type,
32
+
as we do not want to be holding on to reference to the parent string while a interned reference exists.
33
+
You can overload the behavior of `intern(::MyType)` in the usual way.
34
+
35
+
You might like to intern the strings from [Strs.jl](https://github.com/JuliaString/Strs.jl)
36
+
37
+
### What exactly is going on?
38
+
If your not familiar with the concept of string interning perhaps the following example will help.
39
+
40
+
```
41
+
julia> using InternedStrings
16
42
17
-
Use them just like you would Strings and enjoy your memory savings.
43
+
julia> a = "Gold"
44
+
"Gold"
18
45
46
+
julia> typeof(a), object_id(a) #This is the original reference
47
+
(String, 0x2052f7ed641c9475)
19
48
20
-
#### `split` and regex the functions don't return substrings anymore :-( :-(
21
-
Yes, `split`ing an InternedString does not make a vector of `SubString{InternedString}`.
22
-
It just make an `InternedString`.
23
-
Similar for all the regex function.
49
+
julia> a = intern(a)
50
+
"Gold"
24
51
25
-
Ideally we would also change every `SubStrings{InternedString}` everywhere, to be just `InternedString`.
26
-
But it is a bit too breaking.
52
+
julia> typeof(a), object_id(a) # No change still same memory
53
+
(String, 0x2052f7ed641c9475)
27
54
28
-
SubStrings and InternedStrings solve roughly the same problem.
29
-
But with different techniques and trade-offs.
30
-
If you are using InternedStrings you probably don't want a substring anywhere.
31
-
Since you might mistakenly end-up holding on to a really big string.
32
-
The very problem this is designed to avoid.
55
+
julia> b = "Gold"
56
+
"Gold"
33
57
34
-
Please raise issues if you find functions that are returning SubStrings,
35
-
that shouldn't be.
58
+
julia> typeof(b),object_id(b) # New memory, see different ID
59
+
(String, 0x927fe26348e44a27)
60
+
61
+
julia> b = intern(b) # Replace it,
62
+
"Gold"
63
+
64
+
julia> typeof(b),object_id(b) # See it is same memory as for the original `a`
65
+
(String, 0x2052f7ed641c9475)
66
+
67
+
#now the memory allocated to "b" with id=0x927fe26348e44a27 can be garbage collected
68
+
69
+
julia> object_id(intern("Gold")) # Same again
70
+
0x2052f7ed641c9475
71
+
```
36
72
37
73
38
74
## Motivation (/Ranting)
39
-
In natural language processing,
40
-
when looking at a document,
75
+
In natural language processing, when looking at a document,
41
76
the first thing to do is to break it up into tokens.
42
-
Tokenization can often be done simply:
43
-
the most simple-case is just `split`,
77
+
Tokenization can often be done simply: the most simple-case is just `split`,
44
78
more complex use some regex, or even something fairly sophisticated.
79
+
See [WordTokenizers.jl](https://github.com/JuliaText/WordTokenizers.jl)
45
80
46
81
There is an issue though:
47
82
How much are these tokens costing you in memory use?
@@ -77,39 +112,33 @@ If you are smart you will spot it and convert them to Strings, so the content ca
77
112
But i am not smart, and have made that mistake many times.
78
113
79
114
80
-
One option is to use [WeakRefStrings.jl](https://github.com/quinnj/WeakRefStrings.jl).
81
-
In those, keeping you WeakRef substrings in memory won't keep the original string in memory.
82
-
Only now you are responsible for managing that memory yourself.
83
-
And for strengthening those references as required.
84
-
85
115
So there has to be a better way.
86
116
We want to:
87
117
88
-
1. Have lots of Strings, without lots of allocations (like SubString/WeakRefString, unlike String)
89
-
2. Not have to worry about mistakenly keeping original huge source string in memory (like WeakRefString/String, unlike SubString)
90
-
3. Not have to worry about managing the memory of the strings ourself (like SubString/String, unlike WeakRefString)
91
-
4. Just outright use less memory. (Unlike any String string type)
118
+
1. Have lots of Strings, without lots of allocations (like SubString, unlike String)
119
+
2. Not have to worry about mistakenly keeping original huge source string in memory (like String, unlike SubString)
120
+
3. Not have to worry about managing the memory of the strings ourself
121
+
4. Just outright use less memory.
92
122
93
123
Can we do that? Yes we can.
94
124
95
-
#### InternedString
125
+
#### `intern`
96
126
97
-
Every InternedString is a strong reference to a real String.
98
-
But unlike normal Strings, if two InternedStrings are content equal, they are reference equal.
127
+
The value returned by `intern`is a strong reference to a real String.
128
+
But unlike for normal use of Strings, if `s1==s1` then `intern(s1)===intern(s2)` i.e. strings are that content equal, they are reference equal (once interned).
99
129
That is to say if they look like each other, then they are each other.
100
130
101
-
When a new InternedStrings is created,
102
-
before allocating new memory, we check to see if there already is an InternedString with that content, and if so we just grab a (Strong) reference to that existing String.
103
-
This solves point **1.** by reducing allocations, (though not as much as SubStrings, which only have to allocated there pointers and length markers)
104
-
131
+
When a string is interned is created we check to see if there already is an interned string with that content, and if so return it.
132
+
interning a string has no on-going new allocations -- not even the pointer and length marker that `SubString` has.
133
+
This solves point **1.** by reducing allocations.
105
134
106
135
You don't have to worry about mistakenly keeping the huge source string in memory, (Like `SubString`)
107
136
as they do not have a reference to that huge string, unless they **are** that huge source string.
108
137
So that solves point **2.**
109
138
110
139
On point **3.** you don't have to worry about managing the memory yourself,
111
-
because each InternedString is a strong reference to it's content.
112
-
Once the last InternedString with that content goes out of scope (and is garbage collected),
140
+
because each is just a normal reference to it's content.
141
+
Once the last string with with that content goes out of scope (and is garbage collected),
113
142
removing the copy in the interning pool will be handled automatically (it is a WeakRef, so won't keep it alive).
114
143
115
144
@@ -120,14 +149,15 @@ The original 10⁸ byte document, with 10⁷ words probably only has about 50,00
120
149
is has 3.5×10⁵ words, but that is before rare words, numbers etc are removed)
121
150
At an average of 10 bytes long you only need to be keeping 5×10⁵ bytes of content,
122
151
plus for each 8 bytes of pointers/length markers (8×10⁴), plus 1 byte each for null terminating them all. (Grand total: 5.9×10⁵ bytes vs original 10⁸+9 bytes).
152
+
The only difference memory wise between tokenizing into Strings or SubStrings is that the memory for the content in substrings is all contiguous, where as for Strings it need to be reallocated.
153
+
123
154
124
-
Since each `InternedString` is only one point (to the actual String)
125
-
you only have 4×10⁷ bytes of pointers (don't need the 4 bytes of length markers).
126
-
vs SubString's 8×10⁷ bytes of pointers/length markers,
127
-
or individual String's 9×10⁷ bytes of pointers/length markers/null terminating.
155
+
- Original: 10⁸ byte content, 8 bytes pointers/length markers (To be tokenized to 10⁷ words)
0 commit comments