mmap-persistent SymSpell spell checking algorithm


This is a from-scratch implementation in Nim of Wolf Garbe's Symmetric Delete Algorithm for correct spelling suggestions. The basic idea is simple: corpus words within edit distance N of a query can only be at most N units longer or N units shorter and the shorter must be derived from deletes of longer strings. So, build a map from all shortenings to all corpus words which generate them. When queried with a string, we can then lookup all possible edits lengthening the query into a corpus word. We also on-the-fly compute all shortenings of a query (& shortenings of lengthenings). From both sets, we filter "maybe within N edits of a corpus word" to "actually within N edits". The filter can be any "distance" successfully bounded by N-indels. This idea is like Monte Carlo numerical integration of shapes within easy bounding boxes (but this is deterministic & points which pass are reported, not just counted).

While this does allow for fast queries, it takes takes a long time and a lot of space to make a lengthenings table compared to thousands of linear scans of a large-ish correct word corpus. So, it is a useful strategy if you A) can save the big map to disk and very efficiently load it and/or B) have many queries to amortize build costs over. Rebuilding is lame while "real" DB query latency is hostile. So, this module does an efficient external file format with five files to "mmap & go":

A .tabl pointing to (keyIx->varlen.keys, .sugg=varlen[array[CNo]])
and a .meta(ix,cnt) file pointing to varlen .corp.

varlen[arr[CNo]] is an typical allocation arena with early entries the heads of per-list-size free lists.

I originally wrote this to understand and perhaps debunk SymSpell, though the work has (sort of) validated it. Or not. Let you, dear reader, be the judge. I think that I have at least found information I didn't see elsewhere that brackets its applicability which is worth letting people know about. In particular, SymSpell offers only modest speed-up vs-linear scan at large (4,5,..) edit distances of a medium- sized (40 kWord) corpus, as shown in this plot. This does roughly contradict Garbe's "large distance, large dictionary" sales pitch. False positive rates for d>3 probably makes that regime uninteresting. Still, SymSpell benefit remains only 7.3x-ish for 80 kWord @d=3 which is not great. Indeed, parallel storage/processing optimizations on both linear scan and SymSpell querying might even nullify such a small advantage.

The basic experimental set up is to use "frequency_dictionary_en_82_765.txt" from the SymSpell repository as our input. We create synthetic "batches of typos" by sampling words from that very same distribution and then deleting one or more random characters. Create 2000 synthetic batches and feed these into a 'compare' mode of 'suggest' that times both a linear scan approach and an mmap & query SymSpell approach. The averages should thus be pretty representative. Batch size of 6 typos per document/user interaction felt about right, but typo rates probably vary a lot. The script is provided for users wanting to study how things vary. The error bars are the standard deviation of the mean. The distribution is wide with 95th percentile times often 4X the 5th percentile.

Both linear scanning and SymSpell querying have an additional optimization of shrinking the max distance passed to optimized edit distance calculators once "enough" correct suggestions have been found at lesser distances. This can speed up such distance computations somewhat, especially for linear scans of short words near popular portions of word-space or when very few matches are requested.

You can see the general scaling of SymSpell costs with max distance and vocabulary size from this plot.

Along the way, I also found that SymSpell is a pretty good stress test for a some system-layer functionality - in particular memory allocators, string hash functions, and even virtual memory operation. Indeed, unless you really need that last 10x-50x speed boost from 100s of microseconds to 10s of microseconds, it is likely not advisable to try to implement SymSpell (at least without the systems layer warnings in this README). A simple linear scan of a corpus which has been pre-sorted by descending recommendation order is fast enough and very simple - you needn't even sort results. You simply bin them by edit distance.

For the curious about the systems layer problems, sensitivity to string hash functions is perhaps the most obvious. Because each key in the main table is formed by 1-deletions (and more) of other keys the amount of key similarity is an almost hash-attack level end-point. One doesn't quite need cryptographically secure hash functions, but almost any "weak but fast" will be defeated. The symptom is the usual one - long collision chains and slow operations on what is usually a gigantic table. My timing results use a variant of the late Robert Uzgalis' BuzHash that runs in (1.75 cycles + 0.63 cycles/byte). The default Nim hash also works all right, but will have not quite as good numbers. (For one, it's 4.3cyc+4.2 cyc/byte, but also not quite as randomizing.) I tried hard to have SymSpell be cast in the best light possible here, and that meant using the best hash. Similarly, I did not try to implement the prefix optimization that discusses which slows things down a little at some significant memory savings.

The memory allocator problem comes from a long-tailed distribution of how many "correct suggestions" any given typo has. The distribution shape defeats most attempts to "speed-up" memory allocators with, say, "power of two" spaced region sizes. Almost any spacing besides the minimal one results in very low space utilization by suggestion lists. Indeed, an early non-persistent version of the code blew up most GCs Nim offers. Only the Boehm-Demers-Weiser garbage collector actually allowing completion of a table build in reasonable time. Thankfully, minimal spacing is fast enough and we get 70% utilization or so.

Finally, while 4096 byte virtual memory pages are rarely a performance obstacle, the size and access pattern of SymSpell queries is particularly hostile to use from a fresh mmap. For larger dictionaries and distances <=~ 3, using 2M pages so-called "huge TLB" pages resulted in >2x speed-ups for a "fresh mapping", as can be seen in this graph That relative speed-up of large pages does owe to the small absolute time SymSpell queries take, of course, but still indicates many non-local 4k page accesses (which will become relevant in later discussion).

It also bears mentioning that table building time is still costly in this fairly optimized implementation. When using a Linux tmpfs RAM filesystem the resource usage looks like this:

maxDist dt(sec) size(bytes) Misspells
   1      0.182  19,456,529    647786
   2      0.957  76,853,412   2562636
   3      3.203 175,542,825   6852017
   4      8.127 364,447,609  13981420
   5     17.034 694,357,352  23487763

Those times could be sped up 1.2-1.3x by clairvoyantly pre-sizing the table to avoid hash table resizes. Some reasonably precise formula for the number of delete misspellings for a given (corpus, distance) might be able to provide this modest speed-up systematically. It is also unlikely that "less persistent" implementations can achieve much better build times.

maxDist dt(sec) size(bytes) Misspells
   1      0.141  19,456,529    647786
   2      0.710  76,853,412   2562636
   3      2.589 175,542,825   6852017
   4      6.654 364,447,609  13981420
   5     13.891 694,357,352  23487763

Also Garbe claims significant space reduction for modest slow downs which should also help build times. The main take-away though, is just making concrete my "long time and a lot of space" from the introductory paragraphs. This puts real pressure on saving the answer of this build, especially for larger dictionaries with longer words. One really does need thousands of future queries before the investment in build time pays off in query performance. This natural "save the answer" response then begs the question of cold-cache performance.

The corpus file alone is a mere 752,702 bytes and can be scanned in hundreds of microseconds off a modern NVMe storage fully cold-cache. Cold-cache, non-RAM FS times for SymSpell, meanwhile degrade dramatically, especially on high latency storage like Winchester drives. Every corrections query involves at least one, but sometimes hundreds or even in rarer cases thousands of random accesses to the .tabl file for weakly related keys. Arranging for locality to such accesses seems challenging if not impossible.

For example, Cold-cache scanning on a 10 ms + 1e-5 ms/MB (aka 10 ms, 100 MB/s) Winchester drive or network storage unit would probably be approximately 10 + 1e-5*750e3 or about 17.5 ms, generally half the time of a single frame of a typical video file. Getting suggestions for multiple words can easily share that IO as well, making amortized per-word time for batch-of-6-typos more like 3 ms. Meanwhile, just one cold-cache SymSpell query on freshly opened data files could take 100s of random accesses or seconds (at 10 ms latency).

We can measure this cost directly via getrusage instrumentation to measure minor page faults. suggest query2 does this. In a cold cache scenario, these would all be major 10 ms-ish faults with very little locality. Results for a sample of 10,000 typos at various distances are this graph We see that for the fairly relevant d=3, 5% of the time there are 200+ accesses and 1% of the time there are 373 accesses which translate to 2..4 seconds at 10 ms / access, 100..200x the 17.5 ms of a cold-cache linear scan. Running this measurment against an actual 10 ms-slow device would be quite slow. This 10,000 sample case did 2870453 faults, which would be 287045 seconds which is 80 hours. Even a 1000 sample case would take a good fraction of a day.

While multiple words in a linear scan share IO perfectly, this is not the case for the random access pattern of SymSpell IO. For the first batch of 6 off a cold cache, a very noticable fraction of a minute is conceivable for SymSpell and not very conceivable for a linear scan. Indeed, if multiple words are expected, the wisest SymSpell IO strategy would be to get the whole data set paged into RAM via streaming IO (only 1.75 sec for the d=3 case @100 MB/s) and then run the queries.

This cold-cache scenario is yet another "system layer performance fragility" of SymSpell. Notice that linear scan's order 10s of ms cold-cache can now be 100x faster than SymSpell's seconds. Indeed the probability is order 50% that the first typo query will be 25x slower or worse than a linear scan as well as this time being in an end-user noticable regime. Faster storage than 10 ms + 1e-5 ms/MB storage like SSDs and NVMe is more common these days, but even keep SymSpell a performance winner in deployment, one wants to ensure cached pages.

Of course, the above commentary also applies to non-persistent SymSpell implementations in execution environments where swap/page files are possible. Competition for memory may be beyond a developers control, and pre-paging non-persistent data can be more involved than "cat spell.* > /dev/null". Also, SymSpell's large memory requirements make it likely one of the fiercer memory competitors. There are of course system facilities to help with this problem such as mlock and MAP_LOCKED and such (also a bit easier to use with persistent files than volatile "language runtime" data structures), if the developer thinks to use them. [ They may also require CAP_IPC_LOCK or superuser priviliges. ]

The TL;DR? While a well-implemented SymSpell with a well guarded deployment environment can indeed be always faster than a similarly well-implemented linear scan, it is far more "performance risky" without a variety of cautions. It may be 10-100x faster than a linear scan in some hot cache circumstances or 10-100x slower in cold-cache circumstances. Meanwhile, a cold-cache linear scan might be only about 20x worse than a hot-cache linear scan while for SymSpell cold vs hot could be 10000x different. In those terms, SymSpell is 500x more performance risky than a linear scan, not even considering things like allocator and hash function implementation risk.

The TL;DR;DR? "YMMV from hell". ;-)

Need help? Read Nimble
No tagged versions available
License: MIT

Project website Hosted docs