In a time critical section of a recent project, I came across having to optimize the conversion of three digit US month abbreviations (as commonly found in log files) to integers in C++. That is, for "Jan" yield 1, for "Feb" yield 2, etc, for "Dec" yield 12.
In C++ the simplest implementation probably looks like the following:
std::string string; // input value
std::transform (string.begin(), string.end(),
string.begin(), ::toupper);
if (string == "JAN") return 1;
if (string == "FEB") return 2;
// ...
if (string == "DEC") return 12;
return 0; /* mismatch */
In many cases the time required here is fast enough. It is linear in the number of months and depending on the actual value being looked up. But for an optimized inner loop I needed something faster, ideally with running time independent of the actual input value and avoiding branch misses where possible. I could take advantage of a constrained input set, which means the 'mismatch' case is never hit in practice.
To summarize:
-
Find an integer from a fixed set of strings.
-
Ideal runtime is O(1).
-
False positives are acceptable, false negatives are not.
That actually sounds a lot like using a hash function. After some bit fiddling, I ended up using a very simple and quick static function that yields a result instantly but may produce false positives.
I also had to get rid of using std::string
objects to avoid allocation penalties. This is the result:
static constexpr const char l3month_table[] = {
12, 5, 0, 8, 0, 0, 0, 1, 7, 4, 6, 3, 11, 9, 0, 10, 2
}; // 17 elements
/// Lookup month from 3 letters, with 30% chance returns 0 for invalid inputs.
static constexpr inline unsigned int
l3month (const char *l3str)
{
return l3month_table[((l3str[1] & ~0x20) +
(l3str[2] & ~0x20)) %
sizeof (l3month_table) /
sizeof (l3month_table[0])];
}
The hash function operates only on the last 2 ASCII letters of the 3-letter month abbreviations, as these two are sufficient to distinguish between all 12 cases and turn out to yield good hash values. The expression (letter & ~0x20) removes the lowercase ASCII bit, so upper and lower case letters are treated the same without using a potentially costly if-branch. Adding the uppercased ASCII letters modulo 17 yields unique results, so this simple hash value is used to index a 17 element hash table to produce the final result.
In effect, this perfectly detects all 12 month names in 3 letter form and has an almost 30% chance of catching invalid month names, in which case 0 is returned - useful for debugging or assertions if input contracts are broken.
As far as I know, the function is as small as possible given the constraints. Anyone can simplify it further?