Aug 052014
 

Map Jan-Dec to 1-12

In a time critical section of a recent project, I came across having to optimize the conversion of three digit US month abbreviations (as commonly found in log files) to integers in C++. That is, for “Jan” yield 1, for “Feb” yield 2, etc, for “Dec” yield 12.

In C++ the simplest implementation probably looks like the following:

std::string string; // input value
std::transform (string.begin(), string.end(), string.begin(), ::toupper);
if (string == "JAN") return 1;
if (string == "FEB") return 2;
// ...
if (string == "DEC") return 12;
return 0; /* mismatch */

In many cases the time required here is fast enough. It is linear in the number of months and depending on the actual value being looked up. But for an optimized inner loop I needed something faster, ideally with running time independent of the actual input value and avoiding branch misses where possible. I could take advantage of a constrained input set, which means the ‘mismatch’ case is never hit in practice.

To summarize:

  • Find an integer from a fixed set of strings.
  • Ideal runtime is O(1).
  • False positives are acceptable, false negatives are not.

That actually sounds a lot like using a hash function. After some bit fiddling, I ended up using a very simple and quick static function that yields a result instantly but may produce false positives. I also had to get rid of using std::string objects to avoid allocation penalties. This is the result:


static constexpr const char l3month_table[] = {
  12, 5, 0, 8, 0, 0, 0, 1, 7, 4, 6, 3, 11, 9, 0, 10, 2
}; // 17 elements

/// Lookup month from 3 letters, with 30% chance returns 0 for invalid inputs.
static constexpr inline unsigned int
l3month (const char *l3str)
{
  return l3month_table[((l3str[1] & ~0x20) + (l3str[2] & ~0x20)) %
                       sizeof (l3month_table) / sizeof (l3month_table[0])];
}

The hash function operates only on the last 2 ASCII letters of the 3-letter month abbreviations, as these two are sufficient to distinguish between all 12 cases and turn out to yield good hash values. The expression (letter & ~0x20) removes the lowercase ASCII bit, so upper and lower case letters are treated the same without using a potentially costly if-branch. Adding the uppercased ASCII letters modulo 17 yields unique results, so this simple hash value is used to index a 17 element hash table to produce the final result.

In effect, this perfectly detects all 12 month names in 3 letter form and has an almost 30% chance of catching invalid month names, in which case 0 is returned – useful for debugging or assertions if input contracts are broken.

As far as I know, the function is as small as possible given the constraints. Anyone can simplify it further?

Tweet about this on TwitterShare on Google+Share on LinkedInShare on FacebookFlattr the authorBuffer this pageShare on RedditDigg thisShare on VKShare on YummlyPin on PinterestShare on StumbleUponShare on TumblrPrint this pageEmail this to someone

[suffusion-the-author display='description']

  5 Responses to “C++ Month Name Hashing”

  1. For the record, gperf is the tool to produce perfect hashes. Here is a slightly edited version of what it produces for your problem:


    static unsigned int
    hash (const char *str)
    {
    static unsigned char asso_values[] =
    {
    12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
    12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
    12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
    12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
    12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
    12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
    12, 12, 12, 12, 12, 0, 9, 4, 12, 2,
    12, 9, 12, 12, 12, 12, 8, 12, 2, 4,
    3, 12, 1, 12, 3, 1, 4, 12, 12, 0,
    12, 12, 12, 12, 12, 12, 12, 0, 9, 4,
    12, 2, 12, 9, 12, 12, 12, 12, 8, 12,
    2, 4, 3, 12, 1, 12, 3, 1, 4, 12,
    12, 0, 12, 12, 12, 12, 12, 12, 12, 12,
    12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
    12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
    12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
    12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
    12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
    12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
    12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
    12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
    12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
    12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
    12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
    12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
    12, 12, 12, 12, 12, 12
    };
    return 1 + asso_values[(unsigned char)str[2]] + asso_values[(unsigned char)str[1]];
    }

    • Thanks Alexander, I’ve looked at gperf before, but it’s table is much bigger than l3month(), and of course you still need the masking/conversions for lower/upper case.

      • gperf also has an –ignore-case option.

        Here’s another manually constructed method: Use a single table lookup with a 1byte hash constructed from the 4 LSB of bytes 2 and 3.

        unsigned h = (str[1] & 0x0f) | ((str[2] & 0x0f) << 4);
        return lookup[h];

        This one seems to be roughly on par with the version gperf produces, performance-wise.

        It does a little bit more bit fiddling, but only one table lookup, which might give an edge on machines with a slow L1 cache.

        It still uses a lookup table of 256 elements just as the version gperf produces, but for the cache it makes no difference. And maybe you can allocate it dynamically for bulk conversions?

  2. Oops, Please ignore the previous comment. I have not tested the code. It returns months in the wrong order.

  3. You can save one indexing and one &:


    int month2int_littleendian(const char *month)
    {
    const char hash[] = { 3, 12, 8, 2, 1, 11, 7, 5, 0, 10, 4, 9, 6 };
    return hash[(* (const int32_t *) month & ~0x20202020 + 146732) % 13];
    }

    You need to implement a different function for big-endianness though.

    Modulo 13 is selectable: you can use another modulo value (it should likely be coprime with 16) by also changing the offset number and the hash table. I found the above one with an helper program.

 Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>