Wednesday, October 5, 2011

The day Sequences saved the world

In my last post I brought up the idea of reverse indexes and how those could save you from latch contentions.Well, after another day of hard work it turns out that the new Sequence feature in SQL Denali really is a kind of universal life saver for high load OLTP applications… I don’t know how many sequences we did already to get rid of latch contention and last page inserts, but I can tell you, it were quite a few…

There are two things I have to add though regarding my last post:

First… If you run sequences in really high load environments you have to use the CACHE feature. In the case of a server crash this might leave you with gaps in your sequence, but if you don’t use the Cache you will get locking issues on the sequence once you hit somewhere around 10.000 Fetch next statements per second.

And second… While my T-SQL bit reverser works perfectly fine it is sort of CPU intensive… Our 80-core server burnt about 1% CPU per 1000 rows inserted, with more than half of that going into the bit reverser. I did a lot of tests following this finding, and it almost hurts me to say that, but in this one case SQL CLR really is the best solution you can have… Using the CLR function instead of the T-SQL function we are almost down to nothing for the bit reverse.

Oh, and here is the code of the CLR function: (If anyone wants the compiled DLL or the complete solution please ping me and I’ll mail it to you.)

public static Int64 Reverse(Int64 inVal)
    ulong Result = 0;
    ulong Input = (ulong)inVal;
    int i;
    for (i = 0; i < 63; i++) // 64 bits...
        Result <<= 1; // Shift result by one digit
        if ((Input & 1) == 1) // If lowest bit of input is 1 add one to result
        Input >>= 1; // Now shift the input so that the next bit is lowest
    return (Int64) Result;


  1. Excellent blog post Rick, you beat me to it. I have just posted a blog entry that adds more data to support your test.

    You should see a pingback from my blog very soon

  2. Actually Rick... How about this way to generate the sequence (if we assume you are a BIGINT):

    @@SPID * 10000000000 + [Next Sequence Number]

  3. Nice idea Thomas, but I see two drawbacks:
    1) You loose IDs (as you can't predict a SPID), and this could also lead to duplicates if you are not extra carefull.
    2) SPIDs tend to group together, especially if your app uses connection pooling...

  4. RSince you are adding a sequence and a large offset, it will not result in duplicates. The sequence is already unique, and offsetting it by 10B will only make it "non unique" if you have more than 10B inserts per SPID (unlikely?)

    The grouping together is exactly the idea of the SPID trick. To avoid page latch, you have to "split the key space". Bit flipping achieves this - but so does the SPID splitting. For SPID splitting, each 10B range will have its own little "subspace" that where inserts are linear (which is OK; because PAGELATCH only happens at concurrency, which you wont have per SPID, only per server).

    The SPID grouping should give you less fragmentation, but more importantly, less page splits (but it does require the assumption that you can "pre-partition the space" - your bitflip doesnt require that.

  5. 10B inserts are not that unlikely... Two years ago someone told me that we would never need more than int16... now we are at int64 already...

    The point behind grouping is that if I have two SPIDs reused (e.g. 52 and 53) by two heavy load servers I get a grouping around those two spids, while every other "subspace" might be less congested.
    You are still right that this will give you good segmentation at less CPU cost, but when you need the best possible segmentation I still think I am a little ahead ;-)

  6. Good point on the SPID grouping consuming part of the "key space" faster than others.

    Another options to consume the space evenly would be to use scheduler_id (since that is guaranteed to be approximately equal)

    Would be interesting to see how these methods measure up against each other on CPU and page splitting

  7. Instead if looping, how does this attempts stand up? I know it's only 32-bit.

    unsigned int v; // 32-bit word to reverse bit order

    // swap odd and even bits
    v = ((v >> 1) & 0x55555555) | ((v & 0x55555555) << 1);
    // swap consecutive pairs
    v = ((v >> 2) & 0x33333333) | ((v & 0x33333333) << 2);
    // swap nibbles ...
    v = ((v >> 4) & 0x0F0F0F0F) | ((v & 0x0F0F0F0F) << 4);
    // swap bytes
    v = ((v >> 8) & 0x00FF00FF) | ((v & 0x00FF00FF) << 8);
    // swap 2-byte long pairs
    v = ( v >> 16 ) | ( v << 16);

  8. Thanks for that input Unknown... Took me a while to wrap my head around it, but it is a really cool one. Recoded in Assembler it takes a little over 100 CPU cycles, which is another improvement over what I had. And it works with 64bit just as it does with 32...
    I think I will stresstest that and post it, if it should stand up.

  9. Thank you sir. Can you please send me the complete code (CLR) for key generation using bit reversal.


  10. it’s ok to show some appreciation and say ‘great post’
    Asp .NET developer