Problem: Given a 32bit integer, return the number of set bits in its binary representation.
Terminology: bit, crumb (2 bits), nibble (4 bits), byte (8 bits), playte (16 bits), dynner (32 bits).
Solutions:
 Use a lookup table. Split the 32 bit word into 2×16 (or 4×8, whichever is the biggest that can still fit in RAM/CPU cache) bit numbers, then compare each of the bytes individually. Advantage: Depending on CPU architecture, may be the fastest method. Disadvantage: Have to fit lookup table into L1 cache.

Use Kernighan’s (Wegner’s) method: While(v){v&=v1}. Subtraction by 1 clears the least significant set bit, so masking a number with itself minus one just clears its least significant set bit. So worst case runtime is O(n) – same as naive approach.

Treebased/divideandconquer/parallel method (for 32 bit numbers with 32 bit arithmetic and bitwise operations): First get the number of set bits in each crumb, then number of set bits in each nibble, then byte. Then multiply by 0x01010101 to add up the 4 bytes and you have the answer. The reason we multiply after gathering the popcount in each byte is because the popcount may be up to 32, which is too big to be represented by a nibble (2^4 = 16), so we use a byte.
The way this works is that each operation takes the sums up a level, until you get to the level where you can directly add up all the sums with a single multiply operation.
We want to map each original crumb to a count of that crumb. Thus, each crumb should be mapped to a crumb as follows:
00 > 00 unchanged
01 > 01 unchanged
10 > 01 changed
11 > 10 changed
Notice the crucial observation here that when the high bit is unset, the result is the same, and when the high bit is set, then the result is simply the original value minus one (2>1, 3>2).
In other words, we can simply subtract the high bit from the original crumb. If the high bit was 1, then we subtract 1. Otherwise we subtract zero leaving it unchanged as desired.
Therefore we want to bring the original crumb’s high bit to the low bit of the new crumb and ignore everything else. So we shift the original crumb right by one (and we AND it with 01 because we don’t care about the bit to the left of the high bit, we only want either 1, if the high bit was set, or 0, if the high bit was unset) and then subtract it from the original crumb.
Thus the following statement:
v = v – ((v>>1) & 01010101010101010101010101010101);
Now that we have the number of set bits in each crumb, we can use that to obtain the number of set bits in each nibble. So we want to simply add up the crumbs into nibbles:
0000 > 0000
0001 > 0001
0100 > 0001
0010 > 0010
1000 > 0010
0101 > 0010
1001 > 0011
0110 > 0011
1010 > 0100
Notice here that we are no longer counting the number of bits but summing the counts. Therefore we do an addition operation instead of subtraction.
We want to add every crumb with the crumb that is adjacent to it, i.e we want to add the right crumb to the left crumb. Notice that we are now operating on the nibble level right now – we want to sum up the crumbs in each nibble so that a single number is left in the nibble representing the sum of its crumbs.
We can’t use multiply here because each crumb may contain a count of 02, and each crumb can only represent a number from 03. 2+2 is 4 so the sum of the popcounts in two crumbs cannot be stored in a crumb so we need to merge them into a nibble. The way we do that is we bitmask the original number to get only the rightmost crumbs in each nibble, and then we right shift the original number by 2 and then bitmask it by the same bitmask so we get the leftmost crumbs in each nibble, then add these two numbers together to obtain the sum of the crumbs in each nibble.
The reason that we have to mask out the higher crumb and lower crumb before adding them up is because the addition may overflow i.e the sum of the popcounts may be greater than the maximum value representable by a crumb. So we could have 2 + 2 = 0b10 + 0b10 = 0b100. In this case if we simply added the original to the shifted version then we would have some junk in the higher crumb which may interfere with the overflow bit during the addition, which is why we have to zero the higher crumb out before adding them.
Thus:
v = (v & 00110011001100110011001100110011) + ((v>>2) & 00110011001100110011001100110011);
Can we do a multiply here? As we discussed above, a nibble can only represent 015 which means we can’t do a multiply to add up all the popcounts in each nibble into a single nibble, so we do the next best thing which is to add up each pair of nibbles into a byte. This is done similarly to what we did for crumbs except we don’t have to bitmask the two numbers before adding them up.
To see why we don’t need to mask them first, consider the maximum possible popcount in each nibble. Each nibble contains the popcount inside that nibble. Each nibble has 4 bits which means the popcount in each nibble is 04. Now add up two nibbles and the result is 08. A nibble can represent 015 which means the result of adding up the two nibbles will not overflow. This means we do not have to bitmask the two numbers before adding them as we know that the addition will not overflow meaning the result of the addition will store the correct result in the lower nibble (if we add the original number to the number rightshifted by 4 bits).
After the addition is done, since the correct results are stored in the lower nibbles, we then mask out the higher nibble to get the correct popcount in each byte.
Thus:
v = v + (v >> 4) & 0xF0F0F0F
In C, bitwise addition has higher precedence than bitwise and.
Now that we have all the popcounts in each byte, we can simply add up all the bytes with a single multiply operation into the highest byte:
v = v * 0x01010101
It works like this:
0x01010101 is 0000 0001 0000 0001 0000 0001 0000 0001 in binary.
Denote a 32 bit number as composed of bytes A, B, C, D like this: ABCD
Then ABCD * 0x1010101 is equal to:
ABCD * 0001 + ABCD * 0001 0000 0000 + ABCD * 0001 0000 0000 0000 0000 + ABCD * 0001 0000 0000 0000 0000 0000 0000
Which is equal to:
ABCD + BCD 0000 0000 + CD 0000 0000 0000 0000 + D 0000 0000 0000 0000 0000 0000
Which means that you get:
Highest byte = A + B + C + D
2nd highest byte = B + C + D
3rd highest byte = C + D
Lowest byte = D
Which means the sum of the 4 bytes is stored in the highest byte.
Thus:
v = v >> 24
return v
The one presented here may not be faster than the lookup table, depending on the processor architecture (e.g size of the L1 cache). Treebased methods have the advantage that they do not require additional memory, and are on average much faster than Kernighan’s method (unless very few bits are set).
My own implementations of popcount
Original code can be found here