# Lecture 3: Bit Hacks

Flash and JavaScript are required for this feature.

Description: Prof. Shun discusses an array of bit hacks, the types of hacks compilers do, and bit hacks to do by hand when the compiler doesn’t optimize.

Instructor: Julian Shun

The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high-quality educational resources for free. To make a donation or to view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu.

JULIAN SHUN: Good afternoon, everybody. So welcome to the third lecture of 6.172. Today we're going to talk about bit hacks, and today's going to be a really fun lecture. So, first of all, let's recall the binary representation of a word. So a w-bit word is represented as follows. So we're going to number the bits from x0 to xw minus 1 starting from the rightmost side.

And the unsigned integer value stored in x with this binary representation can be computed as follows. So it's essentially the sum of a whole bunch of powers of 2. And you sum the product of the bit with the appropriate power of 2. So if the bit is 1 in position k, then you multiply by 2 to the k. And if it's 0, then you just add 0.

So, for example, let's say we have this 8-bit word here. And if we apply this equation, we get-- first we get 2 because there is one bit in the first position. So we multiply 1 by 2 to 1, which is 2. Then in the second position, we also have a 1. So we multiply 1 by 2 to the 2, which is 4. And then we have 16 and 128. So we just sum up all of these powers of 2 and that gives us the unsigned integer value.

And that 0b prefix here represents a Boolean constant. So that means we're going to interpret this constant as a Boolean value. There's also signed integers. So you can also represent negative numbers, which is useful, and this is called the two's complement representation. And here's the formula for computing the two's complement representation of a word.

So for bit 0 all the way up to bit w minus 2, you do the same thing as above. But for the leftmost bit or bit w minus 1, you subtract that bit multiplied by 2 to the w minus 1. So for this example here, we saw 2 plus 4 plus 16. That's the same as above. But for the leftmost bit, since we have a 1 here, we're going to subtract 2 the 7, which is 128. And this gives us the signed value for the integer, which is negative 106.

Does that make sense? Any questions about this representation? So the leftmost bit is known as a sign bit because it tells you whether you need to subtract by this negative value or not. So if it's 0, then you don't have to subtract anything. If it's 1, then you subtract by a large integer value.

So in two's complement, the all 0's word is just 0. So you just apply the above formula and everything is 0. So you just get 0. What's the value of the all 1's word? Yes.

AUDIENCE: 1.

JULIAN SHUN: Negative 1, right? So the reason why it's negative 1, so you can just use the formula. And we're going to sum up a bunch of powers of 2. All of the x sub k's are going to be 1. So we're summing up 2 to the k from k equals 0 to w minus 2, and that's a geometric series which sums to 2 to the w minus 1 minus 1. And then for the sign bit, we're going to subtract 2 to the w minus 1. So now the 2 to the w minus 1's cancel out and we're just left with negative 1.

So this is an important property to know about two's complement representation. The all 1's word is just negative 1. And this leads to important identity which says that x plus the one's complement of x-- the one's complement is just all the bits of x flipped-- is equal to negative 1. This is because if you add x with all of it bits flipped, then you're just going to end up with the all 1's word. And we saw on the previous slide that that's equal to negative 1.

And from this identity, we have that negative x is equal to the one's complement of x plus 1. So this relates the two's complement to the one's complement representation. Let's look at an example. So let's look at-- let's say x is equal to this constant here. The one's complement of x or tilde of x is just all of the bits of x flipped.

And then to get negative x, we add 1 to the one's complement of x. And the fact of adding 1 here is we're going to take the rightmost 0 bit in the one's complement, flip that to a 1. And then for all of the bits to the right of that, we flip them to 0's. So another way to see this is you look at the representation of x and you flip all of the bits up to the rightmost 1 but not including that rightmost 1 bit, and then you just copy everything over.

So any questions about this? OK. So this is a table showing the relationship between hex and binary representation. So hex representation is base 16. And the reason why we use hex is because sometimes we have these big binary constants and we don't want to write-- have to type all of these symbols into our code. And hex gives us a more compact format to write these constants.

And this table, you can basically just look up, for each possible hex value, what its binary representation is. And for the values from 0 to 9, we're just going to use the same as decimal representation for hex. And then for values 10 to 15, we're going to use the characters from A to F.

To translate from hex to binary, you just take each hex digit, look it up in this table, write out the binary equivalent, and then you concatenate together all of the binary values you've got. So in this example I have this hex constant which says DEC1DE2C0DE4F00D. So now I just look up each of these hex values in this table. So D is 1101, E is 1110, C is 1100, and so on. And I just concatenate all of these values together and that gives me my binary representation.

And you can also go the other way around, converting binary to hex. And you do the same thing, just look it up in this table. And the prefix 0x here designates a hex constant, just like 0b designates the Boolean constant. So if you're using these constants in your code and you're writing it in hex, then you should use the 0x prefix.

So C has a bunch of bitwise operators. And here's a table describing what these bitwise operators do. So the ampersand is just logical AND. The vertical bar is logical OR. This caret sign is the XOR or exclusive OR. And XOR just says that if either of the two bits is 1, then we return 1. And if both of the bits are 0 or both of them are 1, then we return 0.

The tilde sign is the one's complement or the not. And then we have left shift and right shift operators. So let's look at how these operatives work on this example here. So we have these two 8-bit words, A and B. To compute A AND B, we just look at every two bits in the same position in A and B and compute the AND of those two bits. So 1 ANDed with 0 is 0, so we get 0 here. 0 ANDed with 1 is 0. 1 ended with 1 is 1, and so on.

A OR B is similar but now you apply the OR operator instead of the AND operator. So if either one of the two positions is 1, then you return 1. And if both are 0, then you return 0. So an A OR B, all of the bits except for this bit here is 0 because in the original two words both of the corresponding bits were 0.

For A XOR B, we check if exactly one of the two bits is 1. So for the leftmost bit, we have 1 and 0, so we have exactly one bit set to 1 and we get a 1 here. The second bit is 0 and 1, so that's 1. The third bit is 1, 1, so that's 0, and so on. Tilde of A is just the one's complement of A. We saw that before. We just flip all the bits.

A right shifted by 3, we just shift the bit string to the right by 3, and then we fill in the digits or the bits on the left with 0's. And then A left shifted with 2, we do the same thing but to the left. And then we fill in these empty bits with 0's. So these are the bitwise operators in C. Any questions?

AUDIENCE: They're not [INAUDIBLE]?

JULIAN SHUN: For a right shift, there is a-- there is a shift that will fill in the upper digits with whatever the leftmost digit was. But if you're working with unsigned integers, then it's not going to do that. For signed integers it will. And when we're doing bit manipulations, we're usually going to stick to unsigned integers, so we don't have to worry about that.

So now let's look at some common idioms that you can do using these bitwise operators. So the first one we'll look at is setting the kth bit in a word x to 1. So the idea here is to use a shift followed by an OR. So we're going to compute 1 left-shift it by k if we want to set the kth bit to a 1. And this gives us a mask with a 1 in exactly the kth position, and 0's everywhere else.

And then now when we OR that in to x, that's going to change the bit from a 0 to a 1 if it was a 0. And if that bit was already set to 1, then this doesn't do anything. And then for all of the other positions, since we're doing an OR with 0, we're just copying over the bits from x. So that's setting the kth bit.

We can also clear the kth bit. And the idea here is to use a shift, a complement, and then an AND. So again we're going to generate this mask, 1 left-shifted by k. But now we're going to take the complement of this. So now we have a 0 in exactly the kth position and 1's everywhere else.

And now when we AND this mask with x, in the kth position it's going to clear that bit because we're ANDing it with a 0. So the result is going to be 0 no matter what was there before. And then for all the remaining bits, since we're ANDing with 1, we're just copying it over from the original word.

You can toggle the kth bit or flip the kth bit using a shift and then an XOR. So, again, we're going to generate this mask. And then now, when we do an XOR with this mask, it's going to change a bit from a 0 to 1, or from a 1 to a 0, because that's what XOR does. So in this example, it's changing from a 0 to 1. But if it was already a 1, then it's going to toggle it back to 0. Any questions?

So let's look at another bit trick. So here we're trying to extract a bit field from a word x. And this is important if you're working with encoded data. And the idea here is to do a mask and a shift. So we're going to generate a mask with 1's in exactly the positions that we want to extract out of this word, and then 0's everywhere else. And then we're going to AND the x with the mask, and that's going to give us the bits in the four positions that we wanted to extract in this example, and then we have 0's everywhere else.

And then now we're going to right-shift this value that we extracted so that it appears in the least significant digits so that we can use it in our computation. So this is a very useful bit trick to know if you're working with compressed or encoded data. And if you use the bit field facilities in C, it's actually going to generate assembly code that will do masking and shifting for you.

You can also set a bit field in a word. So let's say we want to set a bit field in x to some value y. The idea is to first invert this mask to clear those bits we want to set in x. And then we OR in the shifted value of y. So let's say we have these two words, x and y here. We're going to generate the mask as we did before, but now we're going to flip all the bits in the mask by taking the one's complement.

And then we AND the-- we AND the one's complement of the mask with x, and that's going to clear the bits in x because we have 0's in exactly those positions in that mask, and when you AND that into x it will return to 0. And then for all the other positions, we're just copying in the bits of x.

And then, finally, we're going to left-shift y by an appropriate amount so that we can line up the value with these four bit positions here. And then we can now just OR those values in. And this will set the positions in x to the value y.

In order to be safe, you should actually do a mask on the shifted y value before you OR it in, because you don't know that the value of y is within the range of the mask. So if y has some garbage values in the higher bits, when you OR this in it might pollute the original value of x. So, for safety, you should actually do a mask before you OR the value, the shifted value of y in. So any questions on this?

So now let's look at how we can swap two integers. So we want to swap the values of x and y. The standard way to do this is to use a temporary variable t. So we set t equal to x, x equal to y, and then y equal to t. This does involve a temporary variable, however. So now the question is whether we can do a swap without using a temporary variable. It turns out that you can using bit tricks.

So here's the code for doing a no-temp swap. So you first set x equal to x XOR y, then y equal to x XOR y, and then x equal to x XOR y. So has anyone seen this before? OK, good. So some of you have seen this before. And for the rest of you all, I'll tell you how it works in the next couple slides.

So let's first look at an example of how to run this code before we go into why it actually works. So we're going to start with these two words in x and y. We're first going to do x equal x XOR y. And now we store the result in x. And this is the result when you do the XOR of these two words.

And then now we do y equal to x XOR y. And notice how the value of x here has already changed. So we're doing the XOR of these two words and setting that to y. And here this value is actually the same as x. So we've already placed x in y.

And, finally, we do another XOR. We set x equal to x XOR y. And then this gives us this value, which is y. So at the end, we've just swapped x and y without using any temporary variable.

So the reason why this works is because XOR is its own inverse. So if you do x XOR y, and then XOR the result of that with y, you just get back x itself. So let's look at the truth table to see why this is true. So in the x and y columns, I've shown all the possibilities. So there are four different possibilities of x and y. And then I also have the values of x XOR y. So it's 1 in the rows where I have exactly one 1, and then 0 in the remaining rows.

And then now if I do x XOR y XORed with y, I'm going to XOR these values with y. 0 XOR 0 is 0. 1 XOR 1 is 0. 1 XOR 0 is 1. And 0 XOR 1 is 1. And notice that these values are the same as the values of x. So when I XOR something in twice, it just cancels out and I get back the original thing.

So now let's go into why this bit trick actually does a swap. So in the first line, what we're doing is we're generating a mask with 1's where the bits in x and y differ. Because that's what XOR is going to give you. It's going to return a 1 if the bits are different, and 0 otherwise. So this is a mask that tells us in which positions the bits in x and y differ. And I'm going to store that into x.

And then in the second line, when I do x XOR y, this is going to flip the bits in y that are different from x, because I'm XORing with this mask, which tells me which of the bits differ from x. And then if I XOR with that mask, I'm flipping the bits in y that differ from x, and this will just give me back x. And I store that in y. So we see that the original value of x is in y now.

And then in the last line, I do the same thing but now I'm flipping the bits in x that are different from y. So I still have the mask that's stored in x. And then I can XOR that mask with y, and y has the original value of x. So this is flipping the bits in x that differ from y, and now I have the original value of y stored in x. So this is a pretty cool trick, right? Any questions on why this works?

So one thing about this bit trick here is that it's actually poor at exploiting instruction-level parallelism, so it's actually going to be slower than the naive code that uses a temporary variable. Because in the original code I had, I could actually execute two lines in parallel. I can store value into the temporary and then also change one of the values of x and y at the same time. Whereas in this code here, there's a sequential dependence among these three lines. I can't execute any of the lines in parallel.

We'll learn more about instruction-level parallelism in next week's lectures, but I just wanted to point out that the performance of this isn't actually that great. But this is actually a pretty cool trick to know. Sometimes it shows up in job interviews.

So the next thing we're going to look at is finding the minimum of two integers, x and y. So let's say we want to store the result of the minimum in a variable r. Here's the standard way to do this. We just use an if-else statement. So if x is less than y, than r is x. And, otherwise, r is set to y.

Here's an equivalent expression. It just uses the ternary operator in C. It does exactly the same thing as the if-else statement on the left. One performance problem with this code is that there is a branch in the code. So we have this if statement that checks if x is less than y. And modern machines will do branch prediction. And for whatever branch it predicts the code to take, it's going to do prefetching and execute some of the instructions in advance.

But the problem is if it mispredicts the branch, it does a lot of wasted work, and the processor has to empty the pipeline and undo all of the work that it did. So this is a performance issue due to branch misprediction. Modern compilers are usually good enough to optimize this branch away, but sometimes the compiler isn't good enough to optimize the branch away.

So is there a way to do a minimum without using a branch? All right. So here's how you do it. So we set r equal to y XOR x or y ANDed with negative x less than y. So it's pretty obvious, right? So why does this work?

So first we need to know that the C language represents the Boolean values true and false with the integers 1 and 0, respectively. So now let's look at the two possible cases. First, let's look at a case where x is less than y, and then we'll look at the case where x is greater than or equal to y.

So in the first case, when x is less than y, the comparison here x less than y is going to return 1. And then we're going to negate that, which gives us negative 1. And recall from earlier, negative 1 is the all 1's word in two's complement representation. So when we AND x XOR y with all 1's word, that just gives us x XOR y.

And now we're left with y XOR x XOR y. And we know that the-- we know that the inverse of XOR is itself. And therefore the two y's cancel out here and we're just left with x. And in this case x is indeed the minimum.

In the other case, we have x greater than or equal to y. Then the expression x less than y is going to return 0. Negative of 0 is still 0. And then when we AND x XOR y with 0, we're left with 0. And this just gives us y XOR 0, which is y. And in this case y is the minimum of the two integers. So any questions?

So how many of you-- how many of you knew this already? Good. So we learned something new today. So let's see how branches work in a real function. So here we're trying to merge together two sorted arrays, and this is a subroutine that's used in merge sort if you've seen it before.

So the inputs to this function are three arrays. So we want to merge together arrays A and B and store the result in C. And then we also pass the function the sizes of A and B in na and nb. So what does the restrict keyword do here? Does anyone know?

So the restrict keyword tells the compiler that this is going to be the only pointer that can point to that particular data. And this enables the compiler to do more optimizations. So when you're writing programs and you know that there can only be one pointer pointing to specific pieces of data, then you can declare that restrict keyword, and this gives the compiler more freedom to do optimizations.

So now let's look at this procedure here. So while the sizes of A and B are nonzero, we're going to go into this if-else clause and we're going to check if the element pointed to by A is less than or equal to the element pointed to by B. And if so, we're going to store that element pointed to by A into C. And then we're going to increment both the C and A pointers. And then we're going to decrement na. This tells us that there's one less element in A that we need to merge in now.

And, otherwise, we do the same thing but with array B and nb. And if one of the two arrays becomes empty, then we go to one of these two while loops at the bottom and we just copy all the remaining elements in the non-empty array into C. So here, if na is greater than 0, then A is a non-empty array, and then we just copy the remaining elements of A into C. And, otherwise, we copy the remaining elements of B into C.

So let's do a simple example. Let's say we want to merge these two arrays in green into the blue array here. So let's say the top array is A, and the bottom array is B, and the blue array is C. So, initially, A and B are pointing to the beginning of these two green arrays.

And since both arrays are non-empty, we're going to compare the first two elements here. And we see that 3 is less than 4, so we're going to place 3 into the array C. And then we're going to increment the pointer in A to point to the next element. And we're also going to increment the pointer C to point to the next slot.

Now we're going to compare 4 and 12. 4 is less than 12, so we place 4 into the array C, and we increment array B. And then we just keep doing this. So 12 is less than 14. 14 is less than 19. 19 Is less than 21. 21 is less than 46. And here 23 is less than 46.

And at this point, one of the arrays becomes empty. So B is empty now. So now we get to the second while loop. And we see that A still has elements in it, and we just copy the remaining elements in A into C. And then we're done. So that's how the standard code for merging two sorted arrays works.

So let's look at each of these branches to see if it's predictable. So a predictable branch is a branch that most of the time it returns the same answer, and only rarely does it return a different answer. And an unpredictable branch is one where it sometimes returns one value and sometimes returns another value and you can't really predict it. So let's look at the first branch. Does anyone know if this branch is predictable? Yes.

AUDIENCE: That would be unpredictable because it depends on what input you're given.

JULIAN SHUN: So it turns out that this branch is actually predictable because it's going to return true most of the time except for the last time. So it's only going to return false when nb is equal to 0. And at that point you're just going to execute this once and then you're done. But most of the time nb is going to be greater than 0 when you execute this, and we call this a predictable branch. What about the second one? So--

AUDIENCE: Also predictable?

JULIAN SHUN: Yes. So it's also predictable for the same reason. What about the third one? Yes.

AUDIENCE: No. Because we really-- if we already knew which was bigger, then we already have the sorted array then.

JULIAN SHUN: Yes. So this turns out to be unpredictable because we don't know the values in A and B a priori. So this condition inside the if statement is going to return true about half of the time because we don't know what values are in A and B. And that's going to be an unpredictable branch because it's going to return true or false about 50/50.

What about the last one? Yes. Why?

AUDIENCE: Yes, because for similar reasons of 1 and 2. It's probably [INAUDIBLE].

JULIAN SHUN: Yes. So it is predictable. The reason why it's predictable is that most the time it's going to return true. And that once it returns false you're never going to look at that again inside this function call. So it returns true most of the time, and we call that a predictable branch.

So branches 1, 2, and 4 are OK because they're predictable branches, but branch 3 is going to cause a problem. It's an unpredictable branch, and the hardware doesn't really like this because it can't do prefetching efficiently. So to fix this, we can use our no-branch minimum bit trick that we learned a couple slides ago.

So now what we're doing is we're going to have a variable called cmp which stores the result of the comparison between the first element of A and the first element of B. And then now we're going to get the minimum of A and B as follows. It's the same bit trick that we saw before. So now the variable min is going to store the smaller of the first element of A and the first element of B. And we also have the result of this comparison here. So that's stored in cmp.

So first we're going to place the minimum value in C. And then, based on the result of cmp, we're going to increment one of A or B. So if A was less than or equal to B, then cmp is going to be 1. And A plus equal cmp is going to increment A by 1. And then B plus equal to not cmp is going to not do anything, because not cmp is 0.

And then for na, we're going to decrement by cmp. So it's going to be 1 if A is less than or equal to B, and 0 otherwise. And then for nb, we're going to decrement by the not of the cmp. So only one of these two lines is actually going to do something based on the result of the comparison. And then the rest of the code is the same as before. Any questions? So now we've gotten rid of this unpredictable branch that we had before.

So one thing about this optimization is that it works well on certain machines. However, on modern machines, using a good compiler like Clang with the minus O3 flag, the branchless version is usually going to be slower than the branching version because the compiler is actually smart enough to get rid of the branch inside the original version of minimum. There's this instruction called cmov or a conditional move. It's basically a branchless instruction for doing a comparison. We'll learn more about that next week.

So this trick actually usually doesn't really work. There might be some machines and some compilers that works, but most of the time, the compiler is better at optimizing this code than you are. So one of the common themes so far is that I've told you about a really cool bit trick and then I told you that it doesn't really work. So why are we even learning about these bit tricks then if they don't even work?

So first is because the compiler does some of these bit tricks, and it's helpful to understand what these bit tricks are so you can figure out what the compiler is doing when you look at the assembly code. Secondly, sometimes the compiler doesn't do these optimizations for you and you have to do it yourself. Thirdly, many bit hacks for words extend naturally to bit and word hacks for vectors, which are widely used in high-performance code. So it's good to know about these tricks. These bit tricks also arise in other domains. And, finally, because they're just fun to learn about. And for project 1, you'll be playing around with some of these bit tricks, so it's good to know about these things that I've talked about already.

Here I'll talk about a bit trick that actually does work. So here we're trying to do modular addition. So we want to do x plus y mod n. And here let's assume that x is between 0 and n minus 1, and y is also between 0 and n minus 1. So the standard way to do this is just to use the mod operator, x plus y mod n. However, this does a division, which is relatively expensive compared to other operations unless n is a power of 2. But most of the time, you don't know if n is a power of 2 at compile time, so the compiler can't actually translate this to a right shift operation, and then it has to do a division.

So here's another way to do it without using division. So we're first going to set z equal to the sum of x and y. And then if z is less than n, then it's already within the range and we can just return z. If z is greater than or equal to n, well we know we can be at most 2n minus 2 because x and y were both at most n minus 1. So all we have to do is to subtract n and bring it back into range.

However, this code has an unpredictable branch here because we don't know whether z is less than n or not. So now we can use the same trick as minimum. So now we're going to set r equal to z minus n ANDed with the negative of z greater than or equal to n.

So if z is less than n, then this is going to return 0 in here. And n ANDed with 0 is 0, so we're just left with z. And if z is greater than or equal to n, then this is going to be 1. We negate that, we get negative 1, which is the all 1's word. n ANDed with all 1's is just n. So that is z minus n, which will bring the result back into range.

So any questions? Yes.

AUDIENCE: It seems like there essentially is still a branch based on the value of z. So why would that be faster?

JULIAN SHUN: So this branch here is just generating either a Boolean value 1 or 0. There's actually-- like the code that you execute after it, it's still the same in either case. So the branch misprediction only hurts you if there are two different code paths. In this version, there are two different code paths, because one is doing z and one is doing z minus n.

So the next problem we will look at is computing or rounding a value up to the nearest power of 2. And this is just 2 to the ceiling of log base 2 of n. And recall that lg of n is the log base 2 of n. That's the notation we'll be using in this class. Here's some code to do this.

So we have our value of n here. First, we're going to decrement n. And then we're going to do an OR of n with n right-shifted by 1. Then an OR with n and n right-shifted by 2, and so on, all the way up to 32. So we do this for all powers of 2 up to 32. And then, finally, we increment n at the end. So let's look at an example to see why this works.

So we're starting with this value of n here. First we're going to decrement it. And what that does is it flips the rightmost 1 bit to 0, and then it fills in all the 0's right of that with 1's. And then when we do this line, which says n is equal to n ORed with n right-shifted by 1, that's essentially propagating all of the 1 bits one position to the right and then ORing those in.

So we can see that this 1 bit got copied one position to the right. This 1 bit got copied to one position to the right. These 1's also propagate, but since they were already 1's it doesn't do anything.

For the next line, we're propagating the 1 bits two positions to the right. So this 1 bit here gets copied here. This 1 gets copied here, and so on. And then the next line is going to propagate bits four positions the right. Then 8, 16, and 32. For this example here, when I get to this line I'm already done. But, in general, you have more bits in a word, which I can't fit on this slide.

And now we have something that's exactly one less than a power of 2. And when we add 1 to that, we just get a power of 2. So we're going to zero out all of these 1 bits and then place a 1 here. And this is exactly the power of 2 that's greater than the value n.

So the first line here is essentially guaranteeing us that the log nth minus 1 bit is set. And we need that bit to be set because we want to propagate that bit to all the positions to the right of it. And then these six lines here are populating all the bits to the right with 1's. And then the last bit is setting the log nth bit to 1 and then clearing all of the other bits.

So one question is why did we have to decrement n at the beginning? Yes.

AUDIENCE: In case n is already [INAUDIBLE].

JULIAN SHUN: Yes. So if n is already a power of 2 and if we don't decrement n, this is isn't going to work because the log nth minus 1 bit isn't set. But if we decrement n, then it's going to guarantee us that the log nth minus 1 bit is set so that we can propagate that to the right. Any questions? Yes.

AUDIENCE: [INAUDIBLE]?

JULIAN SHUN: Because, in general, you're using 64-bit words. Here I don't have that many bits here because I can't fit in on the slide, but in general you have more bits. Let's look at another problem. Here we want to compute the mask of the least significant 1 in a word x. So we want a mask that has a 1 in only the position of the least significant 1 in x, and 0's everywhere else. So how can we do this?

So we can set r, the result, equal to x ANDed with negative x. So let's look at why this works. So here is x. And recall negative x is the two's complement of x plus 1. So what we do is we flip all of the bits up to the rightmost 1 but not including it, and then we just copy all of the bits over. That's how we get negative x from x.

And then now when we compare x and negative x, we see that all of the bits when we AND them together are going to be 0 except for the bit at the position corresponding to the least significant 1 bit in x. And that's going to be 1 since we're ANDing 1 and 1, and everything else is going to be 0. And this will give us the mask that we want. So this works because the binary representation of minus x is just the one's complement of x plus 1.

So now, a question is how can we find the index of this bit? So here I'm just generating a mask that has a 1 in the least significant 1 in x, but it doesn't actually tell me the index of this bit. In other words, I want to find the log base 2 of a power of 2. So that's the problem we want to solve, and here's some code that lets us do this.

So we have this constant called the de Bruijn. It's written in hex here. And then we have this table of size 64 called convert. And now all we have to do is multiply x by this de Bruijn constant, right shift it by 58 positions, and then look up the result in the convert table. And that's going to give us the log base 2 of the power of 2. Any questions?

[STUDENTS LAUGH]

So this looks like magic to us. So in the spirit of magic, we're going to do a mathemagic trick. And to do this trick, I'm going to need five volunteers, and the only requirement is that you need to be able to follow directions. So who wants to volunteer for this magic trick? Yes, 1, 2, 3, 4-- one more-- 5. All right, come on up. So line up here.

[STUDENTS APPLAUD]

Yes, just line up right here. Can you move a little bit over to the left? OK, cool. So today I have the pleasure of welcoming Jess Ray, also known as The Golden Raytio, to join us for a lecture and help us perform this cool magic trick. So let's give her a round of applause.

[STUDENTS APPLAUD]

JESS RAY: I'm going to be doing a little bit of magic trick for you all today. I'm going to be reading your guys' minds. And I know you're looking skeptical, but I'm hoping I can convince you here. So we'll get to that part in a second. But, first, the first big step in reading minds is you got to clear the air, like get rid of all the negative vibes, all the bad energy. Throw that out. So I'm going to need a little help from you guys in doing this.

So, first, we have this sweet little bell here. Let's see. Who wants the bell?

AUDIENCE: I'll take it, I guess.

JESS RAY: All right. Can you hold that for a second? So what this bell is going to do is help us get rid of some of those negative ideas. Can you give it a ring? Oh yes. So that painful ringing you're hearing in your ears right now is actually just clearing up the air for us, making it so I can read your minds. Thank you. Stop that.

All right. Next we have this magic tone here. Who would like to give this a spin? Can you shake that around a couple of times? Spin it. Spin it with your wrist there, like-- you can go like this. There we go. All right. Perfect.

All right. It's feeling a little clearer here. I can start-- you can start getting things off your mind. Don't worry, I won't tell anybody what you're thinking. Oh, let's see what else. Let me channel the spirits. Help me out here. All right, I'm feeling good. All right.

So what we're going to be doing is, as I said, reading your mind. I'm going to be doing this by giving you cards, and I'm going to tell you what each of you are holding for the card. So I have some cards here. Well, I guess these are a little small. Let's see. Go a little bigger. Meh. Here we go. Let's-- this looks better.

All right. These are kind of heavy. Get rid of these junk ones up here, all the junk. All right. So I need your help for this. So what I want you to do is take the cards and cut the deck as many times as you want. So, basically, just going like that however much. Just don't actually shuffle them randomly.

AUDIENCE: All right. Here you go.

JESS RAY: All right, cool. So now I'm going to hand each of you a card. Don't let me see it. Feel free to look at it. There you go. All right.

So the reason I'm wearing this awesome onesie is this helps me sweat out the bad energy. I'm literally sweating right now. But there's one more piece that we need for this mind reading trick. The magic hat. All right.

See if this fits on my head. There we go. Where's the switch? All right. Turn it on. All right, I'm feeling good here. All right, you guys ready? All right.

So I do need a little help getting this trick started. So if you are holding a red card, can you just raise your hand? So no? Who's got the red card? Red, red. You don't have red? OK. All right. So the first one and the third one. All right.

So let me handle the mind reading abilities here. Now what I'm going to do is I'm going to go left to right and tell you what you're holding. Obviously, I know the color, but I'll tell you what suit it is, and also I will tell you what the number is.

So first card, obviously I know you have a red. Hmm. I'm feeling a diamond and also a four?

AUDIENCE: That was it.

JESS RAY: Yes. All right. All right. Good start, good start. All right. Got to-- got to think about what the next one is here. All right. So I know you had a black card. Let's see. Black of spades. Is it the ace of spades? Oh yes. There we go.

All right. So back to red. All right. This one, let's see. Red, diamond, two. All right, all right. We're doing good so far. Can I get the last two?

All right, let's see what we can do here. All right, black, club, four. All right. Last one, last one.

All right. Oh, it's going to be a tough one. Black, spade, eight.

[STUDENTS APPLAUD]

And if we had time, I could you mystify you and go through the rest of the deck, but we won't do that. So thank you guys very much. I hope your minds were blown. Yes. So me collect the cards back from you. Thank you. All right. Thank you. Now I can get out of this and stop sweating.

[STUDENTS APPLAUD]

JULIAN SHUN: It's pretty cool, right? So why does this actually work? To know why this trick actually works, we need to first study what a de Bruijn sequence is. So a de Bruijn sequence s of length 2 to the k is a cyclic bit sequence such that each of the 2 to the k possible bit strings of length k occurs exactly once as a substring in s.

So this a pretty long definition, so let's look at an example. So here is a de Bruijn sequence for k equals 3. So the length of this sequence is 8 because 2 to the 3 is 8. And you can see that each of the possible three-bit substrings occurs exactly once in this cyclic bit string of length 8. So it wraps around and you can consider this as a cyclic string.

So we see that 000 appears at position 0. 001 is at position 1. Then 010 is at position 6. 011 is at position 2. 100 is at position 7. 101 is at 5. 110 is at 4. And then 111 is at 3. So all of the 8 possible substrings of length 3 occur exactly once in this de Bruijn sequence.

So now we're going to create this convert table of length 8. In general, this will be 2 to the k. And here, k is 3. And in this convert table, what we're storing in each position is the index in the de Bruijn sequence where the bit string corresponding to that position starts in the de Bruijn sequence.

So here we see that convert of 2 is 6 because the bit string corresponding to 2 is 010, and that begins at position 6 in the de Bruijn sequence. We also see that convert of 4 is 7 because 4 is 100, and that begins at position 7 in the de Bruijn sequence.

Now we have this convert table. And recall that we're trying to compute the log base 2 of a power of 2. So hopefully you guys remember that. So the way to do this is we're going to multiply the de Bruijn sequence constant by this power of 2.

So let's say we're working with the integer 16, which is 2 to the 4. So we're going to multiply this de Bruijn sequence by 2 to the 4. And when we multiply by a power of 2, that's the same as left shifting. So that's going to left shift the de Bruijn sequence four positions to the left. And then now we want to see which of the eight possible substrings appears at the beginning of this sequence. And after we do the left shift, 110 appears at the beginning of the sequence.

And we want to extract this out, and we can do that by right shifting five positions. And 110 is just 6. And we can figure out where 5 starts in this de Bruijn sequence by looking it up in the convert table. We see that convert of 6 is 4. So the string 110 appears starting at position 4 in the de Bruijn sequence, and that means that we did a left shift by 4 in the first step, and that gives us the log base 2 of the power of 2, because the only reason why we did a left shift by 4 is because the power of 2 was 2 to the 4.

So this returns us the log base 2 of the integer that we started with. And one thing to note is that it's important to start with all 0's in this sequence here, because we're representing this as a cyclic bit sequence. So when we do a left shift, we need to make sure that the values that fill in on the right side are correct. So notice that in the sixth and seventh positions, we need 0's at the end when we overflow. So because the de Bruijn sequence starts with all 0's, when we do the left shift, it's automatically filling with 0's, giving us the correct substring.

So the magic trick that Jess did had 32 cards, and in that case k was equal to 5. And the cards were arranged according to a de Bruijn sequence of length 32. And each of the cards corresponded to one particular bit string of length 5. And the color of the card corresponded to the bit.

So when she asked you what the color of your card was, she could determine the bits corresponding to the first card in the sequence because she has the 5 bits corresponding to that card. And then with that she has some clever way to determine the rest of the cards. So that's how the de Bruijn sequence is related to the magic trick that you just saw. Any questions? Yes.

AUDIENCE: The de Bruijn sequence, do you need to do cyclic translation?

JULIAN SHUN: So there could be multiple de Bruijn sequences. We just need one particular de Bruijn sequence to make this bit trick work. Yes. So this example is just for k equals 3. And the code I showed you before, that was for k equals 8, so you can do up to 64-bit words. Yes.

AUDIENCE: How do we know that the sequence exists?

JULIAN SHUN: So there is a mathematical proof that says that. I can give you some pointers so that you can look at it after class. But there's a proof that says that for any length there is a de Bruijn sequence. Yes.

AUDIENCE: Sorry, I missed the procedure. So how exactly do you determine the log base 2?

JULIAN SHUN: So we have-- we're starting with some integer that is a power of 2. So when we multiply by that power of 2, it's left-shifting by the log base 2 of that. And then we can determine how much we left-shifted because we know-- we can just look at the first three bits of this sequence after we did the left shift, and then look at where that three-bit sequence appears in the original de Bruijn sequence before we shifted it.

And to do that, you can look it up in the convert table. This is what we did when we looked up the bit string 110 in the convert table. And that tells us that it starts in the fourth position. That means that we left-shifted by 4, and that means that the value of n was 2 to the 4. Does that make sense? Yes.

AUDIENCE: So just to clarify this only works if you multiply the sequence by a power of 2, then it gives you back which power of 2 it was?

JULIAN SHUN: Yes. So this only works if you're starting with a power of 2. So if it's not a power of 2, this doesn't work. Any other questions? Yes. So if it's not a power of 2, you can round it up to the nearest power of 2 using another bit trick that we saw earlier. And then you can use this bit trick here.

The performance of this bit trick is limited by the performance of multiplication and table lookup. So you have to do a multiplication by some constant, and then you have to do table lookup in this convert table. So a table lookup does a memory reference, which could be expensive. And nowadays there's actually a hardware instruction to compute this, so you don't actually have to implement this trick. But this trick is still pretty cool. And in the past this is how you would do it before there was a hardware instruction that came out.

So let's look at another problem. So this is the n queens problem. How many of you have seen this before? Yes. So many of you have seen this before. As a reminder, we're trying to place n queens on an n by n chessboard so that no queen attacks another queen. In other words, there are no two queens in any row, any column, or any diagonal.

And, commonly, we want to count the number of possible solutions to the n queens problem for a particular value of n. And in this example here, this is a valid configuration. You can check, for each of the queens, they can't attack any other queen on the board.

So one common strategy for implementing the n queens algorithm is to use backtracking. We're going to try placing queens row by row. We know that there can only be one queen per row, so we just need to determine which position in that row the queen will appear in. And then if we can't place a queen in any row, then we backtrack.

So, for example, in the first row, we'll just place the queen in the first position, because there's no queens on the board yet, so the first position is valid. For the second row, we're going to try to place in the first position, but we can't place it there because then it will attack the first queen. And then the second position is also invalid, so the third position is where we place the second queen.

Now, for the third row we're going to check the positions until we get to one that's valid, and this is going to be the fifth position. Do this again. Here we can do it in the second position. For the fifth row, let's see where this is going to end up. OK. So it goes in the fourth position. What about the sixth row?

Whoops. So all of the eight positions are invalid, because if we place the queen in any of those positions, it's going to attack one of the queens that we already placed. So now we're going to backtrack. We're going to find another position for the fifth queen. So let's try some more positions. So we can place it at the end. Now we try again.

All right. So, unfortunately, we couldn't find a position for the sixth row again. We have to backtrack. But we already tried all the positions in the fifth row, so we backtrack to the fourth row. And you get the idea. And then whenever we find a configuration where all eight queens are valid, then we increment some counter by 1. And at the end we just return this counter, which tells us the number of solutions to the n queens puzzle.

So you can implement this quite easily using a recursive procedure. You can implement this backtracking search. But one question is how should we represent the board to facilitate efficient queen placement? So one way to represent the board is to use an array of n squared bytes. And for each byte, we just have a 1 if there is a queen in that position, and 0 otherwise. Is there a better way to represent the board?

AUDIENCE: You can track all of the bits such that a 1 bit represents a queen at some place on the board?

JULIAN SHUN: Yes. So that's a good answer. So instead of using bytes, we can use bits, because the value can only be 0 or 1. We only need one bit to represent that. So we can just have an array of n squared bits. Is there a better way to do this? Yes.

AUDIENCE: You could just say in each row where a queen is with a byte?

JULIAN SHUN: Yes. So good answer. So a better way to do this is to just use an array of n bytes. Because we know that on each row there can only be one queen, so we just need to store the position of that queen. So we have an array of n bytes, one byte for each row, and then you just used the byte to store the position of the queen in that row.

It turns out, to implement this algorithm, there's a even more compact representation, which is to use three-bit vectors of size n, 2n minus 1, and 2n minus 1. So let's see how this works. So the first bit vector we're going to use is of length n. We're going to call this the down vector. And the down vector just stores a 1 in the columns that have a queen in it and 0 in the columns that are empty.

And then when we want to check whether placing a queen is safe in any position, we first have to check whether that column is empty. And you can do this by ANDing the down bit vector with 1 left-shifted by c, where c is a column where you want to place the queen. And if that's nonzero, that means there's already a queen in that column and you can't place it.

Otherwise, we're going to have to do another check, and we're going to create this other bit vector called left. The length of this bit vector is 2n minus 1. And it stores a 1 in the diagonal that has a queen in it, and 0's otherwise. And there are 2n minus 2 possible diagonals.

And then now, when we want to place a queen in row r and column c, we can check whether it's safe by doing left ANDed with 1 left-shifted by r plus c. And this is going to be nonzero if there is already a queen in that particular diagonal. So in that case, we can't place a queen there.

And, otherwise, we're going to do a final check using this right bit vector, which is essentially the same but we're looking at the diagonals going down to the right. So, again, we have a 1 in the diagonals that have a queen and 0's otherwise. And then now the check is going to be right ANDed with 1 left-shifted by n minus 1 minus r plus c. And if a particular candidate passes all three of these checks, then we know that there's not going to be a conflict and we can place the queen in that particular position.

So this is a bit vector representation. You actually still have to write the code to count the number of queens using this bit vector representation, and it's actually an interesting exercise. So I encourage you to try to do this at home. But I just told you about the bit vector representation. So any questions? Yes.

AUDIENCE: Could you just repeat what the down vector bit hack was for figuring out [INAUDIBLE]?

JULIAN SHUN: Yes. So the down vector, it stores a 1 in the columns that have a queen in it and 0's otherwise. And what you do is, if you want to place a queen in column c, you first create the mask 1 left-shifted by c. And then you AND it with a down vector. And that's going to be nonzero if there's a queen in that column. Any other questions? Yes.

AUDIENCE: Why isn't there a horizontal one?

JULIAN SHUN: So it turns out that you don't need. Just these three checks is enough to guarantee-- guarantee that you can place a queen in a position if it passes all three of the checks. Yes. So a fourth check would just be redundant.

AUDIENCE: So we don't need a horizontal one because we're not placing two queens in the same row.

JULIAN SHUN: Yes. That's true. Good point. Yes. So we're only placing one queen in each particular row.

So let's look at another problem. This is called population count, or pop count for short. And the problem here is we want to count the number of 1 bits in some word x. Here's a way to do this that repeatedly eliminates the least significant 1 bit in a word.

So we have this for loop where r is initialized to 0. And we're going to repeat this loop until x becomes 0. And then each time we go through this loop, we increment r. And inside the loop we're going to set x equal to x ANDed with x minus 1. And this is going to clear the least significant 1 bit in x.

So let's look at an example. So let's say we have this value here for x. Well, to get x minus 1, we flip the rightmost 1 bit in x from a 1 to 0. And then we fill in all of the bits to the right of that with 1's. And then now when we AND those two things together, we're going to copy all of the bits up to the rightmost 1. And then for the rightmost 1, we're going to zero it out because we're ending with a 0. And then all of the bits to the right of that are still going to be 0. So x ANDed with x minus 1 is just going to get rid of the least significant 1 bit.

And then we repeat this process until x becomes 0. In that case we've already eliminated all the 1's and we know the answer, which is stored in r. Questions? So this code will be pretty fast if the number of 1 bits is small, but the running time is proportional to the number of 1 bits in a word. So in the worst case, if most of the bits are set to 1, then you're going to need a lot of iterations to run this code.

So let's look at a more efficient way to do this. This is to use table lookup. So we're going to create a table of size 256, which stores for each 8-bit word the number of 1's in that 8-bit word. So we have all possible 8-bit words stored in this table.

And then now, to get the number of 1 bits in x, for every 8-bit substring in x, we're going to look it up in this count table and add it to r. And then we're going to right-shift x by 8 so that we can get the next word. And then when x becomes 0, we know we're done. So that's table lookup. And the performance here depends on the size of x. If we have a 64-bit word, we need to do this at most eight times, whereas in the initial code we might have to do it 64 times if we had 64 1 bits.

The cost of this code is bottlenecked by the memory operations, because this table here is stored in memory. So every time you access it you have to go to memory to fetch the value there. And here are some approximate costs for accessing memory in various levels of the hierarchy. If something's stored in register, it's very fast. It only takes you 1 cycle. If it's stored in L1 cache, it's about 4 cycles, L2 cache about 10 cycles, L3 cache about 50 cycles. And then, finally, if you have to go to DRAM because it's not in cache, it's much more expensive, 150 cycles. It's an order of magnitude slower than doing something-- fetching something that's already stored in a register.

So let's now look at a third way to do population count where we don't actually have to go to cache or DRAM. Essentially, we can do everything in registers. So here's how you do it. So we're going to create these five masks-- or six masks, from M0 up to M5. And these masks-- the values of these masks are shown in the comments here. In this notation here, x to the k just means x repeated k times. So the mask M5 has 32 0's, followed by 32 1's. The mask M0 has the bit string 01 repeated 32 times, and so on.

After we create these masks, we're going to execute these six instructions at the bottom, and this is going to give us the number of 1's in the word. So let's do an example to see how this works. So let's say we start with this bit string here. In the first step, what we're going to do is we're going to AND x with the mask M0.

And then we're also going to AND x right-shifted by 1 with the mask M0. and recall that the mask M0 is just 01 repeated 32 times, and therefore the mask is essentially extracting all of the even bits. So x ANDed with M0 gives us all of the even bits. And then when we right-shift x by 1 and AND it with M0, that's going to give us all the odd bits.

And then we're going to line those two things up and add them together. And the result of doing this is it's going to tell us for every group of two bits the number of 1 bits in that group. So now for each of these pairs of bits, it's telling us how many of them are 1. So in the leftmost group here, we add two 1's. So the result of adding 1 and 1 is 1 0, which is 2. For the rightmost group, we have two 0's, and the count there is 00. And this is the same for all of the other groups.

So this gives us the number of 1's in every pair of positions. Now we're going to AND the result with M1. And we're going to right-shift it by 2 and also AND it with M1 and add those two things together. And M1 is a mask that will give us the bottom two bits in every group of four bits. So when we right-shift x by 2, that's giving us the top two bits.

And then now we add those together, and it will give us the count of the number of 1 bits in every group of size 4. And these counts are stored in the result here now. So you can verify that each of these groups has the count of the number of 1 bits. So, for example, we have 100 here. And this is correct since there are four 1 bits.

Now we do this again with the mask M2. That's going to give us the counts for all groups of size 8. Then we go to groups of size 16. And then, finally, we add these two together, giving us the number of bits in this group of size 32. And this is actually the pop count. So the value here is 17. And you can verify that there are indeed 17 1's in the input word x. Any questions?

So the performance of this code, which is based on parallel divide and conquer, is going to be proportional to log base 2 of w, where w is the word length. Because on every step I'm doubling the size of my groups. And after I do this log base 2 w times, I have the whole group.

In the first two instructions that I executed here, I have to actually do the AND separately for x right-shifted by 1 and x, and also x right-shifted by 2 and x, and then add them together, because there is an overflow issue. The overflow issue is that the size of the groups here might not be large enough to actually store the count of the number of 1 bits in that group. But once I get to the larger groups, the count can always be stored in a group of that size and I don't need to worry about overflow. So for the last four lines, I can actually save one instruction. I don't need to do the AND twice.

So it turns out that most modern machines nowadays have an intrinsic pop count instruction implemented in hardware, which is faster than anything you can code yourself. And you can access this pop count instruction via compiler intrinsics, for example in GCC or Clang. And in GCC, it's __builtin_popcount. One warning though is that if you write this code using these intrinsics, if you try to compile the code on a machine that doesn't support it, your code isn't going to compile. So it makes your code less portable. But this intrinsic is faster than the parallel divide and conquer version.

So one question is, how can you get the log base 2 of a power of 2 quickly using a pop count instruction? So instead of using the de Bruijn sequence trick. Yes.

AUDIENCE: You decrement then you pop count.

JULIAN SHUN: Yes. So what you do is you subtract 1 from the power of 2, and that's going to flood all of the lower bits with 1's. And then now when you execute pop count, it's going to count the number of 1's, and that gives us the log base 2 of the power of 2. So good answer.

So those all the bit tricks I'm going to be talking about today. There's a lot of resources online if you're interested in learning more. There's this really good website maintained by Sean Eron Anderson. There's also the Knuth's textbook, which has some bit tricks in there. There's a chess programming website which has a lot of cool bit tricks. Some of those are used in implementing chess programs. And then, finally, this book called Hacker's Delight. So we'll be playing around with many of these bit tricks in project 1, so happy bit hacking.