Wednesday, March 22, 2017

How does git bisect choose the commit to test ?


The documentation says that git bisect "uses a binary search algorithm to find which commit in your project’s history introduced a bug". Great, but what is a binary search in a DAG ?

The problem

The problem is to find a particular element (the commit that introduced a bug, assumed unique) in a set of elements (all the commits between a known good commit and a known bad commit). The point of the binary search is to divide the set of elements to test, by 2, at each iteration. So, how should we choose a commit in the commit DAG that, once tested, will let us know that half of the elements does not contain the faulty commit ?

A wrong solution

A first idea would be to sort the commits regarding the date, and do a "regular" binary search, as in a one dimension array. This does not work because the faulty commit will not necessarily contaminate all the commits that are created later in time. Here is a concrete example of a small DAG where this algorithm will not find the commit that introduced the bug:
The commit that introduces the bug is 4. If we sort by date, we will do a binary search in the following one-dimension array:
Binary search cannot find the commit 4, it would return 7 as the first bad commit.

A right solution

Let's use this example commit DAG:
If a commit tests good, we know that all his ancestors are good, and we can continue the search in the set of all commits minus this commit and his ancestors. If a commit tests bad, we have to continue the search in its ancestors. Let n be the total number of commits, in this example n=22. To cut the problem in 2, we need to test a commit that has n/2 ancestors, and if it's not possible, a commit that is the closest to n/2 ancestors. (Binary search will not be in log2(n) time if we cannot divide by 2 exactly, worst-case DAG leading to O(n) time)

As far as I know, it's not possible to count the ancestors of each of the n commits in O(n) time. A BFS from the known good commit (1, in our example) will have troubles arriving on merge commits.
The solution chosen by Git is to count the ancestors of every merge commit, with a DFS. The other commits, having a single parent, have one ancestor more than their parent so they can be computed in a single graph traversal.

In our example, counting all the ancestors gives this result:
The "green" commit is the best choice of commit to test next.

Note: if "green commit" (id=14) tests good, the next iteration will have a tree with multiple good commits (ids = 14, 6 and 11).

The implementation in git 2.12

This algorithm is implemented in the function "do_find_bisection" in [git root]/bisect.c.

In git, child commit have pointer(s) to its parent(s). That's convenient to count the number of ancestors in DFS for merge commits as we described, since we need to go from child to parent. This counting is implemented in function "count_distance" in [git root]/bisect.c.

But the child to parent direction is not convenient to count the ancestors for non-merge commits, because we must go from parent to child. I guess that's why the function "find_bisection", that calls "do_find_bisection", prepares a linked list of the commits in "most ancestor" to "most child" BFS order.

Wednesday, October 19, 2016

Why the warning "implicit declaration of function" in C should really draw your attention

If a C code uses a function "foo" but does not specify the header declaring "foo", it may still compile, but the compiler will assign a default type to foo:
int foo();
ie, a function taking an unspecified list of arguments and returning an int.

If the function foo (as defined in its library) returns indeed an int, then the program will work (eg: printf, strcmp, ..., will work). But here is an example of what happen if the function does not return an int, but an unsigned long long int, as for strtoull.

Sample code:
#include <stdio.h>
#include <inttypes.h>

// omitting the header declaring strtoull
// #include <stdlib.h>

int main (int argc, char *argv[]) {
    char *str_number = "3000111000";
    uint64_t an_uint64 = 0;
    an_uint64 = strtoull(str_number, NULL, 10);
    printf("value: %" PRIu64 "\n", an_uint64);
    return 0;

Let's compile this code on a x86_64 linux machine. As expected gcc displays a warning because it can't find any declaration for strtoull:
jma@antec:~/tmp$ gcc --std=c99 test.c -o test
test.c: In function ‘main’:
test.c:10:2: warning: implicit declaration of function ‘strtoull’ [-Wimplicit-function-declaration]
  an_uint64 = strtoull(str_number, NULL, 10);
Nevertheless, the program compiles and we can execute the executable:
jma@antec:~/tmp$ ./test
value: 18446744072414695320
The program does not work as expected. The output we wanted was:
value: 3000111000

So what happened ?
gcc believes the return code of strtoull lies in a 32 bit register because it's (inferred) type says that it return an int (32 bit). But that return code has to be stored in "an_uint64", which is a uint64_t. So gcc generates an instruction (cltq) that will extend the 32 bit register onto 64 bit:
(gdb) disas
Dump of assembler code for function main:
   0x0000000000400556 <+0>: push   %rbp
   0x0000000000400557 <+1>: mov    %rsp,%rbp
   0x000000000040055a <+4>: sub    $0x20,%rsp
   0x000000000040055e <+8>: mov    %edi,-0x14(%rbp)
   0x0000000000400561 <+11>: mov    %rsi,-0x20(%rbp)
   0x0000000000400565 <+15>: movq   $0x400634,-0x8(%rbp)
   0x000000000040056d <+23>: mov    -0x8(%rbp),%rax
   0x0000000000400571 <+27>: mov    $0xa,%edx
   0x0000000000400576 <+32>: mov    $0x0,%esi
   0x000000000040057b <+37>: mov    %rax,%rdi
   0x000000000040057e <+40>: mov    $0x0,%eax
   0x0000000000400583 <+45>: callq  0x400440 <strtoull@plt>
=> 0x0000000000400588 <+50>: cltq   
   0x000000000040058a <+52>: mov    %rax,-0x10(%rbp)
   0x000000000040058e <+56>: mov    -0x10(%rbp),%rax
   0x0000000000400592 <+60>: mov    %rax,%rsi
   0x0000000000400595 <+63>: mov    $0x40063f,%edi
   0x000000000040059a <+68>: mov    $0x0,%eax
   0x000000000040059f <+73>: callq  0x400420 <printf@plt>
   0x00000000004005a4 <+78>: mov    $0x0,%eax
   0x00000000004005a9 <+83>: leaveq 
   0x00000000004005aa <+84>: retq
The expected result was found in %rax before the execution of cltq:
(gdb) info register rax
rax            0xb2d20f98 3000111000
(gdb) stepi
0x000000000040058a in main ()
(gdb) info register rax
rax            0xffffffffb2d20f98 -1294856296
When we include the header that declares strtoull, gcc generates an ASM without the cltq instruction, because it knows the result is already in %rax, on 64 bits.

Even if it might look "convenient" that gcc automatically declares a function, it is not a robust way to code and should be avoided.