Wednesday, March 22, 2017

How does git bisect choose the commit to test ?

Introduction

The documentation says that git bisect "uses a binary search algorithm to find which commit in your project’s history introduced a bug". Great, but what is a binary search in a DAG ?

The problem

The problem is to find a particular element (the commit that introduced a bug, assumed unique) in a set of elements (all the commits between a known good commit and a known bad commit). The point of the binary search is to divide the set of elements to test, by 2, at each iteration. So, how should we choose a commit in the commit DAG that, once tested, will let us know that half of the elements does not contain the faulty commit ?

A wrong solution

A first idea would be to sort the commits regarding the date, and do a "regular" binary search, as in a one dimension array. This does not work because the faulty commit will not necessarily contaminate all the commits that are created later in time. Here is a concrete example of a small DAG where this algorithm will not find the commit that introduced the bug:
The commit that introduces the bug is 4. If we sort by date, we will do a binary search in the following one-dimension array:
Binary search cannot find the commit 4, it would return 7 as the first bad commit.

A right solution

Let's use this example commit DAG:
If a commit tests good, we know that all his ancestors are good, and we can continue the search in the set of all commits minus this commit and his ancestors. If a commit tests bad, we have to continue the search in its ancestors. Let n be the total number of commits, in this example n=22. To cut the problem in 2, we need to test a commit that has n/2 ancestors, and if it's not possible, a commit that is the closest to n/2 ancestors. (Binary search will not be in log2(n) time if we cannot divide by 2 exactly, worst-case DAG leading to O(n) time)

As far as I know, it's not possible to count the ancestors of each of the n commits in O(n) time. A BFS from the known good commit (1, in our example) will have troubles arriving on merge commits.
The solution chosen by Git is to count the ancestors of every merge commit, with a DFS. The other commits, having a single parent, have one ancestor more than their parent so they can be computed in a single graph traversal.

In our example, counting all the ancestors gives this result:
The "green" commit is the best choice of commit to test next.

Note: if "green commit" (id=14) tests good, the next iteration will have a tree with multiple good commits (ids = 14, 6 and 11).

The implementation in git 2.12

This algorithm is implemented in the function "do_find_bisection" in [git root]/bisect.c.

In git, child commit have pointer(s) to its parent(s). That's convenient to count the number of ancestors in DFS for merge commits as we described, since we need to go from child to parent. This counting is implemented in function "count_distance" in [git root]/bisect.c.

But the child to parent direction is not convenient to count the ancestors for non-merge commits, because we must go from parent to child. I guess that's why the function "find_bisection", that calls "do_find_bisection", prepares a linked list of the commits in "most ancestor" to "most child" BFS order.