Sorting floats

A floating-point mystery

Here's a bug I ran into several years ago. Assume we have a bunch of points defined by their coordinates and we want their pairwise distances in increasing order:

distances = [dist(x, y) for x, y in itertools.combinations(points, 2)]
distances.sort()

Here dist is a function that computes some kind of distance measure. The details don't matter much but let's say that it's Manhattan distance,

def dist(x, y):
    return sum(abs(a - b) for a, b in zip(x, y))

Now we have the distances in order so that distances[0] is the smallest and distances[-1] is the largest... right? But it turned out that distances[0] was not the same as min(distances).

If you want to ponder the mystery yourself, the question is how to complete the assignment points = ... so that the assertion at the end fails:

import itertools

def dist(x, y):
    return sum(abs(a - b) for a, b in zip(x, y))

points = ...

distances = [dist(x, y) for x, y in itertools.combinations(points, 2)]
distances.sort()

assert distances[0] == min(distances)

What happened

The problem was caused by some unexpected floating-point numbers in the input data. Here's an example:

inf = float('inf')
points = [(0, 0, 0), (0, 0, inf), (1, -1, inf), (1, 0, 2), (1, 1, 2)]

Here some of the coordinate values are infinite, so those points are infinitely far away from everything else. The infinities should sort nicely at the end, so they are not the problem; the problem is the distance between the two points at infinity:

>>> [dist(x, y) for x, y in itertools.combinations(points, 2)]
[inf, inf, 3, 4, nan, inf, inf, inf, inf, 1]

One of the numbers in the list is printed out as nan and is actually “not a number”, NaN, which is the result of $\infty-\infty$. The NaN value propagates through almost all computations because computing something from undefined inputs should of course be undefined. And it turns out that sorting a sequence containing a NaN isn't guaranteed to work!

>>> sorted(dist(x, y) for x, y in itertools.combinations(points, 2))
[3, 4, inf, inf, nan, 1, inf, inf, inf, inf]

The infinities moved towards the end of the sequence but not all of them went all the way to the right, and the smallest value did not make it all the way to the beginning. It's as if the NaN value splits the sequence in two pieces, each of which is sorted separately.

Of course, in this example it's easy to spot the infinities, but in the original case it was just a rarely occurring value in a much larger set of inputs. If you inspect the beginning of the sequence, it is nicely sorted, and you would have had to browse much further to see the strange values.

Why this happened

Sorting algorithms, including Python's, typically expect the values to be sorted to belong to a totally ordered set. This requires that any two values can be ordered: one is smaller than the other, or the values are equal. I said that NaN values propagate through almost all computations, but comparisons are an exception — they just return false, no matter which way the numbers were being compared. So since NaN cannot be correctly compared to anything, it can end up breaking the sorting algorithm.

Why did we get the two subsequences sorted separately, then? It actually depends on the exact input, but that kind of thing seems to happen with many inputs I've tried. Python's sorting algorithm is quite advanced, but let's try with a much simpler algorithm as a model:

loop through all consecutive pairs $(a, b)$ of elements, and if $a>b$, swap them
if any swaps were made, start over, otherwise stop

It's not hard to see that assuming a totally ordered set, this results in a sorted sequence, even if the algorithm can take unreasonably long to run. In our example, the sequence is

$$ \infty, \infty, 3, 4, \textrm{nan}, \infty, \infty, \infty, \infty, 1 $$

and the first iteration will move one of the $\infty$s past the $3$ and $4$, but when it compares $\textrm{inf}>\textrm{nan}$, the result is false. Similarly, the first iteration will move one $\infty$ past the $1$. On the second iteration, the $3$ and $4$ percolate to the beginning, and eventually also the $1$ makes its way to just after the $\textrm{nan}$, but again, since $\textrm{nan}>1$ is false, it can't move any further to the left.

How to fix it

If you don't expect to see infinities or NaNs in input, you might want to check for them. The fix to my problem was the equivalent of

assert all (math.isfinite(c) for x in points for c in x), \
    "Coordinates must be finite"

Another solution is to use Numpy's sort, which puts NaNs last since numpy 1.4.0.

In R, is.na treats NaN the same as NA (“not available”) so it gets omitted in many contexts – for example, sort omits NA values by default.

Go is adding a total-ordering comparison (via).

Background: other problems with floats

I got the inspiration to write this when Julia Evans asked

[...] how floating point numbers can be treacherous -- what are specific examples of when they've betrayed you?

She got a huge number of replies and wrote a great summary of many of them.