Fanning Software Consulting

Histogram: The Razor's Edge

QUESTION: So I've been reading up on HISTOGRAM and how it is optimized to be wickedly fast. I've also discovered the fancy things one can do with reverse indices. I've even read JD Smith's exegesis on the topic: HISTOGRAM: The Breathless Horror and Disgust.

So, just when I thought it was safe to start using HISTOGRAM with the frequency that Californians use "like", I was brought this scary result by the guy on the other side of my wall (his name is Tiberius).

   IDL> print, histogram([-5.50,-5.45,-5.40,-5.35,-5.30,-5.25],min=-5.50,binsize=0.05)
        1           2           0           2           0           1

Wait a minute, this should be a uniform distribution!

Now, I admit that Tiberius contrived this example with the intent to cause harm to HISTOGRAM by placing the values at the exact boundaries of each bin. But, we didn't expect it to break! Honest. After scratching our heads for a bit we realized that HISTOGRAM is conducting its internal affairs by subtracting MIN from the data and then dividing by BINSIZE:

   IDL> print, ([-5.50,-5.45,-5.40,-5.35,-5.30,-5.25]-(-5.50))/0.05, format='(f21.19)'
   0.0000000000000000000
   1.0000038146972656250
   1.9999980926513671875
   3.0000019073486328125
   3.9999961853027343750
   5.0000000000000000000

Well, there ya go. It's a round-off problem that results from trying to balance the values on a razor's edge. The subtraction and division knock a few values off balance. But, the result is still wrong and I just don't know enough about roundoff error to know if this is an insurmountable problem (but my educated guess is "yes".) Is there some clever way to make HISTOGRAM behave properly in such situations?

ANSWER: The answer is supplied by a number of IDL users from the IDL newsgroup, where this question was originally asked. First, and concisely, here are the solutions by Craig Markwardt.

Here are five partial solutions to the problem:

  1. Use double precision values.
  2. Multiply all values by (1 + eps), where eps= (MACHAR()).EPS. (This assumes you always want to round "up" to the next bin.)
  3. Add a random deviate to each value. (This doesn't solve the razor's edge problem per se, but reduces the chance that a human-entered quantity will land on a bin-edge, and also reduces the bias of always rounding "up" to the next bin.)
  4. Work in powers of 2 instead of multiples of 0.05.
  5. Learn to live with it.

The real problem, of course, is that computers cannot represent real numbers as easily as we think they can. Please read the linked article several times before proceeding. Ed Meinel points this out.

The technical answer is that computers represent numbers as base 2; humans represent numbers as base 10. Mapping numbers from base 10 to base 2 is exact, mapping numbers from base 2 to base 10 ain't.

For example, mapping 5+5/10 = 5+1/2 is an exact correspondence. Mapping 5+45/100 = 5+1/4+1/8+1/16+1/128+1/256+1/2048+... is not exact, even in double precision values.

So, depending on the machine precision, 5.45 on the computer is either slightly greater than or less than an exact representation of 5.45. On top of that, neither is 0.05 exact on the computer, so your bin size is also slightly different than what you are expecting.

The bottom line is this: if you want the results to be exact, think like a machine.

Bob Stockwell elaborates the proper solution: round to integer values.

The histogram gives the correct result. Check out what you put into the histogram function.

   IDL> print, [-5.50,-5.45,-5.40,-5.35,-5.30,-5.25],format='(f50.25)'
   -5.5000000000000000000000000
   -5.4499998092651367000000000
   -5.4000000953674316000000000
   -5.3499999046325684000000000
   -5.3000001907348633000000000
   -5.2500000000000000000000000

So, if such razor edge stuff is a concern of yours, pre-process the data and use integers (i.e. round to integers). This also applies to any conditional tests of a float ( for instance, for i = 0.1,10,0.001 do begin... etc). Here is your example, rounded to integers. You get the result you expect.

   IDL> Print, Histogram(Round([-5.50,-5.45,-5.40,-5.35,-5.30,-5.25]/0.05), $
   IDL     Min=-5.50/0.05, Binsize=1)
   1 1 1 1 1 1

Paul van Delst finishes the discusion with a suggestion.

The subtraction and division don't knock a few values "off balance", the values are simply not representable to the precision you require - such is the nature of float point numberology. As other posters suggested, you need to preprocess the values somehow to make them integers, or take into account the precision when you do the comparisons required. It's worth writing a separate function to do floating point number comparisons. In Fortran I do the following:

!         ABS( x - y ) < ( ULP * SPACING( MAX(ABS(x),ABS(y)) ) )
!
!       If the result is .TRUE., the numbers are considered equal.
!
!       The intrinsic function SPACING(x) returns the absolute spacing of numbers
!       near the value of x,
!
!                      {     EXPONENT(x)-DIGITS(x)
!                      {  2.0                        for x /= 0
!         SPACING(x) = {
!                      {  
!                      {  TINY(x)                    for x == 0
!
!       The ULP optional argument scales the comparison.

where

!       ULP:  Unit of data precision. The acronym stands for "unit in
!             the last place," the smallest possible increment or decrement
!             that can be made using a machine's floating point arithmetic.
!             A 0.5 ulp maximum error is the best you could hope for, since
!             this corresponds to always rounding to the nearest representable
!             floating-point number. Value must be positive - if a negative
!             negative value is supplied, the absolute value is used.
!             If not specified, the default value is 1.

I don't know off the top of my head how to translate all that to IDL, but it would be a useful thing to do.

After re-reading this article a month or so later, Paul added these thoughts for those of you brave enough to attempt his suggestion.

Hmm. Reading that razor's edge article made me dig out a little f90 program I wrote to determine the real number spacing.

Given some number, A, one can do:

   exponent=ceil(alog(abs(A))/alog(2.0d))
   radix=2.0d  ; (machar(/double)).ibeta ??
   epsilon=(machar(/double)).eps
   spacing=epsilon * (radix^(exponent-1))

So for, say, A = 1.234568d+16:

   EXPONENT        LONG      =           54
   SPACING         DOUBLE    =        2.0000000

For A = 1.234568d+01:

   EXPONENT        LONG      =            4
   SPACING         DOUBLE    =    1.7763568e-15

For A = 1.234568d-01:

   EXPONENT        LONG      =           -3
   SPACING         DOUBLE    =    1.3877788e-17

and for A = 1.234568d-16:

EXPONENT        LONG      =          -52
SPACING         DOUBLE    =    2.4651903e-32

which agree with the outputs of my f90 code using the intrinsic functions EXPONENT and SPACING. The problem is, of course, what to do when A = 0.0. I would just use spacing = epsilon/2.0. I think that, in this case, doing something like this would work.

   two=2.0d
   radix=two
   IF ( A .eq. 0.0 ) THEN $
     spacing = epsilon/two $
   ELSE BEGIN
     exponent=ceil(alog(abs(A))/alog(two))
     spacing=epsilon * (radix^(exponent-1))
   ENDELSE

where you are counting on the equality check .EQ. not to work for numbers that aren't represented as exactly zero (since exact zero can be represented). But I'm not sure. You may also need a check to limit the calculated exponent to the range allowed (the minexp and maxexp fields of the output from Machar).

Anyway.... back to work....

And, recently, JD Smith had a couple of thoughs about comparing floats in IDL expressions such as IF A EQ B THEN ...

I know a lot has been said about the issue of floating point representation, but I thought I'd have a stab.

There's nothing special about float comparison, other than the fact that it's hard to write down a floating point number uniquely, by hand as a literal in your code, as ASCII in a text file, etc. This issue just doesn't exist for integers, for which there is a one-to-one mapping between integer representation in ASCII text and binary representation on disk (as long as the number is within the integer range).

A float is equal to another float only when their exact representation in memory is equal. Note that the following is not an issue:

   IDL> print,array_equal(replicate(4.923,100),4.923)
      1

It will *always* be true for any floating number specified like this, even ones which cannot be represented at the given precision:

   IDL> print,array_equal(replicate(4.92309879079079087908790879087,100), $
   		       4.92309879079079087908790879087)
      1

Problems only begin to occur when you start to rely on your (incorrect) intuitive expectation of how a computer *should* handle a floating point number:

   IDL> print,array_equal(replicate(4.923,100),1.00 + 1.02 + 1.045 + 1.858)
      0

But, why not, you ask, since:

   IDL> print, 1.00 + 1.02 + 1.045 + 1.858
         4.92300

Well, not all of those numbers (if considered as exact values drawn from the inifinite set of all real numbers) can actually be represented uniquely as a float. The classic example of this issue is the decimal number 0.9, which in binary representation is:

   0.111001100110011001100110011........

and repeating so on forever. Unfortunately, you can only allocate so many bits for a float, so:

IDL> print, 0.9 eq (0.6 + 0.3)
   0
   IDL> print,FORMAT='(F0.8)',0.9, 0.6 + 0.3
   0.89999998
   0.90000004

Note that this doesn't mean that 0.9 and numbers like it will *never* equal to the "sum of their parts". Sometimes you get lucky and the bit truncation works out the same:

   IDL> print, 0.9 eq (0.5 + 0.4)
      1

Here's a nice example in C to get a better handle on this, which will let you peek behind the curtain and see the real bit representation of a given floating point number:

   #include <stdio.h>
   int main() {
     float f=0.9,f1;

     printf("Literal -- %f: %lx\n",f,*(unsigned int *) &f);

     f1=(float) 0.3 + (float) 0.6;
     printf("0.6+0.3 -- %f: %lx\n",f1,*(unsigned int *) &f1);

     printf("Floats are %s\n",f1==f?"Equal":"Not Equal");

   }

For convenience it prints the floats as 4 bytes of hexadecimal instead of 32 bits of binary, and it uses a de-referencing trick to get at the actual binary data for the float. Here are the results:

   % ./compare_float
   Literal -- 0.900000: 3f666666
   0.6+0.3 -- 0.900000: 3f666667
   Floats are Not Equal

Note how I had to cast the literal numbers 0.3 and 0.6 to floats, since by default most C compilers promote literals to double in arithmetic before assigning to float. IDL does this casting natively as well (which is why you aren't guaranteed to get the same answer with IDL and C).

Note also how the bit representation of the floats is not equal. Here they are in binary:

   3f666666: 111111011001100110011001100110
   3f666667: 111111011001100110011001100111

Just a single bit difference, but what a difference it is!

If you really want to test floats which arrived in a given array from various sources for "equality", you need to specify what "equality" means (if something other than the computer's quite naive definition of equality as "exact same representation in binary format"). You might try:

   ARRAY_EQUAL(abs(a-a[0]) lt epsilon,1b)

where epsilon is your maximum allowed difference within which floats should be considered "equal". A good number might be (machar()).eps, i.e.:

   IDL> print,abs(0.9 - (0.6+0.3)) lt (machar()).eps
      1

Ahh, all is right with the world.

A recent discussion in the IDL newsgroup brought this topic up again. Craig Markwardt, as usual, had something intelligent to say about the problem that I thought was worth noting here. Here is the conversation. First, from Jacqueline, who appeared to solve her immediate problem.

Because of your help, I realised that the problem was in the way I was defining the intervals. In fact, they were too large. Then, I re-wrote the program introducing an eps=1.e-3 value. The loop turned-out to be:
 
   mmod=[findgen(10)*0.01+0.01,findgen(3)*0.1+0.2]
   eps=1.d-3
   for j=0,12 do begin
     a=where(massa ge mmod[j]-eps and massa le mmod[j]+eps,count)
     print,mmod(j),count
   endfor

Now I get the right counts.

Craig answers.

Jacqueline, from your description of the problem, your samples are right at the edges of the bins. I guess you've found a way to do this, but it doesn't look pretty and using a WHERE() function inside a FOR loop, well, it's kind of wierd.

But one will always get strange results if one tries to histogram/bin data when the sample values fall exactly on bin edges. (Believe me, I goofed on this for data from a multi-million dollar space mission.)

There are lots of ways to solve this, but they all rely on moving the bin edges away from your data samples. These techniques will always work, and do not rely on having special knowledge of how the data is quantized.

   ;; With HISTOGRAM (bin edges start at 0.005)
   hh = histogram(massa, min=0.005, max=0.405, binsize=0.010)

   ;; I guess these are where your values are sampled, right?
   mmod = [0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.2, 0.3, 0.4]
   ;; Then put the edges just to the left, plus one edge on the right hand side
   edges = [sample_vals - 0.005, 0.5]

   ;; With HISTOGRAM and VALUE_LOCATE
   hh = histogram(value_locate(edges, massa), min=0, max=12)

   ;; With a loop, no WHERE
   for i = 0, 12 do begin
     hh[i] = total(massa GE edges[i] AND massa LT edges[i+1])
   endfor

   ;; OK, you really want a loop with WHERE?
   for i = 0, 12 do begin
     wh = where(massa GE edges[i] AND massa LT edges[i+1], count)
     hh[i] = count
     print, mmod[i], hh[i]
   endfor

The expression,

    massa GE edges[i] AND massa LT edges[i+1]

does two things. Since EDGES[i+1] will be used as the right hand edge of the i-th bin, and then used again as the left edge of the (i+1)-th bin, one can never miss counts. There are *never* any holes due to floating point arithmetic. By the way, VALUE_LOCATE() does all this behind the scenes, that's what it was designed to do.

Also, look at the strategic use of GE on one bin edge, and LT on another bin edge. This guarantees that one sample can't fall into two bins, i.e. prevents double-counting.

Google
 
Web Coyote's Guide to IDL Programming