# Help! The Sky is Falling!

**QUESTION:** I'm not sure exactly how to express this question. It appears periodically
on the IDL newsgroup under the general category of *The Sky is Falling!* It usually starts out
like this: "*I've found a bug of such magnitude that it confirms my feeling that IDL is a piece of blah, blah, ... And I'm sure every thinking person with half a brain will
see how perverse this is and blah, blah, blah ...*"
What it boils down to is that the person posting the article is just about to get his or her first
lesson in
how a computer represents numbers. It is a lesson we should *all* learn!

Here is the most recent incarnation of the question:

I ran into a number transformation error yesterday that is still confusing me this morning. The problem is that the number 443496.984 is being turned into the number 443496.969 from basic assignments using Float() or Double(), despite the fact that even floats should easily be able to handle a number this large (floats can handle "±10^38, with approximately six or seven decimal places of significance").

** ANSWER:** To address this specific question, I present the excellent answer Paul van Delst offered
in this case. Then, I want to add a couple of other articles I think you
will want to read. But, first, here is what Paul had to say to this person's question:

Don't confuse the magnitude of the number with its precision. This all comes about because most floating point numbers can't be represented exactly in binary (I say most because numbers like 0.0 and 1.0 usually can. It's the stuff like 0.1 and 443496.984 that cause the heartache). Numbers are stored with a mantissa (describes the actual number you want) and an exponent (describes the, well, the exponent.). So, **assuming** that 7 significant figures is a definite "cutoff" for precision, then:

1.1e+37 == 2 significant figures. Precision probably good to 1.100000e+37 1.01e+37 == 3 significant figures. Precision probably good to 1.010000e+37 .... 1.000001e+37 == 7 significant figures. At the limit of precision etc..

and for your number:

6.984 == 4 sigfig, good to 6.984000 96.984 == 5 sigfig, good to 96.98400 496.984 == 6 sigfig, good to 496.9840 3496.984 == 7 sigfig, good to 3496.984 43496.984 == 8 sigfig, good to 43496.98 443496.984 == 9 sigfig, good to 443496.9

So, if the 7 significant figure assumption is a good one (it may not be), then if you set a single precision value to 443496.984 and print it out, a result of 443496.9XX where the XX can be anything is perfectly reasonable.

[**Question**:] But, why doesn't x2=Double(443496.984) produce the correct result?

Because 443496.984 is, again, a *single precision* literal constant. That is, you're
converting a single precision number (where the last 2 decimap points can be anything, see above)
to a double precision one. You'll then have a very precise, "nearly correct" number. As others
have pointed out you need to create the number as a double to begin with,

x2 = 443496.984d0

so that you'll have, say, 16-17 significant figures, or a number "good" to 443496.9840000000.

I reckon that, if in doubt, *always* use double precision for floating point variables. There's nothing worse that trying to debug code and discovering weird results are related to the precision of the represetation (well, maybe apart from an insidious compiler bug, but that would never happen with IDL! :o)

Here is another form of the question. This one most often misunderstands
how the IDL **Print** command works.

**Question: ** I'd like create a varible which always holds numbers out to 6 places
passed the decimal point, and 4 numbers before the decimal point (e.g.
1999.123456). I tried double
precision but this seems to only hold 8 digits total for decimal numbers.
However I would like to create a variable corresponding to a fraction of a
year (e.g 1995.123456). I have not been able to find a procedure allowing
me to create a 10 digit decimal number, without the number being rounded
off to 8 digits. Does anyone know of a way to declare a variable to hold a 10 digit
decimal number?

**Answer:** Here is Liam Gumley's reponse to this question.

I think you are confusing internal machine precision with print formatting precision. For example, to print the double precision PI system variable:

IDL> print, !dpi 3.1415927 IDL> print, !dpi, format='(e20.10)' 3.1415926536e+00The default print format only prints 8 digits, but you can change the print format to print more digits. On machines with IEEE arithmetic, 64 bit double precision stores about 15 digits of information. For a nice discussion of how floating point numbers are represented, see "Section 4: Floating-point Numbers" in this reference.

Finally, here is an article William Clodius wrote for the IDL newsgroup in November of 2000.

Almost every computer newsgroup on programming languages or numerics has this discussion at one time or another. The vast majority of programming languages rely on the hardware representation of floating point numbers so that this issue is essentially programming language independent. The vast majority of computer systems (where computer systems excludes hand held calculators which often implement binary coded decimal) now implement the core of the IEEE 754 standard, so my remarks will be confined to such systems. However the minority of computer systems that do not implement the IEEE 754 standard share many of the same quirks.

Details on these quirks, with an emphasis on the IEEE 754 standard, are available from reports by David Goldberg, "What Every Computer Scientist Should Know about Floating Point Arithmetic," and the subsequent supplement to that report by Doug Priest, "Appendix D". Both reports were written for Sun Computer systems and are available in pdf and postscript formats. [Editor's Note: There is now an easier to understand reference available, too.]

In almost any binary system including IEEE 754, most of the time floating point numbers can be thought of as divided into three parts, a sign, a mantisa, and an exponent stored in a fixed size "word". In IEEE 754 the mantisa can be thought of as an integer with values from 2^n_mant to 2*2^n_mant-1, where n_mant is the number of bits available for the mantisa. Note that the mantisa is non-zero. In IEEE 754 the exponent can be thought of as a scale factor multiplying the integer, where the scale factor is a simple poser of two whose relative range is determined by the number of bits available for the exponent. IEEE 754 requires that the computer make available to the user two such representations: what we normally think of as 32 bit single and 64 bit double precision.

The fact that the data is stored in a fixed size word results in the first surprise to very inexperienced users: this representation is correct for only a finite number of values. This representation cannot deal with all elements of the countable set of all rationals, let alone the uncountable set of all irrationals. Inaccuracies and errors are almost unavoidable in any attempt to use this data type, except for knowledgeable users in limited domains.

This binary representation does not let IEEE 754 exactly represent numbers that are not simple exact multiples of powers of two. This results in the second surprise to many users: most simple decimal floating point numbers (e.g., 0.3 or 0.1) cannot be exactly represented and any attempt to store such numbers results in errors. E.g., 0.1 might become 0.1000000005 when stored.

Most manipulations of IEEE 754 numbers result in intermediate values that cannot be represented exactly by single and double precision numbers. This results in the third surprise, most manipulation result in a cumulative loss of accuracy. To minimize this loss of accuracy it requires that intermediate calculations be performed at a higher precision with well defined rounding rules to obtain the final representation of the intermediate results. Most systems appear to use double precision to represent intermediate results for single precision calculations. All systems must use a precision higher than double for intermediate results of double precision calculations. This higher precision representation need not be available to users, however the Intel extended precision is essentially this higher precision intermediate type (it differs slightly in ways that irritate numericists). When this higher precision type is available it need not have the well defined rounding and error propagation propers of single and double precision.

While most of the time IEEE 754 has this behavior there are exceptions represented by special values. Obviously there has to be a zero value. IEEE 754 also has special values such as + or - infinity to represent such things as dividing a finite number by zero, NaN (Not a Number) for representing such things as zero divided by zero, and a signed zero. This allows calcuations to procede at high speed code and the post-facto recognition and (if neecessary) correction of problems with the algorithm or its "inaccurate" implementation in IEEE 754. It also generates floating point exceptions that can be detected automatically withot examining all the output numbers. Many languages were developed before IEEE 754 and do not map naturally to this model.

The existence of infinities and NaN leads to a fourth surprise to many users: obviously bad results can be generated and the computer does not stop the instance it detects such "bad" values. As there are prblem domains where such values are expected and not an indication of problems, it is up to the user to check for such values if they can be generated and when generated indicate a problem.

Some other surprises.

The definition of the IEEE 754 mantisa, an integer with values from 2^n_mant to 2*2^n_mant-1, where n_mant is the number of bits available for the mantisa, is termed a normalized number. This is error prone for very small numbers. IEEE 754 mandates that there be available for such small numbers what are termed denorms where the mantissa is interpreted as an integer from 0 to 2^n_mant, so that accuracy degrades gradually for such values. However, this complicates the implementation of the floating point, so some processors, e.g., the DEC Alpha make this available only in software at a greatly reduced performance.

The special values for IEEE 754 were chosen with specific applications in mind and are not always the best for other applications. In particular some applications benefit from unsigned zeros and infinities others from signed NaNs none of which are provided by IEEE 754. Only sophisticated users tend to complain about this.

The default rounding behavior for IEEE 754 from an "extended" precision intermediate that is exactly halfway between values in the result representation, always to the same final bit (effectively alternately up and down), often surpises very observant users, although it is designed to reduce the propagation of systematic errors in most applications.

And, finally, just one more time, with the answer provide by William again.

**Question:** To try and find where the problem is, we tried the following lines:

IDL> a = DOUBLE(42766.080001) IDL> print,a,FORMAT='(F24.17)' 42766.07812500000000000

As you see, the number we get out isn't the same as the number we entered. I'm guessing it's to do with the way IDL stores numbers in memory, but my understanding of low-level computational processes isn't great.

Can anybody help me understand what's going on, and/or if there's a way around? I'd really appreciate whatever help is on offer, so thanks in advance.

**Answer:**

All computers IDL is available for store numbers in memory using a binary representation. This representation comes in at least two forms, single (float) and double precision. Both representations can be thought of as typically representing a number by an integer multiplied by a scale factor (exponent) that is an integer power of two. Double uses twice as many bits as float to allow a larger range of integers and scale factors. Because of the finite range of the integers, and because the exponent is a power of two and not a power of ten, only an infinitesimal fraction of the numbers that can be written exactly in decimal can be represented exactly in a finite binary representation. This is a common source of confusion for users of most programming languages. (There are some languages that use less efficient representation such as decimal or rational arithmetic, but such languages, in addition to their inefficiencies, often provide only the simplest mathematical operations.)

In addition to this common source of confusion, your code has an additional problem that is almost as common among such languages. You apparently don't understand the lexical conventions used to distinguish between literals that represent single and double precision numbers. IDL iginores your DOUBLE in deciding this. Instead it interprets your 42766.08001 as a single precision literal, and finds the nearest representable value, which is only accurate to about 7 decimal places. If you want a literal to be interpretted as a double precision, it mus have D# (or d#) as a suffix, where # is an appropriate decimal exponent, i.e. you could represent 42766.08001 as any of these:

42766.08001D0 42766.08001 d0 42.76608001 D3 4.276608001 D4 0.4276608001 D5 ...

to have it interpreted as a double precision number.

# Latest Incarnation

Here is the latest incarnation
of the problem, slightly edited, from an April 18, 2007 IDL newsgroup article entitled *Fix(4.7*100) is...469.*

There's something I can not explain to myself, so maybe someone can enlighten me?

IDL> print, fix(4.70*100) 469and also:

IDL> print, string(4.70*100,format='(i3)') 469While, everything else that came into my head to try was okay. For example:

IDL> print, fix(5.70*100) 570 IDL> print, fix(3.70*100) 370 IDL> print, fix(4.60*100) 460 IDL> print, string(4.60*100,format='(i3)') 460What's up with this!?

I suggested the person asking the question should read this article, but that only got the water boiling. It was soon followed up with this.

It seems weirder than usual though. What about this:

IDL> print, 470.0 - (4.70*100) 3.05176e-005which is bigger than the smallest float:

IDL> print, (machar()).eps 1.19209e-007So how can it be the float accuracy problem if the difference between the expected and the real value is 256 times bigger than the float error?

After this, I was accused of not understanding the problem.

I'm afraid, David, this is not actually the question you describe in your article. I do not expect better accuracy than I provide.

There is nothing wrong here with the floating point accuracy.

IDL> print, 4.700*100.00 470.000It is the conversion to integer (I imagine) which makes no sense.

IDL> print, string(4.700*100.00,format='(i3)') 469

Christopher Thom came to my defense:

No. Read the article again...and the one on double precision. It is exactly what is described there. You have provided IDL with a number that has 8 decimal places of precision. 4.7 is really somewhere between 4.6999999 - 4.7000001, but cannot be precisely represented. For example:

IDL> print, 4.7 4.70000 IDL> print, 4.7,format='(f18.16)' 4.6999998092651367The important point is that converting the

actual number as represented in the computerto an integer, isnotconverting the number youthinkis represented in the computer.So...if you take the number that is actually in IDL...move the decimal place two places to the right, you get:

IDL> print, 4.7*100,format='(f18.14)' 469.99996948242188Now chop off every thing after the decimal place (which is what fix() does)...and 469 is a prefectly reasonable answer to the question you asked. If you want a better answer, you need to ask a better question.

I can't speak as to exactly how the conversion to integers happens within the

Stringcommand you gave, but I imagine it's probably the same.

Now another user jumps in to warn us about using the **FIX **command
if this kind of thing is important to you (and how can it *not *be?).

Personally, I think

FIXshould be used with some care, because the documentation is a bit vague on how the conversion to integer type is done, and the result does depend on the input type. For numerical values, I think that usage ofROUND,CEILorFLOORis safer (but be aware that they produce long integers if the input is large enough, but this is mostly what you want anyway).But

FIXtruncates your number when converting it to integer. Compare:IDL> print, 4.999999 5.00000 IDL> print, fix(4.999999) 4You don't like this behavior? Fine, just use

ROUNDinstead:IDL> print, round(4.999999) 5

But this *still *doesn't convince everyone.

But, look. What is being displayed is different than what is being stored.

IDL> print, 470.0 - (4.70*100) 3.05176e-005

I had to argue that, on the contrary, what is being displayed reflects *exactly*
what is being stored.

IDL> print, 470, format='(f18.14)' 470.00000000000000 IDL> print, 4.70*100, format='(f18.14)' 469.99996948242187 IDL> print, 470.00000000000000D - 469.99996948242187D, format='(f18.14)' 0.00003051757813

There was more of this for the cognoscente, 42 articles worth, but far from a record. But
I leave it to you to find it and read it for yourself. Bottom line? Computers are not always doing
what you *assume *they are doing.

## Et tu, Brutus? Incriminating FINDGEN, of All Things!

The latest public scare involving numbers on computers has put the venerable FINDGEN under scrutiny. Consider this piece of travesty.

IDL> array = Findgen(10000, 10000) IDL> Print, Max(array) 7.50000e+007

Say what!? I'm pretty sure that 10000 x 10000 = 1e8. What's going on?

It turns out that IDL FINDGEN function uses a floating point counter, rather than a integer counter when creating the array. Lajos Fawlty offers this code to illustrate.

PRO Test CPU, tPool_NThreads=1 n = 10l^8 nn = n-1 ; Normal FINDGEN implmentation a1 = Findgen(n) ; IDL's implementation a2 = FltArr(n) count = 0.0 FOR j=0l,nn DO a2[j] = count++ ; An 64-bit integer implementation a3 = FltArr(n) count = 0LL FOR j=0L,nn DO a3[j] = count++ Print, a1[nn], a2[nn], a3[nn], Format='(3F15.3)' END

Running the program give these results.

IDL> Test 16777216.000 16777216.000 100000000.000

Note that the program is restricted to one CPU, since the results will depend on how many CPU cores you are running in your IDL session, too. (Starting values for threads are calculated as integers.)

The folks working on IDL point out that changing the counter to an integer will make the FINDGEN function at least four times slower and
are reluctant to change. [A concern, I guess, because more people
use the FINDGEN function than use function graphics.] Alain Leacheux
comes to their rescue by pointing out that in *any *
computer programming language, including IDL, it is a mistake to think
you can use a floating point counter that is any larger than the
inverse of the floating point precision of the machines. In IDL, such
precision is given my the Machar function.

IDL> Print, Long(1/(Machar()).eps) 8388608

Note that the incorrect result above is twice this number.

To illustrate what Alain means, consider this code.

IDL> Print, 16777216LL + Findgen(10), Format='(F25.0)' 16777216. 16777218. 16777220. 16777220. 16777220. 16777222. 16777224. 16777224. 16777224.

There are jumps in the sequence, because when the 64-bit integer is converted back to a Float (by adding a float to the number), the floating point number does not have enough precision to handle the addition of a small value.

The only way to handle problems like this is to not use floating point counters creater than 8388608. You must use double precision values when working with larger counters like this.

IDL> Print, 16777216LL + Dindgen(10), Format='(F25.0)' 16777216. 16777217. 16777218. 16777219. 16777220. 16777221. 16777222. 16777223. 16777224. 16777225.

Copyright © 1997-2012 David W. Fanning

Last Updated 24 April 2012