Fanning Software Consulting

Representing Missing Data in IDL

QUESTION: To all ye who have attained IDL nirvana and to the one who speaketh the truth :p

Missing data points in my binary data (coded as 32-bit words) are denoted by large numbers, like 999999. In order not to plot these missing values, I am using !Values.F_NAN. But the array ought to be floating type to set it to NaN directly. Besides, the type of a variable within a structure can't be modified.

What I did was this:

   array = fltarr(dim)             ; dim is dimension, i.e., #days * #data/day
   array = float(data[*].mydat)    ; data[dim].mydat is data variable
   array [where(array [*] eq 999999.)] = !Values.F_NAN

It works, but I was wondering if there is a “better way” to do this?

ANSWER: Ken Bowman answers this question on the IDL Newsgroup.

The first line in your code above is unnecessary, as the following line will create the array as a floating point array automatically.

And the third line in your code above is not a great idea, since it will crash when there are no missing data. Plus, the [*] syntax is completely unnecessary. You should do something like this instead:

   i = WHERE(a EQ 999999.0, count)
   IF (count GT 0) THEN a[i] = !VALUES.F_NAN

Other than that, the concept seems fine. You have to create a FLOAT variable in order to use NaNs, which I heartily endorse.

The only alternative is to create the original data structure using a FLOAT instead of a LONG (presumably when you read the data). I prefer to replace missing data codes with NaNs at the point I read the data. That way I don't use them inadvertently.

I pointed out that if it was really only a problem with plotting the data, then a MAX_VALUE keyword would work perfectly well, without any need to change the data to NaNs:

   Plot, array, MAX_VALUE=999999-1

Ken agreed with this, but pointed out that representing data as NaNs often prevented other problems downstream of the plotting.

This is true, but using “special numbers” to indicate missing data is rife with the possibility using the missing value as valid data with noticing it. I'm a big advocate of using NaNs because they ensure that if you use them by mistake, your result will be a NaN (which is usually hard to ignore).

This caused the original questioner to ask another question.

QUESTION: This prompts me to ask another question, if I may. Since I have lots of missing data, and I do lots of math operations (array operations, FFT, etc.), will these NaNs propagate all the way through in such situations? Should I be using them in conjunction with FINITE command? Any pointers as to where one ought to be careful with these NaNs?

ANSWER: Ken answered with a warning about a potential bug in TOTAL in IDL 6.3 that could cause the IDL user problems.

Many IDL functions include /NAN keywords to skip NaNs in operations (TOTAL, MEAN, etc.). In other cases, you will have to find the good data with WHERE(FINITE(...), COUNT = count).

There is one special case that you have to watch out for when using TOTAL with the /NAN keyword. If all of the elements are NaNs, the result returned is not a NaN, but a zero!

   IDL> x = replicate(!values.f_nan, 5)
   IDL> print, x
             NaN          NaN          NaN          NaN          NaN
   IDL> print, total(x)
   IDL> print, total(x, /nan) 

I think this is a serious implementation bug because it renders the /NAN keyword useless in most circumstances, but I guess we are stuck with it.

Inconsistently, this happens with TOTAL, but not with MEAN.

   IDL> print, mean(x, /nan)

Editor's Note: This inconsistency has been fixed in the IDL 7.1 version I am looking at currently.

Web Coyote's Guide to IDL Programming