“Correcting” Data with Thin Plate Splines

QUESTION: I have a two-dimensional data set and some of the data points are missing. For various reasons I would like to replace the missing data points with the “correct” data values. Is this kosher? And is it possible to do it in IDL?

ANSWER: I have extremely strong misgivings about whether it is kosher to replace missing data with “correct” data. And I am certainly not alone in this view. For one thing, neither you nor I have the foggiest idea what the correct value of the data should be. No one really expects science data to be correct in a specific case. But, perhaps, we might be able to guess where the data might have appeared had we been able to collect it. And, perhaps, there are good and sufficient reasons to do so. If you can get your peer-reviewed paper past the referees, then who am I to complain?

So I am going to show you one way you might be able to guess at some data values using IDL. Please note that I am using nice data. Real science data may not look like this, and the method I am going to demonstrate may not be appropriate for your particular data. You and the people you are trying to convince will have to be responsible for knowing if this is an appropriate way to treat your data. My purpose here is to show you one way to “correct” your data if the missing data is not too large, and the data values are well-behaved and reasonably smooth.

The method I am going to use is a thin plate spline method as implemented with the IDL routine GRID_TPS. As this method is rather compute-intensive, I am going to demonstrate on a small data set. If you have a large 2D data set, you may have to adapt the following method for small regions of the large data set you wish to repair. For small missing regions, a 25-by-25 region surrounding the missing data is often reasonable. The GRID_TPS function requires a minimum of seven non-colinear points to come to a solution. The more you have, the better the solution, but at the cost of computer resources. Thin plate splines are often better than polynomials for fitting complex functions with local distortions.

To start, let's simulate some missing data. I assume you have installed the Coyote Library on your machine. Type the following commands to put some holes in a 2D data set. I will put one rather large hole in the data, and 25 other random holes.

   peak = cgDemoData(2)    dims = Size(peak, /Dimensions)    missingPeak = peak
   missingPeak[15:18, 24:25] = !Values.F_NAN
   badIndices = Randomu(-3L, 25) * dims[0] * dims[1]
   missingPeak [badIndices] = !Values.F_NAN
   cgSurface, missingPeak, /ELEVATION, /SHADED

You will see something similar to the figure on the right below. If you rotate the figure around on your computer with your mouse, you will see definite holes in the surface. The idea is to be able to somehow fix these holes in the surface by guessing at the values in the missing locations and thereby restore the surface to something resembling the figure on the left.

On the left the original two-dimensional data set, and on the right the surface with missing data shown as holes.

To use GRID_TPS, we need vectors that describe the good XY locations and the value of the surface at those locations. We will use these locations and values to reconstruct or interpolate the missing values. Type the following commands to create the X, Y, and Z vectors we need.

    goodIndices = Where(Finite(missingPeak), count)
   xy = Array_Indices(dims, goodIndices, /DIMENSIONS)    x = Reform(xy[0,*])
   y = Reform(xy[1,*])    z = missingPeak[goodIndices]

The only thing left is to create a gridded surface with GRID_TPS, using these values. Our grid will be the size of the original grid, start at location [0,0] in X and Y, and will have a grid spacing of 1 in both X and Y directions. In other words, it will be a grid with the original dimensions.

   fittedSurface = GRID_TPS(x, y, z, NGRID=dims, START=[0,0], DELTA=[1,1])
   cgSurface, fittedSurface, /ELEVATION, /SHADED

You see in the figure below that the fitted surface looks almost identical to the original surface.

The reconstructed surface looks almost identical to the original.

But how close is it really? Type these commands to locate the bad indices and fix them. The difference between the corrected data and the original data is small.

   badIndices = Where(Finite(missingPeak) EQ 0)    correctedPeak = missingPeak
   correctedPeak[badIndices] = fittedSurface[badIndices]
   difference = peak - correctedPeak
   Print, 'Min/Max Difference from Orginal: ', Min(difference), Max(difference)
        Min/Max Difference from Orginal:      -2.58771      2.6143

In fact, the error is a very small percentage of the total data range.

    Print, 'Maximum Percent Difference: ', Max(Abs(difference)) / (Max(peak)-Min(peak)) * 100.0
      Maximum Percent Difference:      0.168670

The small percentage error gives confidence that thin plate splines might be an appropriate technique for data correction in this instance.

Note, that if you wanted to use the MIN_CURVE_SURF function to create a surface with a minimum curvature, you could execute this code. The MIN_CURVE_SURF function takes quite a bit longer to execute and should only be used with small areas.

   fittedSurface = Min_Curve_Surf(z, x, y, GS=[1,1], BOUNDS=[0,0,40,40], NX=41, NY=41)

Version of IDL used to prepare this article: IDL 7.1.2