Position Matching

QUESTION: I have two lists of stars, each containing X and Y positions. I wish to match up stars in the first list with stars within a certain radius from the second list. In fact, I would like to know which star from the second list is closest to the star in the first list.

I have code that gets the job done, but it uses a FOR loop over each star in the first list and a WHERE command to search for the closest star from the second list. As you might imagine, when my star lists get to be 10,000 or 20,000 in size, this program can take several minutes to run. This isn't putting my research on hold or anything like that, but it is slower than I would like it to be. Can you suggest a way to speed the job up?

ANSWER: This problem is similar to another star problem whose program speed was decreased dramatically with some careful attention to the proper algorithm. In that case, the problem was to discover the sixth closest star to another star in the same list. Two methods were offered. The first was a brute force method that required considerable memory because the entire array had to be loaded into memory at once. The second used a Delaunay triangulation approach to effectively limit the search space. The second method reduced a problem that took days to run into something that ran in about 30 seconds.

JD Smith has this to say about the problem.

The Delaunay triangulation is just a cheeky way to organize points in 2D (and higher dimension, but less efficiently). That algorithm uses the fact that the Delaunay triangulation graph has as a sub-graph the nearest neighbors. Then you can start with your star of interest, and work your way out to nearby stars along the Delaunay triangulation lines, to find the Nth nearest neighbor, by comparing a small number of stars. For matching two lists, this, as you pointed out, is awkward.

As Paolo noted, the array method can be made to work by dividing it into “fits in memory” sized chunks. As also mentioned on the page you read, such a method doesn't necessarily mean you're doing it the most efficient way (just maximizing the brute force throughput). For searching 20,000 stars, however, the segmented brute force approach with arrays will probably work fine. I could do 20000x20000 set of lists in under a minute on my (slowish) machine with 2GB. I suspect if you get just a few minutes for similar sizes with a purely loop solution, your machine is much faster than mine. Here's an implementation. Tune “chunk”, which limits the size of arrays to compare, to optimize speed.
   ; Match stars in one list to another, with brute force array techniques
   n1 = 20000
   x1 = randomu(sd,n1)
   y1 = randomu(sd,n1)

   n2 = n1
   x2 = randomu(sd,n2)
   y2 = randomu(sd,n2)

   t = systime(1)

   ;; Divide the problem into manageable chunks: use [x2,y2] in full
   chunk = 1.e6        ; largest number of elements to check at once
   nchunk = ceil(n1/(float(chunk)/n2)) > 2
   n1piece = ceil(float(n1)/nchunk)

   print, nchunk, ' Chunks of size ', n1piece, 'x', n2

   max_r = .001     ; maximum allowed radius

   mpos = lonarr(n1)
   for i =0L,nchunk-1 do begin 
      low = n1piece*i
      high = (n1piece*(i+1)) < (n1-1)
      cnt = high-low+1
      d = (rebin(x2,n2,cnt,/SAMPLE)- $
         rebin(transpose(x1[low:high]), n2, cnt, /SAMPLE))^2+ $
         (rebin(y2,n2,cnt,/SAMPLE)- $
         rebin(transpose(y1[low:high]), n2, cnt, /SAMPLE))^2
      void = min(d,DIMENSION=1, p)
      mpos[low] = p mod n2
      wh = where(sqrt(d[p]) gt max_r, cnt)
      if cnt gt 0 then mpos[wh] = -1L
   endfor 

   print, systime(1) - t
That works well enough, but is certainly not optimal. It uses the full set of [x2,y2] stars, comparing them against chunks of stars from the list [x1,y1] at a time. All stars on the target list are compared to all stars on the search list.

In all cases like this, the best approach to speed up the calculation is to think to yourself “how can I reduce the number of possible points which must be matched, before I commence the matching.” For closest match in a single set of stars, this led to the Delaunay triangulation method. In this case, you have set a natural scale to the problem, max_r, which will be very useful, allowing you to subdivide and conquer. The argument is as follows. If you bin the search stars into bins of size 2*max_r, the closest point to a given target star [x,y], which is at least as close as max_r in radius, must fall into one of four bins (the bin which [x,y] is in, and the three bins to the upper-left, upper-right, lower-left, or lower-right of it, depending on where it falls in its bin). If there is no star in any of those bins, then there is no star within max_r.

I'll use HIST_ND to bin the search stars into a large grid. Then, instead of searching all points for the closest, I'll only search ones which fell in that bin (conveniently indexed using REVERSE_INDICES), and the relevant three adjacent bins (depending on location within the bin). You can use the same “brute-force” array tricks here within the bin, but of course they are infinitely faster, as you've pre-trimmed out the vast majority of possible matches. Sprinkle in a few more vectorizing HISTOGRAM tricks (in particular the “dual histogram” method (as described in the drizzle discussion), and you get the code below.

With this code, matching 20000x20000 points takes almost no time at all, 0.1 seconds. I can match 1,000,000 vs. 1,000,000 stars in roughly 4.5 seconds, with a strong dependence on the initial binning size (too coarse, and bins will have too many points to fit in memory, too sparse, and you'll have too many empty bins). If your maximum radius is tiny (compared to the maximum distance between stars), it probably pays just to make larger bin sizes, and then weed out the ones which are “too far” post-facto. (I've left that undone -- a simple WHERE will suffice.) If your maximum radius is large, the bin size will be too coarse, and you won't have removed many for a given target search. You'll be searching many tens or hundreds of thousands of stars per bin, and be right back in the same sort of memory trouble you had originally.

I should emphasize that this code does not guarantee that the closest match itself is returned, only making the guarantee that if the closest match is within one-half of the bin size, then it is correctly returned. For this problem, this sets a minimum bin size: two times the max search radius. You can of course go to larger bin sizes (and you may want to if your stars are sprinkled very sparsely over the grid, or you require a very precise match, such that the histogram could grow excessively large). If you go smaller you risk missing the correct star.
   ; Match stars in one list to another, within some tolerance.
   ; Pre-bin into a 2D histogram, and use DUAL HISTOGRAM matching to select

   n1 = 1000000                      ;number of stars
   x1 = randomu(sd,n1)               ;points to find matches near
   y1 = randomu(sd,n1)

   n2 = n1
   x2 = randomu(sd,n2)               ;points to search in
   y2 = randomu(sd,n2)

   t1 = systime(1)

   max_r = .0005                     ;maximum allowed radius for a match
   bs = 2*max_r                      ;this is the smallest binsize allowed
   h = hist_nd([1#x2,1#y2],bs,MIN=0,MAX=1,REVERSE_INDICES=ri)
   bs = bs[0]
   d = size(h,/DIMENSIONS)

   ;; Bin location of X1,Y1 in the X2,Y2 grid
   xoff = x1/bs       &  yoff = y1/bs   
   xbin = floor(xoff) &  ybin = floor(yoff)
   bin = (xbin + d[0]*ybin)<(d[0]*d[1]-1L) ;The bin it's in

   ;; We must search 4 bins worth for closest match, depending on
   ;; location within bin (towards any of the 4 quadrants).
   xoff = 1-2*((xoff-xbin) lt 0.5)        ;add bin left or right
   yoff = 1-2*((yoff-ybin) lt 0.5)        ;add bin down or up

   min_pos = make_array(n1,VALUE=-1L)
   min_dist = fltarr(n1,/NOZERO)

   for i=0,1 do begin ;; Loop over 4 bins in the correct quadrant direction
      for j=0,1 do begin 
         b = 0L>(bin+i*xoff+j*yoff*d[0])<(d[0]*d[1]-1) ;current bins (offset)
         
         ;; Dual HISTOGRAM method, loop by repeat count in bins
         h2 = histogram(h[b],MIN=1,REVERSE_INDICES=ri2)
         
         ;; Process all bins with the same number of repeats >= 1
         for k=0L,n_elements(h2)-1 do begin 
            if h2[k] eq 0 then continue
            these_bins = ri2[ri2[k]:ri2[k+1]-1] ;the points with k+1 repeats in bin
            
            if k eq 0 then begin   ; single point (n)
               these_points = ri[ri[b[these_bins]]]
            endif else begin       ; range over k+1 points, (n x k+1)
               these_points = ri[ri[rebin(b[these_bins],h2[k],k+1,/SAMPLE)]+ $
                               rebin(lindgen(1,k+1),h2[k],k+1,/SAMPLE)]
               these_bins = rebin(temporary(these_bins),h2[k],k+1,/SAMPLE)
            endelse 
            
            dist = (x2[these_points]-x1[these_bins])^2 + $
                 (y2[these_points]-y1[these_bins])^2 

            if k gt 0 then begin            ;multiple point in bin: find closest
               dist = min(dist,DIMENSION=2,p)
               these_points = these_points[p] ;index of closest point in bin
               these_bins = ri2[ri2[k]:ri2[k+1]-1] ;original bin list
            endif 
            
            ;; See if a minimum is already set
            set = where(min_pos[these_bins] ge 0, nset, $
                      COMPLEMENT=unset,NCOMPLEMENT=nunset)
            
            if nset gt 0 then begin 
               ;; Only update those where the new distance is less
               closer = where(dist[set] lt min_dist[these_bins[set]], cnt)
               if cnt gt 0 then begin 
                  set = set[closer]
                  min_pos[these_bins[set]] = these_points[set]
                  min_dist[these_bins[set]] = dist[set]
               endif 
            endif 
            
            if nunset gt 0 then begin ;; Nothing set, closest by default
               min_pos[these_bins[unset]] = these_points[unset]
               min_dist[these_bins[unset]] = dist[unset]
            endif 
         endfor 
      endfor 
   endfor 


   print,systime(1)-t1

Note that JD has now refined this algorithm and made some improvements so that this can be a general matching algorithm. The function is named MATCH_2D and can be obtained from JD's web page..