Obtaining Random Indices

QUESTION: I have an array with two columns and a thousand rows. Each row has a latitude and longitude value. For my analysis, I need to randomly select 100 latitude/longitude pairs. What is the best way to do this in IDL?

ANSWER: The obvious answer to this question is to simply use RandomU to select 100 random values between 0 and 999:

   indices = Round( RandomU(seed, 100) * 999 )

But it was pointed out in the IDL newsgroup that this doesn't guarantee unique indices. In other words, it is possible, using this method, to return the same index multiple times. To guarantee only unique indices are returned requires a different technique.

Ken Bowman suggested a simple technique in which the random numbers are first sorted, and then the first 100 are selected, which does result in unique index numbers:

   randomNumbers = RANDOMU(seed, 1000)            
   sortedRandomNumbers = SORT(r)     
   indices = sortedRandomNumbers[0:99]     

This works well, but JD Smith points out it can be terribly inefficient if you wanted, say 10 random elements in an array of 100,000. You would have to make and sort 100,000 values to get the 10 you needed. He writes:

There are all sorts of iterative higher-order algorithms for selection without replacement, but they don't match to IDL well. One simple trick would be to start by generating M random numbers, check for duplicates, and generate M-n more, accumulating until you have enough.

   M = 10
   len = 100000L
   inds = LonArr(n, /NOZERO)
   n = M
   WHILE n GT 0 DO BEGIN 
       inds[M-n] = Long(RandomU(seed, n)*len)
       inds = inds[Sort(inds)]
       u = Uniq(inds)
       n = M - N_Elements(u)
       inds[0] = inds[u]
   ENDWHILE

For this case, the speedup is immense. On average this method is about 3500 times faster than sorting. What about a case with more duplicates likely? How about len=100000, with M=25000?

   Sort All randoms:             0.13349121
   Brute force replacement:      0.091892505

The method is still about 1.5 times faster.

Obviously, if you wanted len-1 random indices, this won't scale, but in that case, you could just invert the problem, choose the random indices to be discarded, and use Histogram to generate the "real" list. Here's a general function which does this for you.

   FUNCITON Random_Indices, len, n_in
     swap = n_in gt len/2
     IF swap THEN n = len-n_in ELSE n = n_in
     inds = LonArr(n, /NOZERO)
     M = n
     WHILE n GT 0 DO BEGIN
        inds[M-n] = Long( RandomU(seed, n)*len )
        inds = inds[Sort(inds)]
        u = Uniq(inds)
        n = M-n_elements(u)
        inds[0] = inds[u]
     ENDWHILE
     
     IF swap THEN inds = Where(Histogram(inds,MIN=0,MAX=len-1) EQ 0)
     RETURN, inds
   end

It is outperformed by the simple sort method only when M is close to len/2.

   r=randomu(sd,len)
   inds=(sort(r))[0:M-1]

For example, I found that selecting from length 100000 fewer than 30000 or more than 70000 elements favored Random_Indices. At worst case (M=len/2), it's roughly 3 times slower. The Random_Indices method also returns the indices in sorted order (which you may or may not care about). You could obviously also make a hybrid approach which switches from one method to the other for

   abs(M-len/2) lt len/5

or so, but the tuning would be machine-specific.

This algorithm has been implemented in a slightly more robust way in the Coyote Library program, cgRandomIndices.

Google
 
Web Coyote's Guide to IDL Programming