[Tutorial] Randomization in Calc

Forum rules
No question in this forum please
For any question related to a topic, create a new thread in the relevant section.

[Tutorial] Randomization in Calc

Postby MrProgrammer » Thu Jan 10, 2013 12:53 am

This tutorial explains several different techniques for randomization in Calc:
  1. Creating uniformly-distributed random numbers between specified minimum and maximum values
  2. Creating uniformly-distributed random integers between specified minimum and maximum values
  3. Creating uniformly-distributed unique random integers between specified minimum and maximum values
  4. Creating normally-distributed random numbers with specified mean and standard deviation
  5. Arranging a set of values in random order
  6. Selecting a random sample of size N (subset) from a set of values
  7. Lot sampling
All of these are demonstrated in the attachment. The techniques are related, and sometimes a technique used for one situation can be applied in a different situation. The attachment uses defined names in some formulas, which can be displayed with Insert → Names → Define.

A. Creating uniformly-distributed random numbers between specified minimum and maximum values
The =RAND() formula provides uniformly-distributed random numbers between 0 and 1. Let's call the difference between the minimum and maximum the Range. Then the desired values are the product of RAND() and the Range, increased by the minimum value. All RAND() functions are recalculated when you press F9, when the spreadsheet is initially opened, and (if Tools → Cell Contents → AutoCalculate is checked) when any cell is changed. To preserve the random values so that they are not changed further, select them, Edit → Copy, then Edit → Paste Special → Uncheck Paste All → Uncheck formulas → Check Numbers → OK.

When using Calc's RAND() function you need to be aware of limitations in that generator. Most implementations of Calc use a 32-bit generator, and the random numbers begin to repeat after about 4.3×10⁹ values. This sounds like a lot, but suppose you want to randomly shuffle a deck of cards. There are 52 factorial possible permutations and 52! is about 8.1×10⁶⁷. Thus the random number generator can only select a tiny fraction of the possible arrangements of the deck. If this will be a problem for you, consider using the Mersenne Twister Random Number Generator Add-In, which is a 19937-bit generator and doesn't repeat until more than 10⁶⁰⁰⁰ values have been returned.

B. Creating uniformly-distributed random integers between specified minimum and maximum values
The =RANDBETWEEN(min;max) formula provides uniformly-distributed random integers from min to max, inclusive. The values are independent, so it is possible, if you use RANDBETWEEN() more than once, to get duplicate integers. Using RANDBETWEEN() is equivalent to drawing numbered balls from a hat, replacing the drawn ball before drawing the next one. RANDBETWEEN() is recalculated if you use Recalculate All (⌘⇑F9 on a Mac or Ctrl+Shift+F9 on other platforms). You could also use the technique in A and truncate the values to integers with the INT() function, however with that technique you will get integers that may include the minimum value but will never include the maximum value. There is also a Mersenne Twister random integer generator.

C. Creating N uniformly-distributed unique random integers between specified minimum and maximum values
If you need multiple integers and they must be distinct, technique B is not suitable. However, you can use Floyd's Algorithm (described below, example in the attachment). This is similar to drawing numbered balls from a hat, discarding the drawn ball before drawing the next one. Warning: the integers that are generated by the algorithm are neither in ascending order, nor descending order, nor in random order since larger values tend to occur later in the list. If you need the generated values to be in ascending or descending order, you can sort them. If you need them to be in random order, use Technique C and then use the technique described in E. Technique C can be used even when the Range (difference between minimum and maximum) is larger then the number of rows allowed in a sheet; for example, it can select 12 random unique integers from 76 to a billion. We can't really do that by drawing balls from a hat, unless we have a very large hat and 999999925 balls.

There are some alternate techniques that can be used only when the Range does not exceed the number of rows in a sheet:
    If you want the values in ascending order, put the integers from the minimum to the maximum in cells, then use the technique in F.
    If you want the values in random order, put the integers from the minimum to the maximum in cells, sort them in random order (technique E), and select the first N of them.
Hint: To put sequential integers in cells, select these cells then use Edit → Fill → Series → Increment → 1.
Hint: To put random integers into a grid (common for soccer/football pools), first put the integers in a column, use technique E to get them in random order, and then use formulas in the grid to copy them from the column, as shown in the attachment. To generate a new random grid arrangement, select cell 'Soccer Pool'.M1 and use Data > Sort.

D. Creating normally-distributed random numbers with specified mean and standard deviation
The =NORMINV() formula will create the desired values. To create normally-distributed random numbers with specified mean and variance, just calculate the standard deviation as the square root of the variance and then use NORMINV().

E. Arranging a set of values in random order
Specifically, let's say the data to arrange randomly is in column A, there's other related data in column B, and column C is unused. A simple technique is put the formula =RAND() in the first cell of column C and fill the formula down, for example by double clicking the fill handle. Then Data → Sort can be used arrange these cells according to the random values in C. After the sort is done, Calc will re-evaluate the =RAND() formulas and put new random values in C, so those values won't stay sorted, but the rows were arranged randomly, as desired. To create another random arrangement, just use Data → Sort again. If column C is no longer needed you can delete it.

F. Selecting a random sample of size N (subset) from a set
An efficient technique is to select values from top to bottom with a probability directly proportional the number of values still needed and inversely proportional to the number of values remaining, as shown in the attachment. One could also use technique E to put the values in random order and then select the first N of them, but this requires two steps and, because a sort is required, it is less efficient than using the formulas in the attachment.

G. Lot sampling
Suppose we want to select N samples from L lots. If N<L, the preceeding paragraph is the optimal way to do this, but it may be simpler to select samples at one rate initially, then switch to a different rate for the remaining lots. This isn't as random since the sampling rate isn't uniform and because not all subsets of size N are equally likely, but it may be sufficient for some sitiations. The Lot Sampling sheet shows how to select the sampling rates, given the number of lots and the desired number of samples. When N<L, samples/lot is less than 1, so we select 1 sample from each group of lots. When N≥L, samples/lot is at least 1, so we select at least one sample per lot.

Note that questions are not allowed in the Tutorials section of the forum. Ask them in the Applications > Calc section.

Randomization in Calc.ods
(18.87 KiB) Downloaded 933 times

The section in grey is intended for programmers. It is not necessary to understand how the algorithm for technique C works if you just want to use the example in the attachment.
Floyd's Algorithm description
Say we have N cells to fill with unique integers from IMin through IMax. The algorithm generates a random integer for each cell, however not every integer is allowed in every cell. Each cell has a Limit, its largest allowed integer, and any integer from IMin to its Limit is allowed. The Limit for the last cell is IMax, the Limit for the penultimate cell is IMax-1, then IMax-2 for the preceding cell, etc. We start by selecting a random integer in the first cell from IMin to its Limit, which is IMax-N+1. We select a random candidate for the second cell (IMin to its Limit, IMax-N+2). If the candidate is different than the first cell, it is selected, otherwise the cell is set to the Limit. Since the Limit for the second cell is larger than the Limit for the first cell we are guaranteed that the two cells will be different. A candidate is chosen for the third cell and compared with the first two cells. If different, it's chosen, otherwise the third cell's Limit is chosen. By this process, each cell's integer will be different from the previous cells.

Although it is not obvious, each possible combination of N integers is equally probable. It might appear that the integer IMax is being discriminated against, since it can only appear in the last cell. However, IMax is selected there if it's chosen by the random number generator or if a previously selected integer was chosen. Thus IMax is favored for selection in that cell. The favoritism effect exactly balances the discrimination in the other cells and IMax is as likely to be chosen as any other integer. IMax-1 can only appear in the last two cells, but it is favored for selection in the penultimate cell. It can be proved that (A) the probability of selection for any integer is the same and (B) the probability of selection for any combination of integers is the same.
Mr. Programmer
OpenOffice 3.2.0 Build 9483 on Mac OS X 10.9.5.   The locale for any menus or Calc formulas in my posts is English (USA).
User avatar
MrProgrammer
Volunteer
 
Posts: 3160
Joined: Fri Jun 04, 2010 7:57 pm
Location: Wisconsin, USA

Return to Calc

Who is online

Users browsing this forum: No registered users and 1 guest