[Tutorial] Randomization in Calc

Forum rules
No question in this section please
For any question related to a topic, create a new thread in the relevant section.
Locked
User avatar
MrProgrammer
Moderator
Posts: 4883
Joined: Fri Jun 04, 2010 7:57 pm
Location: Wisconsin, USA

[Tutorial] Randomization in Calc

Post by MrProgrammer »

This tutorial explains several different techniques for randomization in Calc:
  1. Creating uniformly-distributed random numbers between specified minimum and maximum values
  2. Creating uniformly-distributed random integers between specified minimum and maximum values
  3. Creating uniformly-distributed unique random integers between specified minimum and maximum values
  4. Creating normally-distributed random numbers with specified mean and standard deviation
  5. Arranging a set of values in random order
  6. Selecting a random sample of size N (subset) from a set of values
  7. Lot sampling
  8. Create random numbers with specified minimum, maximum, and median
  9. Control of randomization
  10. Gaining control of randomization using Paste Special
  11. Gaining control of randomization using a "seed"
All of these are demonstrated in the attachment. The techniques are related, and sometimes a technique used for one situation can be applied in a different situation. The attachment uses defined names in some formulas, which can be displayed with Insert → Names → Define.

A. Creating uniformly-distributed random numbers between specified minimum and maximum values
The =RAND() formula provides uniformly-distributed random numbers between 0 and 1. Let's call the difference between the minimum and maximum the Range. Then the desired values are the product of RAND() and the Range, increased by the minimum value. All RAND() functions are recalculated when you press F9, when the spreadsheet is initially opened, and (if Tools → Cell Contents → AutoCalculate is checked) when any cell is changed. To preserve the random values so that they are not changed further, select them, Edit → Copy, then Edit → Paste Special → Uncheck Paste All → Uncheck formulas → Check Numbers → OK.

When using Calc's RAND() function you need to be aware of limitations in that generator. Most implementations of Calc use a 32-bit generator, and the random numbers begin to repeat after about 4.3×10⁹ values. This sounds like a lot, but suppose you want to randomly shuffle a deck of cards. There are 52 factorial possible permutations and 52! is about 8.1×10⁶⁷. Thus the random number generator can only select a tiny fraction of the possible arrangements of the deck. If this will be a problem for you, consider using the Mersenne Twister Random Number Generator Add-In, which is a 19937-bit generator and doesn't repeat until more than 10⁶⁰⁰⁰ values have been returned.

B. Creating uniformly-distributed random integers between specified minimum and maximum values
The =RANDBETWEEN(min;max) formula provides uniformly-distributed random integers from min to max, inclusive. The values are independent, so it is possible, if you use RANDBETWEEN() more than once, to get duplicate integers. Using RANDBETWEEN() is equivalent to drawing numbered balls from a hat, replacing the drawn ball before drawing the next one. RANDBETWEEN() is recalculated if you use Recalculate All (⌘⇑F9 on a Mac or Ctrl+Shift+F9 on other platforms). You could also use the technique in A and truncate the values to integers with the INT() function, however with that technique you will get integers that may include the minimum value but will never include the maximum value. There is also a Mersenne Twister random integer generator.

C. Creating N uniformly-distributed unique random integers between specified minimum and maximum values
If you need multiple integers and they must be distinct, technique B is not suitable. However, you can use Floyd's Algorithm (described below, example in the attachment). This is similar to drawing numbered balls from a hat, discarding the drawn ball before drawing the next one. Warning: the integers that are generated by the algorithm are neither in ascending order, nor descending order, nor in random order since larger values tend to occur later in the list. If you need the generated values to be in ascending or descending order, you can sort them. If you need them to be in random order, use Technique C and then use the technique described in E. Technique C can be used even when the Range (difference between minimum and maximum) is larger than the number of rows allowed in a sheet; for example, it can select 12 random unique integers from 76 to a billion. We can't really do that by drawing balls from a hat, unless we have a very large hat and 999999925 balls.

You must not use Technique C when N is equal to the size of the range of the range. Since every number in the range is chosen, the selection is not random, and due to the algorithm, the order isn't random either. However, there are some alternate techniques to create unique random integers which can be used when the Range does not exceed the number of rows in a sheet:
  • If you want the values in ascending order, put the integers from the minimum to the maximum in cells, then use the technique in F.
    If you want the values in random order, put the integers from the minimum to the maximum in cells, sort them in random order (technique E), and select the first N of them.
Hint: To put sequential integers in cells, select these cells then use Edit → Fill → Series → Increment → 1.
Hint: To put random integers into a grid (common for soccer/football pools), first put the integers in a column, use technique E to get them in random order, and then use formulas in the grid to copy them from the column, as shown in the attachment. To generate a new random grid arrangement, select cell 'Soccer Pool'.M1 and use Data > Sort. Another idea for a sports pool is in [Solved] Football Pool.

D. Creating normally-distributed random numbers with specified mean and standard deviation
The =NORMINV() formula will create the desired values. To create normally-distributed random numbers with specified mean and variance, just calculate the standard deviation as the square root of the variance and then use NORMINV().

E. Arranging a set of values in random order
Specifically, let's say the data to arrange randomly is in column A, there's other related data in column B, and column C is unused. A simple technique is put the formula =RAND() in the first cell of column C and fill the formula down, for example by double clicking the fill handle. Then Data → Sort can be used arrange these cells according to the random values in C. After the sort is done, Calc will re-evaluate the =RAND() formulas and put new random values in C, so those values won't stay sorted, but the rows were arranged randomly, as desired. To create another random arrangement, just use Data → Sort again. If column C is no longer needed you can delete it.

F. Selecting a random sample of size N (subset) from a set
An efficient technique is to select values from top to bottom with a probability directly proportional the number of values still needed and inversely proportional to the number of values remaining, as shown in the attachment. One could also use technique E to put the values in random order and then select the first N of them, but this requires two steps and, because a sort is required, it is less efficient than using the formulas in the attachment.

G. Lot sampling
Suppose we want to select N samples from L lots. If N<L, the preceding paragraph is the optimal way to do this, but it may be simpler to select samples at one rate initially, then switch to a different rate for the remaining lots. This isn't as random since the sampling rate isn't uniform and because not all subsets of size N are equally likely, but it may be sufficient for some situations. The Lot Sampling sheet shows how to select the sampling rates, given the number of lots and the desired number of samples. When N<L, samples/lot is less than 1, so we select 1 sample from each group of lots. When N≥L, samples/lot is at least 1, so we select at least one sample per lot.

H. Create random numbers with specified minimum, maximum, and median
Use formula =Min+(Max-Min)*BETAINV(RAND();1/LOG((Max-Min)/(Median-Min);2);1). For example if the minimum is 0, the maximum 1, and the median ⅔, the formula is =BETAINV(RAND();1/LOG(1.5;2);1) which after resolving constants with F9 is =BETAINV(RAND();1.70951129135146;1).

Note that questions are not allowed in the Tutorials section of the forum. Ask them in the Applications > Calc section.
Randomization in Calc.ods
(18.87 KiB) Downloaded 2096 times


The section in grey is intended for programmers. It is not necessary to understand how the algorithm for technique C works if you just want to use the example in the attachment.
A Sample Of Brilliance - Floyd's Algorithm description
Say we have N cells to fill with unique integers from IMin through IMax. The algorithm generates a random integer for each cell, however not every integer is allowed in every cell. Each cell has a Limit, its largest allowed integer, and any integer from IMin to its Limit is allowed. The Limit for the last cell is IMax, the Limit for the penultimate cell is IMax-1, then IMax-2 for the preceding cell, etc. We start by selecting a random integer in the first cell from IMin to its Limit, which is IMax-N+1. We select a random candidate for the second cell (IMin to its Limit, IMax-N+2). If the candidate is different than the first cell, it is selected, otherwise the cell is set to the Limit. Since the Limit for the second cell is larger than the Limit for the first cell we are guaranteed that the two cells will be different. A candidate is chosen for the third cell and compared with the first two cells. If different, it's chosen, otherwise the third cell's Limit is chosen. By this process, each cell's integer will be different from the previous cells.

Although it is not obvious, each possible combination of N integers is equally probable. It might appear that the integer IMax is being discriminated against, since it can only appear in the last cell. However, IMax is selected there if it's chosen by the random number generator or if a previously selected integer was chosen. Thus IMax is favored for selection in that cell. The favoritism effect exactly balances the discrimination in the other cells and IMax is as likely to be chosen as any other integer. IMax-1 can only appear in the last two cells, but it is favored for selection in the penultimate cell. It can be proved that (A) the probability of selection for any integer is the same and (B) the probability of selection for any combination of integers is the same.
Mr. Programmer
AOO 4.1.7 Build 9800, MacOS 13.6.3, iMac Intel.   The locale for any menus or Calc formulas in my posts is English (USA).
User avatar
MrProgrammer
Moderator
Posts: 4883
Joined: Fri Jun 04, 2010 7:57 pm
Location: Wisconsin, USA

More: [Tutorial] Randomization in Calc

Post by MrProgrammer »

I. Control of randomization
Most Calc functions depend directly on their operands, thus they retain their value until the operands are changed. For example, suppose we have a number in D5. As you work with the spreadsheet, the formula =SQRT(D5) is never recalculated until a new number is entered into cell D5. Formula =RAND() is different because it has no operands but still changes value as the spreadsheet is used. In OpenOffice, Calc formula =RANDBETWEEN(low;high) is also different. Its value is recalculated when the operands are changed. But the value also changes when ⇧⌘F9 is pressed on a Mac, when Ctrl+Shift+F9 is pressed on other platforms, or whenever the spreadsheet is opened. You do not control when RAND and RANDBETWEEN update their values; Calc controls that.

J. Gaining control of randomization using Paste Special
To allow you to control when new random numbers are used in formulas you can use two columns as shown in sheet PasteSpecial of the attachment.
• Select the numbers in column A
• Edit → Copy
• Select cell B2
• Edit → Paste Special
   → Selection: Uncheck everything → Check Numbers
   → Options: Uncheck everything
   → Operations: None
   → Shift Cells: Don't Shift
   → OK
Paste Special updates the yellow cells, changing the formulas from column A into static values in column B. You can record a "PasteValues" macro to perform the Paste Special. Or, you could record a macro to perform all of the steps. Potentially, you could use a single column with a recorded macro which fills its cells with =RAND() and then uses Paste Special to convert those formulas to numbers. In any case, the random dice rolls in column C do not change until you perform the Paste Special operation.

K. Gaining control of randomization using a "seed"
Many random number generators allow one to specify a starting value for the generator, called a "seed". Although Calc functions RAND and RANDBETWEEN do not offer that option, you can create your own seeded generator as shown on sheet L_Ecuyer of the attachment. Specify two seed values in cells A2 and B2. This will generate a set of random numbers between 0 and 1 in column C, as if you'd used =RAND() there. But they, and the random dice rolls in column D, do not change until you specify new seed values. The constants in cells E2, F2, G2, and H2 are optimized to produce high-quality random numbers so do not change them. Only change the two seed values.
Randomization in Calc1.ods
Sheets PasteSpecial and L_Ecuyer
(29.54 KiB) Downloaded 248 times
Note that questions are not allowed in the Tutorials section of the forum. Ask them in the Applications > Calc section.
Mr. Programmer
AOO 4.1.7 Build 9800, MacOS 13.6.3, iMac Intel.   The locale for any menus or Calc formulas in my posts is English (USA).
Locked