[Tutorial] Sorting and Filtering data with formulas

Forum rules
No question in this section please
For any question related to a topic, create a new thread in the relevant section.
Locked
User avatar
MrProgrammer
Moderator
Posts: 4895
Joined: Fri Jun 04, 2010 7:57 pm
Location: Wisconsin, USA

[Tutorial] Sorting and Filtering data with formulas

Post by MrProgrammer »

Sorting data
The easiest way to sort data in Calc is to use Data, Sort from the menu. However this does not keep the data sorted if changes are made to it after sorting. It is possible to construct formulas which will maintain a copy of the data in sorted order. The first sheet in the attached file shows how to do that. It provides for sorting up to 20 numbers, of which 16 are currently in use. The capacity of the sheet can be expanded beyond 20 by adjusting the text $21, which occurs in some of the formulas, to a larger number using the Edit, Find & Replace dialog and then filling the formulas down to unused rows with the fill handle. See OOo Help if you don't know about the fill handle.

There are some disadvantages to this technique, however:
  • The formulas are complicated to build, and doing so may take more of your time than simply using Data, Sort even if you have to do that dozens of times.
  • You must make a copy of the data because formulas cannot sort "in place", so the size of your spreadsheet will grow.
  • For large amounts of data, the formulas will be slow because they are inefficient. Data, Sort performs just one sort. But for each row the formula sorts all 20 items, and then discards 19 of them, as explained below. Thus for 20 rows, 20 individual sorts are performed. On current hardware you won't notice a delay for 20 items, but you probably will for 20,000 items, which implies 20,000 sorts of 20,000 items.
Here's how the formulas in the first sheet work. Cell B2 uses COUNTA to determine how many of the 20 cells in the first column are in use. Column C is an ascending series which numbers the items. The formula in column D checks if the item in column A is empty. If so, column D is blank, too. Otherwise OFFSET selects the 16 used items, then SMALL sorts them and picks the smallest for the first value in D. SMALL(…;1) is the same as MIN(…). In the cell below, SMALL sorts the same values again and picks the second smallest, SMALL(…;2). Cell D17 holds the largest of the 16 items, SMALL(…;16). In column E, using LARGE produces the items in reverse order.

Filtering data
The easiest way to filter data in Calc is to use Data, Filter from the menu. However this also needs to be reapplied if changes are made after filtering. It is possible to construct formulas which will maintain a copy of the data after performing filtering. The second sheet in the attached file shows how to do that. It provides for filtering up to 20 numbers, of which 16 are currently in use. The capacity of the sheet can be expanded beyond 20 by adjusting the text $21, which occurs in some of the formulas, to a larger number using the Edit, Find & Replace dialog and then filling the formulas down to unused rows. The disadvantages of the filtering formulas are similar to those for the sorting formulas, though inefficiency is less of a concern.

Here's how the formulas in the second sheet work. Cell B2 uses COUNTA to determine how many of the 20 cells in the first column are in use. Column C is the formula which is used to determine if the item should be included by the filter. My example tests to see if the item is an odd number, but it can be any test that returns a TRUE/FALSE value. Column D locates the positions of the successive TRUE values. We'll look at cell D5. D4=4 has previously determined that the third TRUE item is the 4th one (for 61). The OFFSET function in D5 selects the last twelve (16-4) items in C. These values (in C6 through C17) correspond to the data after 61, that is 908 through 703. In Calc, formulas which evaluate to TRUE have the value 1. The MATCH function looks for the first 1 value in the last twelve items. In this case it is the third one, cell C8. Thus MATCH(…) is 3 and when added to D4=4 the formula determines that the position of the next TRUE value is the seventh item (C8). The formulas in D12 through D21 evaluate to an error value because MATCH cannot find more TRUE values. In column E we test to see if an error occurred in column D and if so supply an empty string as the value. Otherwise INDEX selects the position determined by column D, so E5 picks the seventh (D5=7) item in column A (897).

Filtering and sorting data
Finally, one can combine these techniques to obtain a filtered, sorted list. The third sheet in the attached file shows how to do that. Note that due to the use of COUNTA in B2 of the first sheet, the sorting formulas there will not work properly if any of the cells in A2 thorough A17 are empty. If your list contains empty cells and you want formulas which will created a dynamically sorted list, use the technique on sheet 3 and NOT(ISEMPTY(…)) as your test in column C.

Since this is intended as a tutorial, the sheets in the attachment use several auxiliary columns to hold intermediate calculations. These make it easier to understand how the formulas work. Depending on your requirements and your skill level, you may want to combine the formulas so that fewer auxiliary columns are needed. The attachment does not use any Array formulas, but that technique can also be used to sort/filter with fewer (perhaps even zero) auxiliary columns at the expense of longer and more complicated formulas. Someone may want to post an advanced tutorial for an array function solution.

There are no doubt many alternate ways to write these formulas. You could use OFFSET where I've used INDEX, though different parameters would be needed. So if you're wondering why these methods were chosen, it was because I thought that they were simple to understand for a tutorial, and because I liked them. (De gustibus non disputandum est.) If you prefer, for example, another method for locating and selecting the TRUE values in column C of the second sheet, definitely go ahead and use it instead. But this one has been tested and works. The colors in the second and third sheets are produced by conditional formatting and are only meant to help you quickly identify the odd values. The sheets will work if all the conditional formatting is removed (with Format, Default formatting).

I would advise beginners to avoid all of these techniques, unless you fully understand how they work, and just use Data, Sort and/or Data, Filter.

Note that questions are not allowed in the Tutorials section of the forum. Ask them in the Applications > Calc section.
Attachments
OOo Sorting and Filtering data with formulas.ods
Sorting and Filtering data with formulas.ods
(12.35 KiB) Downloaded 6363 times
Last edited by MrProgrammer on Sun Apr 07, 2013 6:12 pm, edited 1 time in total.
Mr. Programmer
AOO 4.1.7 Build 9800, MacOS 13.6.3, iMac Intel.   The locale for any menus or Calc formulas in my posts is English (USA).
User avatar
MrProgrammer
Moderator
Posts: 4895
Joined: Fri Jun 04, 2010 7:57 pm
Location: Wisconsin, USA

Re: [Tutorial] Sorting and Filtering data with formulas

Post by MrProgrammer »

Another method for sorting with formulas is discussed in [Python] Array Sort add-in function. This should be much faster than techniques that use the SMALL() or LARGE() functions.
Mr. Programmer
AOO 4.1.7 Build 9800, MacOS 13.6.3, iMac Intel.   The locale for any menus or Calc formulas in my posts is English (USA).
User avatar
MrProgrammer
Moderator
Posts: 4895
Joined: Fri Jun 04, 2010 7:57 pm
Location: Wisconsin, USA

Re: [Tutorial] Sorting and Filtering data with formulas

Post by MrProgrammer »

One difficulty with my original (February 2011) sorting method was the use of the LARGE() and SMALL() functions since those functions could only process numeric values; that method couldn't sort text strings. The revised formulas in today's attachment now use the SUMPRODUCT() function and can sort either numbers or text strings. If you don't understand that distinction, read section 1. Types of data in Ten concepts that every Calc user should know. My guess is that the SUMPRODUCT() technique will be more efficient than using LARGE() and SMALL(), but perhaps not as fast as Charlie Young's Array Sort add-in function (link provided in this tutorial's previous post). Note however that the add-in will only work with Calc, not with GNUmeric, Excel, or other spreadsheet programs. The SUMPRODUCT() technique (or LARGE/SMALL) should work in other spreadsheet programs.

The new attachment includes two additional sheets showing how to perform sorts on multiple keys, four keys in the examples. Sheet MultiKey Concat uses the simple method of concatenating the four keys to create a new key for sorting. If you use this method, be sure the keys in a column are padded to equal lengths before concatenation. The last sheet, MultiKey SubRank, uses a different method to perform the sort without padding/concatenating the keys; it may be harder to understand.

Some of the formulas in the attachment now use defined names, created by Insert > Names > Define. Use that menu path to view the definitions for names Data, Ranks, Subset, and SubRanks. Many of the formulas in the sheets depend on using absolute and relative cell references correctly as described in section 8. Using formulas and cell references of the Ten Concepts tutorial.

Sorting data
Here's how the revised formulas in the first sheet work. Cell B2 uses COUNTA to determine how many of the 20 cells in the first column are in use. Column C calculates the ranks of the items, 1=smallest to 16=largest. The formula in column D checks if the item in column A is empty. If so, column D is blank, too. Otherwise, MATCH in D2 looks for a 1 in the ranks, then INDEX selects the data in the row where 1 was found. In D3, INDEX selects the data in the row where was 2 was found, etc. In E2, INDEX selects the data where 16 was found. In E3, INDEX selects the data where 15 was found, etc.

Filtering data
The formulas in the second attachment are unchanged from the first attachment, except that column E uses a defined name, Data.

Filtering and sorting data
Column G uses SUMPRODUCT() to calculate the ranks of the subset (the odd numbers). Columns H and I use MATCH() and the ranks instead of LARGE and SMALL.

Multi-key sorting using key concatenation
Columns A though D are the keys. Columns E and F are related data to be sorted with the keys. Column G concatenates the keys, which in this simple example don't need any padding. Column H uses SUMPRODUCT to determine the ranks of the keys. Column I uses MATCH to determine how the rows should be ordered. Columns J though O use INDEX to build a sorted list. Some of the rows, highlighted in yellow, orange, and blue contain duplicate sets of keys. The sorted list preserves the order, top to bottom, of the data for the duplicate keys. Computer Scientists refer to this as a "stable" sort. Cell coloring is done through Conditional Formatting and is unrelated to the sorting technique. It's only purpose is to help you locate duplicate keys more easily.

Multi-key sorting using sub-ranking
Columns A though D are the keys. Columns E and F are related data to be sorted with the keys. Column G determines the ranks of the keys in column A only. Column H determines the ranks of the keys in column B if their keys in column A were identical. Column I determines the ranks of the keys in column C if their keys in columns A and B were identical. Column J determines the ranks of the keys in column D if their keys in column A through C were identical. Column K determines the sequence of rows where all keys are identical so the sort will be stable. Column L determines the ranks of the four keys; it's the sum of columns G through K. Column M uses MATCH to determine how the rows should be ordered. Columns N though S use INDEX to build a sorted list.

Note that questions are not allowed in the Tutorials section of the forum. Ask them in the Applications > Calc section.
Attachments
Sorting and Filtering data with formulas1.ods
(20.72 KiB) Downloaded 3265 times
Mr. Programmer
AOO 4.1.7 Build 9800, MacOS 13.6.3, iMac Intel.   The locale for any menus or Calc formulas in my posts is English (USA).
Locked