[Tutorial] Sorting and Filtering data with formulas

Forum rules
No question in this forum please
For any question related to a topic, create a new thread in the relevant section.

[Tutorial] Sorting and Filtering data with formulas

Postby MrProgrammer » Thu Feb 03, 2011 6:21 pm

Sorting data
The easiest way to sort data in Calc is to use Data, Sort from the menu. However this does not keep the data sorted if changes are made to it after sorting. It is possible to construct formulas which will maintain a copy of the data in sorted order. The first sheet in the attached file shows how to do that. It provides for sorting up to 20 numbers, of which 16 are currently in use. The capacity of the sheet can be expanded beyond 20 by adjusting the text $21, which occurs in some of the formulas, to a larger number using the Edit, Find & Replace dialog and then filling the formulas down to unused rows with the fill handle. See OOo Help if you don't know about the fill handle.

There are some disadvantages to this technique, however:
  • The formulas are complicated to build, and doing so may take more of your time than simply using Data, Sort even if you have to do that dozens of times.
  • You must make a copy of the data because formulas cannot sort "in place", so the size of your spreadsheet will grow.
  • For large amounts of data, the formulas will be slow because they are inefficient. Data, Sort performs just one sort. But for each row the formula sorts all 20 items, and then discards 19 of them, as explained below. Thus for 20 rows, 20 individual sorts are performed. On current hardware you won't notice a delay for 20 items, but you probably will for 20,000 items, which implies 20,000 sorts of 20,000 items.
Here's how the formulas in the first sheet work. Cell B2 uses COUNTA to determine how many of the 20 cells in the first column are in use. Column C is an ascending series which numbers the items. The formula in column D checks if the item in column A is empty. If so, column D is blank, too. Otherwise OFFSET selects the 16 used items, then SMALL sorts them and picks the smallest for the first value in D. SMALL(…;1) is the same as MIN(…). In the cell below, SMALL sorts the same values again and picks the second smallest, SMALL(…;2). Cell D17 holds the largest of the 16 items, SMALL(…;16). In column E, using LARGE produces the items in reverse order.

Filtering data
The easiest way to filter data in Calc is to use Data, Filter from the menu. However this also needs to be reapplied if changes are made after filtering. It is possible to construct formulas which will maintain a copy of the data after performing filtering. The second sheet in the attached file shows how to do that. It provides for filtering up to 20 numbers, of which 16 are currently in use. The capacity of the sheet can be expanded beyond 20 by adjusting the text $21, which occurs in some of the formulas, to a larger number using the Edit, Find & Replace dialog and then filling the formulas down to unused rows. The disadvantages of the filtering formulas are similar to those for the sorting formulas, though inefficiency is less of a concern.

Here's how the formulas in the second sheet work. Cell B2 uses COUNTA to determine how many of the 20 cells in the first column are in use. Column C is the formula which is used to determine if the item should be included by the filter. My example tests to see if the item is an odd number, but it can be any test that returns a TRUE/FALSE value. Column D locates the positions of the successive TRUE values. We'll look at cell D5. D4=4 has previously determined that the third TRUE item is the 4th one (for 61). The OFFSET function in D5 selects the last twelve (16-4) items in C. These values (in C6 through C17) correspond to the data after 61, that is 908 through 703. In Calc, formulas which evaluate to TRUE have the value 1. The MATCH function looks for the first 1 value in the last twelve items. In this case it is the third one, cell C8. Thus MATCH(…) is 3 and when added to D4=4 the formula determines that the position of the next TRUE value is the seventh item (C8). The formulas in D12 through D21 evaluate to an error value because MATCH cannot find more TRUE values. In column E we test to see if an error occurred in column D and if so supply an empty string as the value. Otherwise INDEX selects the position determined by column D, so E5 picks the seventh (D5=7) item in column A (897).

Filtering and sorting data
Finally, one can combine these techniques to obtain a filtered, sorted list. The third sheet in the attached file shows how to do that. Note that due to the use of COUNTA in B2 of the first sheet, the sorting formulas there will not work properly if any of the cells in A2 thorough A17 are empty. If your list contains empty cells and you want formulas which will created a dynamically sorted list, use the technique on sheet 3 and NOT(ISEMPTY(…)) as your test in column C.

Since this is intended as a tutorial, the sheets in the attachment use several auxiliary columns to hold intermediate calculations. These make it easier to understand how the formulas work. Depending on your requirements and your skill level, you may want to combine the formulas so that fewer auxiliary columns are needed. The attachment does not use any Array formulas, but that technique can also be used to sort/filter with fewer (perhaps even zero) auxiliary columns at the expense of longer and more complicated formulas. Someone may want to post an advanced tutorial for an array function solution.

There are no doubt many alternate ways to write these formulas. You could use OFFSET where I've used INDEX, though different parameters would be needed. So if you're wondering why these methods were chosen, it was because I thought that they were simple to understand for a tutorial, and because I liked them. (De gustibus non disputandum est.) If you prefer, for example, another method for locating and selecting the TRUE values in column C of the second sheet, definitely go ahead and use it instead. But this one has been tested and works. The colors in the second and third sheets are produced by conditional formatting and are only meant to help you quickly identify the odd values. The sheets will work if all the conditional formatting is removed (with Format, Default formatting).

I would advise beginners to avoid all of these techniques, unless you fully understand how they work, and just use Data, Sort and/or Data, Filter.

Note that questions are not allowed in the Tutorials section of the forum. Ask them in the Applications > Calc section.
Attachments
OOo Sorting and Filtering data with formulas.ods
Sorting and Filtering data with formulas.ods
(12.35 KiB) Downloaded 4089 times
Last edited by MrProgrammer on Sun Apr 07, 2013 6:12 pm, edited 1 time in total.
Mr. Programmer
OpenOffice 3.2.0 Build 9483 on Mac OS X 10.9.5.   The locale for any menus or Calc formulas in my posts is English (USA).
User avatar
MrProgrammer
Volunteer
 
Posts: 3161
Joined: Fri Jun 04, 2010 7:57 pm
Location: Wisconsin, USA

Re: [Tutorial] Sorting and Filtering data with formulas

Postby Iritscen » Sun Dec 18, 2011 11:14 pm

Many thanks for this tutorial, MrProgrammer; I needed a way to sort a list of numbers automatically, and couldn't put the pieces together mentally until I read your post. As you know and probably chose to omit for readability, the sorting can be performed in just one column instead of three.

Concise Sorting (contiguous)
In the sample sheet's sorting tab, one can replace the formula in D2 with:
=IF(ISBLANK(A2);"";SMALL(OFFSET($A$2:$A$21;0;0;COUNT(A$2:A$21);1);CELL("row";D2)-1))
Then one can copy the formula down column D to get the same sorted results, and do away with columns B and C. MrProgrammer's way of presenting the solution is obviously more readable. I just thought that not everyone would know the "using CELL() to provide incremental numbers" trick and might find it helpful when trying to save columns in their spreadsheet.

Non-Contiguous Data
However, some users may need to sort a column with some blank spaces between numbers (which was my situation). In this case, the whole formula unfortunately unravels, dropping numbers from the sort results and producing errors. The reason is that the use of the OFFSET() and ISBLANK() functions is based on the assumption that the data is contiguous.

One alternative approach is to change the counting formula in B2 to:
=ROW(INDEX(A1:A65535;MAX(IF(ISBLANK(A1:A65535);"";ROW(A1:A65535))))) (be sure to enter this as an array formula)
The 65535 is just an example of a max number. You might know that your data can only be 5000 rows long and use that number instead, or you might want to play it safe and go with OOo's maximum row number of 1048576 — but that takes a few seconds to calculate. Note that this is no longer a counting algorithm, which is useless for non-contiguous data, but rather provides the last used row in the range A1:A65535.

Then change the formula in D2 to:
=IF(ISERR(SMALL(INDIRECT("A2:A"&B$2);CELL("row";D2)-1));"";SMALL(INDIRECT("A2:A"&B$2);CELL("row";D2)-1))
Then copy that formula down the column. One could always replace B$2 with the formula in B2 to require no additional columns, but the formula is already quite long and B2 formula's is pretty CPU-intensive so you don't want to perform it multiple times. Unfortunately the use of IF(ISERR()) to screen out #VALUE! errors requires the SMALL() function to be called twice in each cell, unless someone knows a way to slim things down. It still runs pretty fast, as far as I can tell.

Ultimately, some people might prefer to just make a macro to do all this, but it's still doable with formulas, so that's what I'm sticking with for my own needs!

Edit: Corrected B2 counting formula, I forgot that I added ROW() to the beginning of it. Otherwise it returns the last value, not the last value's row.
Iritscen
 
Posts: 4
Joined: Wed Nov 26, 2008 3:03 am

Re: [Tutorial] Sorting and Filtering data with formulas

Postby MrProgrammer » Mon Oct 29, 2012 6:53 pm

Another method for sorting with formulas is discussed in [Python] Array Sort add-in function. This should be much faster than techniques that use the SMALL() or LARGE() functions.
Mr. Programmer
OpenOffice 3.2.0 Build 9483 on Mac OS X 10.9.5.   The locale for any menus or Calc formulas in my posts is English (USA).
User avatar
MrProgrammer
Volunteer
 
Posts: 3161
Joined: Fri Jun 04, 2010 7:57 pm
Location: Wisconsin, USA

Re: [Tutorial] Sorting and Filtering data with formulas

Postby MrProgrammer » Fri Feb 21, 2014 8:02 pm

One difficulty with my original (February 2011) sorting method was the use of the LARGE() and SMALL() functions since those functions could only process numeric values; that method couldn't sort text strings. The revised formulas in today's attachment now use the SUMPRODUCT() function and can sort either numbers or text strings. If you don't understand that distinction, read section 1. Types of data in Ten concepts that every Calc user should know. My guess is that the SUMPRODUCT() technique will be more efficient than using LARGE() and SMALL(), but perhaps not as fast as Charlie Young's Array Sort add-in function (link provided in this tutorial's previous post). Note however that the add-in will only work with Calc, not with GNUmeric, Excel, or other spreadsheet programs. The SUMPRODUCT() technique (or LARGE/SMALL) should work in other spreadsheet programs.

The new attachment includes two additional sheets showing how to perform sorts on multiple keys, four keys in the examples. Sheet MultiKey Concat uses the simple method of concatenating the four keys to create a new key for sorting. If you use this method, be sure the keys in a column are padded to equal lengths before concatenation. The last sheet, MultiKey SubRank, uses a different method to perform the sort without padding/concatenating the keys; it may be harder to understand.

Some of the formulas in the attachment now use defined names, created by Insert > Names > Define. Use that menu path to view the definitions for names Data, Ranks, Subset, and SubRanks. Many of the formulas in the sheets depend on using absolute and relative cell references correctly as described in section 8. Using formulas and cell references of the Ten Concepts tutorial.

Sorting data
Here's how the revised formulas in the first sheet work. Cell B2 uses COUNTA to determine how many of the 20 cells in the first column are in use. Column C calculates the ranks of the items, 1=smallest to 16=largest. The formula in column D checks if the item in column A is empty. If so, column D is blank, too. Otherwise, MATCH in D2 looks for a 1 in the ranks, then INDEX selects the data in the row where 1 was found. In D3, INDEX selects the data in the row where was 2 was found, etc. In E2, INDEX selects the data where 16 was found. In E3, INDEX selects the data where 15 was found, etc.

Filtering data
The formulas in the second attachment are unchanged from the first attachment, except that column E uses a defined name, Data.

Filtering and sorting data
Column G uses SUMPRODUCT() to calculate the ranks of the subset (the odd numbers). Columns H and I use MATCH() and the ranks instead of LARGE and SMALL.

Multi-key sorting using key concatenation
Columns A though D are the keys. Columns E and F are related data to be sorted with the keys. Column G concatenates the keys, which in this simple example don't need any padding. Column H uses SUMPRODUCT to determine the ranks of the keys. Column I uses MATCH to determine how the rows should be ordered. Columns J though O use INDEX to build a sorted list. Some of the rows, highlighted in yellow, orange, and blue contain duplicate sets of keys. The sorted list preserves the order, top to bottom, of the data for the duplicate keys. Computer Scientists refer to this as a "stable" sort. Cell coloring is done through Conditional Formatting and is unrelated to the sorting technique. It's only purpose is to help you locate duplicate keys more easily.

Multi-key sorting using sub-ranking
Columns A though D are the keys. Columns E and F are related data to be sorted with the keys. Column G determines the ranks of the keys in column A only. Column H determines the ranks of the keys in column B if their keys in column A were identical. Column I determines the ranks of the keys in column C if their keys in columns A and B were identical. Column J determines the ranks of the keys in column D if their keys in column A through C were identical. Column K determines the sequence of rows where all keys are identical so the sort will be stable. Column L determines the ranks of the four keys; it's the sum of columns G through K. Column M uses MATCH to determine how the rows should be ordered. Columns N though S use INDEX to build a sorted list.

Note that questions are not allowed in the Tutorials section of the forum. Ask them in the Applications > Calc section.
Attachments
Sorting and Filtering data with formulas1.ods
(20.72 KiB) Downloaded 1684 times
Mr. Programmer
OpenOffice 3.2.0 Build 9483 on Mac OS X 10.9.5.   The locale for any menus or Calc formulas in my posts is English (USA).
User avatar
MrProgrammer
Volunteer
 
Posts: 3161
Joined: Fri Jun 04, 2010 7:57 pm
Location: Wisconsin, USA


Return to Calc

Who is online

Users browsing this forum: No registered users and 1 guest