Page 1 of 1

Remove duplicates from Calc

Posted: Sat May 03, 2008 3:01 am
by leupi
What would be the recommended way to remove duplicate entries from a column in Calc? In Word 2007 there is a simple 'Remove Duplicates' menu item; is there anything like this in Calc? If I recall, there was a way to do it in Calc but it was a bit more involved.

Thanks,
Todd

Re: Remove duplicates from calc

Posted: Sat May 03, 2008 7:01 am
by squenson
Select the entire range containing data to filter, then click on the menu Data > Filter > Standard Filter and:
1. Use a condition that is always TRUE, like field1 = Not empty
2. Click on the button more, select Remove Duplicate, select Copy to and put the address of an empty cell
The whole range (without duplicate) will be analyzed and copied at that new address.
RemoveDuplicates.PNG

Re: Remove duplicates from calc

Posted: Wed Jul 09, 2008 8:04 pm
by bzsolt
Hi squenson,

thanks for this explanation. Can you give me an idea how to keep the removed rows (since I have other columns with data in them)?

TIA
Zsolt

Re: Remove duplicates from calc

Posted: Wed Jul 09, 2008 10:52 pm
by squenson
I would use two new columns:
1. In the first one, put a sequential number starting from 1 in the first row, and drag it down (you get 2, 3, etc. in this column)
2. Sort the range according to the column that will eventually contain duplicate value
3. In the second new column, put a formula that tests the value if the same as the previous row. If yes, put 1, if not, put 0. Example: =IF(A3=A2;1;0). Copy the formula down the column up to the last row. Rows with a 1 are duplicate rows.
4. You can now use the filtered list technique mentioned in my previous post to select only these rows and copy them somewhere. Don't use, in that case, the remove duplicate as all rows you want to select have a 1 in the specified column.

Last, if you want to sort back your range to the initial state, select the whole range and sort it according to the sequential numbers you have created in the first step.

Re: Remove duplicates from calc

Posted: Tue Nov 18, 2008 1:20 pm
by gnulab
squenson wrote:Select the entire range containing data to filter, then click on the menu Data > Filter > Standard Filter and:
1. Use a condition that is always TRUE, like field1 = Not empty
2. Click on the button more, select Remove Duplicate, select Copy to and put the address of an empty cell
The whole range (without duplicate) will be analyzed and copied at that new address.
RemoveDuplicates.PNG
Although this method seems to be very intuitive, somehow it doesn't seem to filter out the duplicates.
After fumbling with it for 10-15 minutes, I ended up using the 2-steps method suggested by squenson (the IF function, then filter the 0's)

I don't know what went wrong with the 1st method.

I have 3 columns of data, here are the examples:
A | B | C
CN.S08 AXLE AXLE FRONT BALANCER
CN.J13 AXLE AXLE FRONT KNUCKLE
CN.X02 AXLE AXLE FRONT NUT
CN.C15 AXLE AXLE REAR COMPLETE
CN.C11 AXLE AXLE SHAFT
CN.L08 AXLE AXLE SHAFT
CN.N04 AXLE AXLE SHAFT

Objective is to filter out duplicate datas in column C (axle shaft), ie only the row containing CN.C11 AXLE AXLE SHAFT will appear once, instead of multiple times.
I guess the filter thinks that although row CN.C11 & CN.L08 contain the same value in column C (axle shaft), but value in column A (CN.C11, CNL08) is different. Thus it spits out the same value as before.

Henry

Re: Remove duplicates from calc

Posted: Tue Nov 18, 2008 1:35 pm
by Villeroy
First question when removing duplicates: What makes two records duplicates?
Two equal values in field "Family Name"? Certainly not.
Two equal combinations in fields "Family Name" AND "Forname"? May be.
Two equal combinations in fields "Family Name" AND "Forname" AND "Bith Date"? Not necessarily.
In case of text values, case sensitively?

Second question when removing duplicates: Which one is the duplicate to keep?
Doe John 1941-12-31 Carpenter [added some years ago]
Doe John 1941-12-31 Retired [added recently]
Doe John 1981-12-31 Student [added some years ago]
Doe John 1981-12-31 Dr. med. [added more recently]
...obviously the 4 records refer to the same 2 persons, but no software can decide which one is to be removed.

Re: Remove duplicates from calc

Posted: Thu Nov 20, 2008 4:50 pm
by jrkrideau
It depends on the data but if you're cleaning up a dirty data set with duplicates, weird outliners, etc you' start to recognize them. :)

I'm not sure that I'd want to use Calc to do the work but it's a common problem.

Re: Remove duplicates from calc

Posted: Thu Nov 20, 2008 6:25 pm
by Villeroy
Anyway,

Code: Select all

=SUMPRODUCT($A1=$A$1:$A$99;$B1=$B$1:$B$99;$C1=$C$1:$C$99)
... returns the frequency of combinations in this row's A, B and C within rows 1 to 99.
Adjust to your needs, copy down and filter this column by ">1".
Then decide about the records to stay.

Remove/recalcalculate duplicates in multiple rows/columns

Posted: Tue Jul 07, 2009 12:37 am
by colin
Find and automatically recalculate unique text values or remove duplicate values in multiple rows or columns.

Uses if and countif functions.

Example spreadsheet illustrates:

1) multiple rows of text
2) duplicate values in cells of each row
3) desire to strip out duplicate values in a given row.

Uses vvjoin macro for concatenation but this is not absolutely necessary.

Basic concept may be modified easily to deal with columns and/or to present results in different ways.

Re: Remove duplicates from calc

Posted: Tue Sep 15, 2009 4:25 pm
by kehs
Sorry to bring up this old topic, is there anyway while using the standard filter to paste special instead of copying the cells. In the current spreadsheet the data i am meant to filter out is created by formulas within in the cells. If there are any other way of doing it please mention
Many Thanks
kehs

Re: Remove duplicates from calc

Posted: Tue Sep 15, 2009 4:32 pm
by Villeroy
kehs wrote:Sorry to bring up this old topic, is there anyway while using the standard filter to paste special instead of copying the cells. In the current spreadsheet the data i am meant to filter out is created by formulas within in the cells. If there are any other way of doing it please mention
Many Thanks
kehs
Database, database, database. Always use a database. Never ever use a spreadsheet if you want to organize your data tidily.

Re: Remove duplicates from calc

Posted: Tue Sep 15, 2009 5:34 pm
by kehs
Hahaha will learn that, but as of right now is there anyway to paste special with the standard filter?

Re: Remove duplicates from calc

Posted: Tue Sep 15, 2009 6:02 pm
by Villeroy
Your question is rather unclear. I believe you want to filter to another range (with an output target in the standard filter's extra options) and want to copy values only rather than formulas.
No, there is no way to do this other than using a database or writing sophisticated macro stuff.

Re: Remove duplicates from calc

Posted: Tue Sep 15, 2009 7:09 pm
by Villeroy
Stop, wait a minute!
Filter in place, copy and paste-special numbers, text, times and anything else but no formulas.
Is this what you want?

Re: Remove duplicates from calc

Posted: Wed Sep 16, 2009 12:11 am
by kehs
Coloumn A: This would consist of the dates copied from a database.
Column B: Would contain formula modifying the dates in coloumn A to give a speicifc format and layout.

What i am trying to accomplish is remove any duplicates in Column B and count the total number of dates LEFT once the duplicates are removed.

Anyhelp would be appriciated :D

Re: Remove duplicates from calc

Posted: Wed Sep 16, 2009 11:42 am
by Villeroy
Coloumn A: This would consist of the dates copied from a database.
You copy data from a database into a spreadsheet and wonder why you lose most of the database functionality.
In the database(should work with any database):
SELECT DISTINCT "Dates" FROM "Tables"
SELECT COUNT(*) FROM (SELECT DISTINCT "Dates" FROM "Tables")
Column B: Would contain formula modifying the dates in coloumn A to give a speicifc format and layout.
Apply your "specific number format" to A?

Data-Filter is a great way to remove duplicates from calc

Posted: Fri Apr 27, 2012 5:43 pm
by GopherBaroque
Hello, I want to thank squenson for the initial solution:
Select the entire range containing data to filter, then click on the menu Data > Filter > Standard Filter and:
1. Use a condition that is always TRUE, like field1 = Not empty
2. Click on the button more, select Remove Duplicate, select Copy to and put the address of an empty cell
The whole range (without duplicate) will be analyzed and copied at that new address.
which worked great on my little list. Acknowledging that a spreadsheet is not really a database, I had a long list of items that were provided in a spreadsheet, with a small number of unique entries arranged in a large number of combinations. It was easy to cut and paste cells to make a single long column of all intances and the above technique was an efficient one-shot filter that quickly gave me the list of unique entries to copy over as input to my analysis of the combinations.

Re: Remove duplicates from calc

Posted: Mon Jun 04, 2012 9:53 pm
by coguaro
is possible copy the result to clipboard with one blank row before all row similar of this

cells whit duplicates
1
2
3
1
4
5
6

cell copied
1

2

3

4

5

6
some one please help me

sorry for my english :) :D

Re: Remove duplicates from calc

Posted: Tue Jun 05, 2012 1:46 am
by kingfisher
You can use the menu selection Data > Filter > Standard to copy the range without duplicates to another location (say the next empty column).

Example attached.
filter.jpeg

Re: Remove duplicates from Calc

Posted: Thu May 23, 2013 6:03 am
by Kumara
5 years later, squenson's generous help is still helping people. I'm particularly impressed by the reply at http://forum.openoffice.org/en/forum/vi ... 379#p35891. Brilliant! I registered just to register my thanks.

Re: Remove duplicates from Calc

Posted: Wed Jun 05, 2013 11:43 am
by kalant
OK, there are some pretty elegant solutions to the matter, but mine is a little different than the original problem of the topic! I know it's an old one but still I'm posting my request for help, so I beg you be generous :)

So the situation is like this: I've got a 5 column data set A:E containing:
A:C are: month, day and hour. D is "some string" and E is "someone's name" - they all look like this:
A | B | C | D | E |
Jan | 3 |15:58:32|3ef56| from=<George>|
Jan | 3 |15:59:28|3ef56| to=<Peter> |

... and it goes like this for like say 1000 rows. So obviously the data has to be sorted by month, day, hour and some number in column D. The thing is that D values need to come in couples in every two sequential rows and for every data set I know that the E value for one of the duplicate D value rows must be lets say "George" for the sake of this example. After sorting the data sometimes there is a third row containing the same value in D column, like so:

A | B | C | D | E |
Jan | 3 |15:58:32|3ef56| from=<don't_even_ask>|
Jan | 3 |16:52:31|33f56| from=<George> |
Jan | 3 |16:53:20|33f56| to=<Peter> |

.... and this is a problem. In this particular example is clear that the first row should go. Obviously Calc is not capable of making this decision, but it sure can point out when there are more than two occurrences of the value in D. The best outcome would be if this is the case all the rows to be say colored in RED. I don' even know if that is possible, it was just a thought. It would help identifying the case and remove the rows by hand.

Also, sometimes there is this case:

A | B | C | D | E |
Jan | 3 |15:58:32|3ef56| to=<George>|

... and there is no row with the D value matching the one here - the D value is unique for the entire data set. Those rows I'd like to remove immediately.

Any ideas you OO wizards? :)
I would appreciate all the help I can get

Re: Remove duplicates from Calc

Posted: Wed Jun 05, 2013 12:28 pm
by Villeroy
Jan | 3 |15:58:32|3ef56| from=<George>| to=<Peter>

Re: Remove duplicates from Calc

Posted: Wed Jun 05, 2013 1:20 pm
by kalant
Villeroy wrote:Jan | 3 |15:58:32|3ef56| from=<George>| to=<Peter>
What exactly do you mean by that?

Re: Remove duplicates from Calc

Posted: Wed Jun 05, 2013 1:31 pm
by Villeroy
Your data are redundant. So you have pairs of records where the same information may fit in one record. If the one hour difference between "from" and "to" matters, the "from" and the "to" records belong into separate tables so you can look up the "from" for each "to" (or vice versa).

Apart from that, it is a huge mistake to do such things in a spreadsheet. A database would be by far less error prone, much more convenient to maintain but more difficult to set up.

Re: Remove duplicates from Calc

Posted: Wed Jun 05, 2013 2:08 pm
by kalant
Villeroy wrote:Your data are redundant. So you have pairs of records where the same information may fit in one record. If the one hour difference between "from" and "to" matters, the "from" and the "to" records belong into separate tables so you can look up the "from" for each "to" (or vice versa).
Yes I've thought about merging the records before but then the data does not answer the purpose, so yes the hour difference matters and the data format should stay as shown.
Villeroy wrote: Apart from that, it is a huge mistake to do such things in a spreadsheet. A database would be by far less error prone, much more convenient to maintain but more difficult to set up.
And yes, if the database still existed a relatively simple (or maybe not that simple ) request would be sufficient to do the job. Unfortunately what I've showed here is an shredded excerpt(there is more garbage to it that need to be removed before it gets to this state) from logs of a none existing database. So I don't have much to choose from, I have to do the job with what I've got.

So I sense the answer is somewhere in that part of your reply:
Villeroy wrote:....the "from" and the "to" records belong into separate tables so you can look up the "from" for each "to" (or vice versa).

... but could you be more specific about it please?

Re: Remove duplicates from Calc

Posted: Wed Jun 05, 2013 2:32 pm
by kalant
I have tried this:

=IF(COUNTIF($D$1:$D$748;D1)>1;"Duplicate";""

and it finds all the rows that do have duplicate value in D so in my case the ones that don't are not marked with "Duplicate" and are easy to spot. So this is good, but is there a way to delete those rows right away and not by hand?

********************************
After that I tried this one

=IF(COUNTIF($D$1:$D$748;D1)>2;"Duplicate";""

....and it does the job to find if there are more than one match for the current checked value. And of course marks them with "Duplicate" in the formula column. So this is also good start, but it would be even better if there is a way to make Calc to mark the whole row in color for example. Does anyone knows if this can be done?

Thanks to Villeroy who helped me on my other topic too! Cheers buddy :)

Re: Remove duplicates from Calc

Posted: Wed Jun 05, 2013 3:20 pm
by Villeroy
That feature is called "conditional formatting".
With any cell in row 1 being the active cell:
menu:Format>Conditional ...
Condition #1
<Formula Is>
COUNTIF($D$1:$D$748;$D2)>1
and assign some colored cell style.

For many shades of color you can put style names in a list, say X1:X5 and use:
Condition #1
<Formula Is>
STYLE(INDEX($X$1:$X$5;COUNTIF($D$1:$D$748;$D2)))
If D2 occurs 3 times in D1:D748, it reads the 3rd style name from list X1:X5 and function STYLE will apply this style.

Re: Remove duplicates from Calc

Posted: Wed Jun 05, 2013 4:29 pm
by kalant
Villeroy wrote: With any cell in row 1 being the active cell:
menu:Format>Conditional ...
Condition #1
<Formula Is>
COUNTIF($D$1:$D$748;$D2)>1
and assign some colored cell style.
This one works fine unfortunately it affects the active cell only. And I can not expand this conditional formatting for all the cells in the column. If I select all the cells in the column the conditional formatting does not affect none of them.

Re: Remove duplicates from Calc

Posted: Wed Jun 05, 2013 9:58 pm
by Villeroy
It works with all the cells you have selected and you can copy/paste-special the formatting.