Remove duplicates from Calc

Discuss the spreadsheet application

Remove duplicates from Calc

Postby leupi » Sat May 03, 2008 3:01 am

What would be the recommended way to remove duplicate entries from a column in Calc? In Word 2007 there is a simple 'Remove Duplicates' menu item; is there anything like this in Calc? If I recall, there was a way to do it in Calc but it was a bit more involved.

Thanks,
Todd
leupi
 
Posts: 1
Joined: Sat May 03, 2008 2:56 am

Re: Remove duplicates from calc

Postby squenson » Sat May 03, 2008 7:01 am

Select the entire range containing data to filter, then click on the menu Data > Filter > Standard Filter and:
1. Use a condition that is always TRUE, like field1 = Not empty
2. Click on the button more, select Remove Duplicate, select Copy to and put the address of an empty cell
The whole range (without duplicate) will be analyzed and copied at that new address.
RemoveDuplicates.PNG
LibreOffice 4.2.3.3. on Ubuntu 14.04
User avatar
squenson
Volunteer
 
Posts: 1879
Joined: Wed Jan 30, 2008 9:21 pm
Location: Lausanne, Switzerland

Re: Remove duplicates from calc

Postby bzsolt » Wed Jul 09, 2008 8:04 pm

Hi squenson,

thanks for this explanation. Can you give me an idea how to keep the removed rows (since I have other columns with data in them)?

TIA
Zsolt
OOo 2.4.X on MS Windows Vista
bzsolt
 
Posts: 1
Joined: Wed Jul 09, 2008 8:00 pm

Re: Remove duplicates from calc

Postby squenson » Wed Jul 09, 2008 10:52 pm

I would use two new columns:
1. In the first one, put a sequential number starting from 1 in the first row, and drag it down (you get 2, 3, etc. in this column)
2. Sort the range according to the column that will eventually contain duplicate value
3. In the second new column, put a formula that tests the value if the same as the previous row. If yes, put 1, if not, put 0. Example: =IF(A3=A2;1;0). Copy the formula down the column up to the last row. Rows with a 1 are duplicate rows.
4. You can now use the filtered list technique mentioned in my previous post to select only these rows and copy them somewhere. Don't use, in that case, the remove duplicate as all rows you want to select have a 1 in the specified column.

Last, if you want to sort back your range to the initial state, select the whole range and sort it according to the sequential numbers you have created in the first step.
LibreOffice 4.2.3.3. on Ubuntu 14.04
User avatar
squenson
Volunteer
 
Posts: 1879
Joined: Wed Jan 30, 2008 9:21 pm
Location: Lausanne, Switzerland

Re: Remove duplicates from calc

Postby gnulab » Tue Nov 18, 2008 1:20 pm

squenson wrote:Select the entire range containing data to filter, then click on the menu Data > Filter > Standard Filter and:
1. Use a condition that is always TRUE, like field1 = Not empty
2. Click on the button more, select Remove Duplicate, select Copy to and put the address of an empty cell
The whole range (without duplicate) will be analyzed and copied at that new address.
RemoveDuplicates.PNG


Although this method seems to be very intuitive, somehow it doesn't seem to filter out the duplicates.
After fumbling with it for 10-15 minutes, I ended up using the 2-steps method suggested by squenson (the IF function, then filter the 0's)

I don't know what went wrong with the 1st method.

I have 3 columns of data, here are the examples:
A | B | C
CN.S08 AXLE AXLE FRONT BALANCER
CN.J13 AXLE AXLE FRONT KNUCKLE
CN.X02 AXLE AXLE FRONT NUT
CN.C15 AXLE AXLE REAR COMPLETE
CN.C11 AXLE AXLE SHAFT
CN.L08 AXLE AXLE SHAFT
CN.N04 AXLE AXLE SHAFT

Objective is to filter out duplicate datas in column C (axle shaft), ie only the row containing CN.C11 AXLE AXLE SHAFT will appear once, instead of multiple times.
I guess the filter thinks that although row CN.C11 & CN.L08 contain the same value in column C (axle shaft), but value in column A (CN.C11, CNL08) is different. Thus it spits out the same value as before.

Henry
OOo 2.4.X on Ms Windows XP
gnulab
 
Posts: 4
Joined: Tue Nov 18, 2008 12:15 pm

Re: Remove duplicates from calc

Postby Villeroy » Tue Nov 18, 2008 1:35 pm

First question when removing duplicates: What makes two records duplicates?
Two equal values in field "Family Name"? Certainly not.
Two equal combinations in fields "Family Name" AND "Forname"? May be.
Two equal combinations in fields "Family Name" AND "Forname" AND "Bith Date"? Not necessarily.
In case of text values, case sensitively?

Second question when removing duplicates: Which one is the duplicate to keep?
Doe John 1941-12-31 Carpenter [added some years ago]
Doe John 1941-12-31 Retired [added recently]
Doe John 1981-12-31 Student [added some years ago]
Doe John 1981-12-31 Dr. med. [added more recently]
...obviously the 4 records refer to the same 2 persons, but no software can decide which one is to be removed.
Please, edit this topic's initial post and add "[Solved]" to the subject line if your problem has been solved.
Ubuntu 12.04, Apache OpenOffice 4.x
User avatar
Villeroy
Volunteer
 
Posts: 17307
Joined: Mon Oct 08, 2007 1:35 am
Location: Germany

Re: Remove duplicates from calc

Postby jrkrideau » Thu Nov 20, 2008 4:50 pm

It depends on the data but if you're cleaning up a dirty data set with duplicates, weird outliners, etc you' start to recognize them. :)

I'm not sure that I'd want to use Calc to do the work but it's a common problem.
AOO 4.0.1, Ubuntu 13.04
jrkrideau
 
Posts: 1913
Joined: Sun Dec 30, 2007 10:00 pm
Location: Kingston Ontario Canada

Re: Remove duplicates from calc

Postby Villeroy » Thu Nov 20, 2008 6:25 pm

Anyway,
Code: Select all   Expand viewCollapse view
=SUMPRODUCT($A1=$A$1:$A$99;$B1=$B$1:$B$99;$C1=$C$1:$C$99)

... returns the frequency of combinations in this row's A, B and C within rows 1 to 99.
Adjust to your needs, copy down and filter this column by ">1".
Then decide about the records to stay.
Please, edit this topic's initial post and add "[Solved]" to the subject line if your problem has been solved.
Ubuntu 12.04, Apache OpenOffice 4.x
User avatar
Villeroy
Volunteer
 
Posts: 17307
Joined: Mon Oct 08, 2007 1:35 am
Location: Germany

Remove/recalcalculate duplicates in multiple rows/columns

Postby colin » Tue Jul 07, 2009 12:37 am

Find and automatically recalculate unique text values or remove duplicate values in multiple rows or columns.

Uses if and countif functions.

Example spreadsheet illustrates:

1) multiple rows of text
2) duplicate values in cells of each row
3) desire to strip out duplicate values in a given row.

Uses vvjoin macro for concatenation but this is not absolutely necessary.

Basic concept may be modified easily to deal with columns and/or to present results in different ways.
Attachments
remove duplicate example.ods
(14.97 KiB) Downloaded 2083 times
colin
 
Posts: 9
Joined: Mon Apr 28, 2008 7:33 am

Re: Remove duplicates from calc

Postby kehs » Tue Sep 15, 2009 4:25 pm

Sorry to bring up this old topic, is there anyway while using the standard filter to paste special instead of copying the cells. In the current spreadsheet the data i am meant to filter out is created by formulas within in the cells. If there are any other way of doing it please mention
Many Thanks
kehs
OpenOffice 3.1 on Windows
kehs
 
Posts: 5
Joined: Tue Sep 15, 2009 5:10 am

Re: Remove duplicates from calc

Postby Villeroy » Tue Sep 15, 2009 4:32 pm

kehs wrote:Sorry to bring up this old topic, is there anyway while using the standard filter to paste special instead of copying the cells. In the current spreadsheet the data i am meant to filter out is created by formulas within in the cells. If there are any other way of doing it please mention
Many Thanks
kehs

Database, database, database. Always use a database. Never ever use a spreadsheet if you want to organize your data tidily.
Please, edit this topic's initial post and add "[Solved]" to the subject line if your problem has been solved.
Ubuntu 12.04, Apache OpenOffice 4.x
User avatar
Villeroy
Volunteer
 
Posts: 17307
Joined: Mon Oct 08, 2007 1:35 am
Location: Germany

Re: Remove duplicates from calc

Postby kehs » Tue Sep 15, 2009 5:34 pm

Hahaha will learn that, but as of right now is there anyway to paste special with the standard filter?
OpenOffice 3.1 on Windows
kehs
 
Posts: 5
Joined: Tue Sep 15, 2009 5:10 am

Re: Remove duplicates from calc

Postby Villeroy » Tue Sep 15, 2009 6:02 pm

Your question is rather unclear. I believe you want to filter to another range (with an output target in the standard filter's extra options) and want to copy values only rather than formulas.
No, there is no way to do this other than using a database or writing sophisticated macro stuff.
Please, edit this topic's initial post and add "[Solved]" to the subject line if your problem has been solved.
Ubuntu 12.04, Apache OpenOffice 4.x
User avatar
Villeroy
Volunteer
 
Posts: 17307
Joined: Mon Oct 08, 2007 1:35 am
Location: Germany

Re: Remove duplicates from calc

Postby Villeroy » Tue Sep 15, 2009 7:09 pm

Stop, wait a minute!
Filter in place, copy and paste-special numbers, text, times and anything else but no formulas.
Is this what you want?
Please, edit this topic's initial post and add "[Solved]" to the subject line if your problem has been solved.
Ubuntu 12.04, Apache OpenOffice 4.x
User avatar
Villeroy
Volunteer
 
Posts: 17307
Joined: Mon Oct 08, 2007 1:35 am
Location: Germany

Re: Remove duplicates from calc

Postby kehs » Wed Sep 16, 2009 12:11 am

Coloumn A: This would consist of the dates copied from a database.
Column B: Would contain formula modifying the dates in coloumn A to give a speicifc format and layout.

What i am trying to accomplish is remove any duplicates in Column B and count the total number of dates LEFT once the duplicates are removed.

Anyhelp would be appriciated :D
OpenOffice 3.1 on Windows
kehs
 
Posts: 5
Joined: Tue Sep 15, 2009 5:10 am

Re: Remove duplicates from calc

Postby Villeroy » Wed Sep 16, 2009 11:42 am

Coloumn A: This would consist of the dates copied from a database.

You copy data from a database into a spreadsheet and wonder why you lose most of the database functionality.
In the database(should work with any database):
SELECT DISTINCT "Dates" FROM "Tables"
SELECT COUNT(*) FROM (SELECT DISTINCT "Dates" FROM "Tables")

Column B: Would contain formula modifying the dates in coloumn A to give a speicifc format and layout.

Apply your "specific number format" to A?
Please, edit this topic's initial post and add "[Solved]" to the subject line if your problem has been solved.
Ubuntu 12.04, Apache OpenOffice 4.x
User avatar
Villeroy
Volunteer
 
Posts: 17307
Joined: Mon Oct 08, 2007 1:35 am
Location: Germany

Data-Filter is a great way to remove duplicates from calc

Postby GopherBaroque » Fri Apr 27, 2012 5:43 pm

Hello, I want to thank squenson for the initial solution:

Select the entire range containing data to filter, then click on the menu Data > Filter > Standard Filter and:
1. Use a condition that is always TRUE, like field1 = Not empty
2. Click on the button more, select Remove Duplicate, select Copy to and put the address of an empty cell
The whole range (without duplicate) will be analyzed and copied at that new address.


which worked great on my little list. Acknowledging that a spreadsheet is not really a database, I had a long list of items that were provided in a spreadsheet, with a small number of unique entries arranged in a large number of combinations. It was easy to cut and paste cells to make a single long column of all intances and the above technique was an efficient one-shot filter that quickly gave me the list of unique entries to copy over as input to my analysis of the combinations.
OpenOffice 3.1 on Windows XP, 7
User avatar
GopherBaroque
 
Posts: 2
Joined: Fri Oct 01, 2010 12:14 am

Re: Remove duplicates from calc

Postby coguaro » Mon Jun 04, 2012 9:53 pm

is possible copy the result to clipboard with one blank row before all row similar of this

cells whit duplicates
1
2
3
1
4
5
6

cell copied
1

2

3

4

5

6
some one please help me

sorry for my english :) :D
OpenOffice 3.1 on Windows XP
coguaro
 
Posts: 1
Joined: Mon Jun 04, 2012 9:49 pm

Re: Remove duplicates from calc

Postby kingfisher » Tue Jun 05, 2012 1:46 am

You can use the menu selection Data > Filter > Standard to copy the range without duplicates to another location (say the next empty column).

Example attached.
filter.jpeg
OpenOffice 3.3 on PCLinuxOS. There are 3 kinds of people: those who can count and those who can't.
User avatar
kingfisher
Volunteer
 
Posts: 1861
Joined: Tue Nov 20, 2007 10:53 am

Re: Remove duplicates from Calc

Postby Kumara » Thu May 23, 2013 6:03 am

5 years later, squenson's generous help is still helping people. I'm particularly impressed by the reply at viewtopic.php?f=9&t=5379#p35891. Brilliant! I registered just to register my thanks.
Kumara
 
Posts: 14
Joined: Thu May 23, 2013 5:50 am

Re: Remove duplicates from Calc

Postby kalant » Wed Jun 05, 2013 11:43 am

OK, there are some pretty elegant solutions to the matter, but mine is a little different than the original problem of the topic! I know it's an old one but still I'm posting my request for help, so I beg you be generous :)

So the situation is like this: I've got a 5 column data set A:E containing:
A:C are: month, day and hour. D is "some string" and E is "someone's name" - they all look like this:
A | B | C | D | E |
Jan | 3 |15:58:32|3ef56| from=<George>|
Jan | 3 |15:59:28|3ef56| to=<Peter> |

... and it goes like this for like say 1000 rows. So obviously the data has to be sorted by month, day, hour and some number in column D. The thing is that D values need to come in couples in every two sequential rows and for every data set I know that the E value for one of the duplicate D value rows must be lets say "George" for the sake of this example. After sorting the data sometimes there is a third row containing the same value in D column, like so:

A | B | C | D | E |
Jan | 3 |15:58:32|3ef56| from=<don't_even_ask>|
Jan | 3 |16:52:31|33f56| from=<George> |
Jan | 3 |16:53:20|33f56| to=<Peter> |

.... and this is a problem. In this particular example is clear that the first row should go. Obviously Calc is not capable of making this decision, but it sure can point out when there are more than two occurrences of the value in D. The best outcome would be if this is the case all the rows to be say colored in RED. I don' even know if that is possible, it was just a thought. It would help identifying the case and remove the rows by hand.

Also, sometimes there is this case:

A | B | C | D | E |
Jan | 3 |15:58:32|3ef56| to=<George>|

... and there is no row with the D value matching the one here - the D value is unique for the entire data set. Those rows I'd like to remove immediately.

Any ideas you OO wizards? :)
I would appreciate all the help I can get
OpenOffice.org 3.4.1 on Windows XP Pro and on Windows 7
kalant
 
Posts: 15
Joined: Fri May 31, 2013 3:30 pm

Re: Remove duplicates from Calc

Postby Villeroy » Wed Jun 05, 2013 12:28 pm

Jan | 3 |15:58:32|3ef56| from=<George>| to=<Peter>
Please, edit this topic's initial post and add "[Solved]" to the subject line if your problem has been solved.
Ubuntu 12.04, Apache OpenOffice 4.x
User avatar
Villeroy
Volunteer
 
Posts: 17307
Joined: Mon Oct 08, 2007 1:35 am
Location: Germany

Re: Remove duplicates from Calc

Postby kalant » Wed Jun 05, 2013 1:20 pm

Villeroy wrote:Jan | 3 |15:58:32|3ef56| from=<George>| to=<Peter>


What exactly do you mean by that?
OpenOffice.org 3.4.1 on Windows XP Pro and on Windows 7
kalant
 
Posts: 15
Joined: Fri May 31, 2013 3:30 pm

Re: Remove duplicates from Calc

Postby Villeroy » Wed Jun 05, 2013 1:31 pm

Your data are redundant. So you have pairs of records where the same information may fit in one record. If the one hour difference between "from" and "to" matters, the "from" and the "to" records belong into separate tables so you can look up the "from" for each "to" (or vice versa).

Apart from that, it is a huge mistake to do such things in a spreadsheet. A database would be by far less error prone, much more convenient to maintain but more difficult to set up.
Please, edit this topic's initial post and add "[Solved]" to the subject line if your problem has been solved.
Ubuntu 12.04, Apache OpenOffice 4.x
User avatar
Villeroy
Volunteer
 
Posts: 17307
Joined: Mon Oct 08, 2007 1:35 am
Location: Germany

Re: Remove duplicates from Calc

Postby kalant » Wed Jun 05, 2013 2:08 pm

Villeroy wrote:Your data are redundant. So you have pairs of records where the same information may fit in one record. If the one hour difference between "from" and "to" matters, the "from" and the "to" records belong into separate tables so you can look up the "from" for each "to" (or vice versa).

Yes I've thought about merging the records before but then the data does not answer the purpose, so yes the hour difference matters and the data format should stay as shown.

Villeroy wrote:Apart from that, it is a huge mistake to do such things in a spreadsheet. A database would be by far less error prone, much more convenient to maintain but more difficult to set up.

And yes, if the database still existed a relatively simple (or maybe not that simple ) request would be sufficient to do the job. Unfortunately what I've showed here is an shredded excerpt(there is more garbage to it that need to be removed before it gets to this state) from logs of a none existing database. So I don't have much to choose from, I have to do the job with what I've got.

So I sense the answer is somewhere in that part of your reply:
Villeroy wrote:....the "from" and the "to" records belong into separate tables so you can look up the "from" for each "to" (or vice versa).

... but could you be more specific about it please?
OpenOffice.org 3.4.1 on Windows XP Pro and on Windows 7
kalant
 
Posts: 15
Joined: Fri May 31, 2013 3:30 pm

Re: Remove duplicates from Calc

Postby kalant » Wed Jun 05, 2013 2:32 pm

I have tried this:

=IF(COUNTIF($D$1:$D$748;D1)>1;"Duplicate";""

and it finds all the rows that do have duplicate value in D so in my case the ones that don't are not marked with "Duplicate" and are easy to spot. So this is good, but is there a way to delete those rows right away and not by hand?

********************************
After that I tried this one

=IF(COUNTIF($D$1:$D$748;D1)>2;"Duplicate";""

....and it does the job to find if there are more than one match for the current checked value. And of course marks them with "Duplicate" in the formula column. So this is also good start, but it would be even better if there is a way to make Calc to mark the whole row in color for example. Does anyone knows if this can be done?

Thanks to Villeroy who helped me on my other topic too! Cheers buddy :)
OpenOffice.org 3.4.1 on Windows XP Pro and on Windows 7
kalant
 
Posts: 15
Joined: Fri May 31, 2013 3:30 pm

Re: Remove duplicates from Calc

Postby Villeroy » Wed Jun 05, 2013 3:20 pm

That feature is called "conditional formatting".
With any cell in row 1 being the active cell:
menu:Format>Conditional ...
Condition #1
<Formula Is>
COUNTIF($D$1:$D$748;$D2)>1
and assign some colored cell style.

For many shades of color you can put style names in a list, say X1:X5 and use:
Condition #1
<Formula Is>
STYLE(INDEX($X$1:$X$5;COUNTIF($D$1:$D$748;$D2)))
If D2 occurs 3 times in D1:D748, it reads the 3rd style name from list X1:X5 and function STYLE will apply this style.
Please, edit this topic's initial post and add "[Solved]" to the subject line if your problem has been solved.
Ubuntu 12.04, Apache OpenOffice 4.x
User avatar
Villeroy
Volunteer
 
Posts: 17307
Joined: Mon Oct 08, 2007 1:35 am
Location: Germany

Re: Remove duplicates from Calc

Postby kalant » Wed Jun 05, 2013 4:29 pm

Villeroy wrote:With any cell in row 1 being the active cell:
menu:Format>Conditional ...
Condition #1
<Formula Is>
COUNTIF($D$1:$D$748;$D2)>1
and assign some colored cell style.

This one works fine unfortunately it affects the active cell only. And I can not expand this conditional formatting for all the cells in the column. If I select all the cells in the column the conditional formatting does not affect none of them.
OpenOffice.org 3.4.1 on Windows XP Pro and on Windows 7
kalant
 
Posts: 15
Joined: Fri May 31, 2013 3:30 pm

Re: Remove duplicates from Calc

Postby Villeroy » Wed Jun 05, 2013 9:58 pm

It works with all the cells you have selected and you can copy/paste-special the formatting.
Please, edit this topic's initial post and add "[Solved]" to the subject line if your problem has been solved.
Ubuntu 12.04, Apache OpenOffice 4.x
User avatar
Villeroy
Volunteer
 
Posts: 17307
Joined: Mon Oct 08, 2007 1:35 am
Location: Germany


Return to Calc

Who is online

Users browsing this forum: No registered users and 39 guests