[Solved] First and last name duplicates detector

Discuss the spreadsheet application

[Solved] First and last name duplicates detector

Postby ooGuillaume » Sun Jul 21, 2019 3:32 pm

Hello,

Using LibreOffice 5.1.6.2 on Xubuntu 16.04, I'd like to create a duplicate detector for first and last name combinations.

We have people writing down their names in a notebook. Some people are new, some people will write down their name again next year, some people had already written it down the year before.
Every week or so, we add the most recent names in our Calc sheet (which serves as a subscribers database).
But we only need one line per person.

Hence the need for this 'duplicate detector'. :)

Attached to this post is a sample file showing what I've done so far:

Column A is first name, B is last name. In C I've added a modified combination of the two, going through this formula:
Code: Select all   Expand viewCollapse view
=LOWER(CONCATENATE(NORMALIZENAME(SUBSTITUTE(SUBSTITUTE(A2;" ";"");"-";""));NORMALIZENAME(SUBSTITUTE(SUBSTITUTE(B2;" ";"");"-";""))))

NormalizeName is a function I've taken from https://www.extendoffice.com/documents/ ... cters.html. Although for Microsoft Excel originally, the code was accepted by Calc:
Code: Select all   Expand viewCollapse view
Function NormalizeName(thestring As String)
   Dim A As String * 1
   Dim B As String * 1
   Dim i As Integer
   Const AccChars= "ĄÂÃÄÀÁÅÆĆÇĘÈÉÊËÌÍÎÏŁÐŃÑÒÓÔÕÖØÙÚÛÜŚÝÞŹŻßąàáâãäåæćçęèéêëìíîïłðńñóòóôõöøùúûüśýþÿźżäöüúůýžÚŮÝŽéëïóöüÉËÏÓÖÜùûüÿàâæçéèêÙÛÜŸÀÂÆ"
   Const RegChars= "AAAAAAAACCEEEEEIIIILDNNOOOOOOUUUUSYpZZbaaaaaaaacceeeeeiiiilonnooooooouuuusybyzzaouuuyzUUYZeeioouEEIOOUuuuyaaaceeeUUUYAAA"

   For i = 1 To Len(AccChars)
      A = Mid(AccChars, i, 1)
      B = Mid(RegChars, i, 1)
      thestring = Replace(thestring, A, B)
   Next

   NormalizeName = thestring
End Function

Characters' list was taken from https://forum.openoffice.org/en/forum/viewtopic.php?f=20&t=81932#p379192.

This combination in column C is then used in conditional formatting of columns A and B to apply a different style to duplicates, with this formula:
Code: Select all   Expand viewCollapse view
COUNTIF($C$2:$C$2763;$C2)>1


I'd like to find a faster, cleaner and more complete solution.
- We have 2700+ lines, and the NormalizeName function seems to slow down the sheet loading and use.
- Using the substitute function to remove white spaces and dashes could be used for other characters but would further complicate the whole formula. Maybe it's better to integrate this into the NormalizeName macro.
- An intelligent search could detect typing errors such as "Elliott Ness" vs. "Elliot Ness".

What I like about the solution at the moment is that it detects duplicates on the fly.
I'm not fond of the intermediate column C and would appreciate avoiding it if possible.

To sum it up, I'm interested in any input to improve this, maybe with a completely different approach, or by making this one more effective?
Please let me know if more info is needed.

Thank you!
Attachments
DuplicateNamesDetector.ods
(108.32 KiB) Downloaded 3 times
Last edited by ooGuillaume on Mon Jul 22, 2019 3:21 pm, edited 1 time in total.
LibreOffice 6.2.5.2 on Ubuntu 16.04
ooGuillaume
 
Posts: 7
Joined: Sat Jul 20, 2019 10:03 am

Re: A duplicate detector for first and last name combination

Postby floris v » Sun Jul 21, 2019 4:59 pm

I don't have a solution but a question that may mess things up a bit more. How do you handle two different people with the same first and last name?
AOO 4.1.6 op Linux Mint
If your problem has been solved or your question has been answered, please edit the first post in this thread and add [Solved] to the title bar.
Nederlandstalig forum
User avatar
floris v
Moderator
 
Posts: 4112
Joined: Wed Nov 28, 2007 1:21 pm
Location: Netherlands

Re: A duplicate detector for first and last name combination

Postby RusselB » Sun Jul 21, 2019 5:00 pm

Trying to do this without a helper column (in this case your column C) would be, at best, massive headache...it might be slower (depends on what is used in place of the helper column). and might not even be possible (while using Calc as a database).
If you don't like the column C because you can see it, simply hide the column.
I don't know if there's a way to code this into your macro, but the Find & Replace dialog (Ctrl+H) has a Similarity Search under Other Options.
Something you might want to look into, as a similarity search, from what I gather from my reading (note: I haven't actually used it), might eliminate your need for the Normalize Name code.
Problems I can think of: 1) The search might not be programmable into a on-the-fly macro, thus making it something that might need to be done weekly.
2) It might be slower, due to the fact that it has to do many comparisons to see if the similarity matches.
OpenOffice 4.1.6 and LibreOffice 6.0.6.2 on Windows 7 Pro & Ultimate
If you believe your problem has been resolved, please go to your first post in this topic, click the Edit button and add [Solved] to the beginning of the Subject line.
User avatar
RusselB
Moderator
 
Posts: 5282
Joined: Fri Jan 03, 2014 7:31 am
Location: Sarnia, ON

Re: A duplicate detector for first and last name combination

Postby Villeroy » Sun Jul 21, 2019 10:38 pm

A database could enforce unique combinations of columns so that you would not be able to enter any duplicates in the first place.
Please, edit this topic's initial post and add "[Solved]" to the subject line if your problem has been solved.
Ubuntu 18.04, no OpenOffice, LibreOffice 6.x
User avatar
Villeroy
Volunteer
 
Posts: 26968
Joined: Mon Oct 08, 2007 1:35 am
Location: Germany

Re: A duplicate detector for first and last name combination

Postby ooGuillaume » Mon Jul 22, 2019 1:29 pm

About handling people with the same name:
They come and subscribe to a workshop we offer, so sometimes we have knowledge outside of the Calc sheet to determine this. They can write down their e-mail address, which also helps. But sometimes we don't know, and I changed the solution accordingly. Thanks.

About using a database:
Our Calc sheet also has a history of subscriptions dates and fees, which could benefit from a database. But with the skills and time at hand, it's not something I suppose we could tackle right now. Thanks anyway, it could be an important improvement.

About using the similary search:
Thank you, I was able to use that.

On opening the file, I check that the columns I need are in place, in case someone else messed with the file.
On changed content, I fill the intermediate column with a normalized full name so it's only done once and for all.
Then I check if I find a duplicate, and if so, a Message Box tells the user there might already be an entry for this name, and takes them there.
If they decide it's not the same person, they just have to come back to the line they just filled.

I'm a former C++ programmer but know next to nothing about good programming in Basic.
If you see horrible things in the code below, please do tell me!

Code: Select all   Expand viewCollapse view
Global FirstNameColumnIndex, LastNameColumnIndex, FullNameColumnIndex As Integer
Global Doc As Object
Global Sheet As Object
Global MaxRowIndexForDuplicateSearch As Integer

Sub OnOpenFile
   ' Duplicate detection code is ran at each content change
   ' It needs First, Last and Full name columns to be where they belong
   FirstNameColumnIndex = 0
   LastNameColumnIndex = 1
   FullNameColumnIndex = 3

   ' We look for duplicate in all entries
   ' In July 2019 there were 2700
   ' Raise the number below if we ever reach it
   MaxRowIndexForDuplicateSearch = 10000

   Doc = StarDesktop.CurrentComponent
   Sheet = Doc.Sheets(0)
   
   If Sheet.getCellByPosition(FirstNameColumnIndex, 0).String <> "Prénom" Then
      MsgBox "Column A must be ""First Name"" for duplicate detection."
   EndIf
   If Sheet.getCellByPosition(LastNameColumnIndex, 0).String <> "NOM" Then
      MsgBox "Column B must be ""Last Name"" for duplicate detection."
   EndIf
   If Sheet.getCellByPosition(FullNameColumnIndex, 0).String <> "Détection doublons" Then
      MsgBox "Column D must be ""Intermediate helper for duplicate detection"" for duplicate detection."
   EndIf
End Sub

Sub OnChangedContent(ChangedRange)
   ' CellAddr is to be the first cell in the ChangedRange
   Dim CellAddr As Object
   If Not ChangedRange.supportsService("com.sun.star.sheet.SheetCell") Then
      Dim ChangedRows() : ChangedRows() = ChangedRange.getDataArray()
      Dim FirstChangedRow() : FirstChangedRow() = LBound(ChangedRows())
      CellAddr = ChangedRange.getCellByPosition(0, 0).getCellAddress()
   Else
      CellAddr = ChangedRange.getCellAddress()
   EndIf
   ' W eonly car about changes in the First or Last name columns
    If CellAddr.Row = 0 Or CellAddr.Column > 1 Then Exit Sub
   
   Dim FirstNameCell, LastNameCell, FullNameCell As Object
   Dim FunctionAccess As Object
   Dim FullName As String

   Doc = StarDesktop.CurrentComponent
   Sheet = Doc.Sheets(0)

   FirstNameCell = Sheet.getCellByPosition(0, CellAddr.Row)
   LastNameCell = Sheet.getCellByPosition(1, CellAddr.Row)
   FullNameCell = Sheet.getCellByPosition(3, CellAddr.Row)
   FullName = FirstNameCell.String & LastNameCell.String

   FunctionAccess = CreateUnoService("com.sun.star.sheet.FunctionAccess")

   FullName = FunctionAccess.callFunction("LOWER", Array(FullName))
   FullName = FunctionAccess.callFunction("SUBSTITUTE", Array(FullName, " ", ""))
   FullName = FunctionAccess.callFunction("SUBSTITUTE", Array(FullName, "-", ""))
   FullName = FunctionAccess.callFunction("SUBSTITUTE", Array(FullName, "–", ""))
   FullName = NormalizeName(FullName)
   
   ' We fill column D with a normalized version of the full name
   ' Example : François Étienne becomes francoisetienne
   FullNameCell.String = FullName
   
   ' We look for potential duplicates. If one is found, we tell the user and send them there
   If FullName <> "" Then
      Dim Descript, Cell, FullNameRange As Object
      
      CellRange = Sheet.getCellRangeByPosition(3, CellAddr.Row + 1, 3, MaxRowIndexForDuplicateSearch)
      Descript = CellRange.createSearchDescriptor()
      Descript.SearchString = FullName
      Descript.SearchSimilarity = True
      Descript.SearchSimilarityAdd = 1
      Descript.SearchSimilarityRemove = 1
      Descript.SearchSimilarityExchange = 1

      Cell = CellRange.findFirst(Descript)
      If Not IsNull(Cell) Then
         Dim DuplicateRow As Integer
         DuplicateRow = Cell.getCellAddress().Row
         MsgBox "Doublon possible détecté avec " & Cell.String & " ligne " & DuplicateRow + 1
         Doc.currentController.select(Sheet.getCellByPosition(0, DuplicateRow))
      EndIf
   EndIf
End Sub

' Remove diacritic characters from a lower case string
Function NormalizeName(thestring As String)
   Dim A As String * 1
   Dim B As String * 1
   Dim i As Integer
   Const AccChars= "Þßąàáâãäåæćçęèéêëìíîïłðńñóòóôõöøùúûüśýþÿźżäöüúůýžéëïóöüùûüÿàâæçéèê"
   Const RegChars= "pbaaaaaaaacceeeeeiiiilonnooooooouuuusybyzzaouuuyzeeioouuuuyaaaceee"

   For i = 1 To Len(AccChars)
      A = Mid(AccChars, i, 1)
      B = Mid(RegChars, i, 1)
      thestring = Replace(thestring, A, B)
   Next

   NormalizeName = thestring
End Function

Thanks for the answers (and more generally to all helping people on these boards!)
LibreOffice 6.2.5.2 on Ubuntu 16.04
ooGuillaume
 
Posts: 7
Joined: Sat Jul 20, 2019 10:03 am

Re: [SOLVED] First and last name duplicates detector

Postby Villeroy » Mon Jul 22, 2019 3:45 pm

SOUNDEX is a function that creates a kind of hash value. Equally sounding names share the same soundex value. The function is targeted at English pronounciation. The database engine shipped with Open/LibreOffice includes a SOUNDEX implementation. The rest is trivial SQL and a form where you can edit all instances of all duplicates.
Please, edit this topic's initial post and add "[Solved]" to the subject line if your problem has been solved.
Ubuntu 18.04, no OpenOffice, LibreOffice 6.x
User avatar
Villeroy
Volunteer
 
Posts: 26968
Joined: Mon Oct 08, 2007 1:35 am
Location: Germany

Re: [Solved] First and last name duplicates detector

Postby Villeroy » Mon Jul 22, 2019 10:43 pm

Simple database demo with embedded HSQLDB and SOUNDEX.
One table with forename, surname, birth date and auto-id (row number). Forename, surname and birth date are enforced to be unique.
A query selecting the table plus the soundex of forename and surname.
A query selecting all combinations of soundexes of forenames and surnames and how often these combinations exist.
A form which shows all persons having the same soundex of the selected pair of soundex. I added duplicates person entries with similarly written names to the first 3 pairs. The form allows you to decide if similar persons are the same person or not.
No maco code.
Attachments
DummyPersons.odb
(19.37 KiB) Downloaded 2 times
Please, edit this topic's initial post and add "[Solved]" to the subject line if your problem has been solved.
Ubuntu 18.04, no OpenOffice, LibreOffice 6.x
User avatar
Villeroy
Volunteer
 
Posts: 26968
Joined: Mon Oct 08, 2007 1:35 am
Location: Germany


Return to Calc

Who is online

Users browsing this forum: No registered users and 14 guests