Page 1 of 1

Find Duplicate Words

Posted: Tue Oct 03, 2017 10:08 am
by bobgrey1997
I have been working on some coding for hundreds of files in relation to towns and cities in my state. There are many towns that sit on a county border, and therefore, show up multiple times. I need to find all of these so that I can change them slightly as to not cause conflicts in the files. The program I have been using has no feature to find duplicated words, and while doing some research, I found that OpenOffice Writer has a feature: paste the entire file into Writer, use Find and Replace, search for \b(\w+)\s+\1\b enable "Regular Expression" in "More Options" and Find All. After reading that, I went to download the entire OpenOffice package for that one feature. When I try to use it, "Search key not found"!
I even made a new document and wrote "The The" and tried searching. "Search key not found"!
Every thread I have seen on this states to use this same search phrase. They are also from before 2010. This no longer works, so how do we find duplicate words? I have also found a solution to use the spreadsheets to go through a long process of formulas to remove the duplicates, but this will not help. I do not want to remove them, I want to find them so that I can change them slightly.

Re: Find Duplicate Words

Posted: Tue Oct 03, 2017 12:19 pm
by jrkrideau
Try "Find All"

It would help to know what program the data came from and what language they are in.

Re: Find Duplicate Words

Posted: Tue Oct 03, 2017 2:45 pm
by acknak
The pattern you mentioned works fine for me.

Steps to test it:

Start OpenOffice. Make sure that you have the current version (4.1.3; see Help > About ...)
File > New > Text Document
Type: dt
Press F3 to get a "dummy text" paragraph.
Duplicate some word in the paragraph, say ... He heard quiet quiet steps behind him.

Edit > Find & Replace
Search for: \b(\w+)\s+\1\b
Replace with:
Options/Regular expressions: ON

Click Find or Find All

If that does not work, then something's wrong with your install/setup, or you've missed a step somewhere.

If that works but your document does not, then your document may contain characters that don't match the pattern. You can try relaxing it a bit with something like this: \b(\w+)(\W+\1\b)+
If you try that, make sure to set Match case: ON

That pattern will match one or more "non-word" characters between the duplicates.

If none of that helps, then maybe you can create a small sample document with a bit of the text you're working with and attach that here.