Page 1 of 1
How to find duplicate paragraphs?
Posted: Mon Jul 09, 2012 4:34 pm
by NealRanzoni

Ok I am using OpenOffice 3.3 Text document.
My problem is I pasted several pages in the file and God only knows how but I have repeated information. I have been manually searching Paragraph by Paragraph Crtl-F over and over. Is there a way for the file to check its self for repeated information other than searching it 1 paragraph at a time?
Thanks in advance for your help.
Sincerely, Neal Ranzoni
Title Edited. A descriptive title for posts helps others who are searching for solutions and increases the chances of a reply (Hagar, Moderator).
Re: I really need to find a fix today.
Posted: Mon Jul 09, 2012 4:58 pm
by thomasjk
No that I'm aware of. There could be special utilities that do this but I don't know of any.
Re: I really need to find a fix today.
Posted: Mon Jul 09, 2012 5:09 pm
by acknak
I sure can't think of any easy way.
If you're sure that the duplication is at the paragraph level (that is, whole paragraphs--or more--are duplicated, rather than just sentences or phrases), you might be able to do it with Calc spreadsheet:
Copy/paste all the paragraphs into a Calc column
Copy the column and transpose it into one row
Fill in the entire grid with a formula comparing the paragraph at that row and column.
Any matches that appear off the diagonal are your unwanted duplications.
This will be limited to the maximum number of columns that Calc can hold (16,000, I think).
Not really easy, but it should does work.
Re: I really need to find a fix today.
Posted: Mon Jul 09, 2012 6:49 pm
by Bill
Did you try "Undo"?
Re: How to find duplicate paragraphs?
Posted: Tue Jul 10, 2012 10:38 pm
by JohnV
I took this on as a programming challenge. This code leaves blank paragraphs where the duplicates used to be.
Code: Select all
Sub DeleteDuplicateParagraphs
oDoc = ThisComponent
enum = oDoc.Text.createEnumeration
While enum.hasMoreElements
thisPara = enum.nextElement
s = thisPara.getString
c = c + 1
If Len(s) > 0 then
Check(s,c,oDoc)
EndIf
Wend
End Sub
Sub Check(s,c,oDoc)
enum1 = oDoc.Text.createEnumeration
While enum1.hasMoreElements and c >= cc
enum1.nextElement
cc = cc + 1
Wend
While enum1.hasMoreElements
nextPara = enum1.nextElement
ss = nextPara.getString
If ss = s then
ss = ""
nextPara.setString(ss)
EndIf
Wend
End Sub
Re: How to find duplicate paragraphs?
Posted: Wed Jul 11, 2012 11:00 am
by karolus
Hallo
The "Challenge" solved in Python:
Code: Select all
context = XSCRIPTCONTEXT
def iterable( enumerable ):
enum = enumerable.createEnumeration()
while enum.hasMoreElements():
yield enum.nextElement()
def remove_duplicate_paragraphs():
doc = context.getDocument()
text = doc.getText()
paras = []
for paragraph in iterable( text ):
if paragraph.getString() in paras:
paragraph.setString("")
else:
paras.append( paragraph.getString() )
Karo
Re: How to find duplicate paragraphs?
Posted: Tue Jul 17, 2012 11:08 am
by NealRanzoni
Here is the problem. I was writing a dam book and pasted all the parts in. But somehow post anywhere from 1-3 of the same parts. So I have been going paragraph to paragraph searching and deleting copies of the preposted. The issue is it is over 200 pages and it just got really old. I am close to either scrapping the dam book or starting it over.
Re: How to find duplicate paragraphs?
Posted: Tue Jul 17, 2012 11:15 am
by RoryOF
Is the book broken into chapters? If not, insert chapter breaks at suitable intervals, so that you can deal only with a smallish section at a time (you can always remove the chapters later if they don't suit your text flow). The old-fashioned method might be best: print it out and go through the printout with a highlighter. Mark the repeats, then sit down at your computer and systematically delete them, handling one "chapter" at a time. I can think of no surer way, nor of any easier way.
Edit: For what it is worth, I give here a link to a method of finding such paragraphs in MS Word
http://www.techandlife.com/2012/06/find ... soft-word/
I haven't tried this, and don't know if or how it works. It will almost certainly need modification for OpenOffice, which, as the textbook writers say, I leave as an exercise for those interested! |
Re: How to find duplicate paragraphs?
Posted: Tue Jul 17, 2012 2:34 pm
by NealRanzoni
I would be the interested. lol I will give this a fast shot and then up to the guys post with the code in it. I am dreading starting from scratch but the book is published and OMG if someone buys it before I get it fixed.
I will try both and if I can not fix this disaster I created I will just start over on Sunday when I am finally back at my desk full time.
Thanks to everyone that are working hard not to LOL@me and are shooting me ideas.
Re: How to find duplicate paragraphs?
Posted: Tue Jul 17, 2012 2:39 pm
by NealRanzoni
"karolus"&"JohnV" Guys This may be beyond my comprehension until I have time to sit down (Sunday) But I see a website/program waiting to be made based on this. I am sure this would be a marketable tool guys.
I am strapped on sleep/ computer time this week but i will be figuring the code out from both post this weekend. If I can get anything working by this weekend (well early next week) I will be back posting the good or bad news.
Thanks again, Neal Ranzoni
Re: How to find duplicate paragraphs?
Posted: Tue Jul 17, 2012 2:55 pm
by acknak
The Word article RoryOF linked to won't work in Writer: it depends on being able to search across paragraph breaks, which Writer still can't do.
Did you try the method I outlined? It worked in a few minutes with the document I tested with; the main restriction is that you have to have duplicated paragraphs, and not smaller bits.
Re: How to find duplicate paragraphs?
Posted: Tue Jul 17, 2012 3:01 pm
by RoryOF
acknak wrote:The Word article RoryOF linked to won't work in Writer: it depends on being able to search across paragraph breaks, which Writer still can't do.
AltSearch wlll search across paragraphs, as far as I remember, so a version of the Word code ought be possible from some of our programming geniuses; I don't expect it would be blindingly fast, and can already think of scenarios which might cause it to fail, such as highly repetitive lists. I gave the URL merely as an indication of _a_ method of tackling such a problem.
Re: How to find duplicate paragraphs?
Posted: Tue Jul 17, 2012 3:20 pm
by acknak
AltSearch should handle that; right, I forgot about that.
At least as I understand from a quick look, the Word method only finds adjacent duplicates. If the duplicates are not together, they won't be found.
The matrix approach I gave is limited to 1000 paragraphs, due to Calc's column limit, but that's still a decent-sized chunk of text.
Re: How to find duplicate paragraphs?
Posted: Wed Jul 18, 2012 11:32 am
by karolus
Hallo
I'had try to do this Task with AltSearch, but without success...[maybe I oversee something ] ...
Karo
Re: How to find duplicate paragraphs?
Posted: Wed Jul 18, 2012 5:01 pm
by jrkrideau
This may sound stupid but does anyone know if there is decent OpenSource plagarism software out there?
See for example
http://en.wikipedia.org/wiki/Plagiarism ... -documents
Neal's problem sounds analogous to finding plagarized materials in an article or student essay.
http://www.grammarly.com/?q=plagiarism& ... QAodxzvveA seems something like what I mean thought it's not clear from the page if it is to detect plagarism or to help the copier avoid charges thereof.
