How to find duplicate paragraphs?

Discuss the word processor
Post Reply
NealRanzoni
Posts: 4
Joined: Mon Jul 09, 2012 4:28 pm

How to find duplicate paragraphs?

Post by NealRanzoni »

:crazy: Ok I am using OpenOffice 3.3 Text document.

My problem is I pasted several pages in the file and God only knows how but I have repeated information. I have been manually searching Paragraph by Paragraph Crtl-F over and over. Is there a way for the file to check its self for repeated information other than searching it 1 paragraph at a time?

Thanks in advance for your help.

Sincerely, Neal Ranzoni

Title Edited. A descriptive title for posts helps others who are searching for solutions and increases the chances of a reply (Hagar, Moderator).
OpenOffice3.3
I know I am on Windows7
thomasjk
Volunteer
Posts: 4454
Joined: Tue Dec 25, 2007 4:52 pm
Location: North Carolina

Re: I really need to find a fix today.

Post by thomasjk »

No that I'm aware of. There could be special utilities that do this but I don't know of any.
Tom K.
Windows 11 23H2
LibreOffice
User avatar
acknak
Moderator
Posts: 22756
Joined: Mon Oct 08, 2007 1:25 am
Location: USA:NJ:E3

Re: I really need to find a fix today.

Post by acknak »

I sure can't think of any easy way.

If you're sure that the duplication is at the paragraph level (that is, whole paragraphs--or more--are duplicated, rather than just sentences or phrases), you might be able to do it with Calc spreadsheet:

Copy/paste all the paragraphs into a Calc column

Copy the column and transpose it into one row

Fill in the entire grid with a formula comparing the paragraph at that row and column.

Any matches that appear off the diagonal are your unwanted duplications.

This will be limited to the maximum number of columns that Calc can hold (16,000, I think).

Not really easy, but it should does work.
AOO4/LO5 • Linux • Fedora 23
Bill
Volunteer
Posts: 8952
Joined: Sat Nov 24, 2007 6:48 am

Re: I really need to find a fix today.

Post by Bill »

Did you try "Undo"?
JohnV
Volunteer
Posts: 1585
Joined: Mon Oct 08, 2007 1:32 am
Location: Kentucky, USA

Re: How to find duplicate paragraphs?

Post by JohnV »

I took this on as a programming challenge. This code leaves blank paragraphs where the duplicates used to be.

Code: Select all

Sub DeleteDuplicateParagraphs
oDoc = ThisComponent
enum = oDoc.Text.createEnumeration
While enum.hasMoreElements
thisPara = enum.nextElement
s = thisPara.getString
c = c + 1
If Len(s) > 0 then 
 Check(s,c,oDoc)
EndIf
Wend
End Sub

Sub Check(s,c,oDoc)
enum1 = oDoc.Text.createEnumeration
While enum1.hasMoreElements and c >= cc 
 enum1.nextElement
 cc = cc + 1
Wend
While enum1.hasMoreElements
 nextPara = enum1.nextElement
 ss = nextPara.getString
 If ss = s then 
  ss = ""
  nextPara.setString(ss)
 EndIf 
Wend 
End Sub
User avatar
karolus
Volunteer
Posts: 1234
Joined: Sat Jul 02, 2011 9:47 am

Re: How to find duplicate paragraphs?

Post by karolus »

Hallo

The "Challenge" solved in Python:

Code: Select all

context = XSCRIPTCONTEXT

def iterable( enumerable ):
    enum = enumerable.createEnumeration()
    while enum.hasMoreElements():
        yield enum.nextElement()

def remove_duplicate_paragraphs():
    doc = context.getDocument()
    text = doc.getText()
    
    paras = []
    
    for paragraph in iterable( text ):
        if paragraph.getString() in paras:
            paragraph.setString("")
        else:
            paras.append( paragraph.getString() )

Karo
Libreoffice 25.2… on Debian 13 (trixie) (on RaspberryPI5)
Libreoffice 25.8… flatpak on Debian 13 (Bookworm) (on RaspberryPI5)
NealRanzoni
Posts: 4
Joined: Mon Jul 09, 2012 4:28 pm

Re: How to find duplicate paragraphs?

Post by NealRanzoni »

Here is the problem. I was writing a dam book and pasted all the parts in. But somehow post anywhere from 1-3 of the same parts. So I have been going paragraph to paragraph searching and deleting copies of the preposted. The issue is it is over 200 pages and it just got really old. I am close to either scrapping the dam book or starting it over.
OpenOffice3.3
I know I am on Windows7
User avatar
RoryOF
Moderator
Posts: 35104
Joined: Sat Jan 31, 2009 9:30 pm
Location: Ireland

Re: How to find duplicate paragraphs?

Post by RoryOF »

Is the book broken into chapters? If not, insert chapter breaks at suitable intervals, so that you can deal only with a smallish section at a time (you can always remove the chapters later if they don't suit your text flow). The old-fashioned method might be best: print it out and go through the printout with a highlighter. Mark the repeats, then sit down at your computer and systematically delete them, handling one "chapter" at a time. I can think of no surer way, nor of any easier way.
 Edit: For what it is worth, I give here a link to a method of finding such paragraphs in MS Word
http://www.techandlife.com/2012/06/find ... soft-word/
I haven't tried this, and don't know if or how it works. It will almost certainly need modification for OpenOffice, which, as the textbook writers say, I leave as an exercise for those interested! 
Apache OpenOffice 4.1.15 on Xubuntu 22.04.5 LTS
NealRanzoni
Posts: 4
Joined: Mon Jul 09, 2012 4:28 pm

Re: How to find duplicate paragraphs?

Post by NealRanzoni »

I would be the interested. lol I will give this a fast shot and then up to the guys post with the code in it. I am dreading starting from scratch but the book is published and OMG if someone buys it before I get it fixed.

I will try both and if I can not fix this disaster I created I will just start over on Sunday when I am finally back at my desk full time.

Thanks to everyone that are working hard not to LOL@me and are shooting me ideas.
OpenOffice3.3
I know I am on Windows7
NealRanzoni
Posts: 4
Joined: Mon Jul 09, 2012 4:28 pm

Re: How to find duplicate paragraphs?

Post by NealRanzoni »

"karolus"&"JohnV" Guys This may be beyond my comprehension until I have time to sit down (Sunday) But I see a website/program waiting to be made based on this. I am sure this would be a marketable tool guys.

I am strapped on sleep/ computer time this week but i will be figuring the code out from both post this weekend. If I can get anything working by this weekend (well early next week) I will be back posting the good or bad news.

Thanks again, Neal Ranzoni
OpenOffice3.3
I know I am on Windows7
User avatar
acknak
Moderator
Posts: 22756
Joined: Mon Oct 08, 2007 1:25 am
Location: USA:NJ:E3

Re: How to find duplicate paragraphs?

Post by acknak »

The Word article RoryOF linked to won't work in Writer: it depends on being able to search across paragraph breaks, which Writer still can't do.

Did you try the method I outlined? It worked in a few minutes with the document I tested with; the main restriction is that you have to have duplicated paragraphs, and not smaller bits.
AOO4/LO5 • Linux • Fedora 23
User avatar
RoryOF
Moderator
Posts: 35104
Joined: Sat Jan 31, 2009 9:30 pm
Location: Ireland

Re: How to find duplicate paragraphs?

Post by RoryOF »

acknak wrote:The Word article RoryOF linked to won't work in Writer: it depends on being able to search across paragraph breaks, which Writer still can't do.
AltSearch wlll search across paragraphs, as far as I remember, so a version of the Word code ought be possible from some of our programming geniuses; I don't expect it would be blindingly fast, and can already think of scenarios which might cause it to fail, such as highly repetitive lists. I gave the URL merely as an indication of _a_ method of tackling such a problem.
Apache OpenOffice 4.1.15 on Xubuntu 22.04.5 LTS
User avatar
acknak
Moderator
Posts: 22756
Joined: Mon Oct 08, 2007 1:25 am
Location: USA:NJ:E3

Re: How to find duplicate paragraphs?

Post by acknak »

AltSearch should handle that; right, I forgot about that.

At least as I understand from a quick look, the Word method only finds adjacent duplicates. If the duplicates are not together, they won't be found.

The matrix approach I gave is limited to 1000 paragraphs, due to Calc's column limit, but that's still a decent-sized chunk of text.
AOO4/LO5 • Linux • Fedora 23
User avatar
karolus
Volunteer
Posts: 1234
Joined: Sat Jul 02, 2011 9:47 am

Re: How to find duplicate paragraphs?

Post by karolus »

Hallo

I'had try to do this Task with AltSearch, but without success...[maybe I oversee something ] ...

Karo
Libreoffice 25.2… on Debian 13 (trixie) (on RaspberryPI5)
Libreoffice 25.8… flatpak on Debian 13 (Bookworm) (on RaspberryPI5)
jrkrideau
Volunteer
Posts: 3816
Joined: Sun Dec 30, 2007 10:00 pm
Location: Kingston Ontario Canada

Re: How to find duplicate paragraphs?

Post by jrkrideau »

This may sound stupid but does anyone know if there is decent OpenSource plagarism software out there?
See for example http://en.wikipedia.org/wiki/Plagiarism ... -documents

Neal's problem sounds analogous to finding plagarized materials in an article or student essay.

http://www.grammarly.com/?q=plagiarism& ... QAodxzvveA seems something like what I mean thought it's not clear from the page if it is to detect plagarism or to help the copier avoid charges thereof. :)
LibreOffice 7.3.7. 2; Ubuntu 22.04
Post Reply