[Solved] Delete the paragraph not ending in punctuation

Discuss the word processor
Post Reply
clausisme
Posts: 24
Joined: Fri Mar 28, 2014 3:23 pm

[Solved] Delete the paragraph not ending in punctuation

Post by clausisme »

I have a scanned book and sometimes I get these paragraphs where they shouldn't be.
How can I do a Find and replace to delete a paragraph marker from paragraphs that don't end in a punctuation mark?

Right now I can find them using

Code: Select all

\w$
and then manually delete them, but this problem happens at every single page at any book I scan. Is there any workaround?

Edit: And I guess put a space instead of the replaced paragraph.
Last edited by clausisme on Mon Nov 19, 2018 11:15 am, edited 1 time in total.
OpenOffice 4.01 on Windows 7
User avatar
Zizi64
Volunteer
Posts: 11359
Joined: Wed May 26, 2010 7:55 am
Location: Budapest, Hungary

Re: Delete the paragraph in sentences not ending in punctuat

Post by Zizi64 »

Please upload an ODF type sample file here.
Tibor Kovacs, Hungary; LO7.5.8 /Win7-10 x64Prof.
PortableApps/winPenPack: LO3.3.0-7.6.2;AOO4.1.14
Please, edit the initial post in the topic: add the word [Solved] at the beginning of the subject line - if your problem has been solved.
clausisme
Posts: 24
Joined: Fri Mar 28, 2014 3:23 pm

Re: Delete the paragraph in sentences not ending in punctuat

Post by clausisme »

It's really like an "enter" hit in the the middle of the sentence.
Attachments
sample.odt
(36.92 KiB) Downloaded 80 times
OpenOffice 4.01 on Windows 7
User avatar
robleyd
Moderator
Posts: 5082
Joined: Mon Aug 19, 2013 3:47 am
Location: Murbko, Australia

Re: Delete the paragraph in sentences not ending in punctuat

Post by robleyd »

The AltSearch extension will do what you need. Alternative dialog Find & Replace for Writer (AltSearch)
Search
\>\s\p
and replace " " (space character).
\> = end of a word
\s = any space character
\p = paragraph ending mark
Note that in the sample you provided most - not all - of the paragraph markers are preceded by a space. In some cases, you will have e.g. a comma, then space then end of para; along with other combinations so you'll have to do a few search/replace to get your document anywhere near normal. Not to mention the places where multiple spaces will be inserted. Welcome to the wonderful world of OCR!

My personal preference is to scan/OCR and then use a regex capable text editor to tidy up the OCR output before transferring the text file to an application for final formatting - in my case usually Sigil for creating epub files.
Cheers
David
OS - Slackware 15 64 bit
Apache OpenOffice 4.1.15
LibreOffice 24.2.2.2; SlackBuild for 24.2.2 by Eric Hameleers
User avatar
Zizi64
Volunteer
Posts: 11359
Joined: Wed May 26, 2010 7:55 am
Location: Budapest, Hungary

Re: Delete the paragraph in sentences not ending in punctuat

Post by Zizi64 »

 Edit: I was too slow 
You can not replace the ENTER characters from a paragraph with the built in F&R function. You can raplace the SHITH-ENTER characters only. This restriction is related to the maximum character number of a paragraph in an ODF type document.


Some "regular" paragraphs are ended with space character after the punctuation character in the attached document. First replace them by usage of the
search:

Code: Select all

[\.] $
replace:

Code: Select all

.
with the "Regular expressions" option.
Repeat it with all of used punctuation characters (?!...)
You can use the built-in F&R function for this task.


Install and use the Alternative Find ard Replace (AltSearch) extension for replacing the hard ENTERs what follows a space character.
Then you can use the "space at end of the virtual paragraphs" regular expression with the AltSearch extension:
search:

Code: Select all

 \p
replace it by a Space character.

Code: Select all

 
Last edited by Zizi64 on Mon Nov 19, 2018 11:25 am, edited 1 time in total.
Tibor Kovacs, Hungary; LO7.5.8 /Win7-10 x64Prof.
PortableApps/winPenPack: LO3.3.0-7.6.2;AOO4.1.14
Please, edit the initial post in the topic: add the word [Solved] at the beginning of the subject line - if your problem has been solved.
clausisme
Posts: 24
Joined: Fri Mar 28, 2014 3:23 pm

Re: [Solved] Delete the paragraph not ending in punctuation

Post by clausisme »

Thank you both that worked. I did not know about Alternative Find and Replace (AltSearch) extension.
Find and replace is a bit confusing to me as some regex work while others don't. I'll look up Sigil.
OpenOffice 4.01 on Windows 7
User avatar
RoryOF
Moderator
Posts: 34612
Joined: Sat Jan 31, 2009 9:30 pm
Location: Ireland

Re: [Solved] Delete the paragraph not ending in punctuation

Post by RoryOF »

You should still check your document, as regex expressions, however used, can do funny things. I normally reformat an OCRed document as close as possible to the original (as a temporary measure), using Styles, and then check that total page count is correct; if not, I look to see if page breaks fall in the correct place and examine to find the rogue breaks.
Apache OpenOffice 4.1.15 on Xubuntu 22.04.4 LTS
clausisme
Posts: 24
Joined: Fri Mar 28, 2014 3:23 pm

Re: [Solved] Delete the paragraph not ending in punctuation

Post by clausisme »

This is beyond my problem but can you guys gimme me an example of how you do a book scan work-flow. Like >I scan> I put it this> remove that etc?
OpenOffice 4.01 on Windows 7
User avatar
robleyd
Moderator
Posts: 5082
Joined: Mon Aug 19, 2013 3:47 am
Location: Murbko, Australia

Re: [Solved] Delete the paragraph not ending in punctuation

Post by robleyd »

I'll look up Sigil.
Note that Sigil is a multi-platform EPUB ebook Editor, not a text editor. I only use Windows when I need to get screenshots or check how something behaves in that environment so I can't recommend a specific text editor; I'm sure that a number of people will have a suggestion here :D

Do a web search for Windows text editors, do a bit of reading, download a few and try them and see what works best for you!

As for the process of fixing OCR - I usually do the scan/OCR and then look to see what the current issues are. You can be sure that you will have to replace unwanted line breaks, and multiple spaces. The rest will vary.
Cheers
David
OS - Slackware 15 64 bit
Apache OpenOffice 4.1.15
LibreOffice 24.2.2.2; SlackBuild for 24.2.2 by Eric Hameleers
User avatar
RoryOF
Moderator
Posts: 34612
Joined: Sat Jan 31, 2009 9:30 pm
Location: Ireland

Re: [Solved] Delete the paragraph not ending in punctuation

Post by RoryOF »

Scanning is simple, in the sense that you are taking a picture of each page, and the output from a scanner is simply that - a picture. Scanning and OCRing is the process of turning a document, a book, into text that one can edit in OpenOffice or some other program. To do this, one needs a scanner and an extra program for OCR (Optical Character Recognition). Sometimes (perhaps rarely now? but often years ago) these come on a CD with the scanner, but more frequently now they are commercial applications such as Omnipage, Abby Fine Reader etc.

Basically one scans the book page by page, into a format that the OCR program accepts (often .TIF, sometimes .PDF, but read the instructions). Then feed the files from the scanner into the OCR program; this will attempt to read them, producing (read instructions once again) a .doc, a HTML, an. odt or other format file of text. This file may be of the text on a page by page basis or a file of the entire text. There will be OCR errors, so you will have to edit the text when you give the file to your word processor, but many OCR errors ought be picked up by spellcheck.

When scanning the text, it can be helpful to define a mask for the scan area so that Page Headers and footers are not scanned. This can instead be done at the OCR stage. If scanning a book, I normally scan in (say) 40 page sessions until the book is scanned, then move to the OCR stage. Ideally one would chop the spine off the book, feed the paper stack into an automatic document feeder capable of feeding double sided documents, let it scan the entire, then feed the output file into the OCR application. But I cannot ever bring myself to chop up a book!

Start doing this by working on merely a few pages of a book repeatedly until you develop a workflow. You will still have much editing to be done on a full book.
Apache OpenOffice 4.1.15 on Xubuntu 22.04.4 LTS
Post Reply