[Solved] Tidying OCRed text

Writing a book, Automating Document Production - Discuss your special needs here
Post Reply
User avatar
RoryOF
Moderator
Posts: 34610
Joined: Sat Jan 31, 2009 9:30 pm
Location: Ireland

[Solved] Tidying OCRed text

Post by RoryOF »

Having OCRed a book and sorted most of the formatting problems with a few Find and Replace passes, I am left with one remaining problem: the transition from one page to another frequently leaves the first line of the next page starting with a lower case character. These I can find using OO's Find and Replace and a regular expression

Find ^[:lower:], Match case checked, More options, Regular Expressions checked

Is there any way, using either Find and Replace or AltSearch, that I can Replace with
<space><found lower char>

that is, omitting the paragraph mark, replacing it with a space and the found lowercase character.

I can and have done such replacements by hand in the past; I ask out of curiosity.

Rory
Last edited by MrProgrammer on Sat Apr 08, 2023 3:53 pm, edited 1 time in total.
Reason: Tagged ✓ [Solved] -- MrProgrammer, forum moderator
Apache OpenOffice 4.1.15 on Xubuntu 22.04.4 LTS
esperantisto
Volunteer
Posts: 578
Joined: Mon Oct 08, 2007 1:31 am

Re: Tidying OCRed text

Post by esperantisto »

Take a look at OOoFBTools, its Join broken lines/paragraphs feature.

P. S. Not an answer to your question, though.
AOO 4.2.0 (of 2015) / LO 7.x / Win 7 / openSUSE Linux Leap 15.4 (64-bit)
User avatar
Villeroy
Volunteer
Posts: 31279
Joined: Mon Oct 08, 2007 1:35 am
Location: Germany

Re: Tidying OCRed text

Post by Villeroy »

With match case and regex turned on:
Search: ^([:lower:])
Replace: _$1
where _ is a literal space
Please, edit this topic's initial post and add "[Solved]" to the subject line if your problem has been solved.
Ubuntu 18.04 with LibreOffice 6.0, latest OpenOffice and LibreOffice
User avatar
RoryOF
Moderator
Posts: 34610
Joined: Sat Jan 31, 2009 9:30 pm
Location: Ireland

Re: Tidying OCRed text

Post by RoryOF »

@Villeroy: that replaces with the space and the found character, but leaves the paragraph mark in position,. The () brackets gave me the found [:lower:] parameter, which I was lacking.

@esperantisto: I'll look at OOoFBTools later - going out to do Friday things now.

I can see a way of doing this with three F&R passes (I think); I'll come back with that later.
Apache OpenOffice 4.1.15 on Xubuntu 22.04.4 LTS
User avatar
RoryOF
Moderator
Posts: 34610
Joined: Sat Jan 31, 2009 9:30 pm
Location: Ireland

Re: Tidying OCRed text

Post by RoryOF »

Subject to testing on a large file, here is a method

Find $ Replace %%%% More options: regular expressions checked. Replace All

Find %%%%([:lower:]) Replace <space>$1 More options: regular expressions checked, Match case checked. Replace All (Match case checked is important!)

Find %%%% Replace \n More options :Regular expressions checked, Replace all.

%%%% is some character or sequence of characters that does not occur in the text. <space> is a literal space character.
 Edit: Tested on a 115K word file. Seems to work correctly subject to checking on proofreading and final layout. 
Apache OpenOffice 4.1.15 on Xubuntu 22.04.4 LTS
Post Reply