Having OCRed a book and sorted most of the formatting problems with a few Find and Replace passes, I am left with one remaining problem: the transition from one page to another frequently leaves the first line of the next page starting with a lower case character. These I can find using OO's Find and Replace and a regular expression
Find ^[:lower:], Match case checked, More options, Regular Expressions checked
Is there any way, using either Find and Replace or AltSearch, that I can Replace with
<space><found lower char>
that is, omitting the paragraph mark, replacing it with a space and the found lowercase character.
I can and have done such replacements by hand in the past; I ask out of curiosity.
Rory
[Solved] Tidying OCRed text
[Solved] Tidying OCRed text
Last edited by MrProgrammer on Sat Apr 08, 2023 3:53 pm, edited 1 time in total.
Reason: Tagged ✓ [Solved] -- MrProgrammer, forum moderator
Reason: Tagged ✓ [Solved] -- MrProgrammer, forum moderator
Apache OpenOffice 4.1.15 on Xubuntu 22.04.4 LTS
-
- Volunteer
- Posts: 578
- Joined: Mon Oct 08, 2007 1:31 am
Re: Tidying OCRed text
Take a look at OOoFBTools, its Join broken lines/paragraphs feature.
P. S. Not an answer to your question, though.
P. S. Not an answer to your question, though.
AOO 4.2.0 (of 2015) / LO 7.x / Win 7 / openSUSE Linux Leap 15.4 (64-bit)
Re: Tidying OCRed text
With match case and regex turned on:
Search: ^([:lower:])
Replace: _$1
where _ is a literal space
Search: ^([:lower:])
Replace: _$1
where _ is a literal space
Please, edit this topic's initial post and add "[Solved]" to the subject line if your problem has been solved.
Ubuntu 18.04 with LibreOffice 6.0, latest OpenOffice and LibreOffice
Ubuntu 18.04 with LibreOffice 6.0, latest OpenOffice and LibreOffice
Re: Tidying OCRed text
@Villeroy: that replaces with the space and the found character, but leaves the paragraph mark in position,. The () brackets gave me the found [:lower:] parameter, which I was lacking.
@esperantisto: I'll look at OOoFBTools later - going out to do Friday things now.
I can see a way of doing this with three F&R passes (I think); I'll come back with that later.
@esperantisto: I'll look at OOoFBTools later - going out to do Friday things now.
I can see a way of doing this with three F&R passes (I think); I'll come back with that later.
Apache OpenOffice 4.1.15 on Xubuntu 22.04.4 LTS
Re: Tidying OCRed text
Subject to testing on a large file, here is a method
Find $ Replace %%%% More options: regular expressions checked. Replace All
Find %%%%([:lower:]) Replace <space>$1 More options: regular expressions checked, Match case checked. Replace All (Match case checked is important!)
Find %%%% Replace \n More options :Regular expressions checked, Replace all.
%%%% is some character or sequence of characters that does not occur in the text. <space> is a literal space character.
Find $ Replace %%%% More options: regular expressions checked. Replace All
Find %%%%([:lower:]) Replace <space>$1 More options: regular expressions checked, Match case checked. Replace All (Match case checked is important!)
Find %%%% Replace \n More options :Regular expressions checked, Replace all.
%%%% is some character or sequence of characters that does not occur in the text. <space> is a literal space character.
Edit: Tested on a 115K word file. Seems to work correctly subject to checking on proofreading and final layout. |
Apache OpenOffice 4.1.15 on Xubuntu 22.04.4 LTS