[Solved] OCR troubles: spaces showing up when justified.

Discuss the word processor
Post Reply
Darren N
Posts: 40
Joined: Mon Jan 12, 2015 3:09 pm

[Solved] OCR troubles: spaces showing up when justified.

Post by Darren N »

I am correcting an OCRed novel from the 1800s.

It is going to be published in the justified text format. However, when I click on the justify icon this I get these spaces at the end of some lines every few pages. To fix this I just hit the delete key followed by the space bar once and the space is gone.

Is there anyway that I can detect these instead of hitting "delete" then "space" at the end of every sentence?
Attachments
5555555555555555555555555555555.jpg
Last edited by robleyd on Fri Sep 03, 2021 4:05 am, edited 1 time in total.
Reason: Tag [Solved]
OpenOffice 4.1.1 on Windows 7
User avatar
robleyd
Moderator
Posts: 5036
Joined: Mon Aug 19, 2013 3:47 am
Location: Murbko, Australia

Re: OCR troubles: spaces showing up when format is justified

Post by robleyd »

Possibly your OCR program has inserted line breaks; turn on View | Nonprinting Characters or use keyboard shortcut Ctrl+F10. If you see a left hooked arrow, as in the image below, this is the case.

You can use Find/Replace to replace the line feeds with e.g. a space. Open Find/Replace; in the Find field put \n, in the Replace field a space, or whatever character you want. Click on More Options and make sure Regular Expressions is checked.
Attachments
oo_linebreak.png
Cheers
David
OS - Slackware 15 64 bit
Apache OpenOffice 4.1.15
LibreOffice 24.2.1.2; SlackBuild for 24.2.1 by Eric Hameleers
User avatar
RoryOF
Moderator
Posts: 34570
Joined: Sat Jan 31, 2009 9:30 pm
Location: Ireland

Re: OCR troubles: spaces showing up when format is justified

Post by RoryOF »

An OCR scan usually maintains each line as a separate paragraph, terminating it with a Pilcrow (backwards P character); turn on /View /non printing characters to see this. You have two choices, either to reprint the document in facsimile (i.e, without OCRing it) or to remove the intrusive paragraph marks and reflow the text.

In the sample you post, the paragraphs seem to be marked by blank paragraphs.

What I suggest is:

Position the cursor at top of the file, and do a Find and Replace for Empty Paragraph ((Find ^$, Replace %%%%) having dropped "More options" and Selected "Regular Expressions". Press Replace All.

Now Find $ and Replace with <space>. "Regular expressions" still checked; press Replace All

Then set it to Find %%%% %%%% and replace with %%%% (Replace All) (Regular Expressions unchecked). Repeat that Find and Replace until it finds no more of that pattern.

Then set it to Find " %%%%" (leading space as I could see onscreen the %%%% had a leading space) and Replace with \n (Regular Expressions checked).

You might also alter Style Preformatted text to have automatic indent at paragraph start.

To do all this for OCRed "War and Peace" (550,000 words) takes less than ten minutes. Do not interrupt the Find and Replace processes - wait for the message telling how many replaces were made, as that indicates he current pass is finished.

There is a macro somewhere on Forum to do this but I have never used it - the above sequences work for me.


I strongly suggest you do this on a copy of the file. You may also need to Find <space><space> and Replace with <Space> repeatedly until there are no more extraneous spaces found.

When I use <space> above, I mean a single press of the spacebar.
 Edit: When carrying out any serious editing/formatting, it is useful to have /View /Non printing characters turned on. 
Apache OpenOffice 4.1.15 on Xubuntu 22.04.4 LTS
Darren N
Posts: 40
Joined: Mon Jan 12, 2015 3:09 pm

Re: OCR troubles: spaces showing up when format is justified

Post by Darren N »

Thank you.

All is fixed. :)
OpenOffice 4.1.1 on Windows 7
Post Reply