[Solved] Width of text on the page

Discuss the word processor
Post Reply
wildagain
Posts: 8
Joined: Fri Oct 08, 2021 7:27 pm

[Solved] Width of text on the page

Post by wildagain »

I have transferred text from an OCR application to Open Office. It was text in a newspaper column and that is how it came out in Writer. I have searched for hours to find out how to convert this text into normal regular page width. I have been using word processors for 40 years starting with Wordstar and I can't think of anything more fundamental than the width of text lines.
Last edited by wildagain on Sat Oct 09, 2021 10:26 pm, edited 1 time in total.
Open Office 4.1.8 on Windows 10
John_Ha
Volunteer
Posts: 9583
Joined: Fri Sep 18, 2009 5:51 pm
Location: UK

Re: width of text on the page

Post by John_Ha »

View > Non Printing characters ..., or click the ¶ (pilcrow) icon. Delete the End of paragraph or newline markers at the end of each line.
Clipboard01.png
If that doesn't fix it upload a small file showing the problem so that it can be analysed. Press POSTREPLY and click the Upload attachment tab below where you type (128 kB max); or use a file share site such as mediafire, Dropbox or Google Drive for a larger file.

Showing that a problem has been solved helps others searching so, if your problem is now solved, please view your first post in this thread and click the Edit button (top right in the post) and add [Solved] in front of the subject.
LO 6.4.4.2, Windows 10 Home 64 bit

See the Writer Guide, the Writer FAQ, the Writer Tutorials and Writer for students.

Remember: Always save your Writer files as .odt files. - see here for the many reasons why.
User avatar
RoryOF
Moderator
Posts: 34586
Joined: Sat Jan 31, 2009 9:30 pm
Location: Ireland

Re: width of text on the page

Post by RoryOF »

Enable /View /Formatting marks (these are non printing). OCR normally terminates each line with a paragraph mark; you have to edit these out to run one line into the next so they can fit the wider measure. Real end of paragraphs are usually two empty paragraphs after OCR. I cope this way:

I search for empty paragraphs:

Find ^$, Replace %%%%, drop more options, select Regular expressions. Press Replace Al button.

Then find the ends of the short lines:

Find $ Replace <space character> (Press the space bar). Leave More options, Regular Expressions selected. Once again press Replace All button.

Now turn the %%%% chars back into Paragraph marks

Find %%%%, Replace \n (Leave More options, Regular Expressions selected. Once again press Replace All button..

You now should be nearer what you need.

As the More Options setting can be sticky, best too unset it so that F&R works on ordinary text.
 Edit: I think there is a macro to do this task; I always use the F&R method above, as it is quick and I can control it for unusual formatting - I can do all of War and Peace in less than ten minutes. 
Apache OpenOffice 4.1.15 on Xubuntu 22.04.4 LTS
wildagain
Posts: 8
Joined: Fri Oct 08, 2021 7:27 pm

Re: width of text on the page

Post by wildagain »

openoffice writer.odt
(15.14 KiB) Downloaded 500 times
Surely I don't have to manually do that for every line. Here is a section of file that came from the OCR
Open Office 4.1.8 on Windows 10
User avatar
RoryOF
Moderator
Posts: 34586
Joined: Sat Jan 31, 2009 9:30 pm
Location: Ireland

Re: width of text on the page

Post by RoryOF »

The method I described using $ on its own worked, as you will see in the attached file.
Attachments
openoffice writer altered.odt
(10.62 KiB) Downloaded 473 times
Apache OpenOffice 4.1.15 on Xubuntu 22.04.4 LTS
Bill
Volunteer
Posts: 8932
Joined: Sat Nov 24, 2007 6:48 am

Re: width of text on the page

Post by Bill »

The problem isn't "width of text". That is set by the margin and indent settings. The problem is that the OCR is terminating each line with a paragraph break before reaching the margin. That is what starts a new line before the current line is filled. To fix that, the paragraph breaks which don't actually start a new paragraph must be removed.

The sample doesn't show any pattern that can be used to reliably automate the process. The sample is one long sentence in one paragraph. Where does a real new paragraph actually start?

If the OCR separated real paragraphs by inserting an empty paragraph where a new paragraph should start, then Alt-Search's batch process "Join paragraphs non separated by empty paragraphs" could be used to join the one-line paragraphs into real paragraphs.
AOO 4.1.14 on Ubuntu MATE 22.04
wildagain
Posts: 8
Joined: Fri Oct 08, 2021 7:27 pm

Re: width of text on the page

Post by wildagain »

Thank you for that. I see that the culprit is the OCR. Is there something you have to do with OCR'd text before you save it to Writer.

But am I missing something here. You have a block of text and you want to widen it on the page. Can't you "select all" and then move the right border to the required width which would spread the text to normal page width. Intuitively it should recognize that you ignore the line breaks left by the OCR process. Surely you don't have to write code to accomplish this. You can work the other way with columns by taking a full page and reducing it to columns. Why can't you do it as easily in reverse.
Open Office 4.1.8 on Windows 10
User avatar
RoryOF
Moderator
Posts: 34586
Joined: Sat Jan 31, 2009 9:30 pm
Location: Ireland

Re: width of text on the page

Post by RoryOF »

Turn on /View /Formatting marks to see that OCR treats each line as a paragraph. You have to amalgamate these "short"paragraphs into longer paragraphs using methods as I explained earlier, or the method Bill suggested.
Apache OpenOffice 4.1.15 on Xubuntu 22.04.4 LTS
Bill
Volunteer
Posts: 8932
Joined: Sat Nov 24, 2007 6:48 am

Re: width of text on the page

Post by Bill »

wildagain wrote:You can work the other way with columns by taking a full page and reducing it to columns. Why can't you do it as easily in reverse.
No, that's not doing it "in reverse". "In reverse" would be inserting paragraph breaks in each line to shorten the lines. Adding columns does not insert paragraph breaks, it moves the right margin which causes Writer to wrap the text at the moved margin automatically.
AOO 4.1.14 on Ubuntu MATE 22.04
JeJe
Volunteer
Posts: 2763
Joined: Wed Mar 09, 2016 2:40 pm

Re: Width of text on the page

Post by JeJe »

The attached document is same as you posted but with a button added that runs a macro which
-creates a new document and copies everything over
-starts a new paragraph only when the end is a . ? !
-hyphenated ending lines are merged without the hyphen.

Code is this:

Code: Select all

Sub Main
	en =thiscomponent.text.createenumeration
	newDoc=StarDesktop.loadComponentFromURL( "private:factory/swriter","_blank",0,Array())
	newtext = newdoc.text
	Do While en.hasMoreElements()
		oPar = en.nextElement()
		st= oPar.getString()
		endchar = right(st,1)

		if endchar = "." or  endchar = "?" or  endchar = "!" then
			newtext.insertstring(newtext.getend,st,false)
			newText.insertControlCharacter(newtext.getend,com.sun.star.text.ControlCharacter.PARAGRAPH_BREAK, false)

		elseif endchar = "-" then
			mid(st,len(st),1) =""
			newtext.insertstring(newtext.getend,st,false)
		else
			newtext.insertstring(newtext.getend,st,false)
			newtext.insertstring(newtext.getend," ",false)
		end if
	loop


End Sub
Edit: Your text looks to have problems beyond the line endings though...
Attachments
openoffice writer2.odt
(17.84 KiB) Downloaded 460 times
Windows 10, Openoffice 4.1.11, LibreOffice 7.4.0.3 (x64)
wildagain
Posts: 8
Joined: Fri Oct 08, 2021 7:27 pm

Re: Width of text on the page

Post by wildagain »

Yes. I should have used a cleaner sample. That was the OCR dealing unsuccessfully with two columns. This excerpt will provide the appropriate paragraph breaks. So could you walk me through the process of turning it into a full page width format.


McMURDO SOUND, Antarctica
—(UPI) — Apart from the cold,
the most striking impression on
a newcomer to the Antarctic is
made by the continent's desolate
but magnificent scenic beauty.
The area around this U.S. Navy
air base on McMurdo Sound,
about 800 miles from the South
Pole, is a fitting example of this.
The base was built early in 1956
to provide air support for the
construction and maintenance of
U. S. International Geophysical
Year science b a s e s scattered
throughout the Antarctic. It has
been continuing in this role since
IGY ended Dec. 31 and is now
part of the U. S. Antarctic Research
Program.
Situated on a s l o p e of the
southern tip of Ross Island, an
ice-locked mound of volcanic ash
which overlooks McMurdo Sound,
the base is a cluster of about 40
huts of various sizes including a
chapel, and surrounded by supply
dumps.
At the northern end of the island,
looming over the base, are
two huge volcanoes, 13,350-foot
Mount Erebus, believed to be Antarctica's
only active volcano, and,
next to it, extinct Mount Terror,
some 10,000 feet tall. A "smaller"
mountain 8,000-foot Mount Terra
Nova, sits between the taller peaks
and, like them, is almost entirely
covered by ice and snow.
Erebus is known as "the Fujiyama
of Antarctica" and is 400
feet taller than the famous Japanese
volcano. A cloud-like plume
of steam is usually seen pouring
from the crater.
Opposite the b a s e, about 30
miles across the ice on the other
side of the Sound, is another volcano,
9,000-foot Mount Discovery
which is comparatively free ol
snow. Several long islands, with
hills rising to more than 3,000
feet, surround Discovery.
Open Office 4.1.8 on Windows 10
JeJe
Volunteer
Posts: 2763
Joined: Wed Mar 09, 2016 2:40 pm

Re: Width of text on the page

Post by JeJe »

Your new sample when copied has every line ending in a line feed character - which is different from new paragraphs - which could be the website changing it.

If my sample is any good, replace the text with whatever you want converted and push the button - macros will have to be enabled for the document.
Windows 10, Openoffice 4.1.11, LibreOffice 7.4.0.3 (x64)
JeJe
Volunteer
Posts: 2763
Joined: Wed Mar 09, 2016 2:40 pm

Re: Width of text on the page

Post by JeJe »

A slightly different macro would only start a new paragraph when the line ends in a sentence end character (.?!) and also is less than a certain length - which I've chosen from your sample as 25 characters long. The line
Program.
is obviously followed by a new paragraph as its short whereas
Blah blah blah bllah blah blah program.
if about as long as the other lines is ambiguous and might be a paragraph end or not and can't be solved.

(You can change the 25 to a different number to suit)


Code: Select all

Sub Main
	en =thiscomponent.text.createenumeration
	newDoc=StarDesktop.loadComponentFromURL( "private:factory/swriter","_blank",0,Array())
	newtext = newdoc.text
	Do While en.hasMoreElements()
		oPar = en.nextElement()
		st= oPar.getString()
		endchar = right(st,1)

		if (endchar = "." or  endchar = "?" or  endchar = "!") and len(st) <=25 then
			newtext.insertstring(newtext.getend,st,false)
			newText.insertControlCharacter(newtext.getend,com.sun.star.text.ControlCharacter.PARAGRAPH_BREAK, false)

		elseif endchar = "-" then
			mid(st,len(st),1) =""
			newtext.insertstring(newtext.getend,st,false)
		else
			newtext.insertstring(newtext.getend,st,false)
			newtext.insertstring(newtext.getend," ",false)
		end if
	loop


End Sub
Attachments
openoffice writer3.odt
(20.99 KiB) Downloaded 500 times
Windows 10, Openoffice 4.1.11, LibreOffice 7.4.0.3 (x64)
wildagain
Posts: 8
Joined: Fri Oct 08, 2021 7:27 pm

Re: Width of text on the page

Post by wildagain »

Thank you for your help.
But as I indicated above (and not wanting to look a gift horse in the mouth--its a free app), I am surprised that something so fundamental is so difficult and almost requires writing code to process. Surely programmers could come up with a routine whereby when someone blocks off text and wants to widen it to normal page width, the app would take a look at the "carriage returns" at the of each line, remove them and then wrap the text to the wider format. Can you explain why that can't be done.
Anyway, I'm afraid I'm going to have to give up the OCR and type the whole thing myself to get the format I require. Thanks again
Open Office 4.1.8 on Windows 10
Bill
Volunteer
Posts: 8932
Joined: Sat Nov 24, 2007 6:48 am

Re: Width of text on the page

Post by Bill »

wildagain wrote:Surely programmers could come up with a routine whereby when someone blocks off text and wants to widen it to normal page width, the app would take a look at the "carriage returns" at the of each line, remove them and then wrap the text to the wider format. Can you explain why that can't be done.
It has already been done. Install the AltSearch extension mentioned in my first post. You can then select the text you want in one paragraph and run the batch process in AltSearch to remove the extra paragraph breaks from the selected text. The text will then wrap to the page margin. Your sample already has the normal wider text area.

Or, instead of retyping the document, go through the OCR document and insert empty paragraphs where you want to start a new paragraph, then run the batch process in AltSearch.
AOO 4.1.14 on Ubuntu MATE 22.04
User avatar
RoryOF
Moderator
Posts: 34586
Joined: Sat Jan 31, 2009 9:30 pm
Location: Ireland

Re: Width of text on the page

Post by RoryOF »

I'd insert empty paragraphs by hand, if the OCR process didn't do so, then use the full method I described. Because you post your sample inline, we cannot tell for certain how the lines are ended as the Forum software may tweak them. If you attach a sample file, as a file, not an image, we would know for certain what we were dealing with.

Line breaks show up in OpenOffice, when /View /Formatting marks is enabled, as a left pointing hooked arrow; paragraph marks show up as a Pilcrow (backwards P character)
Apache OpenOffice 4.1.15 on Xubuntu 22.04.4 LTS
JeJe
Volunteer
Posts: 2763
Joined: Wed Mar 09, 2016 2:40 pm

Re: Width of text on the page

Post by JeJe »

Code: Select all

Anyway, I'm afraid I'm going to have to give up the OCR and type the whole thing myself to get the format I require. Thanks again
Why... what's wrong with the approaches here including mine where you just have to click a button?

The problem is your OCR... it doesn't output in the format you want. Solving it isn't completely simple it needs a tiny bit of adaptation... a general word processor can't have a ready made solution for every format problem... that's why this one has a programming language so people can write one.
Windows 10, Openoffice 4.1.11, LibreOffice 7.4.0.3 (x64)
User avatar
Hagar Delest
Moderator
Posts: 32627
Joined: Sun Oct 07, 2007 9:07 pm
Location: France

Re: Width of text on the page

Post by Hagar Delest »

wildagain wrote:I have been using word processors for 40 years starting with Wordstar and I can't think of anything more fundamental than the width of text lines.
I'm baffled that after 40 years using word processors you don't grasp the mechanics behind the line feed or carriage return.
Some people have come up with codes for what you want, then why not use them?
The difficulty in the code being to spot where are the true paragraph breaks, especially when the line with the end of paragraph is almost the same length than a line with a mere line feed and followed by a sentence that should still be part of the same paragraph. This is no trivial operation at all.

Please add [Solved] at the beginning of the title in your first post (top of the topic) with the *EDIT button if your issue has been fixed.
LibreOffice 7.6.2.1 on Xubuntu 23.10 and 7.6.4.1 portable on Windows 10
User avatar
RoryOF
Moderator
Posts: 34586
Joined: Sat Jan 31, 2009 9:30 pm
Location: Ireland

Re: Width of text on the page

Post by RoryOF »

Making the assumption that a paragraph begins with a Capital letter which starts a line, one can insert an empty paragraph before this with Find and Replace using a Regular Expression. Then the various methods detailed above will reorganise your file. There is a likelihood that the Regular expression will introduce some small number of spurious paragraph breaks, which can be recombined with their preceding text by visual identification and hand editing.

In Find box, enter [:upper:]
in Replace box, enter \n&
Check the "Match case" checkbox
Drop "More options" and select "Regular Expressions".
Press "Replace all" button.
Apache OpenOffice 4.1.15 on Xubuntu 22.04.4 LTS
JeJe
Volunteer
Posts: 2763
Joined: Wed Mar 09, 2016 2:40 pm

Re: Width of text on the page

Post by JeJe »

There's one ambiguous line in the 2nd sample which is
Pole, is a fitting example of this.
The base was built early in 1956
But given newspapers generally break everything up the best version is probably just to make every line ending a sentence the end of a paragraph (as my first effort).
Windows 10, Openoffice 4.1.11, LibreOffice 7.4.0.3 (x64)
wildagain
Posts: 8
Joined: Fri Oct 08, 2021 7:27 pm

Re: Width of text on the page

Post by wildagain »

Bill: Thank you for your patience with someone who is not a programmer or someone used to dealing with coding. I downloaded and installed AltSearch, but can not see how to invoke it. Where do you look to find it and apply it. When I go to extensions manager it tells me I have installed it. But I want to know how to proceed.
Open Office 4.1.8 on Windows 10
User avatar
RoryOF
Moderator
Posts: 34586
Joined: Sat Jan 31, 2009 9:30 pm
Location: Ireland

Re: Width of text on the page

Post by RoryOF »

AltSearch is quite complex to use, in my opinion: a complicated syntax for anything other than the simplest. Run the Regular expression I gave, then the Find and Replace methods outlined in my first post. The whole job will be complete (all steps) in under ten minutes.
Apache OpenOffice 4.1.15 on Xubuntu 22.04.4 LTS
wildagain
Posts: 8
Joined: Fri Oct 08, 2021 7:27 pm

Re: Width of text on the page

Post by wildagain »

Bill: Found it. Took a while to show up.
Open Office 4.1.8 on Windows 10
Bill
Volunteer
Posts: 8932
Joined: Sat Nov 24, 2007 6:48 am

Re: Width of text on the page

Post by Bill »

RoryOF wrote:AltSearch is quite complex to use, in my opinion: a complicated syntax for anything other than the simplest. Run the Regular expression I gave, then the Find and Replace methods outlined in my first post. The whole job will be complete (all steps) in under ten minutes.
This is a batch process in AltSearch, not a search, so there is no complicated syntax or even a Regular Expression involved. Just select the text containing the paragraph breaks to be deleted, then open AltSearch, click "batch>>" to open the Batch Manager, select the batch, then click "Execute".
AOO 4.1.14 on Ubuntu MATE 22.04
John_Ha
Volunteer
Posts: 9583
Joined: Fri Sep 18, 2009 5:51 pm
Location: UK

Re: Width of text on the page

Post by John_Ha »

wildagain wrote:Anyway, I'm afraid I'm going to have to give up the OCR and type the whole thing myself to get the format I require.
Did you read the first reply to your post? It told you exactly what to do using only the DELETE key. It's 22 posts earlier.

Surely, you must know where the DELETE key is on the keyboard! If you do know how to find it, click at the end of each line and press DELETE. It's done in one keystroke. It will take less keystrokes to do all than you have posted here and many, many less than retyping. :crazy:

You might even want to read [Tutorial] How do I remove end_of_paragraph marks?

Showing that a problem has been solved helps others searching so, if your problem is now solved, please view your first post in this thread and click the Edit button (top right in the post) and add [Solved] in front of the subject.
LO 6.4.4.2, Windows 10 Home 64 bit

See the Writer Guide, the Writer FAQ, the Writer Tutorials and Writer for students.

Remember: Always save your Writer files as .odt files. - see here for the many reasons why.
wildagain
Posts: 8
Joined: Fri Oct 08, 2021 7:27 pm

Re: Width of text on the page

Post by wildagain »

Bill: thank you greatly. I finally got it with the batch routine. It certainly beats going through the file line by line as someone suggested. For anyone looking for the same method, here it is:
Download and install the AltSearch extension, OPen the file, highlight (select all) the text, open AltSearch, select "join paragraphs non separated," click on Batch and press execute. The only glitch is that it comes out as one big paragraph. I suspect I can restore the original paragraph breaks by inserting an extra pgh break in the manuscript before I do the above.
And thank you to everyone who chipped in on this.

And now can we get OpenOffice to incorporate a method to accomplish easily what I have now done with AltSearch and its batch routine.

Unfortunately when I bought the Apple II+ with Cp/m in 1981, I didn't learn to code as my 12 year old son did very quickly and talked me into buying an early Hayes modem. His computer savvy got him a fee waiver when he did his doctorate at Michigan for being faculty advisor to undergraduates on how they could best use the faculty's IT resources.
Open Office 4.1.8 on Windows 10
User avatar
RoryOF
Moderator
Posts: 34586
Joined: Sat Jan 31, 2009 9:30 pm
Location: Ireland

Re: Width of text on the page

Post by RoryOF »

OpenOffice does not like a paragraph greater than 64 k characters, so you may find it impossible to enter new text. This hardcoded paragraph limit is why it is best to identify the actual paragraphs before moving on to adjust line lengths.
 Edit: 64K characters is typically about 11000 - 12,000 words 
Apache OpenOffice 4.1.15 on Xubuntu 22.04.4 LTS
JeJe
Volunteer
Posts: 2763
Joined: Wed Mar 09, 2016 2:40 pm

Re: Width of text on the page

Post by JeJe »

wildagain wrote: And now can we get OpenOffice to incorporate a method to accomplish easily what I have now done with AltSearch and its batch routine.
Did you even try the document I posted? You never said what was wrong with it which I'm still (mildlly) curious about? Might help someone else to know...
Windows 10, Openoffice 4.1.11, LibreOffice 7.4.0.3 (x64)
Bill
Volunteer
Posts: 8932
Joined: Sat Nov 24, 2007 6:48 am

Re: [Solved] Width of text on the page

Post by Bill »

@JeJe: I don't see a problem with the macro. There is a problem with the OCR document itself which seems to have left out most of the punctuation marks at the end of sentences.
AOO 4.1.14 on Ubuntu MATE 22.04
Post Reply