Page 1 of 1

[Tutorial] How do I remove end_of_paragraph marks?

Posted: Mon Jan 13, 2020 4:56 pm
by John_Ha
If you copy text from a PDF, and sometimes from other sources like the web, you will find it is composed of "single line paragraphs" and the text does not flow. Go View > Non printing characters ..., (View > Formatting marks in LO) to see them. They are pilcrows and appear as ¶ or end_of_paragraph markers. You can hide them again later if you wish.

You can remove the unnecessary end_of_paragraph marks by running a few Find and Replace searches using regular expressions. You will almost certainly have to clean up the text afterwards but the majority of work will be done.

Note: If you copy from an email you will often find that each line ends with a newline character which is a backwards facing arrow.
 Edit: The OOoFBTools add on is excellent and highly recommended.

The searches below could be improved. OOoFBTools does a better job. 
You need four searches:

1. Find all genuine end_of_paragraph marks (lines ending in full stop, question or exclamation marks) and protect them by changing them to QAZWSX
2. Replace all unnecessary end_of_paragraph marks by a space
3. Put back the protected genuine end_of_paragraph marks from search 1
4. Delete any spaces at the beginning of lines.

Go Edit > Find and Replace. Click more options. Tick Regular expressions. Run the following four searches where sp means a space character. I do not think there is a limit to how much text you can do at a time - the files have just over 7,000 words which was two chapters of Vanity Fair.

Code: Select all

Search 1
Find   : (\.|\?|!)$
Replace: $0QAZWSX
Click replace all

Search 2
Find   : $
Replace: sp
Click replace all

Search 3
Find   : QAZWSX
Replace: \n 
Click replace all

Search 4
Find   : ^sp 
Replace: leave the field blank
Click replace all
You will now probably need to clean up the text as there will be some errors.

It is also helpful to add an extra end_of_paragraph marker (by pressing Enter) after any headings or lines you know should not be run together - see image. You can fine tune the searches to cope with quotation marks, colons, right brackets etc appearing at the end of lines.
Text from Start of Vanity Fair.PDF copied into Writer<br />Note the unnecessary end_of_paragraph marks which need to be deleted
Text from Start of Vanity Fair.PDF copied into Writer
Note the unnecessary end_of_paragraph marks which need to be deleted
See Start of Vanity Fair.PDF where every line is a paragraph and the paragraph markers need to be removed. See Start of Vanity Fair.odt which was created from the PDF by adding end_of_paragraph markers after the headings and running the searches.

Re: [Tutorial] How do I remove end_of_paragraph marks?

Posted: Mon Jan 13, 2020 11:27 pm
by Hagar Delest

Re: [Tutorial] How do I remove end_of_paragraph marks?

Posted: Tue Jan 14, 2020 4:21 pm
by esperantisto
And this extension is even better help: OOoFBTools. Choose Join broken lines/paragraphs (for automatic operation on the entire text) or Process ends of lines/paragraphs (to manually process a selection). No need to reinvent the wheel :-)

Re: [Tutorial] How do I remove end_of_paragraph marks?

Posted: Tue Jan 14, 2020 6:07 pm
by John_Ha
esperantisto wrote:And this extension is even better help: OOoFBTools.
That's very nice though I don't think many would find it with a name of OOo FBTools and description of "The cross platform OpenOffice.org extension OOo FBTools used to convert to and processing eBooks in FictionBook2 format." I am pleased to have flushed it out as it looks very powerful.

I went OOoFBTools > Join broken lines of a paragraph..., with the settings as below. It produced a virtually identical result to the searches I used above.

I will include it in the final tutorial.