You can remove the unnecessary end_of_paragraph marks by running a few Find and Replace searches using regular expressions. You will almost certainly have to clean up the text afterwards but the majority of work will be done.
Note: If you copy from an email you will often find that each line ends with a newline character which is a backwards facing arrow.
Edit: The OOoFBTools add on is excellent and highly recommended. The searches below could be improved. OOoFBTools does a better job. |
1. Find all genuine end_of_paragraph marks (lines ending in full stop, question or exclamation marks) and protect them by changing them to QAZWSX
2. Replace all unnecessary end_of_paragraph marks by a space
3. Put back the protected genuine end_of_paragraph marks from search 1
4. Delete any spaces at the beginning of lines.
Go Edit > Find and Replace. Click more options. Tick Regular expressions. Run the following four searches where sp means a space character. I do not think there is a limit to how much text you can do at a time - the files have just over 7,000 words which was two chapters of Vanity Fair.
Code: Select all
Search 1
Find : (\.|\?|!)$
Replace: $0QAZWSX
Click replace all
Search 2
Find : $
Replace: sp
Click replace all
Search 3
Find : QAZWSX
Replace: \n
Click replace all
Search 4
Find : ^sp
Replace: leave the field blank
Click replace all
It is also helpful to add an extra end_of_paragraph marker (by pressing Enter) after any headings or lines you know should not be run together - see image. You can fine tune the searches to cope with quotation marks, colons, right brackets etc appearing at the end of lines.
See Start of Vanity Fair.PDF where every line is a paragraph and the paragraph markers need to be removed. See Start of Vanity Fair.odt which was created from the PDF by adding end_of_paragraph markers after the headings and running the searches.