[Tutorial] How to fix SAXParse errors in LibreOffice files

Help with installation and general system troubleshooting questions concerning the office suite LibreOffice.

[Tutorial] How to fix SAXParse errors in LibreOffice files

Postby John_Ha » Fri Jul 28, 2017 5:52 pm

These problems seem only to arise in .docx files saved by LibreOffice.

Try one of the following three self-help methods to fix LibreOffice .docx files with SAX parse errors. You only need to use one of them!

Be sure to work on a copy of the file just in case something goes wrong ...

1. AOO seems to be able to open these files ...

... so download Apache OpenOffice from http://www.openoffice.org/download/index.html. Create a new user on your PC and install AOO for that user only. AOO and LO seem to interact in that LO grabs some of the AOO properties and this will completely isolate AOO from LO. Open the .docx file with AOO. Save it as a .odt file. Uninstall AOO and delete the added user.

Note that this will delete any MS Word Textboxes and their contents. AOO does not support MS Word Textboxes (presumably because they are not part of the OOXML standard).

2. Remove the repeated definitions from document.xml

This requires you to unzip the .docx file, extract the \word\document.xml file, and remove all the occurrences of the repeated attribute specified in the error message you get when you open the .docx file. Note that there may be more than one attribute repeated in the file so you may have to do this for the other repeated attribute(s). Repeated attributes reported here include w:themeShade, w:themeColor and w:cstheme. Some files uploaded to the forum have had many (30+?) repeats.

error message.png
Error message says that " w:themeColor " has been re-defined.

This means that there are two (or more) occurrences of " w:themeColor " following each other. There should be only one each time it occurs
error message.png (5.28 KiB) Viewed 904 times

1 Unzip the .docx file and extract \word\document.xml. A .docx file is actually a ZIP file so just unZIP it; or rename fred.docx to fred.zip, and double click it.

.docx file when unzipped in 7-ZIP. Double-click \word\ to find document.xml inside

2 Open document.xml in Notepad++ and search for the repeated attribute (eg w:themeColor) and delete the second instance of w:themeColor="accent1" each time it occurs, leaving the trailing / as below. Save document.xml. Re-ZIP it back into the .docx file.

Code: Select all   Expand viewCollapse view
    <w:sz w:val="20"/>
    <w:szCs w:val="20"/>
    <w:highlight w:val="yellow"/>
    <w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/>
    <w:color w:val="5B9BD5" w:themeColor="accent1" w:themeColor="accent1"/>

This image shows document.xml being edited in Notepad++ without invoking the "pretty print" add-on.

Two instances of w:themeColor="accent1" follow each other. Delete the second (and third, fourth etc ...) each time it occurs, always leaving the trailing /
Clipboard01.png (31.16 KiB) Viewed 904 times

3 Put document.xml back into the .docx file. If you renamed fred.docx to fred.zip then drag document.xml back into it and rename fred.zip back to fred.docx.

The .docx file should now open properly.

Note that it is easier to find the repeated occurrences if you "pretty print" the XML using the XML Tools plugin for Notepad++. BUT - if you use Pretty print, be sure to Linearise the XML before saving it (it is an XML Tools option) or lots of tabs and newlines will be saved in the file which then appear in the repaired document.

3. Extract \word\document.xml from the .docx file and strip off all the XML tags to leave just the text


Rename the file from fred.docx to fred.ZIP.
Double click fred.ZIP.
Navigate to the \word folder.
Drag document.XML onto the desktop.
- Install Notepad++ and the XML Tools plug-in. Open document.xml with Notepad ++. Go Plugins > XML Tools > Pretty print XML with line breaks. Delete the XML tags leaving just the text.
- Alternatively, Google pretty print and upload document.xml to a pretty print web site which will format it. Delete the XML tags.


Rename the file from fred.docx to fred.ZIP.
Unzip fred.ZIP - you may need to install a ZIP utility on Linux.
Navigate to the \word folder.
Extract document.xml.
- Install an XML editor. Open document.xml with the XML editor and format it "pretty print". Delete the XML tags leaving just the text.
- Alternatively, Google pretty print and upload document.xml to a pretty print web site which will format it. Delete the XML tags.

 Edit: The easiest way to delete all the XML tags is with a Find and Replace, where you use a Regular Expression to find all the tags. (Note: A regular expression search will work in LO itself as long as you do not break the character limit for a paragraph [64k in AOO].)

It works fine in NotePad++.

1. Open document.xml.
2. Go Search > Replace ..., with search argument <[^>]+> and replace argument blank (or space).
3. Tick Regular Expressions.
4. Click Replace All.

All XML tags are deleted and you are left with just the text. You will need to re-format it and recreate tables and footnotes etc. If you pretty printed before searching and you do not Linearise the XML after searching, you will be left with many tabs which you need to delete manually. 
AOO 4.1.4, Windows 7 Home 64 bit

See the Writer Manual, the Writer FAQ, the Writer Tutorials and the up to date Writer guide for information. Click the Help button on a pop-up window for extensive help on that function.
Posts: 4656
Joined: Fri Sep 18, 2009 5:51 pm
Location: UK

Return to LibreOffice

Who is online

Users browsing this forum: No registered users and 2 guests