[Tutorial] How to fix SAXParse error in LibreOff .docx files

Help with installation and general system troubleshooting questions concerning the office suite LibreOffice.
Locked
John_Ha
Volunteer
Posts: 9583
Joined: Fri Sep 18, 2009 5:51 pm
Location: UK

[Tutorial] How to fix SAXParse error in LibreOff .docx files

Post by John_Ha »

These problems seem only to arise in .docx files saved by LibreOffice.
 Edit: See [Tutorial] Differences between Writer and MS Word files for why you should always work in, and save files, as .odt; and never save a file as a .doc, .docx or .rtf.

Only create a .doc, .docx or .rtf as a copy if you have to, but always keep the .odt as the master.
 
Try one of the following four self-help methods to fix LibreOffice .docx files with SAX parse errors - you only need to use one of them and you chance of a successful recovery is very high.

0. Always work on a copy of the file because, if things go wrong, you will still have the original.
 Edit: When you get the SAXParse error message it tells you the name of the repeated attribute - note it as you will need to search for it.

The error message goes on to say: Do you want to continue to open the file?

If you say YES, everything up to the error is displayed, and everything after the error is missing. If, despite the warning above, you are working on the original file DO NOT NOW SAVE THE FILE (unless you use a different name) BECAUSE YOU WILL DELETE ALL THE MISSING DATA FROM THE .DOCX and you will never then be able to recover the missing data as you will have deleted it from the file. 
1. The free Microsoft Word Viewer seems to be able to open these files ... EDIT: It may only show the content before the error, and everything after the error is truncated

... so download it from How to obtain the latest Microsoft Word Viewer. Open the file and copy the data into a Writer document.

Why can Microsoft Word Viewer open them? Presumably because Microsoft Word Viewer has better, more robust error handling code than LO, and can cope with the repeated attribute without throwing an error.

It is also very useful always to have the Viewer available in case you have a document with MS Word Textboxes because AOO does not display Textboxes (though LO does).

2. AOO seems to be able to open these files ... EDIT: It may only show the content before the error, and everything after the error is truncated

... so download Apache OpenOffice from http://www.openoffice.org/download/index.html. Create a new user on your PC and install AOO for that user only. AOO and LO seem to interact in that LO grabs some of the AOO properties and this will completely isolate AOO from LO. Open the .docx file with AOO. Save it as a .odt file. Uninstall AOO and delete the added user. Note that this will delete any MS Word Textboxes and their contents because AOO does not support MS Word Textboxes (presumably because they are not part of the OOXML standard).

Why can AOO open them? Presumably because AOO has better, more robust error handling code than LO, and AOO can cope with the repeated attribute without throwing an error.

3. Remove the repeated definitions from document.xml

This requires you to unzip the .docx file, extract the \word\document.xml file, and remove all the occurrences of the repeated attribute specified in the error message you get when you open the .docx file. Note that there may be more than one attribute repeated in the file so you may have to do this for the other repeated attribute(s). Repeated attributes reported here include w:themeShade, w:themeColor and w:cstheme. Some files uploaded to the forum have had many (30+?) repeats.
NotePad++ XML Tools plugin error message when opening document.xml says that &quot; w:themeColor &quot; has been re-defined.<br />This means that there are two (or more) occurrences of &quot; w:themeColor &quot; following each other.  There should be only one each time it occurs.<br />You can also use the XML Tools plugin to check XML syntax which will check to see if you have removed the problem (or have other problems in the file).
NotePad++ XML Tools plugin error message when opening document.xml says that " w:themeColor " has been re-defined.
This means that there are two (or more) occurrences of " w:themeColor " following each other. There should be only one each time it occurs.
You can also use the XML Tools plugin to check XML syntax which will check to see if you have removed the problem (or have other problems in the file).
error message.png (5.28 KiB) Viewed 57134 times
1 Unzip the .docx file and extract \word\document.xml.

A .docx file is actually a ZIP file so just unZIP it; or rename fred.docx to fred.zip, and double click it.
.docx file when unzipped in 7-ZIP.  Double-click \word\ to find document.xml inside
.docx file when unzipped in 7-ZIP. Double-click \word\ to find document.xml inside
2 Open document.xml in Notepad++ and search for the repeated attribute (eg w:themeColor)

Delete the second instance of w:themeColor="accent1" each time it occurs, leaving the trailing /> as below:

Before correcting ...

Code: Select all

<w:rPr>
    <w:sz w:val="20"/>
    <w:szCs w:val="20"/>
    <w:highlight w:val="yellow"/>
    <w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/>
    <w:color w:val="5B9BD5" w:themeColor="accent1" w:themeColor="accent1"/>
</w:rPr>
After correcting ...

Code: Select all

<w:rPr>
    <w:sz w:val="20"/>
    <w:szCs w:val="20"/>
    <w:highlight w:val="yellow"/>
    <w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/>
    <w:color w:val="5B9BD5" w:themeColor="accent1"/>
</w:rPr>
This image shows document.xml being edited in Notepad++ without invoking the &quot;pretty print&quot; add-on.<br />Two instances of w:themeColor=&quot;accent1&quot; follow each other.  Delete the second (and third, fourth etc ...) each time it occurs, always leaving the trailing /
This image shows document.xml being edited in Notepad++ without invoking the "pretty print" add-on.
Two instances of w:themeColor="accent1" follow each other. Delete the second (and third, fourth etc ...) each time it occurs, always leaving the trailing /
3 Put document.xml back into the .docx file in the \word folder.

If you renamed fred.docx to fred.zip then drag document.xml back into it and rename fred.zip back to fred.docx.

The .docx file should now open properly.

Note that it is easier to find the repeated occurrences if you "pretty print" the XML using the XML Tools plugin for Notepad++. BUT - if you use Pretty print, be sure to Linearise the XML before saving it (it is an XML Tools option) or lots of tabs and newlines will be saved in the file which then appear in the repaired document.
 Edit: Files have recently been analysed where there are additional different errors in the XML. These errors were found and corrected by using the XML Syntax Checker to check the XML. See here 

4. Extract \word\document.xml from the .docx file and strip off all the XML tags to leave just the text

Windows:

Rename the file from fred.docx to fred.ZIP.
Double click fred.ZIP.
Navigate to the \word folder.
Drag document.XML onto the desktop.
- Install Notepad++ and the XML Tools plug-in. Open document.xml with Notepad ++. Go Plugins > XML Tools > Pretty print XML with line breaks. Delete the XML tags leaving just the text.
- Alternatively, Google pretty print and upload document.xml to a pretty print web site which will format it. Delete the XML tags.

Linux:

Rename the file from fred.docx to fred.ZIP.
Unzip fred.ZIP - you may need to install a ZIP utility on Linux.
Navigate to the \word folder.
Extract document.xml.
- Install an XML editor. Open document.xml with the XML editor and format it "pretty print". Delete the XML tags leaving just the text.
- Alternatively, Google pretty print and upload document.xml to a pretty print web site which will format it. Delete the XML tags.
 Edit: The easiest way to delete all the XML tags is with a Find and Replace, where you use a Regular Expression to find all the tags. (Note: A regular expression search will work in LO itself as long as you do not break the character limit for a paragraph [64k in AOO, more in LO??].)

It works fine in NotePad++.

1. Open document.xml.
2. Go Search > Replace ..., with search argument <[^>]+> and replace argument blank (or space).
3. Tick Regular Expressions.
4. Click Replace All.

All XML tags are deleted and you are left with just the text. You will need to re-format it and recreate tables and footnotes etc. If you pretty printed before searching and you do not Linearise the XML before you save the file, you will be left with many tabs which you will need to delete manually. 
5. See other examples in this Tutorial

See [Tutorial] Format error discovered in sub-document for other examples of SAXParse corruptions.

If everything else fails, your only hope then is to see [Tutorial] How to find and un-delete Writer temporary files for

a) use Previous Versions (W7 and later) to recover previous versions of the file (is there something similar on MacOS and Linux?);

b) recover your file as it was when you last opened or saved it; or as it was when it was last saved with AutoRecovery;

c) find previous versions of the file in the folder it is located in, but which have since been deleted;

d) find any temporary files AOO wrote while you were editing the file but which have not yet been deleted;

e) un-delete the temporary files AOO wrote while you were editing the file, and then deleted. d) and e) will recover your file as it was when you last opened or you last saved it.
LO 6.4.4.2, Windows 10 Home 64 bit

See the Writer Guide, the Writer FAQ, the Writer Tutorials and Writer for students.

Remember: Always save your Writer files as .odt files. - see here for the many reasons why.
mikekaganski
Posts: 12
Joined: Mon Oct 30, 2017 12:39 pm

Re: [Tutorial] How to fix SAXParse error in LibreOff .docx f

Post by mikekaganski »

Sigh. This partially useful tutorial still has all these nonsensical statements after almost three years!
John_Ha wrote:These problems seem only to arise in .docx files saved by LibreOffice.
This only appears in LibreOffice, because the same erroneous data is handled silently by AOO, which truncates the stream after the error. LibreOffice indeed had introduced some regressions causing more such corruptions (mostly fixed), but corrupted OOXML is not exclusive property of LibreOffice, only the message, helping users to at least realize that something bad has happened to the document, is.
John_Ha wrote:0. Always work on a copy of the file because, if things go wrong, you will still have the original.
 Edit: When you get the SAXParse error message it tells you the name of the repeated attribute - note it as you will need to search for it.

The error message goes on to say: Do you want to continue to open the file?

If you say YES, everything up to the error is displayed, and everything after the error is missing. If, despite the warning above, you are working on the original file DO NOT NOW SAVE THE FILE (unless you use a different name) BECAUSE YOU WILL DELETE ALL THE MISSING DATA FROM THE .DOCX and you will never then be able to recover the missing data as you will have deleted it from the file. 
And here we see why the message is helpful: in case of LibreOffice, if the document was damaged, user at least has a notification, and may try to recover using this helpful tutorial. In case the same happens in AOO, the user just opens the document, which has parts silently dropped; it's not guaranteed that user sees e.g. that some footnotes are lost. Then user makes some changes, then saves document without any doubt. And guess what happens with the dropped information then (a hint: see the cited fragment above).
John_Ha wrote:2. AOO seems to be able to open these files ... EDIT: It may only show the content before the error, and everything after the error is truncated

... so download Apache OpenOffice from http://www.openoffice.org/download/index.html. Create a new user on your PC and install AOO for that user only. AOO and LO seem to interact in that LO grabs some of the AOO properties and this will completely isolate AOO from LO. Open the .docx file with AOO. Save it as a .odt file. Uninstall AOO and delete the added user. Note that this will delete any MS Word Textboxes and their contents because AOO does not support MS Word Textboxes (presumably because they are not part of the OOXML standard).

Why can AOO open them? Presumably because AOO has better, more robust error handling code than LO, and AOO can cope with the repeated attribute without throwing an error.
The "EDIT" above tells the truth: yes, AOO drops the data, only silently, giving warm feeling of "safety", to persuade you to trust into it doing things correctly, only to learn that the data is unrecoverable later after it was saved. And LibreOffice does the same if user chooses to continue opening the data, but warned this time. So no, there's no "better, more robust error handling code" in AOO. And so no, this step is definitely not useful (can't comment on MS Word Viewer, which has a similar EDIT).
LibreOffice 7.6 on Windows 10
User avatar
RoryOF
Moderator
Posts: 34570
Joined: Sat Jan 31, 2009 9:30 pm
Location: Ireland

Re: [Tutorial] How to fix SAXParse error in LibreOff .docx f

Post by RoryOF »

OpenOffice cannot save in .docx format, so doesn't cause the error.

With respect, mikekaganski, how OpenOffice or LibreOffice handle the corrupted file is irrelevant - the file is damaged and needs repair. Neither program can handle the damaged file with any integrity. They both report that the file is damaged, and that is sufficient, until (if ever) a mechanism is put in place to allow such program repair the file.

So if one gets a SaxParse error, do not save over the existing document that reports the error. The original document reporting the error (or an exact copy of it, made at system level), is needed for repair.
Apache OpenOffice 4.1.15 on Xubuntu 22.04.4 LTS
John_Ha
Volunteer
Posts: 9583
Joined: Fri Sep 18, 2009 5:51 pm
Location: UK

Re: [Tutorial] How to fix SAXParse error in LibreOff .docx f

Post by John_Ha »

mikekaganski wrote:Sigh. This partially useful tutorial still has all these nonsensical statements after almost three years!
Your disparaging sarcasm is unwarranted.

This tutorial has helped many people as it gives detailed instructions as to how to fix a file and recover the data. Numerous volunteers in the forum have repaired LO users' damaged files by following these instructions and those in [Tutorial] Format error discovered in sub-document, which covers similar ground, so I suggest people do find it useful despite your nitpicking the descriptive parts.

Have you repaired any? If so, and you have any methods which may help users who see these errors to repair their files and recover their data, then please add it.

AOO does not seem to create or see SAXParse errors - only LO seems to create and see them.

As an aside we are users not developers so we have no access to the code. We analyse problems by what users tell us so inevitably we will not get every "nitpicking detail of the descriptive part" of the diagnosis correct. Hence the prolific use of "may ...", "seems to ..." and "presumably because ..." throughout the Tutorial.

The key point is that numerous forum volunteers and moderators now use these instructions to assist users recover their data - that is was the intent of writing the Tutorial. If none are available a user can do it him/herself.

As you can see many have been very pleased to have their files recovered using these methods:
concepTV wrote:IT WORKS! THANK YOU BOTH! I`M VERY VERY HAPPY!! YOU REALLY SAVED MY LIVE! :bravo: :bravo: :bravo: :bravo: :bravo:
:D :D :D :D :D :D :D
sarthak04 wrote:thanks a lot robleyd....you saved me a lot of trouble. Every thing looks as it was before the error. :bravo: :bravo:
GeneticBio wrote:Thanks so much that worked!! That is really really appreciated kind internet person. <3 I didn't have time to redo that today.
Wahine wrote:RORY! I love you, you bloody life-saver. Saved my arse here. :super: :bravo:
The document is perfect! Can you explain what the hell went wrong here? Should I install AOO instead of LO?
silviarev wrote:OMG thank you SO MUCH!! It works!! You saved me :bravo: :D
jennifersita wrote:Rory, you are a lifesaver. Thank you so much. I think you did recover everything! :-)
musicgopher wrote:Thanks so much! :D I am most appreciative.
sandimilicevic wrote:O YES, everithing is fine. Thank you very very much
Last edited by John_Ha on Mon Feb 24, 2020 3:32 pm, edited 5 times in total.
LO 6.4.4.2, Windows 10 Home 64 bit

See the Writer Guide, the Writer FAQ, the Writer Tutorials and Writer for students.

Remember: Always save your Writer files as .odt files. - see here for the many reasons why.
User avatar
RoryOF
Moderator
Posts: 34570
Joined: Sat Jan 31, 2009 9:30 pm
Location: Ireland

Re: [Tutorial] How to fix SAXParse error in LibreOff .docx f

Post by RoryOF »

I think, in every case I have accessed a damaged file of the type under discussion, OpenOffice reported that the file was damaged and gave an indication of the error type and its location.
Apache OpenOffice 4.1.15 on Xubuntu 22.04.4 LTS
Locked