[Solved] LibreOffice File format error found at SAXParse

Help with installation and general system troubleshooting questions concerning the office suite LibreOffice.
Post Reply
lajeandom
Posts: 4
Joined: Tue Dec 13, 2016 12:07 pm

[Solved] LibreOffice File format error found at SAXParse

Post by lajeandom »

Hello all, URGENT

I saved a corrupted file in docx. and it is due today for a client. I tried everything, even modifying the source code in visual studio. But i suck in everything related to code.

Please help me out.

Here's a link the file:

https://www.dropbox.com/s/hu2me5fl02cd7 ... .docx?dl=0

Thank you very much in advance for the help!
OpenOffice 3.1 on windows 10
User avatar
RoryOF
Moderator
Posts: 34586
Joined: Sat Jan 31, 2009 9:30 pm
Location: Ireland

Re: [Solved] File format error found at SAXParse

Post by RoryOF »

The repaired file is attached. Please check that all content and formatting are as you require.
Attachments
DAJFR_KEYS1 repaired.docx
(909.29 KiB) Downloaded 511 times
Apache OpenOffice 4.1.15 on Xubuntu 22.04.4 LTS
lajeandom
Posts: 4
Joined: Tue Dec 13, 2016 12:07 pm

Re: [Solved] File format error found at SAXParse

Post by lajeandom »

RoryOF wrote:The repaired file is attached. Please check that all content and formatting are as you require.
Wow! You are an angel! I lost so many hours today trying to fix this problem. I don't know how you did it but it works! Thank you a million times.
OpenOffice 3.1 on windows 10
John_Ha
Volunteer
Posts: 9583
Joined: Fri Sep 18, 2009 5:51 pm
Location: UK

Re: [Solved] File format error found at SAXParse

Post by John_Ha »

You need to report this is a bug with LibreOffice as, until it is fixed, it will continue to happen. See How to Report Bugs in LibreOffice
LO 6.4.4.2, Windows 10 Home 64 bit

See the Writer Guide, the Writer FAQ, the Writer Tutorials and Writer for students.

Remember: Always save your Writer files as .odt files. - see here for the many reasons why.
lajeandom
Posts: 4
Joined: Tue Dec 13, 2016 12:07 pm

Re: [Solved] File format error found at SAXParse

Post by lajeandom »

John_Ha wrote:You need to report this is a bug with LibreOffice as, until it is fixed, it will continue to happen. See How to Report Bugs in LibreOffice
Could you resume in a few words what was the bug? Because I don't really know how to research nor how to phrase it.
OpenOffice 3.1 on windows 10
John_Ha
Volunteer
Posts: 9583
Joined: Fri Sep 18, 2009 5:51 pm
Location: UK

Re: [Solved] File format error found at SAXParse

Post by John_Ha »

lajeandom wrote:Could you resume in a few words what was the bug? Because I don't really know how to research nor how to phrase it.
Report it as:

Title: SAXParse exception error - multiple occurrences of attribute re-defined in document.xml

When opening the attached file [upload your broken file in the bug report] which was saved by LO as a .docx file, I get the error message [your error message - it will be something like "SAXParseExeption:'[word/document.xml line 2]: Attribute w:themeShade redefined',Stream 'word/document.xml',Line, Column 159269(row,col)"].

Analysis of \word\document.xml shows repeated occurrences of [the attribute being defined twice - something like w:themeShade] as in [sample line of code from your file].

See the thread [Solved] File format error found at SAXParse at viewtopic.php?f=7&t=80923#p373226 which has several examples of .docx files with repeated attributes.

Resolution: Prevent LO writing these attributes twice.


If you upload your broken file to Dropbox I will extract a sample line of code for you to use in your bug report. Let me know if you do not still have your broken file as you could use one of the files in this thread - I will write the words for you.
LO 6.4.4.2, Windows 10 Home 64 bit

See the Writer Guide, the Writer FAQ, the Writer Tutorials and Writer for students.

Remember: Always save your Writer files as .odt files. - see here for the many reasons why.
John_Ha
Volunteer
Posts: 9583
Joined: Fri Sep 18, 2009 5:51 pm
Location: UK

Self-help methods to fix .docx files with SAXParse error

Post by John_Ha »

These problems seem only to arise in LibreOffice created documents. Be sure to work on a copy of the file in case something goes wrong.

Three self-help methods to fix LibreOffice .docx files with SAX parse errors. You only need to use one of them!

1 AOO seems to be able to open these files ...
 Edit: ... but only displays things before the error. Everything after the error is not displayed and, worse, it all gets permanently deleted from the file if you save it so you cannot do the other fixes! 
Download Apache OpenOffice from http://www.openoffice.org/download/index.html. Create a new user on your PC and install AOO for that user only. AOO and LO seem to interact in that LO grabs some of the AOO properties and this will completely isolate AOO from LO. Open the .docx file with AOO. Save it as a .odt file. Uninstall AOO and delete the added user.

2 Follow the directions given at ...

... viewtopic.php?f=101&t=86936&#p403228. This requires you to unzip the .docx file, extract the \word\document.xml file, and remove all the occurrences of the repeated attribute specified in the error message you get when you open the .docx file. Note that there may be more than attribute repeated in the file so you may have to do this for the other repeated attribute(s). Repeated attributes reported here include w:themeShade, w:themeColor and w:cstheme. Files uploaded to this thread have had many (30+?) repeats.

3 Extract \word\document.xml from the .docx file and strip off all the XML tags to leave just the text

Windows:

Rename the file from fred.docx to fred.ZIP.
Double click fred.ZIP.
Navigate to the \word folder.
Drag document.XML onto the desktop.
- Install Notepad++ and the XML Tools plug-in. Open document.xml with Notepad ++. Go Plugins > XML Tools > Pretty print XML with line breaks. Delete the XML tags leaving just the text.
- Alternatively, Google pretty print and upload document.xml to a pretty print web site which will format it. Delete the XML tags.
 Edit: I had a file with about 30 errors and I had to find them manually using Notepad++.

I downloaded XML Copy Editor and found it much easier to use as it stepped through the file finding each line with an error.

However, XML Copy Editor would not pretty print because of the errors, so I needed to use Notepad++ to pretty print the file which I then saved. I edited the saved file with XML Copy Editor, saved it, and used Notepad++ to re-linearise it.

XML Copy Editor missed some errors when using F2 to step through the file. However issuing the pretty command in XML Copy Editor located these errors. 
Linux:

Rename the file from fred.docx to fred.ZIP.
Unzip fred.ZIP - you may need to install a ZIP utility on Linux.
Navigate to the \word folder.
Extract document.xml.
- Install an XML editor. Open document.xml with the XML editor and format it "pretty print". Delete the XML tags leaving just the text.
- Alternatively, Google pretty print and upload document.xml to a pretty print web site which will format it. Delete the XML tags.
 Edit: The easiest way to delete all the XML tags is by using Find and Replace with Regular Expressions. It should work in LO as long as you do not break the character limit for a paragraph (64k in AOO).

It works fine in NotePad++. Open document.xml. Pretty print (it needs the XML Tools plugin - if you don't you will end up with a single paragraph). Go Search > Replace ..., with search argument <[^>]+> and replace argument blank. Tick Regular Expressions. Click Replace All.

All XML tags are deleted and you are left with just the text. 
Last edited by John_Ha on Tue Oct 12, 2021 6:06 pm, edited 8 times in total.
LO 6.4.4.2, Windows 10 Home 64 bit

See the Writer Guide, the Writer FAQ, the Writer Tutorials and Writer for students.

Remember: Always save your Writer files as .odt files. - see here for the many reasons why.
lajeandom
Posts: 4
Joined: Tue Dec 13, 2016 12:07 pm

Re: Self-help methods to fix .docx files with SAX Parse erro

Post by lajeandom »

Thanks for all the precious information guys. I will research and post the bug if needed asap. I copy pasted all your post so if this happens again to someone that I know (or myself but I am staying away from docx file now lol) at least I will know how to solve the issue.
OpenOffice 3.1 on windows 10
John_Ha
Volunteer
Posts: 9583
Joined: Fri Sep 18, 2009 5:51 pm
Location: UK

Re: [Solved] LibreOffice File format error found at SAXParse

Post by John_Ha »

lajeandom wrote:I am staying away from docx file now
That is an extremely wise decision. See [Tutorial] Differences between Writer and MS Word files for why you should always work in and save files as .odt.
lajeandom wrote:if this happens again to someone that I know ...
The SAXParse error is a LibreOffice problem, not an AOO problem. I have posted Fixing .docx files with SAXParse error in the LO Forum so that LO users will find the post.
LO 6.4.4.2, Windows 10 Home 64 bit

See the Writer Guide, the Writer FAQ, the Writer Tutorials and Writer for students.

Remember: Always save your Writer files as .odt files. - see here for the many reasons why.
Post Reply