[Tutorial] Format error discovered in sub-document

Forum rules
No question in this section please
For any question related to a topic, create a new thread in the relevant section.
Post Reply
User avatar
Hagar Delest
Moderator
Posts: 32594
Joined: Sun Oct 07, 2007 9:07 pm
Location: France

[Tutorial] Format error discovered in sub-document

Post by Hagar Delest »

Here is a tutorial to fix documents (mainly .odt) that show the following error message: Format error discovered in the file in sub-document content.xml at position 2,155278(row,col).
The row is always 2 but the column differs depending on your document.
This is based on this post from John_Ha. Other tricks are given along this (long) topic: [Hint] How did I fix my ODT file.

Here is what you see in such case:
Content_01.png
Content_01.png (10.59 KiB) Viewed 34611 times
1. Open the file for the surgery
First make a copy of the file (in case something goes wrong). Then open the file with an archive manager:
  • You can just right click and Open with then select an archive manager
  • Else, change the extension of the file from .odt to .zip
You should now see the content of the file:
Content_02.png
User avatar
Hagar Delest
Moderator
Posts: 32594
Joined: Sun Oct 07, 2007 9:07 pm
Location: France

2. Do the surgery

Post by Hagar Delest »

To edit the file, you need to install an XML editor like: Then with the editor, open the content.xml file from the archive.
It should warn you that there is indeed a problem with the same row and column position (or 1 col next to it).
Note that at this point, the XML structure is not correct and cannot be formatted by the editor.
The file is then displayed as 2 lines only, the second being a huge one.

Place your cursor at that position:
Content_03.png
You can notice that in this case, there is a "office:name" parameter that is repeated and it doesn't look very logical (yellow highlighting in the picture below).
Thus, delete the string: office:name="__Annotation__765_9324755062" :
Content_04.png
Edit Apr. 2, 2018: in fact, both "office:name="__Annotation__714_93247550611111"" and "office:name="__Annotation__765_9324755062"" text have wrongly been inserted in the middle of the Style P1 definition and all instances of it need to be deleted. See Re: Format error discovered and Re: [Solved] Read-Error.

Try to identify a text string close to the change. It will be helpful later to check the depth of the resulting changes.
User avatar
Hagar Delest
Moderator
Posts: 32594
Joined: Sun Oct 07, 2007 9:07 pm
Location: France

3. Check the result

Post by Hagar Delest »

Now check if the XML file is correct:
Content_05.png
Note: there may be other errors in the document. In this case do that until none remains.

You can now use the Pretty-print view feature of the editor to format the XML.
In Notepad ++, Go to Plugins > XML Tools > Pretty print XML with line breaks.
It will display with its structure now readable:
Content_06.png
In Notepad++, you have to Linearise the XML before saving it or lots of tabs and newlines will be saved in the file which then appear in the repaired document.
Save the modified content.xml file in the archive.
Close the archive and change its extension back to .odt if you had changed it to .zip.

Your file should now open in Writer.
Check its content, especially the part related to the change made in the content.xml file.
If you had spotted a specific text string close to the position where you applied changes, search for it.
In some cases, significant parts of the content.xml file have to be deleted, it will remove data from the recovered file. You'll have to type them again with their formatting.
User avatar
Hagar Delest
Moderator
Posts: 32594
Joined: Sun Oct 07, 2007 9:07 pm
Location: France

Fixing docx files with SAXParse error (LibreOffice bug)

Post by Hagar Delest »

The editing of an XML file can also be useful to fix the bug specific to LibreOffice when saving in .docx (SAXParse error).
See John_Ha's post describing the 3 possible methods: Self-help methods to fix .docx files with SAXParse error.
Note: the 3 methods are alternatives, you don't need to apply all of them!

The 2nd method (explained in this post is the closest to this tutorial. However, a .docx file can have multiple occurrences of this bug. If there are too many of them, the 2 other methods may be quicker in the end.
John_Ha
Volunteer
Posts: 9583
Joined: Fri Sep 18, 2009 5:51 pm
Location: UK

Re: [Tutorial] Format error discovered in sub-document

Post by John_Ha »

Since Hagar's posts above we have had more examples posted to the forum and have been able to investigate the problem in more detail.

It appears that there are (at present) two different problems which require slightly different solutions. The sequence of fixing either problem is: (see full details in Hagar's posts above).

1. Open the file - you will get an error message like the image below. Record the 3309 number - it tells you where the error is located in the file.
error message.png
error message.png (6.34 KiB) Viewed 32071 times
2. Unzip the .odt file and extract the content.xml file.

3. Open content.xml with an XML editor and click in the file until the editor shows that the cursor is at or close to the number (3309) you recorded. You have now found the location of the error. Once you have found the location you may find it easier to "pretty print" the file so it is easier to see what is happening.

4. Repair the error as described for the two cases below. If you pretty printed in Step 3 Linearise the XML (it undoes the pretty printing) before saving the file.

5. Save content.xml.

6. Insert content.xml back into the .odt file.

The .odt file is now repaired.

The two cases are as follows:

Case 1: Multiple added "office:name="__Annotation__714_93247550611111""

These additions appear in the middle of the first style definition in the file and corrupt it. They should not be there so the fix is to delete all occurrences of them so as to restore the style definition.
Note that the P1 Style has been corrupted by the addition of several Annotations.  <br />You need to delete ALL occurrences of the Annotations until the P1 Style has been corrected.
Note that the P1 Style has been corrupted by the addition of several Annotations.
You need to delete ALL occurrences of the Annotations until the P1 Style has been corrected.
After making any correction(s) it is sensible to use the editor's XML Syntax Checker to check the XML is grammatically correct. Correct any further errors which are shown to exist.

We are pretty certain this error happens when a .odt file has
  • two or more comments attached to highlighted ranges of text (as opposed to a location in text)
  • the comments have been deleted - probably when Record > Changes was on
If the user now
  • keeps Edit > Record Changes ON
  • deletes text containing two ranges with comments attached to those ranges

AOO (but not LO) then corrupts the file.

A bug report has been raised - see Issue 128356 - Track Changes and Annotations on text range can cause corruption. Applies to 4.x (all versions?).

Case 2: Repeated attributes such as w:themeShade, w:themeColor and w:cstheme

These repeated attribute definitions can appear anywhere in the file, and can appear multiple times, and in different places in the file. The fix is to find all repeats, and delete only the repeats so as to leave just one occurrence. So, in the example below, delete w:themeColor="accent1" in the red box.
When an attribute like w:themeColour is repeated you should delete the REPEATS and leave just ONE occurrence.
When an attribute like w:themeColour is repeated you should delete the REPEATS and leave just ONE occurrence.
After making any correction(s) it is sensible to use the editor's XML Syntax Checker to check the XML is grammatically correct. Correct any further errors which are shown to exist.

Whereas these errors can occur in .odt files they also occur in .docx files which have been created or edited by LibreOffice. See [Tutorial] How to fix SAXParse errors in LibreOffice files for full instructions how to fix them.

Why do these errors occur?

We are not sure and investigation is continuing to understand these errors.

It is now known that the first error can be caused when AOO is used to edit a .odt file; the document has Edit > Record > Changes set to ON; two (or more?) comments, each attached to a range of characters, are deleted; and the document is saved. AOO 4.1.7 creates the error and it can be replicated. LO 6.3.5.2 does not create the error.

We do not understand how Case 2 errors occur.

SAXParse errors are caused by a known LO bug.
Last edited by John_Ha on Thu Dec 31, 2020 7:30 pm, edited 3 times in total.
LO 6.4.4.2, Windows 10 Home 64 bit

See the Writer Guide, the Writer FAQ, the Writer Tutorials and Writer for students.

Remember: Always save your Writer files as .odt files. - see here for the many reasons why.
John_Ha
Volunteer
Posts: 9583
Joined: Fri Sep 18, 2009 5:51 pm
Location: UK

Re: [Tutorial] Format error discovered in sub-document

Post by John_Ha »

Another type of format error has now been observed where a single character in a definition was changed.

Case 3: Single character in a definition is changed

See [Solved] Format error discovered in the file in sub-document where a spreadsheet .ods file was corrupted. The fix was to edit content.xml and change "pable" back to "table".
Clipboard01.gif
Clipboard01.gif (8.04 KiB) Viewed 30836 times
LO 6.4.4.2, Windows 10 Home 64 bit

See the Writer Guide, the Writer FAQ, the Writer Tutorials and Writer for students.

Remember: Always save your Writer files as .odt files. - see here for the many reasons why.
John_Ha
Volunteer
Posts: 9583
Joined: Fri Sep 18, 2009 5:51 pm
Location: UK

Re: [Tutorial] Format error discovered in sub-document

Post by John_Ha »

I have undertaken a little more analysis of errors of this kind. I had previously created a bug report for it, namely Issue 127745 - Read Error: Format error discovered ... at n,nnnn (row,col).

I have conducted some more tests and I have now come to the following conclusions:

I now believe the office-name: error is caused by AOO when
  • AOO is used to edit a .odt file which has
  • two (or more?) comments, each attached to a range of characters (and may be deleted?)
  • Edit > Record > Changes is ON
  • the user deletes a run of text which includes two highlighted ranges where a comment has been attached to a range of text
  • the document is saved
AOO 4.1.7 creates the error and it can be replicated. LO 6.3.5.2 does not create the error.

There are a number of reports of this error on the asklibreoffice forum where LO users get the problem. It suggests, therefore, either that earlier versions of LO caused it; or that the person to whom the LO user sent the file caused it. That user is far more likely to use MS Word than AOO (typically a university supervisor is commenting on a student's work) so it is possible that MS Word similarly corrupts .odt files.

1. The error seems to arise when an AOO or LO user sends a .odt file (or .docx file if LO) to a person who uses MS Word, where that person adds comments to a range of characters. Note that adding comments does not require Edit > Changes > Record ..., to be switched on but a person adding comments is usually also recording changes so it is a little difficult to separate the two as potential causes.

2. I am now not sure how the corruption happens. Does MS Word corrupt the original .odt file and return a corrupted file to the AOO user? Or does MS Word not corrupt the file, but AOO cannot handle what MS Word sends back, and AOO then corrupts the file?

3. If I correct a corrupted file by deleting the repeated attributes I then get different behaviours with AOO and LO:

When AOO 4.1.6 saves the corrected file under another name, AOO corrupts the corrected file. Hence the new corruption is definitely introduced by AOO.

When LO 6.0.2.1 saves the corrected file under another name, LO does not corrupt the corrected file.

This does suggest that LO may be more stable than AOO when exchanging files with MS Word
Last edited by John_Ha on Thu Dec 31, 2020 7:40 pm, edited 1 time in total.
LO 6.4.4.2, Windows 10 Home 64 bit

See the Writer Guide, the Writer FAQ, the Writer Tutorials and Writer for students.

Remember: Always save your Writer files as .odt files. - see here for the many reasons why.
John_Ha
Volunteer
Posts: 9583
Joined: Fri Sep 18, 2009 5:51 pm
Location: UK

Re: [Tutorial] Format error discovered in sub-document

Post by John_Ha »

This is a bug in AOO, but not in LO, which corrupts the file when
  • Edit > Changes ..., is set to RECORD, and
  • a user deletes a range of text which includes two comments each of which is attached to a range of text.
Other sequences may also cause the problem but the above procedure is re-produceable.

See Issue 128356 - Track Changes and Annotations on text range can cause corruption. Applies to 4.x (all versions?)
Deleting the highlighted run of text while Edit &gt; Changes is set to RECORD causes the corruption
Deleting the highlighted run of text while Edit > Changes is set to RECORD causes the corruption
I therefore strongly recommend users who use tracked changes to move to LibreOffice.
Last edited by John_Ha on Thu Feb 17, 2022 10:24 pm, edited 3 times in total.
LO 6.4.4.2, Windows 10 Home 64 bit

See the Writer Guide, the Writer FAQ, the Writer Tutorials and Writer for students.

Remember: Always save your Writer files as .odt files. - see here for the many reasons why.
John_Ha
Volunteer
Posts: 9583
Joined: Fri Sep 18, 2009 5:51 pm
Location: UK

Re: [Tutorial] Format error discovered in sub-document

Post by John_Ha »

I had a file with about 30 errors and I had to find them manually using Notepad++.

I downloaded XML Copy Editor and found it much easier to use as it stepped through the file finding each line with an error.

However, XML Copy Editor would not pretty print because of the errors, so I needed to use Notepad++ to pretty print the file which I then saved. I edited the saved file with XML Copy Editor, saved it, and used Notepad++ to re-linearise it.

XML Copy Editor missed some errors when using F2 to step through the file. However issuing the pretty command in XML Copy Editor located these errors.
LO 6.4.4.2, Windows 10 Home 64 bit

See the Writer Guide, the Writer FAQ, the Writer Tutorials and Writer for students.

Remember: Always save your Writer files as .odt files. - see here for the many reasons why.
User avatar
RoryOF
Moderator
Posts: 34570
Joined: Sat Jan 31, 2009 9:30 pm
Location: Ireland

Re: [Tutorial] Format error discovered in sub-document

Post by RoryOF »

John_Ha wrote: However, XML Copy Editor would not pretty print because of the errors,
That is the major shortcoming I find with XML Copy Editor, the inability to PrettyPrint when there are errors in the file. It leaves one having to search through massive walls of text to locate the errors.
Apache OpenOffice 4.1.15 on Xubuntu 22.04.4 LTS
User avatar
MrProgrammer
Moderator
Posts: 4883
Joined: Fri Jun 04, 2010 7:57 pm
Location: Wisconsin, USA

Re: [Tutorial] Format error discovered in sub-document

Post by MrProgrammer »

John_Ha wrote: Wed Apr 04, 2018 12:19 am Case 1: Multiple added "office:name="__Annotation__714_93247550611111""
Case 2: Repeated attributes such as w:themeShade, w:themeColor and w:cstheme
John_Ha wrote: Wed Jun 27, 2018 10:56 am Case 3: Single character in a definition is changed
For people with MacOS or Linux, [Tutorial] Delete duplicate attributes tool can often fix the document without manual editing. It does not attempt to fix case 3.

RoryOF wrote: Mon Feb 08, 2021 3:57 pm That is the major shortcoming I find with XML Copy Editor, the inability to PrettyPrint when there are errors in the file.
I have had similar challenges with the tools I use, so if I have to manually edit the XML when I extract it I use:
unzip -p file.od? content.xml | sed -e $'s/</\\\n</g' >bad.xml

It's not quite "pretty print", just less ugly than the XML in the document. Command $'s/</\\\n</g' ensures each tag starts on a new line.
Mr. Programmer
AOO 4.1.7 Build 9800, MacOS 13.6.3, iMac Intel.   The locale for any menus or Calc formulas in my posts is English (USA).
Post Reply