Edit: An updated version of this tool is in [Tutorial] Delete duplicate attributes tool |
I plan to make this into a tutorial and am interested in any comments:
Many cases of the "Format error discovered in sub-document" message are caused by duplicate attributes in an XML tag. I have written a tool in perl to fix that type of problem. It discards any attributes with a duplicate name, keeping only the first of them. You'll need to install perl to use this tool, though it is commonly already available on modern systems.
For example (office:name is duplicated):
<style:style
office:name="__Annotation__1" office:name="__Annotation__11" office:name="__Annotation__111"
style:name="P1" style:family="paragraph" style:parent-style-name="Text_20_body"
style:master-page-name="Standard">
after using the tool becomes
<style:style
office:name="__Annotation__1"
style:name="P1" style:family="paragraph" style:parent-style-name="Text_20_body"
style:master-page-name="Standard">
Duplicate attributes may occur in content.xml or styles.xml, possibly in other sub-files. The tool is only intended to fix this one specific problem. It can't fix problems like bad checksum, truncated files, mangled tags, mismatched quotes, etc. If you use attempt to use it for those errors, it will probably just ignore them, through it might make the situation worse. Duplicate attribute problems seem to occur regularly in the forum, and I hope this tool will simplify solving them.
To use the tool, copy the program below into a file on your computer. Extract the bad XML. Run perl, passing the name of that file as a parmeter pointing STDIN to the bad XML and STDOUT to a new location to hold the repaired file. Then insert the repaired file into the OpenOffice document. For example:
prog=~/Documents/Computers/Projects/Perl/DDA.pl # The program below
unzip -p "«file»" content.xml >bad.xml # Extract content.xml from the damaged file to bad.xml
perl $prog <bad.xml >content.xml # Run Perl program reading bad.xml and writing content.xml
zip "«file»" content.xml # Repair the document
rm bad.xml content.xml # Delete work files
Problems which the tool can repair include:
[Solved] Format error discovered in … content.xml
[Solved] Format error in sub-document content
[Solved] Read-Error after using Edit > Changes > Record
[Solved] Format error while opening presentation
[Solved] Read-error format error discovered in sub-document styles.xml
[Solved] Format error discovered in content.xml at 2,44475(row,col) ← Today!
Here is the Perl program:
#!/usr/bin/env perl
# V1R1M0 20220723 Delete duplicate attributes in bad AOO XML
# V1R2M0 20220729 Use subroutine for parsing and output
# Input is STDIN; Output is STDOUT
die "$0 does not accept parameters; Use STDIN and STDOUT; Exit" if @ARGV;
use strict; use warnings; # Program initialization
my ($in); # Input XML (used in subroutine)
my ($aname,$aequal); # Attribute name, "=" sign
my ($dl,$adata,$dr); # Attribute value, left/right delimiters
my %attr; # Attribute name hash (values unimportant)
sub Output { # Output matched data; Return the rest
$in =~ m"($_[0])(.*)$"s; # Look for data to output and to return
print(defined $1 ? $1 : $in); return $2; # Output first match, or $in if no match
}
$/ = undef; $in = <STDIN>; # Read entire file
for (;;) { # Look for <TagName and whitespace
$in = Output('.*?<[^!/?][^\s/>]+\s*'); # Output data before tag attributes
last unless $in; # Exit if no more input
%attr = (); # Clear attribute hash for new tag
while ($in !~ m"^/?>"s) { # Process attributes until end of tag (/> or >)
($aname,$aequal,$in) = $in =~ m"([^\s=]+)(\s*=\s*)(.*)$"s; # Attr name ends at whitespace or =
($dl,$adata,$dr,$in) = $in =~ m"(.)(.*?)(\1\s*)(.*)$"s; # Attr value is between delimiters
next if exists($attr{$aname}); # Skip known attribute
$attr{$aname} = ''; # Save attribute name in hash
print $aname.$aequal.$dl.$adata.$dr; # Copy attribute to STDOUT if not known
} # End of attributes
$in = Output('/?>'); # Output end of tag, either /> or >
} # End of input