[Tutorial] Delete duplicate attributes tool

Home made tutorials, by users, for users
Forum rules
No question in this section please
For any question related to a topic, create a new thread in the relevant section.
Locked
User avatar
MrProgrammer
Moderator
Posts: 4905
Joined: Fri Jun 04, 2010 7:57 pm
Location: Wisconsin, USA

[Tutorial] Delete duplicate attributes tool

Post by MrProgrammer »

Note that questions are not allowed in the Tutorials section of the forum. Ask them in one of the Applications forums (Writer, Calc, …).


Many cases of the "Format error discovered in sub-document" message are caused by duplicate attributes in an XML tag. I have written a tool in perl to fix that type of problem. It discards any attributes with a duplicate name, keeping only the first of them. You'll need to have perl installed to use this tool, though it is commonly pre-installed on modern systems. Any recent MacOS system will include perl. You can install perl on Linux or Windows too.

This is an illustration of the problem which causes that "Format error" message (office:name is duplicated):
<style:style
   office:name="__Annotation__1" 
   office:name="__Annotation__11"
   office:name="__Annotation__111"
   style:name="P1" style:family="paragraph" style:parent-style-name="Text_20_body"
   style:master-page-name="Standard">
after using the tool becomes
<style:style
   office:name="__Annotation__1"
   style:name="P1" style:family="paragraph" style:parent-style-name="Text_20_body"
   style:master-page-name="Standard">

Duplicate attributes may occur in content.xml or styles.xml, possibly in other sub-files. The tool is only intended to fix this one specific problem. It can't fix problems like bad checksum, truncated files, mangled tags, mismatched quotes, etc. If you use attempt to use it for those errors, it will probably just ignore them, through it might make the situation worse. It would be wise to make a backup of your document before attempting to repair it. Duplicate attribute problems seem to occur regularly in the forum as a result of Issue 128356 - Track Changes and Annotations on text range can cause corruption. Applies to 4.x (all versions?). I hope this tool will simplify solving them.

To use the tool, copy the program below into a file, say DDA.pl, on your computer. Extract the bad XML with unzip. Run perl, passing the name of my program as a parmeter, and pointing STDIN to the bad XML and STDOUT to a new location to hold the repaired file. You must use the < and > characters in the perl command, as shown below. The bad XML is often content.xml, but it might be styles.xml or something else. The Format error discovered in sub-document message will tell you which one to fix. If no attributes were removed, the tool found nothing that it could fix. If the report shows that at least one attribute was removed, then use zip to insert the repaired file into the OpenOffice document. For example:

prog=~/Documents/Computers/Projects/Perl/DDA.pl   # The name of the file containing the program below
unzip -p "«file»" content.xml >bad.xml            # Extract content.xml from the damaged file to bad.xml
perl "$prog" <bad.xml >content.xml                # Run Perl program reading bad.xml and writing content.xml
zip "«file»" content.xml                          # Repair the document by inserting the corrected content
rm bad.xml content.xml                            # Delete work files

On MacOS you'll enter those commands in Terminal. «file» in the unzip and zip commands represents the name of the document you want to fix. You'll need to supply its path, unless the document is in the working directory. You should put quotes around the document name, as shown. Problems which the tool can repair include:
2021-04-07 [Solved] Format error discovered in … content.xml
2021-05-26 [Solved] Format error in sub-document content
2022-01-26 [Solved] Read-Error after using Edit > Changes > Record
2022-05-15 [Solved] Format error while opening presentation
2022-06-15 [Solved] Read-error format error discovered in … styles.xml
2022-06-30 [Solved] Format error discovered in content.xml …
2022-08-28 [Solved] Format Error in content.xml
Some documents could contain problems in addition to duplicate attributes, so a document the tool has repaired could still fail with Format error message. We have a tutorial which discusses some of the other problems, but often this tool is all you'll need to fix your document.

Here is the Perl program which you must copy to a file on your computer:

#!/usr/bin/env perl
# V1R1M0 2022-07-23 Delete duplicate attributes in bad AOO XML
# V1R2M0 2022-07-29 Use subroutine for parsing and output
# V1R2M1 2022-08-03 Exit if any parameters were passed
# V1R3M0 2022-08-06 Use variables $1 (captured text) and $+[0] (post-match position)
# V1R4M0 2022-08-08 Report counts of attributes which were processed
# Input is STDIN; Output is STDOUT; Report to STDERR
# Exit code is 0 if any attributes were removed, 1 if none were removed, or 255
die "$0 does not accept parameters; Use STDIN and STDOUT;\n" if @ARGV; # Exit code 255

use strict; use warnings;                        # Program initialization
$/ = undef;                                      # No input record delimiter, read entire file
my $in = <STDIN>;                                # Input XML (used in subroutine)
my $attr;                                        # Current attribute «attr»="«value»"
my %attr;                                        # Attribute name hash (values unimportant)
my @counts = ("Attributes %s: %i\n" x 3,         # Initialize array for report ...
   'read   ',0,'written',0,'removed');           # ... in count is [2], out count is [4]

sub Output {                                     # Try to match a pattern
if ($in =~ m"($_[0])"s)                          # Look for matching pattern
   { print $1; return substr($in,$+[0]); }       # Send matched text to STDOUT; return the rest
print $in;                                       # If no match, send rest of XML to STDOUT
printf STDERR @counts,$counts[2]-$counts[4];     # Print counts of in,out,removed to STDERR
exit ($counts[2]==$counts[4]);                   # Set exit code as described above
}

for (;;) {                                       # Look for <TagName and whitespace
   $in = Output('.*?<[^!/?][^\s/>]+\s*');        # Output data before tag attributes
   %attr = ();                                   # Clear attribute hash for new tag
   while ($in !~ m"^/?>"s) {                     # Process attributes until end of tag (/> or >)
      $counts[2]++;                              # Increment input attribute count
      $in =~ m"([^\s=]+)\s*=\s*(.).*?\2\s*"s;    # Attr name, =, delim, Attr value, delim
      $attr = substr($in,0,$+[0]);               # Save attribute for printing
      $in = substr($in,$+[0]);                   # Remove attriute from XML
      next if exists($attr{$1});                 # Skip previously found attribute
      $attr{$1} = '';                            # Remember attribute name in hash
      print $attr;                               # Copy attribute to STDOUT if new in tag
      $counts[4]++;                              # Increment output attribute count
   }                                             # End of attributes in tag
   $in = Output('/?>');                          # Output end of tag, either /> or >
}                                                # End of input

Note that questions are not allowed in the Tutorials section of the forum. Ask them in one of the Applications forums (Writer, Calc, …).
Mr. Programmer
AOO 4.1.7 Build 9800, MacOS 13.6.3, iMac Intel.   The locale for any menus or Calc formulas in my posts is English (USA).
Locked