[Solved] Request for comments: Delete duplicate attributes tool

Post by **MrProgrammer** » Mon Aug 01, 2022 8:07 pm

Edit: An updated version of this tool is in [Tutorial] Delete duplicate attributes tool

I plan to make this into a tutorial and am interested in any comments:

Many cases of the "Format error discovered in sub-document" message are caused by duplicate attributes in an XML tag. I have written a tool in perl to fix that type of problem. It discards any attributes with a duplicate name, keeping only the first of them. You'll need to install perl to use this tool, though it is commonly already available on modern systems.

For example (office:name is duplicated):

<style:style
   office:name="__Annotation__1" office:name="__Annotation__11" office:name="__Annotation__111"
   style:name="P1" style:family="paragraph" style:parent-style-name="Text_20_body"
   style:master-page-name="Standard">

after using the tool becomes

<style:style
   office:name="__Annotation__1"
   style:name="P1" style:family="paragraph" style:parent-style-name="Text_20_body"
   style:master-page-name="Standard">

Duplicate attributes may occur in content.xml or styles.xml, possibly in other sub-files. The tool is only intended to fix this one specific problem. It can't fix problems like bad checksum, truncated files, mangled tags, mismatched quotes, etc. If you use attempt to use it for those errors, it will probably just ignore them, through it might make the situation worse. Duplicate attribute problems seem to occur regularly in the forum, and I hope this tool will simplify solving them.

To use the tool, copy the program below into a file on your computer. Extract the bad XML. Run perl, passing the name of that file as a parmeter pointing STDIN to the bad XML and STDOUT to a new location to hold the repaired file. Then insert the repaired file into the OpenOffice document. For example:

prog=~/Documents/Computers/Projects/Perl/DDA.pl   # The program below
unzip -p "«file»" content.xml >bad.xml            # Extract content.xml from the damaged file to bad.xml
perl $prog <bad.xml >content.xml                  # Run Perl program reading bad.xml and writing content.xml
zip "«file»" content.xml                          # Repair the document
rm bad.xml content.xml                            # Delete work files

Problems which the tool can repair include:
[Solved] Format error discovered in … content.xml
[Solved] Format error in sub-document content
[Solved] Read-Error after using Edit > Changes > Record
[Solved] Format error while opening presentation
[Solved] Read-error format error discovered in sub-document styles.xml
[Solved] Format error discovered in content.xml at 2,44475(row,col) ← Today!

Here is the Perl program:

#!/usr/bin/env perl
# V1R1M0 20220723 Delete duplicate attributes in bad AOO XML
# V1R2M0 20220729 Use subroutine for parsing and output
# Input is STDIN; Output is STDOUT
die "$0 does not accept parameters; Use STDIN and STDOUT; Exit" if @ARGV;

use strict; use warnings;                                         # Program initialization
my ($in);                                                         # Input XML (used in subroutine)
my ($aname,$aequal);                                              # Attribute name, "=" sign
my ($dl,$adata,$dr);                                              # Attribute value, left/right delimiters
my %attr;                                                         # Attribute name hash (values unimportant)

sub Output {                                                      # Output matched data; Return the rest
$in =~ m"($_[0])(.*)$"s;                                          # Look for data to output and to return
print(defined $1 ? $1 : $in); return $2;                          # Output first match, or $in if no match
}

$/ = undef; $in = <STDIN>;                                        # Read entire file
for (;;) {                                                        # Look for <TagName and whitespace
   $in = Output('.*?<[^!/?][^\s/>]+\s*');                         # Output data before tag attributes
   last unless $in;                                               # Exit if no more input
   %attr = ();                                                    # Clear attribute hash for new tag
   while ($in !~ m"^/?>"s) {                                      # Process attributes until end of tag (/> or >)
      ($aname,$aequal,$in) = $in =~ m"([^\s=]+)(\s*=\s*)(.*)$"s;  # Attr name ends at whitespace or =
      ($dl,$adata,$dr,$in) = $in =~ m"(.)(.*?)(\1\s*)(.*)$"s;     # Attr value is between delimiters
      next if exists($attr{$aname});                              # Skip known attribute
      $attr{$aname} = '';                                         # Save attribute name in hash
      print $aname.$aequal.$dl.$adata.$dr;                        # Copy attribute to STDOUT if not known
   }                                                              # End of attributes
   $in = Output('/?>');                                           # Output end of tag, either /> or >
}                                                                 # End of input

edit this topic's initial post · Post by **Villeroy** » Tue Aug 02, 2022 9:33 am

The program does not return. Output of program top:

Code: Select all

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                      
26304 andreas   20   0   20588   4764   4308 S   0,0  0,0   0:00.00 perl

Post by **MrProgrammer** » Tue Aug 02, 2022 3:35 pm

What was the input file, Villeroy? It reads from STDIN.

edit this topic's initial post · Post by **Villeroy** » Tue Aug 02, 2022 7:17 pm

OK, OK. I overlooked the < > around the input file name.

[Solved] Request for comments: Delete duplicate attributes tool

[Solved] Request for comments: Delete duplicate attributes tool

Re: Request for comments: Delete duplicate attributes tool

Re: Request for comments: Delete duplicate attributes tool

Re: Request for comments: Delete duplicate attributes tool