[Solved] Request for comments: Delete duplicate attributes tool

Discuss setup / installation issues - Add a spell checker, Language pack?
Post Reply
User avatar
MrProgrammer
Moderator
Posts: 4905
Joined: Fri Jun 04, 2010 7:57 pm
Location: Wisconsin, USA

[Solved] Request for comments: Delete duplicate attributes tool

Post by MrProgrammer »

 Edit: An updated version of this tool is in [Tutorial] Delete duplicate attributes tool 

I plan to make this into a tutorial and am interested in any comments:

Many cases of the "Format error discovered in sub-document" message are caused by duplicate attributes in an XML tag. I have written a tool in perl to fix that type of problem. It discards any attributes with a duplicate name, keeping only the first of them. You'll need to install perl to use this tool, though it is commonly already available on modern systems.

For example (office:name is duplicated):
<style:style
   office:name="__Annotation__1" office:name="__Annotation__11" office:name="__Annotation__111"
   style:name="P1" style:family="paragraph" style:parent-style-name="Text_20_body"
   style:master-page-name="Standard">
after using the tool becomes
<style:style
   office:name="__Annotation__1"
   style:name="P1" style:family="paragraph" style:parent-style-name="Text_20_body"
   style:master-page-name="Standard">

Duplicate attributes may occur in content.xml or styles.xml, possibly in other sub-files. The tool is only intended to fix this one specific problem. It can't fix problems like bad checksum, truncated files, mangled tags, mismatched quotes, etc. If you use attempt to use it for those errors, it will probably just ignore them, through it might make the situation worse. Duplicate attribute problems seem to occur regularly in the forum, and I hope this tool will simplify solving them.

To use the tool, copy the program below into a file on your computer. Extract the bad XML. Run perl, passing the name of that file as a parmeter pointing STDIN to the bad XML and STDOUT to a new location to hold the repaired file. Then insert the repaired file into the OpenOffice document. For example:

prog=~/Documents/Computers/Projects/Perl/DDA.pl   # The program below
unzip -p "«file»" content.xml >bad.xml            # Extract content.xml from the damaged file to bad.xml
perl $prog <bad.xml >content.xml                  # Run Perl program reading bad.xml and writing content.xml
zip "«file»" content.xml                          # Repair the document
rm bad.xml content.xml                            # Delete work files

Problems which the tool can repair include:
[Solved] Format error discovered in … content.xml
[Solved] Format error in sub-document content
[Solved] Read-Error after using Edit > Changes > Record
[Solved] Format error while opening presentation
[Solved] Read-error format error discovered in sub-document styles.xml
[Solved] Format error discovered in content.xml at 2,44475(row,col)          ← Today!

Here is the Perl program:

#!/usr/bin/env perl
# V1R1M0 20220723 Delete duplicate attributes in bad AOO XML
# V1R2M0 20220729 Use subroutine for parsing and output
# Input is STDIN; Output is STDOUT
die "$0 does not accept parameters; Use STDIN and STDOUT; Exit" if @ARGV;

use strict; use warnings;                                         # Program initialization
my ($in);                                                         # Input XML (used in subroutine)
my ($aname,$aequal);                                              # Attribute name, "=" sign
my ($dl,$adata,$dr);                                              # Attribute value, left/right delimiters
my %attr;                                                         # Attribute name hash (values unimportant)

sub Output {                                                      # Output matched data; Return the rest
$in =~ m"($_[0])(.*)$"s;                                          # Look for data to output and to return
print(defined $1 ? $1 : $in); return $2;                          # Output first match, or $in if no match
}

$/ = undef; $in = <STDIN>;                                        # Read entire file
for (;;) {                                                        # Look for <TagName and whitespace
   $in = Output('.*?<[^!/?][^\s/>]+\s*');                         # Output data before tag attributes
   last unless $in;                                               # Exit if no more input
   %attr = ();                                                    # Clear attribute hash for new tag
   while ($in !~ m"^/?>"s) {                                      # Process attributes until end of tag (/> or >)
      ($aname,$aequal,$in) = $in =~ m"([^\s=]+)(\s*=\s*)(.*)$"s;  # Attr name ends at whitespace or =
      ($dl,$adata,$dr,$in) = $in =~ m"(.)(.*?)(\1\s*)(.*)$"s;     # Attr value is between delimiters
      next if exists($attr{$aname});                              # Skip known attribute
      $attr{$aname} = '';                                         # Save attribute name in hash
      print $aname.$aequal.$dl.$adata.$dr;                        # Copy attribute to STDOUT if not known
   }                                                              # End of attributes
   $in = Output('/?>');                                           # Output end of tag, either /> or >
}                                                                 # End of input
Mr. Programmer
AOO 4.1.7 Build 9800, MacOS 13.6.3, iMac Intel.   The locale for any menus or Calc formulas in my posts is English (USA).
User avatar
Villeroy
Volunteer
Posts: 31279
Joined: Mon Oct 08, 2007 1:35 am
Location: Germany

Re: Request for comments: Delete duplicate attributes tool

Post by Villeroy »

The program does not return. Output of program top:

Code: Select all

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                      
26304 andreas   20   0   20588   4764   4308 S   0,0  0,0   0:00.00 perl
Please, edit this topic's initial post and add "[Solved]" to the subject line if your problem has been solved.
Ubuntu 18.04 with LibreOffice 6.0, latest OpenOffice and LibreOffice
User avatar
MrProgrammer
Moderator
Posts: 4905
Joined: Fri Jun 04, 2010 7:57 pm
Location: Wisconsin, USA

Re: Request for comments: Delete duplicate attributes tool

Post by MrProgrammer »

What was the input file, Villeroy? It reads from STDIN.
Mr. Programmer
AOO 4.1.7 Build 9800, MacOS 13.6.3, iMac Intel.   The locale for any menus or Calc formulas in my posts is English (USA).
User avatar
Villeroy
Volunteer
Posts: 31279
Joined: Mon Oct 08, 2007 1:35 am
Location: Germany

Re: Request for comments: Delete duplicate attributes tool

Post by Villeroy »

OK, OK. I overlooked the < > around the input file name.
Please, edit this topic's initial post and add "[Solved]" to the subject line if your problem has been solved.
Ubuntu 18.04 with LibreOffice 6.0, latest OpenOffice and LibreOffice
Post Reply