Removing redundant format tags

Writing a book, Automating Document Production - Discuss your special needs here
Post Reply
uwedulz
Posts: 5
Joined: Thu Jan 19, 2012 10:13 am

Removing redundant format tags

Post by uwedulz »

Hi, I am a freelance translator using OmegaT as translation tool.
I have recently come across the problem that documents, that have been created with a newer version of MS Office, contain lots of redundant tags. Those redundant tags enclose virtually every single word.

Although (or because) OmegaT displays those tags, translating such texts while keeping the formatting is very time consuming. Furthermore, those redundant tags just do not conform to the idea of translation memories, thus making the usage of CAT-tools difficult.

It looks like that:

Code: Select all

Nabídky<f0> </f0>z menu<f1> </f1>pro<f2> </f2>překlad<f3> </f3>(nejlépe<f4> </f4>ve<f5> </f5>všech<f6> </f6>jazycích)
All tags are redundant, all words have the same format. The original XML looks like this:

Code: Select all

<text:h text:style-name="Heading_20_1" text:outline-level="1">
Nabídky
<text:span text:style-name="T1"> </text:span>
z menu
<text:span text:style-name="T1"> </text:span>
pro
<text:span text:style-name="T1"> </text:span>
překlad
<text:span text:style-name="T1"> </text:span>
(nejlépe
<text:span text:style-name="T1"> </text:span>
ve
<text:span text:style-name="T1"> </text:span>
všech
<text:span text:style-name="T1"> </text:span>
jazycích)
</text:h>
Until now I have worked around this problem by cutting out an entire paragraph, which has the same formatting, and immediately pasting it back unformatted via Ctrl-V manually. This removes all redundant tags (in the above case, all <text:span text:style-name="T1"> </text:span> tags, but is very time consuming for longer texts.

Now my idea is if someone would be able to create a macro which does that job automatically so I can run it over a whole document. Although I can imagine the logic required to program such a macro I do not have sufficient programming skills to realize this myself.
So if someone would be willing and able to cooperate with me or at least provide an initial macro code I would be more than happy.

Update:
I fixed my particular problem the easy way. From reading several forums I realized that LibreOffice 3.4 has a bug similar to the weird MS Office behaviour, thus putting tags around every single word. So I just used LibreOffice 3.3 to open the doc-file and save it as odt. All redundant tags are gone and I can now work on the file easily. BTW, Abiword did not do the job, it was unable to import the graphics and did not keep the page layout.

I still leave this topic unsolved, because it would be nice to have an OpenOffice version of that Codezapper script.
Last edited by uwedulz on Fri Jan 20, 2012 9:52 am, edited 1 time in total.
Ubuntu [current]
Libreoffice 3.4.4 (distro-provided)
esperantisto
Volunteer
Posts: 578
Joined: Mon Oct 08, 2007 1:31 am

Re: Removing redundant format tags

Post by esperantisto »

1. Try to simply reset the formatting by selecting the (entire, part) text and hitting Ctrl+M.
2. Try to remove custom paragraph/character styles. There was a macro for it somewhere at http://www.oooforum.org/.
3. Read Marc Prior’s howto on handling DOCX files: http://www.omegat.org/en/howtos/docx.html.
AOO 4.2.0 (of 2015) / LO 7.x / Win 7 / openSUSE Linux Leap 15.4 (64-bit)
uwedulz
Posts: 5
Joined: Thu Jan 19, 2012 10:13 am

Re: Removing redundant format tags

Post by uwedulz »

Thanks for you quick answer, esperantisto. I have had a look at all those referred pages.
Marc Prior’s howto on handling DOCX files mainly refers to Word, which I obviously do not have, so the tips and hints there do not help much. Well, actually, the referenced CodeZapper macro might be the solution I am looking for, if it was available for OpenOffice/LibreOffice.

I tried the style removal script and it seems that it did at least not damage my files.
Removing custom formatting with Ctrl-M also removes indent formatting, so if I have a text which uses custom text indents this does not help much.

What I successfully did manually was to unzip the ODT and remove all above mentioned tags from content.xml using the following regular expression:

Code: Select all

<text:span text:style-name="\w"> </text:span>
Made the entire file I have at hand at least about 1 MB smaller and does not destroy the overall formatting.
Now the problem is to make this work the other way round when there is text instead of spaces within the tags. Removing those tags does destroy the formatting, I would have to leave the first tag intact and remove only the tags within as long as there is no other tag present.
Ubuntu [current]
Libreoffice 3.4.4 (distro-provided)
User avatar
Villeroy
Volunteer
Posts: 31345
Joined: Mon Oct 08, 2007 1:35 am
Location: Germany

Re: Removing redundant format tags

Post by Villeroy »

You should learn about styles.
Please, edit this topic's initial post and add "[Solved]" to the subject line if your problem has been solved.
Ubuntu 18.04 with LibreOffice 6.0, latest OpenOffice and LibreOffice
uwedulz
Posts: 5
Joined: Thu Jan 19, 2012 10:13 am

Re: Removing redundant format tags

Post by uwedulz »

Villeroy wrote:You should learn about styles.
Sorry, you totally missed my point. I am receiving a document from a client and want to avoid going through the whole text to format it properly. The document is in word format (doc), so I convert it to odt. I never said that I had a problem with styles, and believe me, I am well familiar with styles since I started using WordPro many years ago.

However, I edited my first post since I found a solution to my particular problem, but unfortunately no solution to the overall problem.
Ubuntu [current]
Libreoffice 3.4.4 (distro-provided)
User avatar
Villeroy
Volunteer
Posts: 31345
Joined: Mon Oct 08, 2007 1:35 am
Location: Germany

Re: Removing redundant format tags

Post by Villeroy »

uwedulz wrote: I tried the style removal script and it seems that it did at least not damage my files.
We can not know which script you refer to. If it removes unused styles then of course it will not do anything visible since unused styles do not affect anything nor do they take much space.
Removing custom formatting with Ctrl-M also removes indent formatting, so if I have a text which uses custom text indents this does not help much.
Ctrl+M removes all hard formatting so all formats fall back to the underlying styles. You can easily add the indent and all other formatting attributes to the respective paragraph style(s).
Please, edit this topic's initial post and add "[Solved]" to the subject line if your problem has been solved.
Ubuntu 18.04 with LibreOffice 6.0, latest OpenOffice and LibreOffice
uwedulz
Posts: 5
Joined: Thu Jan 19, 2012 10:13 am

Re: Removing redundant format tags

Post by uwedulz »

I am referring to http://asap-traduction.com/CodeZapper. Got that from reading the links experantisto mentioned above.
The logic of the script would be to determine if the subsequent tag of one tag is identical, then to remove the closing tag of the first one and the beginning of the subsequent one and continue this algorithm until a new tag begins. That way all redundant tags can be easily eliminated.
Ubuntu [current]
Libreoffice 3.4.4 (distro-provided)
User avatar
Villeroy
Volunteer
Posts: 31345
Joined: Mon Oct 08, 2007 1:35 am
Location: Germany

Re: Removing redundant format tags

Post by Villeroy »

This is what styles do in css (cascading style sheets), in this office suite or in that other suite where hardly anybody of the billion users ever used the concept of styles. You have a named thing "Caramba" which includes Spanish language, bold, italic, blue Arial font with 10pt indentation and what else. Each document part having all these attributes is simply assigned to style "Caramba". Removing all hard attributes and appliying style based formatting only removes redundant attributes. Changing the attributes in one place (the stylist window) will change all the document snippets where the modified styles is in use. No need to apply any downloaded snake oil.
Please, edit this topic's initial post and add "[Solved]" to the subject line if your problem has been solved.
Ubuntu 18.04 with LibreOffice 6.0, latest OpenOffice and LibreOffice
rudolfo
Volunteer
Posts: 1488
Joined: Wed Mar 19, 2008 11:34 am
Location: Germany

Re: Removing redundant format tags

Post by rudolfo »

I think the problem here is not about using styles or not. Even not about the right and/or wrong way to use styles.
What I understand from uwedulz is that he has a terrible tag soup of redundant styles applied to parts of a paragraph, but among this tag soup there are some (logical) character styles to emphasize some words or phrases or to have some hyperlinks. And he don't want to loose these well justified logical formats.
Unfortunately the Default Formatting with Ctrl-M removes not only the direct formatting, but also those logical character style markup. With the CodeZapper approach he would only loose redundancy and he surely won't miss it.
OpenOffice 3.1.1 (2.4.3 until October 2009) and LibreOffice 3.3.2 on Windows 2000, AOO 3.4.1 on Windows 7
There are several macro languages in OOo, but none of them is called Visual Basic or VB(A)! Please call it OOo Basic, Star Basic or simply Basic.
uwedulz
Posts: 5
Joined: Thu Jan 19, 2012 10:13 am

Re: Removing redundant format tags

Post by uwedulz »

rudolfo wrote:I think the problem here is not about using styles or not. Even not about the right and/or wrong way to use styles.
What I understand from uwedulz is that he has a terrible tag soup of redundant styles applied to parts of a paragraph, but among this tag soup there are some (logical) character styles to emphasize some words or phrases or to have some hyperlinks. And he don't want to loose these well justified logical formats.
Unfortunately the Default Formatting with Ctrl-M removes not only the direct formatting, but also those logical character style markup. With the CodeZapper approach he would only loose redundancy and he surely won't miss it.
Thank you for reading my post thoroughly and exactly getting my point.
Ubuntu [current]
Libreoffice 3.4.4 (distro-provided)
Elisae
Posts: 1
Joined: Wed Aug 15, 2012 8:26 am

Re: Removing redundant format tags

Post by Elisae »

Hi uwedulz,

I happen to come across the same problem. Have you found any new solutions from January up to today? The codes I get when I open the document in odt format on OmegaT are the following.
<f0>Read </f0><f1><s2/></f1><f3>completely <s4/></f3><f5>t</f5><f6>h</f6><f7>rough </f7><f8><s9/></f8><f10>each </f10><f11><s12/></f11><f13>step </f13><f14><s15/></f14><f16>in </f16><f17><s18/></f17><f19>every </f19><f20><s21/></f20><f22>procedure <s23/>before </f22><f24><s25/></f24><f26>s</f26><f27>t</f27><f28>art</f28><f29>i</f29><f30>ng </f30><f31><s32/></f31><f33>the </f33><f34><s35/></f34><f36>procedure; <s37/>any except</f36><f38>i</f38><f39>ons</f39><f40> </f40><f41>may</f41><f42> </f42><f43>result</f43><f44> </f44><f45>in</f45><f46> </f46><f47>a </f47><f48><s49/></f48><f50>failure</f50><f51> </f51><f52>t</f52><f53>o</f53><f54> </f54><f55>proper</f55><f56>l</f56><f57>y</f57><f58> </f58><f59>and</f59><f60> </f60><f61>safe</f61><f62>l</f62><f63>y</f63><f64> </f64><f65>comp</f65><f66>l</f66><f67>e</f67><f68>t</f68><f69>e</f69><f70> </f70><f71>the</f71><f72> </f72><f73>a</f73><f74>t</f74><f75>tempted</f75><f76> </f76><f77>procedure.</f77
It is thus mostly <f> and <s> codes that remain the same even if I save the doc into odt and open it or save it into odt, back to word and then back to odt and then open it.
I've tried using my own Openoffice 3.3 and the Libreoffice you mentioned, but nothing seems to take the codes out.
The only thing has been to translate on the document itself, once I do that, the text fragment that I have inserted comes with no codes on OmegaT. But the document is a 100 page long.

Thank you!
Openoffice.org 3.1 with MacOS 10.4
zhivko
Posts: 7
Joined: Mon Feb 22, 2010 11:29 am

Re: Removing redundant format tags

Post by zhivko »

Same problem here.
I am getting something like:
<text:span text:style-name="T14">text1</text:span>
<text:span text:style-name="T15">text2</text:span>
<text:span text:style-name="T11">text3</text:span>
<text:span text:style-name="T14">text4</text:span>
<text:span text:style-name="T15">text5</text:span>

If I look at document it has 1 paragraph with text from text 1 to text4 in plain, and text5 in bold.

What could cause such excessive use of text:style-name elements inside content.xml of .ODT file ?
OpenOffice 3.1 on Windows XP
Bill
Volunteer
Posts: 8952
Joined: Sat Nov 24, 2007 6:48 am

Re: Removing redundant format tags

Post by Bill »

Please post a sample file. I'd like to see how "T15" can produce plain text for text2 but produce bold text for text5.
AOO 4.1.14 on Ubuntu MATE 22.04
esperantisto
Volunteer
Posts: 578
Joined: Mon Oct 08, 2007 1:31 am

Re: Removing redundant format tags

Post by esperantisto »

zhivko wrote:What could cause such excessive use of text:style-name elements inside content.xml of .ODT file ?
The file is probably a product of an OCR program.
AOO 4.2.0 (of 2015) / LO 7.x / Win 7 / openSUSE Linux Leap 15.4 (64-bit)
esperantisto
Volunteer
Posts: 578
Joined: Mon Oct 08, 2007 1:31 am

Re: Removing redundant format tags

Post by esperantisto »

By the way, there’s a solution for the original posting: now OmegaT can hide tags, just enable this option in the translation project properties and go on translating.
AOO 4.2.0 (of 2015) / LO 7.x / Win 7 / openSUSE Linux Leap 15.4 (64-bit)
Post Reply