Getting the content.xml

Discuss the word processor
Locked
Paeon
Posts: 4
Joined: Fri Nov 13, 2009 2:01 am

Getting the content.xml

Post by Paeon »

Right now, as far as I know, to get at the content file of a Writer doc you have to save the file with .odt as an extension, change the extension name to .zip, and then decompress the the zipped file to give you a folder of the xml files.

What I'd like to to simply export the content.xml file, without having to do the zip/decompress routine. The only thing I found was on the wiki, (Flat XML export), but the link container was empty.

If I have to write an XSLT (seems overkill), does the filter necessarily have to have an import option...cause I really will never need to import the file back to a word processing program once it has been extracted? :cry:
OpenOffice 2.4 with MacOS 10.4
User avatar
acknak
Moderator
Posts: 22756
Joined: Mon Oct 08, 2007 1:25 am
Location: USA:NJ:E3

Re: Getting the content.xml

Post by acknak »

If you want to export xml through the GUI, I would think that the flat xml is (potentially) the simplest solution since it's already available and complete--you just have to find out how to access it (and I don't have the slightest idea; I've not been able to find out myself). The xml saved this way will include both the document content and all the styles, settings, and metadata from the document.

If you really only want the document content, it's very simple to create an XSLT that sends its input (content.xml) to the output, unchanged. You can install that as an xml filter and the user can export the data using the normal File > Save/Save As menu.

If you just need a way to grab the content.xml part of an ODF file, and not necessarily through the OOo GUI, then I would simply use a command-line zip archive utility. That will avoid the extra gyrations you mentioned and you can get the content.xml in one operation, with no need to worry about file names or extra files. Some editors will recognize a zip archive and allow you to extract any of the archive entries just as if it were a normal file; modern OS's all support some kind of enhanced access to archive contents as well.
AOO4/LO5 • Linux • Fedora 23
hol.sten
Volunteer
Posts: 495
Joined: Mon Oct 08, 2007 1:31 am
Location: Hamburg, Germany

Re: Getting the content.xml

Post by hol.sten »

Paeon wrote:If I have to write an XSLT (seems overkill), does the filter necessarily have to have an import option
Short answer to this question of yours: No! If, for what reason ever, you only need an export filter, you don't have to add an import filter (and vice versa).
OOo 3.2.0 on Ubuntu 10.04 • OOo 3.2.1 on Windows 7 64-bit and MS Windows XP
User avatar
Robert Tucker
Volunteer
Posts: 1250
Joined: Mon Oct 08, 2007 1:34 am
Location: Manchester UK

Re: Getting the content.xml

Post by Robert Tucker »

Don't know if these references are of any interest:
You can’t just run this stylesheet against an .odt file with a typical transformation program because the OpenDocument file is in a .zip format. Instead, you need a special program that can read directly from the compressed file. That program is ODTransform.java.
http://books.evc-cit.info/odf_utils/odt_to_xhtml.html
http://books.evc-cit.info/oasis/ods_to_ ... sform.java
LibreOffice 7.x.x on Arch and Fedora.
rudolfo
Volunteer
Posts: 1488
Joined: Wed Mar 19, 2008 11:34 am
Location: Germany

Re: Getting the content.xml

Post by rudolfo »

Just as a follow-up on acknak's hint to create a simple xslt stylesheet to solve this ...
acknak wrote:If you really only want the document content, it's very simple to create an XSLT that sends its input (content.xml) to the output, unchanged. You can install that as an xml filter and the user can export the data using the normal File > Save/Save As menu.
XSLT has the copy-of element to make a deep copy of a node. If you do this for the root node you'll have a direct copy of the full xml input. Most of the introductions on XSLT have a stylesheet similar to the following one as their first example:

Code: Select all

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="xml" indent="yes" encoding="utf-8" />

  <xsl:template match="/">
    <xsl:copy-of select="*"/>
  </xsl:template>

</xsl:stylesheet>
Save the above code to a raw-export.xslt or direct-copy.xslt or whatever suits you (delete the indent="yes" if you don't need the output to be human readable) and use the xml-filter dialog from Tools to configure this for as an export filter. It should basically work for all kind of OOo documents (text, spreadsheets, presentations, etc.) but the filter dialog requires you to specify a type. Either do this for all types that you need or maybe "unknown" can do the job as well. But I tested it only for .odt documents.
After this you can use "Export..." from the file menu to export to an xml file. The output of an Writer document contains the xml from meta.xml, settings.xml, styles.xml and content.xml:

Code: Select all

<?xml version="1.0" encoding="UTF-8"?>
<office:document ... office:version="1.2" office:mimetype="application/vnd.oasis.opendocument.text">
   <office:meta>...</office:meta>
   <office:settings>...</office:settings>
   <office:scripts>...</office:scripts>
   <office:font-face-decls>...</office:font-face-decls>
   <office:styles>...</office:styles>
   <office:automatic-styles>...</office:automatic-styles>
   <office:master-styles>...</office:master-styles>
   <office:body>...</office:body>
</office:document>
I have never used the flat odt (.fodt) extension, but from what I read about it it seems to output more or less the same structure.
OpenOffice 3.1.1 (2.4.3 until October 2009) and LibreOffice 3.3.2 on Windows 2000, AOO 3.4.1 on Windows 7
There are several macro languages in OOo, but none of them is called Visual Basic or VB(A)! Please call it OOo Basic, Star Basic or simply Basic.
User avatar
acknak
Moderator
Posts: 22756
Joined: Mon Oct 08, 2007 1:25 am
Location: USA:NJ:E3

Re: Getting the content.xml

Post by acknak »

Nice work! Thanks for the follow-up.
AOO4/LO5 • Linux • Fedora 23
sjaguar
Posts: 1
Joined: Wed Oct 30, 2024 12:31 am

Re: Getting the content.xml

Post by sjaguar »

In case anyone wonders, here are a couple of ways of getting content.xml using Java.

The first one converts the contents of the content.xml file into a byte array:

Code: Select all

public byte[] readContent( XComponent component )
throws com.sun.star.io.IOException,
       NoSuchElementException,
       Exception
{
   byte[][]              buffer;  
   int                   bytes;
   XStream               content;
   XInputStream          stream;
   XStorage              storage;
   XStorageBasedDocument storageDoc;

   storageDoc = UnoRuntime.queryInterface( XStorageBasedDocument.class, component );
   storage    = storageDoc.getDocumentStorage();
   content    = storage.openStreamElement( "content.xml", ElementModes.READ );
   stream     = content.getInputStream();
   bytes      = content.available();
   buffer     = new byte[ 1 ][ bytes ];
	         
   content.readBytes( buffer, bytes );
   stream.closeInput();
	      	      
   return buffer[ 0 ];      
}
It is also possible to obtain a reference to the file inside the OpenDocument file (.ods, .odt, etc.) as follows:

Code: Select all

public Path getContent( String odURL )
throws com.sun.star.io.IOException,
{
   Path                  contentFile;
   Map< String, String > env;
   FileSystem            fSystem;
   Path                  odFile;
   URI                   uri;

   env         = new HashMap();
   odFile      = Paths.get( odURL );
   uri         = URI.create( "jar:" + odFile.toString() );
   fSystem     = FileSystems.newFileSystem( uri, env ) ;
   contentFile = fSystem.getPath( "/" + "content.xml" );
       
   return contentFile;
}
note: closing of fSystem has been excluded for the sake of clarity, and should be taking care of in a way appropiate to the surrounding context.

Cheers!
Last edited by sjaguar on Sat Jan 31, 2026 8:59 pm, edited 1 time in total.
LibreOffice 7.3.7.2, 24.2.4.2
OpenOffice 4.1.15
---
Debian 12
FreeBSD 14.1
Ubuntu 22.04
Locked