Page 1 of 1

Getting the content.xml

Posted: Fri Jan 07, 2011 1:55 am
by Paeon
Right now, as far as I know, to get at the content file of a Writer doc you have to save the file with .odt as an extension, change the extension name to .zip, and then decompress the the zipped file to give you a folder of the xml files.

What I'd like to to simply export the content.xml file, without having to do the zip/decompress routine. The only thing I found was on the wiki, (Flat XML export), but the link container was empty.

If I have to write an XSLT (seems overkill), does the filter necessarily have to have an import option...cause I really will never need to import the file back to a word processing program once it has been extracted? :cry:

Re: Getting the content.xml

Posted: Fri Jan 07, 2011 2:34 am
by acknak
If you want to export xml through the GUI, I would think that the flat xml is (potentially) the simplest solution since it's already available and complete--you just have to find out how to access it (and I don't have the slightest idea; I've not been able to find out myself). The xml saved this way will include both the document content and all the styles, settings, and metadata from the document.

If you really only want the document content, it's very simple to create an XSLT that sends its input (content.xml) to the output, unchanged. You can install that as an xml filter and the user can export the data using the normal File > Save/Save As menu.

If you just need a way to grab the content.xml part of an ODF file, and not necessarily through the OOo GUI, then I would simply use a command-line zip archive utility. That will avoid the extra gyrations you mentioned and you can get the content.xml in one operation, with no need to worry about file names or extra files. Some editors will recognize a zip archive and allow you to extract any of the archive entries just as if it were a normal file; modern OS's all support some kind of enhanced access to archive contents as well.

Re: Getting the content.xml

Posted: Fri Jan 07, 2011 10:42 am
by hol.sten
Paeon wrote:If I have to write an XSLT (seems overkill), does the filter necessarily have to have an import option
Short answer to this question of yours: No! If, for what reason ever, you only need an export filter, you don't have to add an import filter (and vice versa).

Re: Getting the content.xml

Posted: Fri Jan 07, 2011 12:34 pm
by Robert Tucker
Don't know if these references are of any interest:
You can’t just run this stylesheet against an .odt file with a typical transformation program because the OpenDocument file is in a .zip format. Instead, you need a special program that can read directly from the compressed file. That program is ODTransform.java.
http://books.evc-cit.info/odf_utils/odt_to_xhtml.html
http://books.evc-cit.info/oasis/ods_to_ ... sform.java

Re: Getting the content.xml

Posted: Mon Jan 10, 2011 1:30 pm
by rudolfo
Just as a follow-up on acknak's hint to create a simple xslt stylesheet to solve this ...
acknak wrote:If you really only want the document content, it's very simple to create an XSLT that sends its input (content.xml) to the output, unchanged. You can install that as an xml filter and the user can export the data using the normal File > Save/Save As menu.
XSLT has the copy-of element to make a deep copy of a node. If you do this for the root node you'll have a direct copy of the full xml input. Most of the introductions on XSLT have a stylesheet similar to the following one as their first example:

Code: Select all

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="xml" indent="yes" encoding="utf-8" />

  <xsl:template match="/">
    <xsl:copy-of select="*"/>
  </xsl:template>

</xsl:stylesheet>
Save the above code to a raw-export.xslt or direct-copy.xslt or whatever suits you (delete the indent="yes" if you don't need the output to be human readable) and use the xml-filter dialog from Tools to configure this for as an export filter. It should basically work for all kind of OOo documents (text, spreadsheets, presentations, etc.) but the filter dialog requires you to specify a type. Either do this for all types that you need or maybe "unknown" can do the job as well. But I tested it only for .odt documents.
After this you can use "Export..." from the file menu to export to an xml file. The output of an Writer document contains the xml from meta.xml, settings.xml, styles.xml and content.xml:

Code: Select all

<?xml version="1.0" encoding="UTF-8"?>
<office:document ... office:version="1.2" office:mimetype="application/vnd.oasis.opendocument.text">
   <office:meta>...</office:meta>
   <office:settings>...</office:settings>
   <office:scripts>...</office:scripts>
   <office:font-face-decls>...</office:font-face-decls>
   <office:styles>...</office:styles>
   <office:automatic-styles>...</office:automatic-styles>
   <office:master-styles>...</office:master-styles>
   <office:body>...</office:body>
</office:document>
I have never used the flat odt (.fodt) extension, but from what I read about it it seems to output more or less the same structure.

Re: Getting the content.xml

Posted: Mon Jan 10, 2011 4:13 pm
by acknak
Nice work! Thanks for the follow-up.

Re: Getting the content.xml

Posted: Sat Jan 31, 2026 8:57 pm
by sjaguar
In case anyone wonders, here are a couple of ways of getting content.xml using Java.

The first one converts the contents of the content.xml file into a byte array:

Code: Select all

public byte[] readContent( XComponent component )
throws com.sun.star.io.IOException,
       NoSuchElementException,
       Exception
{
   byte[][]              buffer;  
   int                   bytes;
   XStream               content;
   XInputStream          stream;
   XStorage              storage;
   XStorageBasedDocument storageDoc;

   storageDoc = UnoRuntime.queryInterface( XStorageBasedDocument.class, component );
   storage    = storageDoc.getDocumentStorage();
   content    = storage.openStreamElement( "content.xml", ElementModes.READ );
   stream     = content.getInputStream();
   bytes      = content.available();
   buffer     = new byte[ 1 ][ bytes ];
	         
   content.readBytes( buffer, bytes );
   stream.closeInput();
	      	      
   return buffer[ 0 ];      
}
It is also possible to obtain a reference to the file inside the OpenDocument file (.ods, .odt, etc.) as follows:

Code: Select all

public Path getContent( String odURL )
throws com.sun.star.io.IOException,
{
   Path                  contentFile;
   Map< String, String > env;
   FileSystem            fSystem;
   Path                  odFile;
   URI                   uri;

   env         = new HashMap();
   odFile      = Paths.get( odURL );
   uri         = URI.create( "jar:" + odFile.toString() );
   fSystem     = FileSystems.newFileSystem( uri, env ) ;
   contentFile = fSystem.getPath( "/" + "content.xml" );
       
   return contentFile;
}
note: closing of fSystem has been excluded for the sake of clarity, and should be taking care of in a way appropiate to the surrounding context.

Cheers!