Page 1 of 1

Extract Embedded objects from ODT and other OO files

Posted: Fri Nov 24, 2017 9:44 pm
by billowgao
Hi there,

Is it possible to extract embedded objects from OpenOffice files?
If I check the zip file, I saw files like: Object 1, Object 2, Object 3, Object 4....
It seems that the original file was stored in the Object 1 with header.

Is there an API to read the header, then extract the embedded object from the Object # file?

Where I can find the header format?

Thanks,

Billow

Re: Extract Embedded objects from ODT and other OO files

Posted: Sat Nov 25, 2017 1:58 am
by Lupp
At least if the embedded objects contain a OpenOffice / LibreOffice component it is possible.
How to do it is exempified for extraction from a AOO/LibO spreadsheet file (.ods) in an attachment to the recent thread viewtopic.php?f=9&t=91041&p=431198&hili ... ed#p431198 .
Doing it from a Writer file requires some changes due to the different document models. Text documents contain "textembeddedobject"s , and spreadsheet documents have a DrawPage per sheet while text documents only have one DrawPage for all of the document. In both cases any embedded object is represented by a shape containing it and being a member of a DrawPage in turn.

Re: Extract Embedded objects from ODT and other OO files

Posted: Fri Jan 05, 2018 4:26 am
by Waldo
Hi Billow,

Are you looking to extract embedded objects from ODT files specifically using the C++/Java API or by looking through the extracted zip file?

When extracting embedded objects from the zip file, they can take one of two forms -- a file-based Object or a directory-based Object.

The directory-based objects occur most frequently when an OpenOffice/LibreOffice document is embedded within another OpenOffice/LibreOffice document. If you look at the directory-based object's sub-directory, it will contain the same types of files you would find in a standard document such as content.xml, settings.xml, styles.xml, etc. Unfortunately you can't zip up the directory-based object and give it an odp, odt, or ods extension because the directory-based object is missing it's own manifest.xml file. It's possible to take the manifest.xml file in the parent document, modify it, then combine it with the files in the directory-based object but it can get difficult depending on the complexity of the parent file and embedded object.

The file-based objects occur most frequently when the embedded object was inserted by Microsoft Office and/or is a file type that cannot be expressed using the OpenDocument Format. In these cases the native file is wrapped in a Microsoft OLE stream. There are tools like oledump.py which can extract data from an OLE stream but I've never tested that one myself.