Extract Embedded objects from ODT and other OO files

Creating a macro - Writing a Script - Using the API (OpenOffice Basic, Python, BeanShell, JavaScript)
Post Reply
billowgao
Posts: 1
Joined: Fri Nov 24, 2017 9:40 pm

Extract Embedded objects from ODT and other OO files

Post by billowgao »

Hi there,

Is it possible to extract embedded objects from OpenOffice files?
If I check the zip file, I saw files like: Object 1, Object 2, Object 3, Object 4....
It seems that the original file was stored in the Object 1 with header.

Is there an API to read the header, then extract the embedded object from the Object # file?

Where I can find the header format?

Thanks,

Billow
OpenOffice 3.1 on Windows 2010
User avatar
Lupp
Volunteer
Posts: 3542
Joined: Sat May 31, 2014 7:05 pm
Location: München, Germany

Re: Extract Embedded objects from ODT and other OO files

Post by Lupp »

At least if the embedded objects contain a OpenOffice / LibreOffice component it is possible.
How to do it is exempified for extraction from a AOO/LibO spreadsheet file (.ods) in an attachment to the recent thread viewtopic.php?f=9&t=91041&p=431198&hili ... ed#p431198 .
Doing it from a Writer file requires some changes due to the different document models. Text documents contain "textembeddedobject"s , and spreadsheet documents have a DrawPage per sheet while text documents only have one DrawPage for all of the document. In both cases any embedded object is represented by a shape containing it and being a member of a DrawPage in turn.
On Windows 10: LibreOffice 24.2 (new numbering) and older versions, PortableOpenOffice 4.1.7 and older, StarOffice 5.2
---
Lupp from München
Waldo
Posts: 4
Joined: Tue Jan 02, 2018 7:39 pm

Re: Extract Embedded objects from ODT and other OO files

Post by Waldo »

Hi Billow,

Are you looking to extract embedded objects from ODT files specifically using the C++/Java API or by looking through the extracted zip file?

When extracting embedded objects from the zip file, they can take one of two forms -- a file-based Object or a directory-based Object.

The directory-based objects occur most frequently when an OpenOffice/LibreOffice document is embedded within another OpenOffice/LibreOffice document. If you look at the directory-based object's sub-directory, it will contain the same types of files you would find in a standard document such as content.xml, settings.xml, styles.xml, etc. Unfortunately you can't zip up the directory-based object and give it an odp, odt, or ods extension because the directory-based object is missing it's own manifest.xml file. It's possible to take the manifest.xml file in the parent document, modify it, then combine it with the files in the directory-based object but it can get difficult depending on the complexity of the parent file and embedded object.

The file-based objects occur most frequently when the embedded object was inserted by Microsoft Office and/or is a file type that cannot be expressed using the OpenDocument Format. In these cases the native file is wrapped in a Microsoft OLE stream. There are tools like oledump.py which can extract data from an OLE stream but I've never tested that one myself.
LibreOffice 5.3.7 on Windows 7
Post Reply