Extract Embedded objects from ODT and other OO files

Creating a macro - Writing a Script - Using the API

Extract Embedded objects from ODT and other OO files

Postby billowgao » Fri Nov 24, 2017 9:44 pm

Hi there,

Is it possible to extract embedded objects from OpenOffice files?
If I check the zip file, I saw files like: Object 1, Object 2, Object 3, Object 4....
It seems that the original file was stored in the Object 1 with header.

Is there an API to read the header, then extract the embedded object from the Object # file?

Where I can find the header format?

Thanks,

Billow
OpenOffice 3.1 on Windows 2010
billowgao
 
Posts: 1
Joined: Fri Nov 24, 2017 9:40 pm

Re: Extract Embedded objects from ODT and other OO files

Postby Lupp » Sat Nov 25, 2017 1:58 am

At least if the embedded objects contain a OpenOffice / LibreOffice component it is possible.
How to do it is exempified for extraction from a AOO/LibO spreadsheet file (.ods) in an attachment to the recent thread https://forum.openoffice.org/en/forum/v ... ed#p431198 .
Doing it from a Writer file requires some changes due to the different document models. Text documents contain "textembeddedobject"s , and spreadsheet documents have a DrawPage per sheet while text documents only have one DrawPage for all of the document. In both cases any embedded object is represented by a shape containing it and being a member of a DrawPage in turn.
On Windows 10: LibreOffice 6.1 and older versions, PortableOpenOffice 4.1.5 and older, StarOffice 5.2
---
Let's create a powerful UFO: United Free Office!
Lupp from München
User avatar
Lupp
Volunteer
 
Posts: 2176
Joined: Sat May 31, 2014 7:05 pm
Location: München, Germany

Re: Extract Embedded objects from ODT and other OO files

Postby Waldo » Fri Jan 05, 2018 4:26 am

Hi Billow,

Are you looking to extract embedded objects from ODT files specifically using the C++/Java API or by looking through the extracted zip file?

When extracting embedded objects from the zip file, they can take one of two forms -- a file-based Object or a directory-based Object.

The directory-based objects occur most frequently when an OpenOffice/LibreOffice document is embedded within another OpenOffice/LibreOffice document. If you look at the directory-based object's sub-directory, it will contain the same types of files you would find in a standard document such as content.xml, settings.xml, styles.xml, etc. Unfortunately you can't zip up the directory-based object and give it an odp, odt, or ods extension because the directory-based object is missing it's own manifest.xml file. It's possible to take the manifest.xml file in the parent document, modify it, then combine it with the files in the directory-based object but it can get difficult depending on the complexity of the parent file and embedded object.

The file-based objects occur most frequently when the embedded object was inserted by Microsoft Office and/or is a file type that cannot be expressed using the OpenDocument Format. In these cases the native file is wrapped in a Microsoft OLE stream. There are tools like oledump.py which can extract data from an OLE stream but I've never tested that one myself.
LibreOffice 5.3.7 on Windows 7
Waldo
 
Posts: 4
Joined: Tue Jan 02, 2018 7:39 pm


Return to Macros and UNO API

Who is online

Users browsing this forum: No registered users and 5 guests