Hi there,
Is it possible to extract embedded objects from OpenOffice files?
If I check the zip file, I saw files like: Object 1, Object 2, Object 3, Object 4....
It seems that the original file was stored in the Object 1 with header.
Is there an API to read the header, then extract the embedded object from the Object # file?
Where I can find the header format?
Thanks,
Billow
Extract Embedded objects from ODT and other OO files
Extract Embedded objects from ODT and other OO files
OpenOffice 3.1 on Windows 2010
Re: Extract Embedded objects from ODT and other OO files
At least if the embedded objects contain a OpenOffice / LibreOffice component it is possible.
How to do it is exempified for extraction from a AOO/LibO spreadsheet file (.ods) in an attachment to the recent thread viewtopic.php?f=9&t=91041&p=431198&hili ... ed#p431198 .
Doing it from a Writer file requires some changes due to the different document models. Text documents contain "textembeddedobject"s , and spreadsheet documents have a DrawPage per sheet while text documents only have one DrawPage for all of the document. In both cases any embedded object is represented by a shape containing it and being a member of a DrawPage in turn.
How to do it is exempified for extraction from a AOO/LibO spreadsheet file (.ods) in an attachment to the recent thread viewtopic.php?f=9&t=91041&p=431198&hili ... ed#p431198 .
Doing it from a Writer file requires some changes due to the different document models. Text documents contain "textembeddedobject"s , and spreadsheet documents have a DrawPage per sheet while text documents only have one DrawPage for all of the document. In both cases any embedded object is represented by a shape containing it and being a member of a DrawPage in turn.
On Windows 10: LibreOffice 24.2 (new numbering) and older versions, PortableOpenOffice 4.1.7 and older, StarOffice 5.2
---
Lupp from München
---
Lupp from München
Re: Extract Embedded objects from ODT and other OO files
Hi Billow,
Are you looking to extract embedded objects from ODT files specifically using the C++/Java API or by looking through the extracted zip file?
When extracting embedded objects from the zip file, they can take one of two forms -- a file-based Object or a directory-based Object.
The directory-based objects occur most frequently when an OpenOffice/LibreOffice document is embedded within another OpenOffice/LibreOffice document. If you look at the directory-based object's sub-directory, it will contain the same types of files you would find in a standard document such as content.xml, settings.xml, styles.xml, etc. Unfortunately you can't zip up the directory-based object and give it an odp, odt, or ods extension because the directory-based object is missing it's own manifest.xml file. It's possible to take the manifest.xml file in the parent document, modify it, then combine it with the files in the directory-based object but it can get difficult depending on the complexity of the parent file and embedded object.
The file-based objects occur most frequently when the embedded object was inserted by Microsoft Office and/or is a file type that cannot be expressed using the OpenDocument Format. In these cases the native file is wrapped in a Microsoft OLE stream. There are tools like oledump.py which can extract data from an OLE stream but I've never tested that one myself.
Are you looking to extract embedded objects from ODT files specifically using the C++/Java API or by looking through the extracted zip file?
When extracting embedded objects from the zip file, they can take one of two forms -- a file-based Object or a directory-based Object.
The directory-based objects occur most frequently when an OpenOffice/LibreOffice document is embedded within another OpenOffice/LibreOffice document. If you look at the directory-based object's sub-directory, it will contain the same types of files you would find in a standard document such as content.xml, settings.xml, styles.xml, etc. Unfortunately you can't zip up the directory-based object and give it an odp, odt, or ods extension because the directory-based object is missing it's own manifest.xml file. It's possible to take the manifest.xml file in the parent document, modify it, then combine it with the files in the directory-based object but it can get difficult depending on the complexity of the parent file and embedded object.
The file-based objects occur most frequently when the embedded object was inserted by Microsoft Office and/or is a file type that cannot be expressed using the OpenDocument Format. In these cases the native file is wrapped in a Microsoft OLE stream. There are tools like oledump.py which can extract data from an OLE stream but I've never tested that one myself.
LibreOffice 5.3.7 on Windows 7