[Solved] Convert PDF to FODG/XML in JAVA

Creating a macro - Writing a Script - Using the API (OpenOffice Basic, Python, BeanShell, JavaScript)
Post Reply
Reyleaf
Posts: 3
Joined: Mon Nov 19, 2018 9:43 am

[Solved] Convert PDF to FODG/XML in JAVA

Post by Reyleaf »

Hello everyone.

First of all, I’m an engineering student and not the best programmer.
To learn more I ask my dad if he has a task for me (not a programmer at all)
He wants a program to convert pdf to XML, examine the text in the XML, translate it, afterward back in the XML file and convert the XML back to pdf.

After lots and lots of reading, I found LibreOffice and the document converter. A website(aconvert) uses LibreOffice to the conversion between the different file types.
After a few days and a lot of searching in this forum, I was able to convert my first files. [Thanks a lot to all of you]

I converted the PDFs to fodg (which is my favorable file type besides “normal” XML), but the files have less data than the file I converted ‘by hand‘. (converted, filter: draw8 135 kb, draw_pdf_export: 150kb, by hand 370kb)

1. So how does this happened? Is it because of the read-only mode situation? Or because the file isn’t normally opened?

I tried to read about this, but I didn’t found any convincing solution. If I add a PropertyValue to load the Document (ReadOnly=False) the code doesn’t work.

Code: Select all

 // Loading the wanted document
                    com.sun.star.beans.PropertyValue propertyValues[] =
                        new com.sun.star.beans.PropertyValue[2];
                    propertyValues[0] = new com.sun.star.beans.PropertyValue();
                    propertyValues[0].Name = "Hidden";
                    propertyValues[0].Value = Boolean.TRUE;
                    propertyValues[1] = new com.sun.star.beans.PropertyValue();
                    propertyValues[1].Name = "ReadOnly";
                    propertyValues[1].Value = Boolean.FALSE;

I use Java, especially eclipse, windows 10, LibreOffice 6.1, pdf 1.6 and the filter names: draw8 or draw_pdf-export (didn’t find any other that would work, nor an actual list after LibreOffice version 3.5)

Thank you and have a nice day,
Rey



Whole Code ( The nearly original example)

Code: Select all

/* -*- Mode: Java; tab-width: 4; indent-tabs-mode: nil; c-basic-offset: 4 -*- */
/*************************************************************************
 *
 *  The Contents of this file are made available subject to the terms of
 *  the BSD license.
 *
 *  Copyright 2000, 2010 Oracle and/or its affiliates.
 *  All rights reserved.
 *
 *  Redistribution and use in source and binary forms, with or without
 *  modification, are permitted provided that the following conditions
 *  are met:
 *  1. Redistributions of source code must retain the above copyright
 *     notice, this list of conditions and the following disclaimer.
 *  2. Redistributions in binary form must reproduce the above copyright
 *     notice, this list of conditions and the following disclaimer in the
 *     documentation and/or other materials provided with the distribution.
 *  3. Neither the name of Sun Microsystems, Inc. nor the names of its
 *     contributors may be used to endorse or promote products derived
 *     from this software without specific prior written permission.
 *
 *  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
 *  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
 *  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
 *  FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
 *  COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
 *  INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
 *  BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS
 *  OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
 *  ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR
 *  TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
 *  USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 *
 *************************************************************************/

import com.sun.star.uno.UnoRuntime;

import java.io.File;
import ooo.connector.BootstrapSocketConnector;


/** The class <CODE>DocumentConverter</CODE> allows you to convert all documents
 * in a given directory and in its subdirectories to a given type. A converted
 * document will be created in the same directory as the origin document.
 *
 */
public class DocumentConverter {
    /** Containing the loaded documents
     */
    static com.sun.star.frame.XComponentLoader xCompLoader = null;
    /** Containing the given type to convert to
     */
    static String sConvertType = "";
    /** Containing the given extension
     */
    static String sExtension = "";
    /** Containing the current file or directory
     */
    static String sIndent = "";
    /** Containing the directory where the converted files are saved
     */
    static String sOutputDir = "";

    /** Traversing the given directory recursively and converting their files to
     * the favoured type if possible
     * @param fileDirectory Containing the directory
     */
    static void traverse( File fileDirectory ) {
        // Testing, if the file is a directory, and if so, it throws an exception
        if ( !fileDirectory.isDirectory() ) {
            throw new IllegalArgumentException(
                "not a directory: " + fileDirectory.getName()
                );
        }

        // Prepare Url for the output directory
        File outdir = new File(DocumentConverter.sOutputDir);
        String sOutUrl = "file:///" + outdir.getAbsolutePath().replace( '\\', '/' );

        System.out.println("\nThe converted documents will stored in \""
                           + outdir.getPath() + "!");

        System.out.println(sIndent + "[" + fileDirectory.getName() + "]");
        sIndent += "  ";

        // Getting all files and directories in the current directory
        File[] entries = fileDirectory.listFiles();


        // Iterating for each file and directory
        for ( int i = 0; i < entries.length; ++i ) {
            // Testing, if the entry in the list is a directory
            if ( entries[ i ].isDirectory() ) {
                // Recursive call for the new directory
                traverse( entries[ i ] );
            } else {
                // Converting the document to the favoured type
                try {
                    // Composing the URL by replacing all backslashes
                    String sUrl = "file:///"
                        + entries[ i ].getAbsolutePath().replace( '\\', '/' );

                    // Loading the wanted document
                    com.sun.star.beans.PropertyValue propertyValues[] =
                        new com.sun.star.beans.PropertyValue[1];
                    propertyValues[0] = new com.sun.star.beans.PropertyValue();
                    propertyValues[0].Name = "Hidden";
                    propertyValues[0].Value = Boolean.TRUE;

                    Object oDocToStore =
                        DocumentConverter.xCompLoader.loadComponentFromURL(
                            sUrl, "_blank", 0, propertyValues);

                    // Getting an object that will offer a simple way to store
                    // a document to a URL.
                    com.sun.star.frame.XStorable xStorable =
                        UnoRuntime.queryInterface(
                        com.sun.star.frame.XStorable.class, oDocToStore );

                    // Preparing properties for converting the document
                    propertyValues = new com.sun.star.beans.PropertyValue[2];
                    // Setting the flag for overwriting
                    propertyValues[0] = new com.sun.star.beans.PropertyValue();
                    propertyValues[0].Name = "Overwrite";
                    propertyValues[0].Value = Boolean.TRUE;
                    // Setting the filter name
                    propertyValues[1] = new com.sun.star.beans.PropertyValue();
                    propertyValues[1].Name = "FilterName";
                    propertyValues[1].Value = DocumentConverter.sConvertType;

                    // Appending the favoured extension to the origin document name
                    int index1 = sUrl.lastIndexOf('/');
                    int index2 = sUrl.lastIndexOf('.');
                    String sStoreUrl = sOutUrl + sUrl.substring(index1, index2 + 1)
                        + DocumentConverter.sExtension;

                    // Storing and converting the document
                    xStorable.storeAsURL(sStoreUrl, propertyValues);

                    // Closing the converted document. Use XCloseable.close if the
                    // interface is supported, otherwise use XComponent.dispose
                    com.sun.star.util.XCloseable xCloseable =
                        UnoRuntime.queryInterface(
                        com.sun.star.util.XCloseable.class, xStorable);

                    if ( xCloseable != null ) {
                        xCloseable.close(false);
                    } else {
                        com.sun.star.lang.XComponent xComp =
                            UnoRuntime.queryInterface(
                            com.sun.star.lang.XComponent.class, xStorable);

                        xComp.dispose();
                    }
                }
                catch( Exception e ) {
                    e.printStackTrace(System.err);
                }

                System.out.println(sIndent + entries[ i ].getName());
            }
        }

        sIndent = sIndent.substring(2);
    }

    /** Bootstrap UNO, getting the remote component context, getting a new instance
     * of the desktop (used interface XComponentLoader) and calling the
     * static method traverse
     * @param args The array of the type String contains the directory, in which
     *             all files should be converted, the favoured converting type
     *             and the wanted extension
     */
    public static void main( String args[] ) {
        if ( args.length < 3 ) {
            System.out.println("usage: java -jar DocumentConverter.jar " +
                "\"<directory to convert>\" \"<type to convert to>\" " +
                "\"<extension>\" \"<output_directory>\"");
            System.out.println("\ne.g.:");
            System.out.println("usage: java -jar DocumentConverter.jar " +
                "\"c:/myoffice\" \"swriter: MS Word 97\" \"doc\"");
            System.exit(1);
        }

        com.sun.star.uno.XComponentContext xContext = null;

        try {
            // get the remote office component context
             String oooExeFolder = "C:\\Program Files\\LibreOffice\\program";
        	xContext = BootstrapSocketConnector.bootstrap(oooExeFolder);
            System.out.println("Connected to a running office ...");

            // get the remote office service manager
            com.sun.star.lang.XMultiComponentFactory xMCF =
                xContext.getServiceManager();

            Object oDesktop = xMCF.createInstanceWithContext(
                "com.sun.star.frame.Desktop", xContext);

            xCompLoader = UnoRuntime.queryInterface(com.sun.star.frame.XComponentLoader.class,
                                      oDesktop);

            // Getting the given starting directory
            File file = new File(args[0]);

            // Getting the given type to convert to
            sConvertType = args[1];

            // Getting the given extension that should be appended to the
            // origin document
            sExtension = args[2];

            // Getting the given type to convert to
            sOutputDir = args[3];

            // Starting the conversion of documents in the given directory
            // and subdirectories
            traverse(file);

            System.exit(0);
        } catch( Exception e ) {
            e.printStackTrace(System.err);
            System.exit(1);
        }
    }
}

/* vim:set shiftwidth=4 softtabstop=4 expandtab: */
Last edited by Reyleaf on Tue Nov 20, 2018 12:12 pm, edited 1 time in total.
OpenOffice 4.1.5 on Windows 10; LibreOffice 6.1
Reyleaf
Posts: 3
Joined: Mon Nov 19, 2018 9:43 am

Re: Convert PDF to FODG/XML in JAVA

Post by Reyleaf »

No filter is the best filter.

Doesn't work perfectly, but it is enough for now
OpenOffice 4.1.5 on Windows 10; LibreOffice 6.1
User avatar
Villeroy
Volunteer
Posts: 31279
Joined: Mon Oct 08, 2007 1:35 am
Location: Germany

Re: [SOLVED) Convert PDF to FODG/XML in JAVA

Post by Villeroy »

PDF is not a document format. It is a printer format. You can convert anything printable to PDF but not the other way round, simply because PDF does not contain any information about the original document structure (if it used to be any "document" at all). Likewise, you never get a satisfactory result when converting a printed paper back into computer data even if you store information in the same file format as the original software that was used for the print-out. Likewise, every video, sound, picture editor has its own format for editing and various output formats for the consumable media. PDF is the output format for anything printable when work is done.
The office component which comes as close as possible to the data structures in a PDF is the Draw component. We have an OpenOffice extension which tries its best to convert anything printable from any other application into multi-page vector graphics or into a collection of bitmaps (some PDFs consist of paper size bitmaps). This conversion can not be perfect. It is like pushing the tooth paste back into the tube.
The extenstion provides another feature. It can produce hybrid PDFs including the original office document. These hybrids can be opened by PDF viewers and OpenOffice / LIbreOffice as well. The viewers load the PDF part, the office suites load the office document and save both parts when the office document has been modified.
The attached archive contains a Python module that can be run as a macro. It reports availlable data about the supported file types and type detection mechanisms.
The reported filter name of a flat ODG file is "draw_ODG_FlatXML". The XML follows the same standard as the normal Open Document Graphic format (odg). ODG contains XML files in a zip archive, FODG wraps all data in a single, uncompressed XML file.
Attachments
reportFileTypes.zip
Python macro. Extract to directory <user_profile>/Scripts/python/
(789 Bytes) Downloaded 158 times
Please, edit this topic's initial post and add "[Solved]" to the subject line if your problem has been solved.
Ubuntu 18.04 with LibreOffice 6.0, latest OpenOffice and LibreOffice
Reyleaf
Posts: 3
Joined: Mon Nov 19, 2018 9:43 am

Re: [Solved] Convert PDF to FODG/XML in JAVA

Post by Reyleaf »

Thanks for the advice and the filter-macro.
I don´t want a 100% conversion rate, just as much as possible. It just suits my purpose.

You definitely helped me a lot understanding even more of the subject.
OpenOffice 4.1.5 on Windows 10; LibreOffice 6.1
Post Reply