Page 1 of 1

Importing Postscript

Posted: Thu Jun 05, 2008 9:27 pm
by bgaston
Which version of OpenOffice.org are you using? 2.4X
What Operating System (version) are you using? Windows XP
What is your question or comment? Is there a way to read Postscript or EPS into Impress or Writer? Any info is helpful. Thanks!

Thanks for all of the suggestions. I am using postscript because it is an image that was originally created in matlab. I do not have the original code, otherwise i would just modify it. However, i am authorized to use the postscript since it was distributed freely in an academic environment. The help about creating a PDF from the postscript is very helpful. Thanks!

Re: Importing Postscript

Posted: Thu Jun 05, 2008 9:51 pm
by acknak
Not Postscript, but you can insert an EPS graphic using Insert > Picture > From File

You should understand that OOo is rather picky about what EPS files it will accept and display properly.

Re: Importing Postscript

Posted: Fri Jun 06, 2008 11:22 am
by sybille
At times, I've needed to get text from a postscript file into an ODT document. To do this on Linux, I've converted the PS file to PDF. I tend to use the command line and ps2pdf, which is part of the Ghostscript package and should be installed on just about any Linux distro that has packages for printing. It's also possible to do the same thing without the command line, for example by using Evince, the document viewer that comes with Gnome, and choosing print to PDF.

However it's done, the resulting PDF can be opened in a PDF viewer (like Evince) and the text can be copied, something that is not possible when opening the original PS file in the document viewer. And the copied text can be pasted into an ODT.

Of course, this method does not really preserve the layout and formatting of the text so it may not be useful for your project, but it is another option. I mention Linux tools because that's what I know and use, but I'm sure that there are some options for opening a PS file and printing it to PDF with Windows tools.

Re: Importing Postscript

Posted: Fri Jun 06, 2008 5:12 pm
by acknak
Nice idea--thanks!

ps2ascii can also be used to extract text from a PS file (all formatting is lost).

The PDF import extension is now ready for testing with OpenOffice.org 3.0+. The extension will load a PDF file directly into OOo Draw. I don't know if it helps at all with getting text out of the PDF.

Re: Importing Postscript

Posted: Fri Jun 06, 2008 5:43 pm
by sybille
The problem with ps2ascii is that not only does it not preserve formatting, but,in some cases, it also remove spaces between words in the output text. That's a complication...

And too, it really depends on the nature of the postscript file. They're not all the same, so sometimes printing to PDF won't produce a PDF with text that can be copied.

Depending on the project, it can even be worthwhile to print to paper, scan, and then use optical character recognition to convert the scan back to text. Programs that use the tesseract-ocr engine give good enough results to make this an option.

Anyway, it's nice to have a variety of approaches. I'm really looking forward to seeing how OOo's PDF importing will work and also to the hybrid PDF formats - exciting developments. :)

Re: Importing Postscript

Posted: Fri Jun 06, 2008 8:09 pm
by TerryE
One of the issues here is the both PS and PDF formats are optimised and targeted at display and printing. In many enterprises, working documents are circulated in DOC (and hopefully more frequently an Open Document Format). Baseline versions are circulated in PS or PDF format to "freeze" them for publication, and one off the reasons here is to discourage uncontrolled modification or plagiarism. That's one of the influencers in their design.

By nature PDF to RTF convertors seek to frustrate this intent. I do wonder, are you honouring the copyright statements in the documents that you are wanting to use, rich text and all?

Re: Importing Postscript

Posted: Fri Jun 06, 2008 9:09 pm
by sybille
TerryE wrote:I do wonder, are you honouring the copyright statements in the documents that you are wanting to use, rich text and all?
I can't speak for the original poster, but in my case it's essentially a matter of academic publishing and I am dealing with the authors directly.

Re: Importing Postscript

Posted: Fri Jun 06, 2008 10:48 pm
by TerryE
sybille, in that case then almost certainly they used some authoring package to prepare the PS/PDF in the first place. Why not ask for a copy of that?

Re: Importing Postscript

Posted: Fri Jun 06, 2008 11:53 pm
by sybille
It is very important to respect copyright, and I do appreciate that you're drawing attention to this.

Yet there are times when the best or the only option is to use free software tools to work with PS and PDF files, including to extract text. I believe that this does not in itself constitute any kind of disregard for or violation of copyright restrictions. It really depends on the file, knowing who holds the copyright to the material it contains, the purpose of the endeavor, and the laws of a given country.

Re: Importing Postscript

Posted: Sat Jun 07, 2008 12:20 pm
by TerryE
Sybille, I asked one Q which you've answered. My point is that its a shame that the author has granted the right to copy, but has provided the material in a encoding format that it is difficult to extract text from. Many OCR packages will now bypass the need to print PDFs before scanning them in again. Google "PDF scanning" and you will see that there are quite a few cheap packages which provide PDF -> RTF conversion.

Re: Importing Postscript

Posted: Sat Jun 07, 2008 12:52 pm
by sybille
I'm not sure which question I missed?

For pdf to text, I use gscan2pdf with the tesseract-ocr engine (as opposed to gocr), and it works just great.
http://gscan2pdf.sourceforge.net/
Gscan2pdf supports PDF import as well as being a scanner tool, so it can be used for text extraction with existing PDFs. It does not yet support page layout of the scanned text (PDF to RTF), but that's in development with the Ocropus project which, like tesseract-ocr, is sponsored by Google.

For me, it's basically a matter of having a toolbox with different approaches for dealing with corner cases when the "source" files no longer exist or cannot be readily obtained for whatever reason.