Very bad conversions from OCR

Talk about anything at all....
Post Reply
Zevach
Posts: 33
Joined: Thu Nov 29, 2007 8:02 pm

Very bad conversions from OCR

Post by Zevach »

I have been playing around with an OCR program (ABBYY FineReader v. 9). My last “project” was to digitalize an 18-page booklet with texts and 7 simple images.

I first produced a PDF file from it – 685 KB, good quality. Then I produced several DOC files, to see how the document looked when opened in Word (Word 2003 in my case).

The size of the file varied grossly with the image format that I selected. Left to the default settings for ”high-quality-print” images, I got a 36 MB DOC. When I selected PNG color images, the size dropped to 14.7 MB, and with JPEG images the size was 926 KB, similar to the PDF size.

All the files opened fast in Word, and the quality of the conversion was very good.

Then I tried to open the three DOC files with Writer (2.4.1) – very bad results. In all three cases most of the images just disappeared, the text format was changed, the documents were unusable.

Si it seems that OOo can´t convert these “OCR DOC” files. Disappointing.
User avatar
acknak
Moderator
Posts: 22756
Joined: Mon Oct 08, 2007 1:25 am
Location: USA:NJ:E3

Re: Very bad conversions from OCR

Post by acknak »

Does the OCR software require you to install MS Office as well?

MS' ".doc" format is not a standard; the only way to know how it works is to look at how MS Office does it. OOo does this better than anyone else, and it still has lots of problems importing files that were produced by MS Office. Do you expect files produced by some other software to work better than files produced by MS Office?

You might want to try the PDF output. OOo 3 (now in beta; stable release in a few weeks) has an extension that can import PDF documents into OOo Draw. Perhaps from there you could either tweak the document directly or copy/paste the pieces into writer to reconstruct a text document.

You can also try MS' ODF export filter, but if you really want a direct OCR-to-text document, ask the company to support ODF; at least it's an open file format.
AOO4/LO5 • Linux • Fedora 23
Post Reply