Page 1 of 1

[Dropped] Concept question about reading OCR files & creating list

Posted: Sat Apr 01, 2023 8:49 pm
by Cat101
In working with attorneys, they create files which are PDF/A files - meaning they can't be changed and are OCR readable.

I know I can use FIND to search in the document, after I open it, BUT...

What I'm pondering is:

1) writing something which can read all their documents located in a daily file and creating a list of them.
2) the list is created by what I read, as well as, some cross tables that gives codes, that is loaded into the file.
3) then write a link to the document's location in the file - Im estimating about 600 records per day - obviously will need some error handling things, but that will come later.
4) This will be a program which is kicked off by a daily cron job or a scheduler.
5) Then pushes it out the files & list, then moves files to done.

I know how to do steps 4 & 5, basically.
Its steps 1 - 3 that I'm pondering on.

First - is the reading & gathering information from an OCR document, in some automated manner, possible?
Second - If possible, which direction is best? CALC, Database, Combo, something else?

Re: Concept Question about reading OCR files & creating list

Posted: Sat Apr 01, 2023 9:03 pm
by RoryOF
It is possible that the OCR might not be necessary. I note, using OCR on many PDF documents, that my OCR front-end frequently announces that the text is embedded in the PDF, do I really want it to OCR the PDF. I have not yet discovered an application which will extract this embedded (I'm using linux is operating system)

Re: Concept Question about reading OCR files & creating list

Posted: Sat Apr 01, 2023 9:54 pm
by Cat101
Thanks.. Im not sure of all methods that the PDF/A documents are created. But based on what I've seen in their process, it appears to be via their scanners (ScanSnap). The staff mentioned that the scanner created two documents, one ending in OCR and they use the OCR document for filing. I didn't go down that rabbit hole at the time but maybe will have to.

Want to add, while the original doc is created in open office, currently they print for attorneys signature. It is that signed document that I've seen them scan and file.

Re: Concept Question about reading OCR files & creating list

Posted: Sat Apr 01, 2023 10:22 pm
by RoryOF
For OCR I use gimagereader QT as front-end - (gimagereader gtk is very slow, for reasons I don't know), but the QT version flies. They use Tesseract as the OCR engine, very accurate with good scans. Numbers in particular should be checked.

Re: Concept Question about reading OCR files & creating list

Posted: Sat Apr 01, 2023 10:29 pm
by MrProgrammer
RoryOF wrote: Sat Apr 01, 2023 9:03 pm I have not yet discovered an application which will extract this embedded (I'm using linux is operating system)
[Solved] Can I embed font in ODF document?

Re: Concept Question about reading OCR files & creating list

Posted: Sun Apr 02, 2023 12:23 am
by Cat101
OMG... Way, way cool. I coded in Perl in another lifetime. I wonder if I can load a perl program onto these attorney's system to test? I may need to brush-up on perl. Thanks MrProgrammer

Re: Concept Question about reading OCR files & creating list

Posted: Sun Apr 02, 2023 12:38 am
by Cat101
RoryOF wrote: Sat Apr 01, 2023 10:22 pm For OCR I use gimagereader QT as front-end - (gimagereader gtk is very slow, for reasons I don't know), but the QT version flies. They use Tesseract as the OCR engine, very accurate with good scans. Numbers in particular should be checked.
Thanks Rory. At first I read gim-age-reader & said what? LOL
I'll check it out too.

Re: Concept Question about reading OCR files & creating list

Posted: Sun Apr 02, 2023 12:49 am
by Cat101
oh. https://tesseract-ocr.github.io/tessdoc/Home.html is way interesting & Open Source. YEAH
I can see lots of play-time in my future.

I think I'll leave this open for a while & See how far I get in a month.



"Life is short. Find something you love to do. Then excel in what you do."

Re: Concept Question about reading OCR files & creating list

Posted: Sun Apr 02, 2023 9:01 am
by RoryOF
MrProgrammer wrote: Sat Apr 01, 2023 10:29 pm
RoryOF wrote: Sat Apr 01, 2023 9:03 pm I have not yet discovered an application which will extract this embedded (I'm using linux is operating system)
[Solved] Can I embed font in ODF document?
Thanks, Mr Programmer; I'll play with that later,when I have some time - my main backup computer is currently having hysterics and needs talking to.

Re: Concept Question about reading OCR files & creating list

Posted: Sun Apr 02, 2023 9:17 am
by robleyd
my main backup computer is currently having hysterics and needs talking to.
With a Windows install medium in your hand, to properly frighten it :-)

Re: Concept Question about reading OCR files & creating list

Posted: Sat Apr 08, 2023 10:15 pm
by Cat101
In my never ending quest to find easiest way. I stumbled on this for VB.Net & C# coders. Been there, done those. As well as iTextSharp in ASP.NET. Posts were from 2018.

https://social.msdn.microsoft.com/Forum ... isualbasic I have no idea at this point, if anything in this links works, but thought I'd share.

This very thing (Extracting info) will be my project for next week - meeting with their techs to see what's available. And if the force is with me, it will be done next week. Then onto the API aspect of the program.



"If you can't do what you love, then love what you do."

Re: Concept Question about reading OCR files & creating list

Posted: Sun Apr 09, 2023 9:38 am
by RoryOF
Have a look at Mr Programmer's script which he points to earlier in this thread.