Extracting numbers

Discuss the spreadsheet application
Post Reply
ed in Fla
Posts: 2
Joined: Sun Oct 14, 2018 2:47 pm

Extracting numbers

Post by ed in Fla »

Hi.

I have a bunch of .pdf's that I am cutting and pasting columns of data from. Typical file looks like this

https://wsfrprograms.fws.gov/subpages/l ... 202018.pdf

Pasted data looks like

Alabama 301,537
Alaska 31,879
Arizona 97,312
Arkansas 225,969

I want to get the numbers in their own column. Easy enough except you have
North Carolina 365,635
Two spaces before the number.

Thing is I want to create a time series and they didn't store it that way.



Thoughts?
OpenOffice 4.1.1
erbsenzahl
Volunteer
Posts: 255
Joined: Tue Apr 18, 2017 8:23 am
Location: Germany

Re: Extracting numbers

Post by erbsenzahl »

Extraction of PDFs isn't that simple because spaces are often added for reasons of lay-out. Best results often by using some OCR applications like Abbyy finereader, IRIS and other (no freeware, sorry).

If it's a simple column and space between text and figures you could select entire column, then Data menu, Text to columns, Separated by Space.

If it's not so simple (e.g. more space "characters" instead of one) you can SEARCH/REPLACE two spaces by one and so often that only one remains between text and figures...

Cheers
SeparationOfTextAndFigures.png
Last edited by erbsenzahl on Sun Oct 14, 2018 5:03 pm, edited 2 times in total.
LibreOffice current versions 7 and OpenOffice 4.1.15
on LinuxMint 20 - 21 Mate, W10-64 pro
User avatar
RusselB
Moderator
Posts: 6646
Joined: Fri Jan 03, 2014 7:31 am
Location: Sarnia, ON

Re: Extracting numbers

Post by RusselB »

Welcome to the Forums.

I was needing something a while back for a similar situation.
There were a number of replies (it's been a long time, and I don't feel like searching all of my posts for the correct one), but the one I liked the best, is

Code: Select all

=VALUE(MID(A1;FIND(CHAR(0);SUBSTITUTE(A1;" ";CHAR(0);LEN(A1)-LEN(SUBSTITUTE(A1;" ";""))));LEN(A1)))
This puts the 365,635 of North Carolina 365,635 into B1 (presuming the formula is in B1). It does NOT remove it from the entry in A1, and you do not want to remove the numbers manually afterwards, otherwise the entry in B1 will change.
If you need the state in one column alone and the (guessing) population in a different column, then put the first formula into C1 and

Code: Select all

=substitute(A1;C1;"")
into B1
OpenOffice 4.1.7, LibreOffice 7.0.1.2 on Windows 7 Pro, Ultimate & Windows 10 Home (2004)
If you believe your problem has been resolved, please go to your first post in this topic, click the Edit button and add [Solved] to the beginning of the Subject line.
User avatar
Lupp
Volunteer
Posts: 3552
Joined: Sat May 31, 2014 7:05 pm
Location: München, Germany

Re: Extracting numbers

Post by Lupp »

How to understand?

-1- In your example pdf all the states (55 + 1DC ??) are listed with their two-letter-abbreviations.
-2- I cannot find cells with compound content similar to what you posted in plain text.
-3- The pdf is strucured badly (as usual; who did it?), but with an ordinary pdf-reader (Foxit in my case) I get the file presented with a readable table from where I can copy the content as a table and then 'Paste Speciel...' It into a spreadsheet using the RTF option.
Now merged cells and nonsense of the kind may cause problems, but at least the data are split.
Leading and trailing spaces can be removed using the TRIM() function, and the commas supposed to mimic thousands separators with SUBSTITUTE().

Concerning the example "North Carolina 365,635" I would again recommend to TRIM away superfluous spaces and to SUBSTITUTE the annoying commas by nothing. Then you find the first position behind the space splitting the number from the name by

Code: Select all

=SEARCH(" [^A-Z]";ExampleCellReference)
(with RegEx enabled) and you get the numbers as actual numbers under any locale.

(The pasted table from the pdf is much better usable in LibreOffice V6.1.1 than in Apache OpenOffice V4.1.5.)

See attachment containing the result.
Attachments
Hunting_License_Report_2018reworked.ods
(27.13 KiB) Downloaded 74 times
On Windows 10: LibreOffice 24.2 (new numbering) and older versions, PortableOpenOffice 4.1.7 and older, StarOffice 5.2
---
Lupp from München
ed in Fla
Posts: 2
Joined: Sun Oct 14, 2018 2:47 pm

Re: Extracting numbers

Post by ed in Fla »

Hi folks. I want to thank you all for your help on this, somewhat trivial, project.

In poking around looking for trend data on "hunting" (broadly defined) I came across the trove that I have linked to and that got me thinking "maybe absolute number of licenses is decreasing but does that mean that $ are decreasing or that non-resident licenses are decreasing?" etc etc. I was also interested is seeing if there is a relationship between NICS (firearms background checks) and licenses.

All inquiries of the highest importance yet stymied by annoying data formats.

Thank you all again.
OpenOffice 4.1.1
Post Reply