[Solved] Alphabetized vocabulary list

Writing a book, Automating Document Production - Discuss your special needs here
Post Reply
MidtownKR
Posts: 4
Joined: Fri Oct 18, 2013 8:57 am

[Solved] Alphabetized vocabulary list

Post by MidtownKR »

Is there some feature where I could scan a document and instantly alphabetize all words appearing in that document? For example, create a vocabulary list for words appearing in that document. An enhancement would be to index which page a words appears, like in a foreign language text book.
Last edited by Hagar Delest on Tue Oct 22, 2013 11:06 pm, edited 1 time in total.
Reason: tagged [Solved].
OpenOffice 2.4 on Ubuntu 13.04
User avatar
RoryOF
Moderator
Posts: 34586
Joined: Sat Jan 31, 2009 9:30 pm
Location: Ireland

Re: Alphabetized vocabulary list

Post by RoryOF »

You have to OCR the scan to convert the words from a picture into text. This needs another application (OCR). Then you can change all spaces into paragraph marks with Find and Replace, save the file as text and put the file of single word lines into a dictionary - I vaguely remember that OO has such a facility..
Apache OpenOffice 4.1.15 on Xubuntu 22.04.4 LTS
MidtownKR
Posts: 4
Joined: Fri Oct 18, 2013 8:57 am

Re: Alphabetized vocabulary list

Post by MidtownKR »

I want to create a reader with selections covering various topics. It would contain articles from various sources. With the selections, I want to create a vocabulary list without having to manually read through the whole thing and select words to go into a glossary (and end up entering the same word many times).

In other words, I would first create a .doc or .odt document (or similar), paste in long text selections, and then develop a glossary. I am looking for something that would compile a list of all words that appear in a selection (or the whole document) and alphabetize them. With that long list I would then manually remove unneeded or redundant items and then develop a glossary at the end of the book with the edited list compile from that operation. But I need a quick and efficient way to create that list.
OpenOffice 2.4 on Ubuntu 13.04
User avatar
RoryOF
Moderator
Posts: 34586
Joined: Sat Jan 31, 2009 9:30 pm
Location: Ireland

Re: Alphabetized vocabulary list

Post by RoryOF »

In this thread
http://www.oooforum.org/forum/viewtopic.phtml?t=9980
user RunningUtes gives a method he used which I quote below. You may find it of use.
I needed a medical dictionary and looked on the forums. Seems that many people are also suffering from the same problem.

This is what I have discovered. The 'custom' dictionaries only allow for 2000 words. This really puts a limit on what words you can add.
These are the steps I followed:


I used a word list of medical terms (45,000 words and phrases)
Separate all the words by removing characters (space, comma, dash, etc.)
Remove any characters that will give you problems (Unicode doesn't work in the dictionaries)
Remove duplicates in this list, and sort alphabetically
Remove any words in this list that are duplicated in the "en_US.dic" file. I used MS Access (since I have been using OpenOffice for 2 days) to do this using the "Find Unmatched Query Wizard". This gave me a final list of around 35,000 words.
Append your custom word list to the end of the "en_US.dic" list. You have to write down the line number of the last word in this list, and type it at the top of the file. for example, before I started, my "en_US.dic" had 62076 words in it. Now I add my word list to this and type the total here (96394).



To simplify this process, I am including a link when you can just download a new "en_US.dic" file with all the words. You will need to save it in the "C:\Program Files\OpenOffice.org 2.0\share\dict\ooo" directory.
Apache OpenOffice 4.1.15 on Xubuntu 22.04.4 LTS
MidtownKR
Posts: 4
Joined: Fri Oct 18, 2013 8:57 am

Re: Alphabetized vocabulary list

Post by MidtownKR »

This does not even come close. This guy wants to take an existing list and append it to a dictionary in Open Office. I need to generate a list from a text, and create a word list from that. I am trying the avoid the extremely cumbersome process of having to manually enter words one by one, and do this by running a simple command and generating the list of each word appearing in a given text. In other words, it would work like the word count operation where you highlight a selection and then select word count. But instead of word count, I am looking for something that itemizes each word somehow - perhaps to a spreadsheet that I could work with from there. Just some kind of instantly generated itemization of each word appearing in a selection.

I know about the custom dictionary, but you have to manually enter each word repeating three steps for each new word - that would take an extremely long time to compile the list. If the program can count words, I imagine there is some way to itemize each word alphabetically. Does this make sense?
OpenOffice 2.4 on Ubuntu 13.04
User avatar
RoryOF
Moderator
Posts: 34586
Joined: Sat Jan 31, 2009 9:30 pm
Location: Ireland

Re: Alphabetized vocabulary list

Post by RoryOF »

You may have to build the initial list using Scan and OCR and any other editing manipulation that may be necessary, remove duplicates - perhaps by inserting it into Calc and removing duplicates there, as there are many threads on that, then reexport the revised list so that you can incorporate it into a custom dictionary; from memory one can insert a word list into a custom dictionary in one move. Or if you are happy to work in Calc (I wouldn't, but that would be my choice), you can leave out the latter steps.
Apache OpenOffice 4.1.15 on Xubuntu 22.04.4 LTS
User avatar
Robert Tucker
Volunteer
Posts: 1250
Joined: Mon Oct 08, 2007 1:34 am
Location: Manchester UK

Re: Alphabetized vocabulary list

Post by Robert Tucker »

LibreOffice 7.x.x on Arch and Fedora.
MidtownKR
Posts: 4
Joined: Fri Oct 18, 2013 8:57 am

Re: Alphabetized vocabulary list

Post by MidtownKR »

Linguist is exactly what I am looking for - Thanks!
OpenOffice 2.4 on Ubuntu 13.04
Post Reply