OO Writer WordCount (excluding non-dictionary words)

Shared Libraries
Forum rules
For sharing working examples of macros / scripts. These can be in any script language supported by OpenOffice.org [Basic, Python, Netbean] or as source code files in Java or C# even - but requires the actual source code listing. This section is not for asking questions about writing your own macros.
Post Reply
JeJe
Volunteer
Posts: 2779
Joined: Wed Mar 09, 2016 2:40 pm

OO Writer WordCount (excluding non-dictionary words)

Post by JeJe »

[Evolving work, see below for development beyond this initial post]
_________________________________________________________________________________

OO's word count doesn't give a dictionary-word count - but counts anything between separators. (LibreOffice has fixed this, so this is just for OO)

There is a function I got from here

viewtopic.php?f=20&t=82678&hilit=+count#p382966

Which I've modified as it gives some results that include just punctuation

Note: THIS WILL BE VERY SLOW WITH A LARGE OR EVEN A MODEST SIZED DOCUMENT

Better ideas/methods for a word count welcome...

Edit: note, just includes words in the main text paragraphs (through an enumeration), nothing else
Edit2: slight mods to allow choice of count of numbers in main sub and commenting out a test line I left in and shouldn't have.

Code: Select all



'uncomment lines with wdlist in for msgbox with list of words
'dim wdlist
sub CountWordsThisComponentMainText
	LANGUAGE = ooolang
	txt = thiscomponent.text
	en = txt.createenumeration

	counttype = com.sun.star.i18n.WordType.DICTIONARY_WORD
	acceptnos = true
	do until en.hasmoreelements =false
		p = en.nextelement
		on error goto hr
		tot = tot +GetStringWordcount(p.string,counttype,LANGUAGE,	acceptnos)
		hr:
	loop
	msgbox "Dictionary Word Count =" & tot,,Thiscomponent.currentcontroller.frame.title
'msgbox wdlist
end sub


	'modified from https://forum.openoffice.org/en/forum/viewtopic.php?f=20&t=82678&hilit=+count#p382966
	'for  choice of count type, and language and to exclude puctuation only words
Function GetStringWordcount(aString,counttype,LANGUAGE,	acceptnos)
'*******************************************
	'Function: Count Words in provided string
	'Author: Andrew Brown

	'Last updated 18 March 2016
	'Last updated by:Daniel Wilson
'*******************************************
	' the ultimate, using the same breakiterator as the program
	
	
			'from the api 'com.sun.star.i18n.WordType.WORD_COUNT
	'const short ANY_WORD = 0;
	'Any "words" - words in the meaning of same character types, collection of alphanumeric characters, or collection of non-alphanumeric characters.
	'const short ANYWORD_IGNOREWHITESPACES = 1;
	'    Any "words" - words in the meaning of same character types, collection of alphanumeric characters, or collection of non-alphanumeric characters except blanks.
	'const short DICTIONARY_WORD = 2;
	'    "words" - in the meaning of a collection of alphanumeric characters and some punctuations, like dot for abbreviation.
	'const short WORD_COUNT = 3;
	'    The mode for counting words, it will combine punctuations and spaces as word trail.

	'
	Dim mystartpos As Long
	Dim numwords,nw
	Dim nextwd As New com.sun.star.i18n.Boundary
	Dim aLocale As New com.sun.star.lang.Locale
	Dim brk

	aLocale.Language=LANGUAGE ' "en"

	numwords=0
	mystartpos=0
	brk=CreateUNOService("com.sun.star.i18n.BreakIterator")

	astring = " " & astring 'doesn't count first word

	nextwd=brk.nextWord(aString,startpos,aLocale,counttype)
	'   com.sun.star.i18n.WordType.WORD_COUNT
	Do While nextwd.startPos<> nextwd.endPos
		wd = mid(aString,nextwd.startpos+1,nextwd.endpos - nextwd.startpos)
		if isValidWord(wd,acceptnos) then
'			wdlist =wdlist & wd & chr(9)
			numwords=numwords+1
		end if
		nw=nextwd.startpos
		nextwd=brk.nextWord(aString,nw,aLocale,counttype)
		'      com.sun.star.i18n.WordType.WORD_COUNT
	Loop
	GetStringWordcount=numwords
'	msgbox numwords & "   " &  st
End Function

	'Useful Macro Information For OpenOffice By Andrew Pitonyak
	'5.7.2. OOo Locale
	'Listing 5.11: Obtain the current OpenOffice.org locale.
Function OOoLang() as string
	'Author : Laurent Godard
	'e-mail : listes.godard@laposte.net
	Dim oSet, oConfigProvider
	Dim oParm(0) As New com.sun.star.beans.PropertyValue
	Dim sProvider$, sAccess$
	sProvider = "com.sun.star.configuration.ConfigurationProvider"
	sAccess = "com.sun.star.configuration.ConfigurationAccess"
	oConfigProvider = createUnoService(sProvider)
	oParm(0).Name = "nodepath"
	oParm(0).Value = "/org.openoffice.Setup/L10N"
	oSet = oConfigProvider.createInstanceWithArguments(sAccess, oParm())
	Dim OOLangue as string
	OOLangue= oSet.getbyname("ooLocale") 'en-US
	OOLang=lcase(Left(trim(OOLangue),2)) 'en

End Function


function isValidWord(wd,acceptnos) 'eliminate words only containing punctuation
	'and numbers if want those not counted
	punct = ".!?,:;“”‘’()[]{}<>-—–/’…" & chr(34) & chr(39) '34 double quote '39 single quote
	nos= "0123456789"
	isValidWord =false
	for i = 1 to len(wd)
		c=(mid(wd,i,1))
		if instr(1,nos,c)>0 then
			if acceptnos then
				isValidWord = true
				exit for
			end if
		elseif instr(1,punct,c)>0 then
		else
			isValidWord = true
			exit for
		end if
	next
End function
Last edited by JeJe on Wed Sep 16, 2020 12:38 am, edited 3 times in total.
Windows 10, Openoffice 4.1.11, LibreOffice 7.4.0.3 (x64)
JeJe
Volunteer
Posts: 2779
Joined: Wed Mar 09, 2016 2:40 pm

Re: OO Writer WordCount (excluding non-dictionary words)

Post by JeJe »

Here are two more methods. The following does a split on the space character and uses a similar function as above to test if the word is valid.
Edit: one big problem will be a space isn't the only valid word delimeter eg word+tab+word
Edit2: fixed that (partially) with function to replace whitespace characters 9-13 with space before splitting on space. This method also runs very very slow.

Code: Select all


sub CountWordsSplit
	txt = thiscomponent.text
	en = txt.createenumeration
	acceptnos = true
	do until en.hasmoreelements =false
		p = en.nextelement
		on error goto hr
		st =REPLACEOTHERWHITESPACE2(p.string)
		sts = split(st," ")
		for i = 0 to ubound(sts)
		if isValidWord2(sts(i),acceptnos) then 
		tot = tot +1
'		wdlist=wdlist & sts(i) & chr(9)
		end if
		next
		hr:
	loop
	msgbox "Split method Word Count =" & tot,,Thiscomponent.currentcontroller.frame.title
'	msgbox wdlist
end sub

function isValidWord2(wd,acceptnos) 'eliminate words only containing punctuation
	'and numbers if want those not counted
	punct = ".!?,:;“”‘’()[]{}<>-—–/’…" & chr(34) & chr(39) & chr(9) & chr(10) & chr(13) '34 double quote '39 single quote
	nos= "0123456789"
	isValidWord2=false
	for i = 1 to len(wd)
		c=(mid(wd,i,1))
		if instr(1,nos,c)>0 then
			if acceptnos then
				isValidWord2 = true
				exit for
			end if
		elseif instr(1,punct,c)>0 then
		else
			isValidWord2 = true
			exit for
		end if
	next
End function


function REPLACEOTHERWHITESPACE2(byval st as string)
for i = 1 to len(st)
a=asc(mid(st,i,1))
if (a>= 9 and a <= 13) then mid(st,i,1) =" "
next
REPLACEOTHERWHITESPACE2 =st
End function

This one uses Regex. Some of the Regex word count expressions I found wouldn't work in OO. This one does but has problems with abbreviations and apostrophe words.

Code: Select all

Sub RegexWordCount  
Dim vDescriptor, vFound
vDescriptor = ThisComponent.createSearchDescriptor()
With vDescriptor
'https://stackoverflow.com/questions/30866275/regular-expression-for-counting-words-in-a-sentence
.SearchString ="\w+(-\w+)*" 'fails with abbreviations - M.O.D. will be 3 and Henry’s will be 2
.SearchRegularExpression = true
End With
' Find the first one
vFound = ThisComponent.findall(vDescriptor)
if isEmpty(vfound) = false then
msgbox  "Regex Word Count =" & vfound.count,,Thiscomponent.currentcontroller.frame.title
end if
End Sub
Windows 10, Openoffice 4.1.11, LibreOffice 7.4.0.3 (x64)
JeJe
Volunteer
Posts: 2779
Joined: Wed Mar 09, 2016 2:40 pm

Re: OO Writer WordCount (excluding non-dictionary words)

Post by JeJe »

Here's a regex count cobbled together from the mentioned links. Not quite there - it treats this-word as one word but this-word-and-more as two not one. Another thing I've noticed is that if there's a picture anchored as a character an adjacent word may not be counted (or selected if the expression is run from the find dialog).

Code: Select all

Sub RegexWordCount2  
Dim vDescriptor, vFound
vDescriptor = ThisComponent.createSearchDescriptor()
With vDescriptor

'https://css-tricks.com/build-word-counter-app/
'\b['?-?(\w+)?]+\b
'https://stackoverflow.com/questions/35076016/regex-to-match-acronyms
'\b(?:[a-zA-Z]\.){2,}
'https://stackoverflow.com/questions/41483674/regex-ignore-a-constant-string-that-matches-a-pattern
'exclude matches

.SearchString ="\b(?!(?:\(|\)))[['’-]?(\w+)?]+\b|\b(?:[a-zA-Z]\.){2,}"
.SearchRegularExpression = true
End With
vFound = ThisComponent.findall(vDescriptor)
if isEmpty(vfound) = false then
msgbox "Regex Word Count =" & vfound.count,,Thiscomponent.currentcontroller.frame.title
end if
End Sub
Windows 10, Openoffice 4.1.11, LibreOffice 7.4.0.3 (x64)
JeJe
Volunteer
Posts: 2779
Joined: Wed Mar 09, 2016 2:40 pm

Re: OO Writer WordCount (excluding non-dictionary words)

Post by JeJe »

Here's the Regex method applied to a string using XTextSearch

Code: Select all


sub test
	
	st =  THISCOMPONENT.TEXT.STRING 'assumes smaller than string limit for this simple test
	pattern ="\b(?!(?:_|\(|\)))[['’-]?(\w+)?]+\b|\b(?:[a-zA-Z]\.){2,}"
	msgbox (RegExXTextSearchCount( st,pattern),,"XTextSearch Count")
end sub

'uncomment the commented lines for a msgbox with the list of words.
Function RegExXTextSearchCount(st as string,Expression) as long 
	dim ts,so,endpos as long,startpos as long,wcount as long,a

	ts = createUnoService("com.sun.star.util.TextSearch")
	so = createunostruct("com.sun.star.util.SearchOptions")
	so.algorithmtype = com.sun.star.util.SearchAlgorithms.REGEXP
	so.searchstring= Expression
	ts.setoptions(so)
	endpos = len(st)
	STARTPOS=1
	DO
		a=ts.searchForward(st,startpos-1,endpos)
		if a.subRegExpressions >0 then 
'			a1 = a.startoffset(0) +1
'			b1 = a.endoffset(0) +1
'			wordlist = wordlist & chr(9) &  mid(st,a1,b1-a1)
			STARTPOS = A.ENDOFFSET(0) +1
			WCOUNT = WCOUNT +1
		else
			EXIT DO
		end if
	LOOP
	RegExXTextSearchCount= WCOUNT
'	msgbox wordlist
end function

Windows 10, Openoffice 4.1.11, LibreOffice 7.4.0.3 (x64)
JeJe
Volunteer
Posts: 2779
Joined: Wed Mar 09, 2016 2:40 pm

Re: OO Writer WordCount (excluding non-dictionary words)

Post by JeJe »

Simple loop method - goes through each character - if in between break characters there isn't a character that is not in the list of nonwordchars then it isn't counted.

Code: Select all


Sub Main
	msgbox  CountWordsLoop(thiscomponent.text.string,true) 'assume text less than string limit.
End Sub

function CountWordsLoop(st,acceptnos) 'uncomment commented lines for msgbox with list of words
	dim iswd as boolean
	BreakChars = " " & chr(9) & chr(10) &  chr(11) & chr(12) & chr(13)
	nonwordchars = ".!?,:;“”‘’()[]{}<>-—–/’…" & chr(34) & chr(39) 
	if acceptnos =false then nonwordchars =nonwordchars & "0123456789"
	ISWD = FALSE
	for i = 1 to len(st)
		ch=mid(st,i,1)
		if instr(1,BreakChars,ch)>0 then
			if iswd = true then
				tot  = tot+1
'				wd = mid(st,lastwdstart+1,i - lastwdstart-1)
'				allwds = allwds & " " & wd
				iswd = false
			end if
			lastwdstart = i
		else
			if iswd = false then
				if instr(1,nonwordchars,ch)=0 then iswd = true
			end if
		END IF
	next
	if iswd = true then
		tot  = tot+1
'		wd = mid(st,lastwdstart+1,i - lastwdstart-1)
'		allwds = allwds & " " & wd
	end if

	CountWordsLoop = tot
'	msgbox allwds
end function
Windows 10, Openoffice 4.1.11, LibreOffice 7.4.0.3 (x64)
JeJe
Volunteer
Posts: 2779
Joined: Wed Mar 09, 2016 2:40 pm

Re: OO Writer WordCount (excluding non-dictionary words)

Post by JeJe »

Here's a very simple extension which shows a dialog to run the Regex count (slightly modified from above).

The benefit is you get to see the word boundaries which is a large part of the problem here. There's no universal way of counting words. There's a count provided by the word processor but it may not do what you want or expect - and you're left in the dark as to which words were counted and how.

Install and Run: Library JeWordCountRegex/ Module AAA/ Sub WordCountR

The Regex is not perfect - I don't really know Regex - like I say its cobbled together. But... you get to see any problems as the words get selected...

Appears to work in OO and LO.

There is an issue if a find is performed first - such as searching for bold words - OO can't seem to clear that search/selection and keeps going back to it when the regex word count is run.

Usual disclaimer - use at own risk, always backup your work regularly, etc
JeWordCountRegex.oxt
(3.86 KiB) Downloaded 340 times
Attachments
wordcount.JPG
Windows 10, Openoffice 4.1.11, LibreOffice 7.4.0.3 (x64)
Post Reply