OO Writer WordCount (excluding non-dictionary words)

Creating Extension - Shared Libraries
Forum rules
For sharing working examples of macros / scripts. These can be in any script language supported by OpenOffice.org [Basic, Python, Netbean] or as source code files in Java or C# even - but requires the actual source code listing. This forum is not for asking questions about writing your own macros.

OO Writer WordCount (excluding non-dictionary words)

Postby JeJe » Sat Sep 12, 2020 4:33 pm

[Evolving work, see below for development beyond this initial post]
_________________________________________________________________________________

OO's word count doesn't give a dictionary-word count - but counts anything between separators. (LibreOffice has fixed this, so this is just for OO)

There is a function I got from here

viewtopic.php?f=20&t=82678&hilit=+count#p382966

Which I've modified as it gives some results that include just punctuation

Note: THIS WILL BE VERY SLOW WITH A LARGE OR EVEN A MODEST SIZED DOCUMENT

Better ideas/methods for a word count welcome...

Edit: note, just includes words in the main text paragraphs (through an enumeration), nothing else
Edit2: slight mods to allow choice of count of numbers in main sub and commenting out a test line I left in and shouldn't have.

Code: Select all   Expand viewCollapse view


'uncomment lines with wdlist in for msgbox with list of words
'dim wdlist
sub CountWordsThisComponentMainText
   LANGUAGE = ooolang
   txt = thiscomponent.text
   en = txt.createenumeration

   counttype = com.sun.star.i18n.WordType.DICTIONARY_WORD
   acceptnos = true
   do until en.hasmoreelements =false
      p = en.nextelement
      on error goto hr
      tot = tot +GetStringWordcount(p.string,counttype,LANGUAGE,   acceptnos)
      hr:
   loop
   msgbox "Dictionary Word Count =" & tot,,Thiscomponent.currentcontroller.frame.title
'msgbox wdlist
end sub


   'modified from https://forum.openoffice.org/en/forum/viewtopic.php?f=20&t=82678&hilit=+count#p382966
   'for  choice of count type, and language and to exclude puctuation only words
Function GetStringWordcount(aString,counttype,LANGUAGE,   acceptnos)
'*******************************************
   'Function: Count Words in provided string
   'Author: Andrew Brown

   'Last updated 18 March 2016
   'Last updated by:Daniel Wilson
'*******************************************
   ' the ultimate, using the same breakiterator as the program
   
   
         'from the api 'com.sun.star.i18n.WordType.WORD_COUNT
   'const short ANY_WORD = 0;
   'Any "words" - words in the meaning of same character types, collection of alphanumeric characters, or collection of non-alphanumeric characters.
   'const short ANYWORD_IGNOREWHITESPACES = 1;
   '    Any "words" - words in the meaning of same character types, collection of alphanumeric characters, or collection of non-alphanumeric characters except blanks.
   'const short DICTIONARY_WORD = 2;
   '    "words" - in the meaning of a collection of alphanumeric characters and some punctuations, like dot for abbreviation.
   'const short WORD_COUNT = 3;
   '    The mode for counting words, it will combine punctuations and spaces as word trail.

   '
   Dim mystartpos As Long
   Dim numwords,nw
   Dim nextwd As New com.sun.star.i18n.Boundary
   Dim aLocale As New com.sun.star.lang.Locale
   Dim brk

   aLocale.Language=LANGUAGE ' "en"

   numwords=0
   mystartpos=0
   brk=CreateUNOService("com.sun.star.i18n.BreakIterator")

   astring = " " & astring 'doesn't count first word

   nextwd=brk.nextWord(aString,startpos,aLocale,counttype)
   '   com.sun.star.i18n.WordType.WORD_COUNT
   Do While nextwd.startPos<> nextwd.endPos
      wd = mid(aString,nextwd.startpos+1,nextwd.endpos - nextwd.startpos)
      if isValidWord(wd,acceptnos) then
'         wdlist =wdlist & wd & chr(9)
         numwords=numwords+1
      end if
      nw=nextwd.startpos
      nextwd=brk.nextWord(aString,nw,aLocale,counttype)
      '      com.sun.star.i18n.WordType.WORD_COUNT
   Loop
   GetStringWordcount=numwords
'   msgbox numwords & "   " &  st
End Function

   'Useful Macro Information For OpenOffice By Andrew Pitonyak
   '5.7.2. OOo Locale
   'Listing 5.11: Obtain the current OpenOffice.org locale.
Function OOoLang() as string
   'Author : Laurent Godard
   'e-mail : listes.godard@laposte.net
   Dim oSet, oConfigProvider
   Dim oParm(0) As New com.sun.star.beans.PropertyValue
   Dim sProvider$, sAccess$
   sProvider = "com.sun.star.configuration.ConfigurationProvider"
   sAccess = "com.sun.star.configuration.ConfigurationAccess"
   oConfigProvider = createUnoService(sProvider)
   oParm(0).Name = "nodepath"
   oParm(0).Value = "/org.openoffice.Setup/L10N"
   oSet = oConfigProvider.createInstanceWithArguments(sAccess, oParm())
   Dim OOLangue as string
   OOLangue= oSet.getbyname("ooLocale") 'en-US
   OOLang=lcase(Left(trim(OOLangue),2)) 'en

End Function


function isValidWord(wd,acceptnos) 'eliminate words only containing punctuation
   'and numbers if want those not counted
   punct = ".!?,:;“”‘’()[]{}<>-—–/’…" & chr(34) & chr(39) '34 double quote '39 single quote
   nos= "0123456789"
   isValidWord =false
   for i = 1 to len(wd)
      c=(mid(wd,i,1))
      if instr(1,nos,c)>0 then
         if acceptnos then
            isValidWord = true
            exit for
         end if
      elseif instr(1,punct,c)>0 then
      else
         isValidWord = true
         exit for
      end if
   next
End function
Last edited by JeJe on Wed Sep 16, 2020 12:38 am, edited 3 times in total.
Openoffice 4.1.6
Windows 8
JeJe
Volunteer
 
Posts: 1322
Joined: Wed Mar 09, 2016 2:40 pm

Re: OO Writer WordCount (excluding non-dictionary words)

Postby JeJe » Sat Sep 12, 2020 11:12 pm

Here are two more methods. The following does a split on the space character and uses a similar function as above to test if the word is valid.
Edit: one big problem will be a space isn't the only valid word delimeter eg word+tab+word
Edit2: fixed that (partially) with function to replace whitespace characters 9-13 with space before splitting on space. This method also runs very very slow.
Code: Select all   Expand viewCollapse view

sub CountWordsSplit
   txt = thiscomponent.text
   en = txt.createenumeration
   acceptnos = true
   do until en.hasmoreelements =false
      p = en.nextelement
      on error goto hr
      st =REPLACEOTHERWHITESPACE2(p.string)
      sts = split(st," ")
      for i = 0 to ubound(sts)
      if isValidWord2(sts(i),acceptnos) then
      tot = tot +1
'      wdlist=wdlist & sts(i) & chr(9)
      end if
      next
      hr:
   loop
   msgbox "Split method Word Count =" & tot,,Thiscomponent.currentcontroller.frame.title
'   msgbox wdlist
end sub

function isValidWord2(wd,acceptnos) 'eliminate words only containing punctuation
   'and numbers if want those not counted
   punct = ".!?,:;“”‘’()[]{}<>-—–/’…" & chr(34) & chr(39) & chr(9) & chr(10) & chr(13) '34 double quote '39 single quote
   nos= "0123456789"
   isValidWord2=false
   for i = 1 to len(wd)
      c=(mid(wd,i,1))
      if instr(1,nos,c)>0 then
         if acceptnos then
            isValidWord2 = true
            exit for
         end if
      elseif instr(1,punct,c)>0 then
      else
         isValidWord2 = true
         exit for
      end if
   next
End function


function REPLACEOTHERWHITESPACE2(byval st as string)
for i = 1 to len(st)
a=asc(mid(st,i,1))
if (a>= 9 and a <= 13) then mid(st,i,1) =" "
next
REPLACEOTHERWHITESPACE2 =st
End function



This one uses Regex. Some of the Regex word count expressions I found wouldn't work in OO. This one does but has problems with abbreviations and apostrophe words.

Code: Select all   Expand viewCollapse view
Sub RegexWordCount 
Dim vDescriptor, vFound
vDescriptor = ThisComponent.createSearchDescriptor()
With vDescriptor
'https://stackoverflow.com/questions/30866275/regular-expression-for-counting-words-in-a-sentence
.SearchString ="\w+(-\w+)*" 'fails with abbreviations - M.O.D. will be 3 and Henry’s will be 2
.SearchRegularExpression = true
End With
' Find the first one
vFound = ThisComponent.findall(vDescriptor)
if isEmpty(vfound) = false then
msgbox  "Regex Word Count =" & vfound.count,,Thiscomponent.currentcontroller.frame.title
end if
End Sub
Openoffice 4.1.6
Windows 8
JeJe
Volunteer
 
Posts: 1322
Joined: Wed Mar 09, 2016 2:40 pm

Re: OO Writer WordCount (excluding non-dictionary words)

Postby JeJe » Sun Sep 13, 2020 2:47 am

Here's a regex count cobbled together from the mentioned links. Not quite there - it treats this-word as one word but this-word-and-more as two not one. Another thing I've noticed is that if there's a picture anchored as a character an adjacent word may not be counted (or selected if the expression is run from the find dialog).

Code: Select all   Expand viewCollapse view
Sub RegexWordCount2 
Dim vDescriptor, vFound
vDescriptor = ThisComponent.createSearchDescriptor()
With vDescriptor

'https://css-tricks.com/build-word-counter-app/
'\b['?-?(\w+)?]+\b
'https://stackoverflow.com/questions/35076016/regex-to-match-acronyms
'\b(?:[a-zA-Z]\.){2,}
'https://stackoverflow.com/questions/41483674/regex-ignore-a-constant-string-that-matches-a-pattern
'exclude matches

.SearchString ="\b(?!(?:\(|\)))[['’-]?(\w+)?]+\b|\b(?:[a-zA-Z]\.){2,}"
.SearchRegularExpression = true
End With
vFound = ThisComponent.findall(vDescriptor)
if isEmpty(vfound) = false then
msgbox "Regex Word Count =" & vfound.count,,Thiscomponent.currentcontroller.frame.title
end if
End Sub
Openoffice 4.1.6
Windows 8
JeJe
Volunteer
 
Posts: 1322
Joined: Wed Mar 09, 2016 2:40 pm

Re: OO Writer WordCount (excluding non-dictionary words)

Postby JeJe » Mon Sep 14, 2020 5:38 pm

Here's the Regex method applied to a string using XTextSearch

Code: Select all   Expand viewCollapse view

sub test
   
   st =  THISCOMPONENT.TEXT.STRING 'assumes smaller than string limit for this simple test
   pattern ="\b(?!(?:_|\(|\)))[['’-]?(\w+)?]+\b|\b(?:[a-zA-Z]\.){2,}"
   msgbox (RegExXTextSearchCount( st,pattern),,"XTextSearch Count")
end sub

'uncomment the commented lines for a msgbox with the list of words.
Function RegExXTextSearchCount(st as string,Expression) as long
   dim ts,so,endpos as long,startpos as long,wcount as long,a

   ts = createUnoService("com.sun.star.util.TextSearch")
   so = createunostruct("com.sun.star.util.SearchOptions")
   so.algorithmtype = com.sun.star.util.SearchAlgorithms.REGEXP
   so.searchstring= Expression
   ts.setoptions(so)
   endpos = len(st)
   STARTPOS=1
   DO
      a=ts.searchForward(st,startpos-1,endpos)
      if a.subRegExpressions >0 then
'         a1 = a.startoffset(0) +1
'         b1 = a.endoffset(0) +1
'         wordlist = wordlist & chr(9) &  mid(st,a1,b1-a1)
         STARTPOS = A.ENDOFFSET(0) +1
         WCOUNT = WCOUNT +1
      else
         EXIT DO
      end if
   LOOP
   RegExXTextSearchCount= WCOUNT
'   msgbox wordlist
end function

Openoffice 4.1.6
Windows 8
JeJe
Volunteer
 
Posts: 1322
Joined: Wed Mar 09, 2016 2:40 pm

Re: OO Writer WordCount (excluding non-dictionary words)

Postby JeJe » Tue Sep 15, 2020 12:57 pm

Simple loop method - goes through each character - if in between break characters there isn't a character that is not in the list of nonwordchars then it isn't counted.

Code: Select all   Expand viewCollapse view

Sub Main
   msgbox  CountWordsLoop(thiscomponent.text.string,true) 'assume text less than string limit.
End Sub

function CountWordsLoop(st,acceptnos) 'uncomment commented lines for msgbox with list of words
   dim iswd as boolean
   BreakChars = " " & chr(9) & chr(10) &  chr(11) & chr(12) & chr(13)
   nonwordchars = ".!?,:;“”‘’()[]{}<>-—–/’…" & chr(34) & chr(39)
   if acceptnos =false then nonwordchars =nonwordchars & "0123456789"
   ISWD = FALSE
   for i = 1 to len(st)
      ch=mid(st,i,1)
      if instr(1,BreakChars,ch)>0 then
         if iswd = true then
            tot  = tot+1
'            wd = mid(st,lastwdstart+1,i - lastwdstart-1)
'            allwds = allwds & " " & wd
            iswd = false
         end if
         lastwdstart = i
      else
         if iswd = false then
            if instr(1,nonwordchars,ch)=0 then iswd = true
         end if
      END IF
   next
   if iswd = true then
      tot  = tot+1
'      wd = mid(st,lastwdstart+1,i - lastwdstart-1)
'      allwds = allwds & " " & wd
   end if

   CountWordsLoop = tot
'   msgbox allwds
end function
Openoffice 4.1.6
Windows 8
JeJe
Volunteer
 
Posts: 1322
Joined: Wed Mar 09, 2016 2:40 pm

Re: OO Writer WordCount (excluding non-dictionary words)

Postby JeJe » Wed Sep 16, 2020 12:34 am

Here's a very simple extension which shows a dialog to run the Regex count (slightly modified from above).

The benefit is you get to see the word boundaries which is a large part of the problem here. There's no universal way of counting words. There's a count provided by the word processor but it may not do what you want or expect - and you're left in the dark as to which words were counted and how.

Install and Run: Library JeWordCountRegex/ Module AAA/ Sub WordCountR

The Regex is not perfect - I don't really know Regex - like I say its cobbled together. But... you get to see any problems as the words get selected...

Appears to work in OO and LO.

There is an issue if a find is performed first - such as searching for bold words - OO can't seem to clear that search/selection and keeps going back to it when the regex word count is run.

Usual disclaimer - use at own risk, always backup your work regularly, etc

JeWordCountRegex.oxt
(3.86 KiB) Downloaded 38 times
Attachments
wordcount.JPG
Openoffice 4.1.6
Windows 8
JeJe
Volunteer
 
Posts: 1322
Joined: Wed Mar 09, 2016 2:40 pm


Return to Code Snippets

Who is online

Users browsing this forum: No registered users and 2 guests