Cursor word empty line

Creating a macro - Writing a Script - Using the API (OpenOffice Basic, Python, BeanShell, JavaScript)
Post Reply
BubikolRamios
Posts: 91
Joined: Sat Jan 04, 2014 1:28 pm

Cursor word empty line

Post by BubikolRamios »

Reads (should) word by word ...

does not msg out 'de'
does message out '.' multiple times where it finds it only once
does not get paste empty line to 'Beispiel:'

Suggestions ?



sample text:
Exkursionsflora der Alpen
und angrenzender Gebiete
© Dr. Thomas Götz, Singen (letzte Änderung: 18.04.2017)
Weitere Infos zum Projekt:
http: //www .tkgoetz .homepage .t-online. de/alpenflorahome.html Fehler und Anmerkungen bitte an: tk.goetz@t-online .de

Beispiel:

Code: Select all

Dim oDoc As Object
Dim Proceed As Boolean
Dim Cursor As Object

oDoc = ThisComponent
Cursor = oDoc.Text.createTextCursor()
Cursor.gotoStart (false)'start of doc
Cursor.gotoEndOfWord(True)

do

MsgBox Cursor.String
Cursor.gotoNextWord (false)
Proceed = Cursor.gotoEndOfWord(True)

Loop While Proceed
OPen office 4.1.5/ win 7
FJCC
Moderator
Posts: 9248
Joined: Sat Nov 08, 2008 8:08 pm
Location: Colorado, USA

Re: Cursor word empty line

Post by FJCC »

The unusual behavior I see is caused by the spaces in the middle of
www .tkgoetz .homepage .t-online. de
There is a space before .tkgoetz and .homepage and .t-online and de. The search gets confused and returns the . character. There is a similar space in
tk.goetz@t-online .de
before the .de. If these spaces are removed the search works as expected, as far as I can see. Apparently, the code for gotoNextWord() does not know what to do if the next word starts with a period.
OpenOffice 4.1 on Windows 10 and Linux Mint
If your question is answered, please go to your first post, select the Edit button, and add [Solved] to the beginning of the title.
User avatar
RoryOF
Moderator
Posts: 34586
Joined: Sat Jan 31, 2009 9:30 pm
Location: Ireland

Re: Cursor word empty line

Post by RoryOF »

The API says
gotoNextWord
boolean
gotoNextWord( [in] boolean bExpand );

Description
moves the cursor to the next word.

Note: the function returning true does not necessarily mean that the cursor is located at the next word, or any word at all! This may happen for example if it travels over empty paragraphs.
Returns
true if the cursor was moved. It returns false it the cursor can not advance further.
The OP will have to examine the value at the location and discard it if it is a ".", then continue with another gotoNextWord()
Apache OpenOffice 4.1.15 on Xubuntu 22.04.4 LTS
JeJe
Volunteer
Posts: 2763
Joined: Wed Mar 09, 2016 2:40 pm

Re: Cursor word empty line

Post by JeJe »

Openoffice counts a lot of punctuation as words... its terrible and completely unreliable as far as what a word is. You can see this by looking at the word counts it gives. LibreOffice does a much better job.
Windows 10, Openoffice 4.1.11, LibreOffice 7.4.0.3 (x64)
BubikolRamios
Posts: 91
Joined: Sat Jan 04, 2014 1:28 pm

Re: Cursor word empty line

Post by BubikolRamios »

even skiping '.' does not get paste empty line, sample text
a b

c
will not msg out ''c'

looking at this: https://wiki.openoffice.org/wiki/Docume ... hin_a_Text
Since isEndOfParagraph () returns true on each line end (wrong I think), figured i could use gotoNextParagraph when it fails going to next word.


fails with: sub-procedure or function procedure not defined (compiles anyway as far as I can see)

Code: Select all

oDoc = ThisComponent
'oText = oDoc.getText()
Cursor = oDoc.Text.createTextCursor()
Cursor.gotoStart (false)'start of doc
Cursor.gotoEndOfWord(True)
Proceed = true
do


if Cursor.String = "." then
 'do nothing, skip
else
MsgBox Cursor.String
end if
'if Cursor.isEndOfParagraph () then
'Proceed = Cursor.gotoNextParagraph (false)
'Proceed = Cursor.gotoEndOfWord(True)
'else
Cursor.gotoNextWord (false)
if not Cursor.gotoEndOfWord(True) then
Proceed = gotoNextParagraph(false)
end if

'end if




Loop While Proceed
OPen office 4.1.5/ win 7
BubikolRamios
Posts: 91
Joined: Sat Jan 04, 2014 1:28 pm

Re: Cursor word empty line

Post by BubikolRamios »

Managed to go paste empty line with this

Code: Select all

Cursor.gotoNextWord (false)
if not Cursor.gotoEndOfWord(True) then
  Proceed = Cursor.gotoNextParagraph(true)
  Cursor.gotoEndOfWord(True)
end if

OPen office 4.1.5/ win 7
BubikolRamios
Posts: 91
Joined: Sat Jan 04, 2014 1:28 pm

Re: Cursor word empty line

Post by BubikolRamios »

also, sample text:
Zier-

Code: Select all

Cursor.gotoStart (false)'start of doc
Cursor.gotoEndOfWord(True)
MsgBox Cursor.String'--> empty string

If one removes '-' from 'Zier-'
it works ok.


Is this a bug ?
OPen office 4.1.5/ win 7
JeJe
Volunteer
Posts: 2763
Joined: Wed Mar 09, 2016 2:40 pm

Re: Cursor word empty line

Post by JeJe »

In your first code change to:

Code: Select all

do
MsgBox Cursor.String
proceed = Cursor.gotoNextWord (false)
Cursor.gotoEndOfWord(True)
Loop While Proceed
End Sub
Windows 10, Openoffice 4.1.11, LibreOffice 7.4.0.3 (x64)
UnklDonald418
Volunteer
Posts: 1544
Joined: Wed Jun 24, 2015 12:56 am
Location: Colorado, USA

Re: Cursor word empty line

Post by UnklDonald418 »

In two of his documents Andrew Pintonyak mentions that goToNextWord() has bugs going as far back as OO version 1.1 but he didn't elaborate.
A quick check on Bugzilla didn't turn up anything.
For what it's worth here is what I came up with

Code: Select all

REM  *****  BASIC  *****

Sub Main
Dim oDoc As Object
Dim Proceed As Boolean
Dim Cursor As Object
Dim words as string

	oDoc = ThisComponent
	Cursor = oDoc.Text.createTextCursor()
	Cursor.gotoStart (false)'start of doc
	Cursor.gotoEndOfWord(True)
	words = ""
do
	str1 = Cursor.String
REM goToNextWord() mis-handles words beginning with "." 
REM  this bit corrects that behavior 
	if str1 = "." then   
	  Cursor.gotoPreviousWord(False)
	  Cursor.goRight(2, False)
	  Cursor.gotoEndOfWord(False)
	  Cursor.gotoStartOfWord(True)
	  str1 = "." & Cursor.String
	endif
REM now continue on building list
	words = words & CHR$(10) & str1
	Proceed =  Cursor.gotoNextWord (False)
REM put a marker in the list when starting a new paragraph 
  	If Cursor.IsStartOfParagraph Then   'isStartOfParagraph
	  	words = words & CHR$(10) & "**** New Paragraph ****"
	End If 
REM check if all the words are on the list	
	If Cursor.gotoEndOfWord(True) = False then
	  Proceed = False
	end if
Loop While Proceed
  MsgBox words, 0,  "Word List"
End Sub
 Edit: Apparently I need new glasses. When I looked at my results this morning I noticed the the code above miss-handles a stand alone “.” 
If your problem has been solved, please edit this topic's initial post and add "[Solved]" to the beginning of the subject line
Apache OpenOffice 4.1.14 & LibreOffice 7.6.2.1 (x86_64) - Windows 10 Professional- Windows 11
JeJe
Volunteer
Posts: 2763
Joined: Wed Mar 09, 2016 2:40 pm

Re: Cursor word empty line

Post by JeJe »

gotoEndOfWord doesn't move the cursor when there are two paragraph marks. Its the design of the function. It remains in the original position as its not at the end of a word when there are two paragraph or two line feek marks.
gotoEndOfWord
boolean
gotoEndOfWord( [in] boolean bExpand );

Description
moves the cursor to the end of the current word.
Returns
true if the cursor is now at the end of a word, false otherwise. If false was returned the cursor will remain at its original position.
https://www.openoffice.org/api/docs/com ... oEndOfWord

you need to use gotoNextWord and remove the punctuation etc

Code: Select all

do
MsgBox Cursor.String
cursor.collapsetoend
proceed=Cursor.gotoNextWord (true)
Loop While Proceed
Windows 10, Openoffice 4.1.11, LibreOffice 7.4.0.3 (x64)
JeJe
Volunteer
Posts: 2763
Joined: Wed Mar 09, 2016 2:40 pm

Re: Cursor word empty line

Post by JeJe »

gotoNextWord gets thrown by periods as well - and I've played around with Xbreakiterator and that doesn't seem to work properly either.
Paragraph text seems to return okay so the alternative is to write your own code for each paragraph.

Code: Select all

Sub EnumerateParagraphs
	REM Author: Andrew Pitonyak
	Dim oParEnum 'Enumerator used to enumerate the paragraphs
	Dim oPar 'The enumerated paragraph
	REM Enumerate the paragraphs.
	REM Tables are enumerated along with paragraphs
	oParEnum = ThisComponent.getText().createEnumeration()
	Do While oParEnum.hasMoreElements()
		oPar = oParEnum.nextElement()
		REM This avoids the tables. Add an else statement if you want to
		REM process the tables.
		If oPar.supportsService("com.sun.star.text.Paragraph") Then
			'MsgBox oPar.getString(), 0, "I found a paragraph"
			getwordsinpara(oPar.getstring)

		ElseIf oPar.supportsService("com.sun.star.text.TextTable") Then
			'Print "I found a TextTable"
		Else
			'Print "What did I find?"
		End If
	Loop
End Sub


sub getwordsinPara(txt as string) as long 'author - me
	dim punct as string,i as long,c as string,wd as string,lenwd as long


	punct = " .;:!?(){}[]\/<>,*@" & Chr(34) & Chr(10)

	for i = 1 to len(txt)
		c = Mid(txt, i, 1)
		If InStr(1, punct, c) <> 0 Then
			if wd <>"" then

'				if isnumeric(wd) =false then

					msgbox wd
					'
'				end if
			end if
		
			wd = ""
			lenwd=0
		Else
			lenwd = lenwd+1
			wd =  wd & c
		End If
	next


			if wd <>"" then

'				if isnumeric(wd) =false then

					msgbox wd
					'
'				end if
			end if


end sub

Windows 10, Openoffice 4.1.11, LibreOffice 7.4.0.3 (x64)
User avatar
Lupp
Volunteer
Posts: 3542
Joined: Sat May 31, 2014 7:05 pm
Location: München, Germany

Re: Cursor word empty line

Post by Lupp »

What did I miss completely?
Aren't words what is left-delimited by a word boundary and starting with a word character then? RegEx search should find that. Stubbornly included spaces (fighting spam?) you can't eliminate without either natural or (that's a joke:) artificial intelligence. And anyway: Who did ever define the counting of "words" for a technical construct like an URL following its very specific syntax?
F_R_ForCountingWords.png
If I do the F&R with "Replace All" as shown above, I get the words counted, either for all the searchable text, or for the current selection if the option is enabled. As far as I can judge the results are good. The sample text starting with "Exkursionsflora" gives me 35 words that way, and that's exactly what I would count using NI. There is, of course, the single "t" of "t-online" counted as a word, but that's the fault of a company using a silly name ("fake syntax"?).

A ViewCursor or a TextCursor ordered "gotoEndOfWord" may not apply the same criteria as it may be expected to work wrap-oriented or the like. I don't know enough. However the results are obviously different, and the statistics done for the document properties are not reliable. (Not in AOO and also not in LibO.)

Editing:
To avoid misunderstandings about what I meant, I attach a bit of code, too:

Code: Select all

Sub test()
Print wordCount()
End Sub

Function wordCount(Optional pSearchable)
If IsMissing(pSearchable) Then pSearchable = ThisComponent
doc0 = pSearchable
tRD = doc0.CreateReplaceDescriptor
tRD.SearchRegularExpression = True
tRD.SearchString = "\b\w+\b"
found = doc0.FindAll(tRD)
wordCount = found.Count
For j = 0 To found.Count - 1
  oneWord = found(j).String
  Print oneWord ' Now every found word is displayed one by one
Next j
End Function
On Windows 10: LibreOffice 24.2 (new numbering) and older versions, PortableOpenOffice 4.1.7 and older, StarOffice 5.2
---
Lupp from München
JeJe
Volunteer
Posts: 2763
Joined: Wed Mar 09, 2016 2:40 pm

Re: Cursor word empty line

Post by JeJe »

Lupp - your function for

Mark’s and Sammy’s.

gives one word - the and.
Windows 10, Openoffice 4.1.11, LibreOffice 7.4.0.3 (x64)
JeJe
Volunteer
Posts: 2763
Joined: Wed Mar 09, 2016 2:40 pm

Re: Cursor word empty line

Post by JeJe »

A little bit improved version of my code above. Added a few more punctuation symbols, keeps web address as one word. Doesn't handle some things. Several choices to make when you decide what your words are.

Code: Select all

Sub EnumerateParagraphs
	REM Author: Andrew Pitonyak
	Dim oParEnum 'Enumerator used to enumerate the paragraphs
	Dim oPar 'The enumerated paragraph
	REM Enumerate the paragraphs.
	REM Tables are enumerated along with paragraphs
	oParEnum = ThisComponent.getText().createEnumeration()
	Do While oParEnum.hasMoreElements()
		oPar = oParEnum.nextElement()
		REM This avoids the tables. Add an else statement if you want to
		REM process the tables.
		If oPar.supportsService("com.sun.star.text.Paragraph") Then
			'MsgBox oPar.getString(), 0, "I found a paragraph"
			getwordsinpara(oPar.getstring)

		ElseIf oPar.supportsService("com.sun.star.text.TextTable") Then
			'Print "I found a TextTable"
		Else
			'Print "What did I find?"
		End If
	Loop
End Sub


sub getwordsinPara(txt as string) as long
	dim punct as string,i as long,c as string,wd as string,lenwd as long
	dim lentxt as long,webaddress as boolean, breakpos as long,hyphencount as long

	lentxt = len(txt)
	punct = " ;:!?(){}[]<>,*‘’“”—–…‹›«»•†‡§||#_" & Chr(34) & Chr(10) & chr(9)
	'".'-_@\/" handle separately ?

	for i = 1 to lentxt 'go though each letter
		c = Mid(txt, i, 1)

		select case c
		case "."
			IF WD = "" THEN
				breakpos = i
			ELSE
				if i < lentxt then
					IF InStr(1, punct, Mid(txt, i+1, 1)) <> 0 Then breakpos =i
				else
					breakpos =i
				end if
			END IF
			'apostrophe and single curly quote confusion - NOT handled yet
			'		case "'" 'not in punctuation list so always treating as part of word
			'		case "‘" in punctuation list so always treated as breakchar
			'		case "’" in punctuation list so always treated as breakchar
			'		case "_" 'underline in punctuation list so always treated as breakchar

		case "-" 'hyphen options when to treat as break char
			hyphencount = hyphencount +1

		CASE ":"
			IF wd <> "http" then breakpos = i

		case "/"
			if instr(1,wd,"http:") <>1 and instr(1,wd,"www.")<>1 then
				breakpos = i
			else
				webaddress = true
			end if
		case else
			If InStr(1, punct, c) <> 0 Then breakpos =i
		end select

		if breakpos<>0 then
			gosub handleword
			breakpos =0
			wd = ""
			lenwd=0
			webaddress = false
			hyphencount = 0
		Else
			lenwd = lenwd+1
			wd =  wd & c
		End If
	next

	gosub handleword
	exit sub


handleword:

	if wd<>"" then
		'handle hyphenated words if 1 hyphen treat as one world else spit if not date
		if webaddress = false then
			if hyphencount >1 and isdate(wd) =false then
				wds = split(wd,"-")
				for j=  0 to ubound(wds)
					'option to exclude numbers
					'					if isnumeric(wd) =false then
					msgbox wd
					'					end if
				next
			end if
		end if

			if wd<> "" then
				'option to exclude numbers
				'				if isnumeric(wd) =false then
				msgbox wd
				'				end if
			end if
		end if

		return

end sub
Windows 10, Openoffice 4.1.11, LibreOffice 7.4.0.3 (x64)
User avatar
Lupp
Volunteer
Posts: 3542
Joined: Sat May 31, 2014 7:05 pm
Location: München, Germany

Re: Cursor word empty line

Post by Lupp »

I would assume there is no way to list and count words from any document as long as there isn't a clear definition of what a word is.
I also don't feel capable at all to judge how many and which languages might be treated the same way. But I would judge that the appropriate means for an approach is the usage of regular expressions. This for the definition of words and for the parsing of documents as well.
Any syntactical analysis based on scanning text objects character by character or using TextCursor objects has to do similar things as a RegEx engine based on much less expertise concerning unicode, e.g. In addition it will lack a reliable specification if not the author re-invents something like RegEx.

Actual existing RegEx engines are powerful and efficient, and this is surely valid for the ICU engine integrated into AOO (and LibO, too). And if a more special text requires a more special treatment, it should be much easier to rework the used RegEx string in one place (even if it becomes rather complicated) than to rework an already coded program in many places.

You may run the code contained in the attached example as a demonstration of what I mean to more detail.
Attachments
aoo92697SplitInWords_1.odt
(28.5 KiB) Downloaded 166 times
On Windows 10: LibreOffice 24.2 (new numbering) and older versions, PortableOpenOffice 4.1.7 and older, StarOffice 5.2
---
Lupp from München
JeJe
Volunteer
Posts: 2763
Joined: Wed Mar 09, 2016 2:40 pm

Re: Cursor word empty line

Post by JeJe »

Lupp - yes, regular expressions are very powerful. Good demo. In it...
‘Mark and Steven’s party’ said Jenny.
Gives Steven as a word - that's the apostrophe, single quote confusion problem. When people use the single quote as a typographically prettier apostrophe.
ex-partner
Is counted separately when its really one word. Other hyphenations will be two words though - its not simple.

Its perhaps not so much that we can't agree a word count - there are some choices about how we count hyphenated words, word contractions and so on... but having made those decisions... we should agree how many words there are.

“Sam and Gary,” said Jenny.
Is counted by OpenOffice as 6 words. LibreOffice does a much better count of words, correctly making it five. OO is particularly bad.
Windows 10, Openoffice 4.1.11, LibreOffice 7.4.0.3 (x64)
JeJe
Volunteer
Posts: 2763
Joined: Wed Mar 09, 2016 2:40 pm

Re: Cursor word empty line

Post by JeJe »

Regarding hyphenation word counts - this article's interesting:

https://wordribbon.tips.net/T009228_Ign ... ounts.html

Counting all hyphenations as one word.


That would make this example that I got from here,

https://www.quickanddirtytips.com/educa ... -modifiers

a count of three words. Which people might disagree with.
a what-you-might-have-been-wondering-about topic
Windows 10, Openoffice 4.1.11, LibreOffice 7.4.0.3 (x64)
User avatar
Lupp
Volunteer
Posts: 3542
Joined: Sat May 31, 2014 7:05 pm
Location: München, Germany

Re: Cursor word empty line

Post by Lupp »

I actually cannot imagine a solution to the "hyphenation problem" based on pure syntax. Or may I say, it's impossible? And it changes with time. When I was still a pupil I was teached to use "type writer" for what we called "Schreibmaschine" in German. Then it was "type-writer" for some time if I remember correctly, and nowadays everybody seems to use "typewriter". This reasonable convention was reached no sooner than we came to need a trip to the museum to see one of these things.

In German we have the absurd case that a prefix needs to be separated and postponed, though not being a regular word with a distinct
meaning. An example?
Well: A: "Lasst uns anfangen!" B: "Ich habe schon angefangen, aber fange du endlich an!"
(A: "Let's start!" B: "I started already, but you should start ultimately!" or similar)

I'm not a linguist, and I don't even have a term for the "infixed perfect-prefix 'ge'" or whatever it is.
On the other hand I'm 73 now and fortunately I never had actual need of definitely listing and counting "words".

The misuse of a single quote as a "pretty apostophe" is easily solved in RegEx by "('|’)" or even "('|’|‘)".
But national and regional or group-creating stubborn specifics will surely soon be introduced to present us with new funny nonsense once in a while.

(Editing:)
Lupp wrote:I actually cannot imagine a solution to the "hyphenation problem" based on pure syntax.
JeJe wrote:Counting all hyphenations as one word.
Yes, I thought of this one, but...
JeJe wrote:Which people might disagree with.
... was anticipated.
I would agree, however. And the example
JeJe wrote:a what-you-might-have-been-wondering-about topic
is exactly the one I would like to apply the agreement to.
On Windows 10: LibreOffice 24.2 (new numbering) and older versions, PortableOpenOffice 4.1.7 and older, StarOffice 5.2
---
Lupp from München
JeJe
Volunteer
Posts: 2763
Joined: Wed Mar 09, 2016 2:40 pm

Re: Cursor word empty line

Post by JeJe »

Here's another version of my count code in an attached writer document with a dialog showing the word list.
Should be easy to adapt for anyone's else's count code.

(Just replace the word "EnumerateParagraphs" in the
"LoadWordsNew" sub in the "WCM" module with the name of your function which starts the adding of words.
And add the words by calling addWord(YourWord) from within your function.)
Attachments
Word Count.odt
(17.64 KiB) Downloaded 158 times
Windows 10, Openoffice 4.1.11, LibreOffice 7.4.0.3 (x64)
Post Reply