Page 1 of 1

Cursor word empty line

Posted: Tue Mar 06, 2018 4:23 pm
by BubikolRamios
Reads (should) word by word ...

does not msg out 'de'
does message out '.' multiple times where it finds it only once
does not get paste empty line to 'Beispiel:'

Suggestions ?



sample text:
Exkursionsflora der Alpen
und angrenzender Gebiete
© Dr. Thomas Götz, Singen (letzte Änderung: 18.04.2017)
Weitere Infos zum Projekt:
http: //www .tkgoetz .homepage .t-online. de/alpenflorahome.html Fehler und Anmerkungen bitte an: tk.goetz@t-online .de

Beispiel:

Code: Select all

Dim oDoc As Object
Dim Proceed As Boolean
Dim Cursor As Object

oDoc = ThisComponent
Cursor = oDoc.Text.createTextCursor()
Cursor.gotoStart (false)'start of doc
Cursor.gotoEndOfWord(True)

do

MsgBox Cursor.String
Cursor.gotoNextWord (false)
Proceed = Cursor.gotoEndOfWord(True)

Loop While Proceed

Re: Cursor word empty line

Posted: Tue Mar 06, 2018 4:56 pm
by FJCC
The unusual behavior I see is caused by the spaces in the middle of
www .tkgoetz .homepage .t-online. de
There is a space before .tkgoetz and .homepage and .t-online and de. The search gets confused and returns the . character. There is a similar space in
tk.goetz@t-online .de
before the .de. If these spaces are removed the search works as expected, as far as I can see. Apparently, the code for gotoNextWord() does not know what to do if the next word starts with a period.

Re: Cursor word empty line

Posted: Tue Mar 06, 2018 5:00 pm
by RoryOF
The API says
gotoNextWord
boolean
gotoNextWord( [in] boolean bExpand );

Description
moves the cursor to the next word.

Note: the function returning true does not necessarily mean that the cursor is located at the next word, or any word at all! This may happen for example if it travels over empty paragraphs.
Returns
true if the cursor was moved. It returns false it the cursor can not advance further.
The OP will have to examine the value at the location and discard it if it is a ".", then continue with another gotoNextWord()

Re: Cursor word empty line

Posted: Tue Mar 06, 2018 9:51 pm
by JeJe
Openoffice counts a lot of punctuation as words... its terrible and completely unreliable as far as what a word is. You can see this by looking at the word counts it gives. LibreOffice does a much better job.

Re: Cursor word empty line

Posted: Wed Mar 07, 2018 9:47 am
by BubikolRamios
even skiping '.' does not get paste empty line, sample text
a b

c
will not msg out ''c'

looking at this: https://wiki.openoffice.org/wiki/Docume ... hin_a_Text
Since isEndOfParagraph () returns true on each line end (wrong I think), figured i could use gotoNextParagraph when it fails going to next word.


fails with: sub-procedure or function procedure not defined (compiles anyway as far as I can see)

Code: Select all

oDoc = ThisComponent
'oText = oDoc.getText()
Cursor = oDoc.Text.createTextCursor()
Cursor.gotoStart (false)'start of doc
Cursor.gotoEndOfWord(True)
Proceed = true
do


if Cursor.String = "." then
 'do nothing, skip
else
MsgBox Cursor.String
end if
'if Cursor.isEndOfParagraph () then
'Proceed = Cursor.gotoNextParagraph (false)
'Proceed = Cursor.gotoEndOfWord(True)
'else
Cursor.gotoNextWord (false)
if not Cursor.gotoEndOfWord(True) then
Proceed = gotoNextParagraph(false)
end if

'end if




Loop While Proceed

Re: Cursor word empty line

Posted: Wed Mar 07, 2018 10:15 am
by BubikolRamios
Managed to go paste empty line with this

Code: Select all

Cursor.gotoNextWord (false)
if not Cursor.gotoEndOfWord(True) then
  Proceed = Cursor.gotoNextParagraph(true)
  Cursor.gotoEndOfWord(True)
end if


Re: Cursor word empty line

Posted: Wed Mar 07, 2018 10:37 pm
by BubikolRamios
also, sample text:
Zier-

Code: Select all

Cursor.gotoStart (false)'start of doc
Cursor.gotoEndOfWord(True)
MsgBox Cursor.String'--> empty string

If one removes '-' from 'Zier-'
it works ok.


Is this a bug ?

Re: Cursor word empty line

Posted: Wed Mar 07, 2018 10:57 pm
by JeJe
In your first code change to:

Code: Select all

do
MsgBox Cursor.String
proceed = Cursor.gotoNextWord (false)
Cursor.gotoEndOfWord(True)
Loop While Proceed
End Sub

Re: Cursor word empty line

Posted: Fri Mar 09, 2018 1:31 am
by UnklDonald418
In two of his documents Andrew Pintonyak mentions that goToNextWord() has bugs going as far back as OO version 1.1 but he didn't elaborate.
A quick check on Bugzilla didn't turn up anything.
For what it's worth here is what I came up with

Code: Select all

REM  *****  BASIC  *****

Sub Main
Dim oDoc As Object
Dim Proceed As Boolean
Dim Cursor As Object
Dim words as string

	oDoc = ThisComponent
	Cursor = oDoc.Text.createTextCursor()
	Cursor.gotoStart (false)'start of doc
	Cursor.gotoEndOfWord(True)
	words = ""
do
	str1 = Cursor.String
REM goToNextWord() mis-handles words beginning with "." 
REM  this bit corrects that behavior 
	if str1 = "." then   
	  Cursor.gotoPreviousWord(False)
	  Cursor.goRight(2, False)
	  Cursor.gotoEndOfWord(False)
	  Cursor.gotoStartOfWord(True)
	  str1 = "." & Cursor.String
	endif
REM now continue on building list
	words = words & CHR$(10) & str1
	Proceed =  Cursor.gotoNextWord (False)
REM put a marker in the list when starting a new paragraph 
  	If Cursor.IsStartOfParagraph Then   'isStartOfParagraph
	  	words = words & CHR$(10) & "**** New Paragraph ****"
	End If 
REM check if all the words are on the list	
	If Cursor.gotoEndOfWord(True) = False then
	  Proceed = False
	end if
Loop While Proceed
  MsgBox words, 0,  "Word List"
End Sub
 Edit: Apparently I need new glasses. When I looked at my results this morning I noticed the the code above miss-handles a stand alone “.” 

Re: Cursor word empty line

Posted: Fri Mar 09, 2018 11:42 pm
by JeJe
gotoEndOfWord doesn't move the cursor when there are two paragraph marks. Its the design of the function. It remains in the original position as its not at the end of a word when there are two paragraph or two line feek marks.
gotoEndOfWord
boolean
gotoEndOfWord( [in] boolean bExpand );

Description
moves the cursor to the end of the current word.
Returns
true if the cursor is now at the end of a word, false otherwise. If false was returned the cursor will remain at its original position.
https://www.openoffice.org/api/docs/com ... oEndOfWord

you need to use gotoNextWord and remove the punctuation etc

Code: Select all

do
MsgBox Cursor.String
cursor.collapsetoend
proceed=Cursor.gotoNextWord (true)
Loop While Proceed

Re: Cursor word empty line

Posted: Sat Mar 10, 2018 2:46 am
by JeJe
gotoNextWord gets thrown by periods as well - and I've played around with Xbreakiterator and that doesn't seem to work properly either.
Paragraph text seems to return okay so the alternative is to write your own code for each paragraph.

Code: Select all

Sub EnumerateParagraphs
	REM Author: Andrew Pitonyak
	Dim oParEnum 'Enumerator used to enumerate the paragraphs
	Dim oPar 'The enumerated paragraph
	REM Enumerate the paragraphs.
	REM Tables are enumerated along with paragraphs
	oParEnum = ThisComponent.getText().createEnumeration()
	Do While oParEnum.hasMoreElements()
		oPar = oParEnum.nextElement()
		REM This avoids the tables. Add an else statement if you want to
		REM process the tables.
		If oPar.supportsService("com.sun.star.text.Paragraph") Then
			'MsgBox oPar.getString(), 0, "I found a paragraph"
			getwordsinpara(oPar.getstring)

		ElseIf oPar.supportsService("com.sun.star.text.TextTable") Then
			'Print "I found a TextTable"
		Else
			'Print "What did I find?"
		End If
	Loop
End Sub


sub getwordsinPara(txt as string) as long 'author - me
	dim punct as string,i as long,c as string,wd as string,lenwd as long


	punct = " .;:!?(){}[]\/<>,*@" & Chr(34) & Chr(10)

	for i = 1 to len(txt)
		c = Mid(txt, i, 1)
		If InStr(1, punct, c) <> 0 Then
			if wd <>"" then

'				if isnumeric(wd) =false then

					msgbox wd
					'
'				end if
			end if
		
			wd = ""
			lenwd=0
		Else
			lenwd = lenwd+1
			wd =  wd & c
		End If
	next


			if wd <>"" then

'				if isnumeric(wd) =false then

					msgbox wd
					'
'				end if
			end if


end sub


Re: Cursor word empty line

Posted: Sat Mar 10, 2018 6:06 pm
by Lupp
What did I miss completely?
Aren't words what is left-delimited by a word boundary and starting with a word character then? RegEx search should find that. Stubbornly included spaces (fighting spam?) you can't eliminate without either natural or (that's a joke:) artificial intelligence. And anyway: Who did ever define the counting of "words" for a technical construct like an URL following its very specific syntax?
F_R_ForCountingWords.png
If I do the F&R with "Replace All" as shown above, I get the words counted, either for all the searchable text, or for the current selection if the option is enabled. As far as I can judge the results are good. The sample text starting with "Exkursionsflora" gives me 35 words that way, and that's exactly what I would count using NI. There is, of course, the single "t" of "t-online" counted as a word, but that's the fault of a company using a silly name ("fake syntax"?).

A ViewCursor or a TextCursor ordered "gotoEndOfWord" may not apply the same criteria as it may be expected to work wrap-oriented or the like. I don't know enough. However the results are obviously different, and the statistics done for the document properties are not reliable. (Not in AOO and also not in LibO.)

Editing:
To avoid misunderstandings about what I meant, I attach a bit of code, too:

Code: Select all

Sub test()
Print wordCount()
End Sub

Function wordCount(Optional pSearchable)
If IsMissing(pSearchable) Then pSearchable = ThisComponent
doc0 = pSearchable
tRD = doc0.CreateReplaceDescriptor
tRD.SearchRegularExpression = True
tRD.SearchString = "\b\w+\b"
found = doc0.FindAll(tRD)
wordCount = found.Count
For j = 0 To found.Count - 1
  oneWord = found(j).String
  Print oneWord ' Now every found word is displayed one by one
Next j
End Function

Re: Cursor word empty line

Posted: Sat Mar 10, 2018 9:14 pm
by JeJe
Lupp - your function for

Mark’s and Sammy’s.

gives one word - the and.

Re: Cursor word empty line

Posted: Sun Mar 11, 2018 2:50 pm
by JeJe
A little bit improved version of my code above. Added a few more punctuation symbols, keeps web address as one word. Doesn't handle some things. Several choices to make when you decide what your words are.

Code: Select all

Sub EnumerateParagraphs
	REM Author: Andrew Pitonyak
	Dim oParEnum 'Enumerator used to enumerate the paragraphs
	Dim oPar 'The enumerated paragraph
	REM Enumerate the paragraphs.
	REM Tables are enumerated along with paragraphs
	oParEnum = ThisComponent.getText().createEnumeration()
	Do While oParEnum.hasMoreElements()
		oPar = oParEnum.nextElement()
		REM This avoids the tables. Add an else statement if you want to
		REM process the tables.
		If oPar.supportsService("com.sun.star.text.Paragraph") Then
			'MsgBox oPar.getString(), 0, "I found a paragraph"
			getwordsinpara(oPar.getstring)

		ElseIf oPar.supportsService("com.sun.star.text.TextTable") Then
			'Print "I found a TextTable"
		Else
			'Print "What did I find?"
		End If
	Loop
End Sub


sub getwordsinPara(txt as string) as long
	dim punct as string,i as long,c as string,wd as string,lenwd as long
	dim lentxt as long,webaddress as boolean, breakpos as long,hyphencount as long

	lentxt = len(txt)
	punct = " ;:!?(){}[]<>,*‘’“”—–…‹›«»•†‡§||#_" & Chr(34) & Chr(10) & chr(9)
	'".'-_@\/" handle separately ?

	for i = 1 to lentxt 'go though each letter
		c = Mid(txt, i, 1)

		select case c
		case "."
			IF WD = "" THEN
				breakpos = i
			ELSE
				if i < lentxt then
					IF InStr(1, punct, Mid(txt, i+1, 1)) <> 0 Then breakpos =i
				else
					breakpos =i
				end if
			END IF
			'apostrophe and single curly quote confusion - NOT handled yet
			'		case "'" 'not in punctuation list so always treating as part of word
			'		case "‘" in punctuation list so always treated as breakchar
			'		case "’" in punctuation list so always treated as breakchar
			'		case "_" 'underline in punctuation list so always treated as breakchar

		case "-" 'hyphen options when to treat as break char
			hyphencount = hyphencount +1

		CASE ":"
			IF wd <> "http" then breakpos = i

		case "/"
			if instr(1,wd,"http:") <>1 and instr(1,wd,"www.")<>1 then
				breakpos = i
			else
				webaddress = true
			end if
		case else
			If InStr(1, punct, c) <> 0 Then breakpos =i
		end select

		if breakpos<>0 then
			gosub handleword
			breakpos =0
			wd = ""
			lenwd=0
			webaddress = false
			hyphencount = 0
		Else
			lenwd = lenwd+1
			wd =  wd & c
		End If
	next

	gosub handleword
	exit sub


handleword:

	if wd<>"" then
		'handle hyphenated words if 1 hyphen treat as one world else spit if not date
		if webaddress = false then
			if hyphencount >1 and isdate(wd) =false then
				wds = split(wd,"-")
				for j=  0 to ubound(wds)
					'option to exclude numbers
					'					if isnumeric(wd) =false then
					msgbox wd
					'					end if
				next
			end if
		end if

			if wd<> "" then
				'option to exclude numbers
				'				if isnumeric(wd) =false then
				msgbox wd
				'				end if
			end if
		end if

		return

end sub

Re: Cursor word empty line

Posted: Sun Mar 11, 2018 3:33 pm
by Lupp
I would assume there is no way to list and count words from any document as long as there isn't a clear definition of what a word is.
I also don't feel capable at all to judge how many and which languages might be treated the same way. But I would judge that the appropriate means for an approach is the usage of regular expressions. This for the definition of words and for the parsing of documents as well.
Any syntactical analysis based on scanning text objects character by character or using TextCursor objects has to do similar things as a RegEx engine based on much less expertise concerning unicode, e.g. In addition it will lack a reliable specification if not the author re-invents something like RegEx.

Actual existing RegEx engines are powerful and efficient, and this is surely valid for the ICU engine integrated into AOO (and LibO, too). And if a more special text requires a more special treatment, it should be much easier to rework the used RegEx string in one place (even if it becomes rather complicated) than to rework an already coded program in many places.

You may run the code contained in the attached example as a demonstration of what I mean to more detail.

Re: Cursor word empty line

Posted: Sun Mar 11, 2018 4:06 pm
by JeJe
Lupp - yes, regular expressions are very powerful. Good demo. In it...
‘Mark and Steven’s party’ said Jenny.
Gives Steven as a word - that's the apostrophe, single quote confusion problem. When people use the single quote as a typographically prettier apostrophe.
ex-partner
Is counted separately when its really one word. Other hyphenations will be two words though - its not simple.

Its perhaps not so much that we can't agree a word count - there are some choices about how we count hyphenated words, word contractions and so on... but having made those decisions... we should agree how many words there are.

“Sam and Gary,” said Jenny.
Is counted by OpenOffice as 6 words. LibreOffice does a much better count of words, correctly making it five. OO is particularly bad.

Re: Cursor word empty line

Posted: Sun Mar 11, 2018 4:48 pm
by JeJe
Regarding hyphenation word counts - this article's interesting:

https://wordribbon.tips.net/T009228_Ign ... ounts.html

Counting all hyphenations as one word.


That would make this example that I got from here,

https://www.quickanddirtytips.com/educa ... -modifiers

a count of three words. Which people might disagree with.
a what-you-might-have-been-wondering-about topic

Re: Cursor word empty line

Posted: Sun Mar 11, 2018 5:08 pm
by Lupp
I actually cannot imagine a solution to the "hyphenation problem" based on pure syntax. Or may I say, it's impossible? And it changes with time. When I was still a pupil I was teached to use "type writer" for what we called "Schreibmaschine" in German. Then it was "type-writer" for some time if I remember correctly, and nowadays everybody seems to use "typewriter". This reasonable convention was reached no sooner than we came to need a trip to the museum to see one of these things.

In German we have the absurd case that a prefix needs to be separated and postponed, though not being a regular word with a distinct
meaning. An example?
Well: A: "Lasst uns anfangen!" B: "Ich habe schon angefangen, aber fange du endlich an!"
(A: "Let's start!" B: "I started already, but you should start ultimately!" or similar)

I'm not a linguist, and I don't even have a term for the "infixed perfect-prefix 'ge'" or whatever it is.
On the other hand I'm 73 now and fortunately I never had actual need of definitely listing and counting "words".

The misuse of a single quote as a "pretty apostophe" is easily solved in RegEx by "('|’)" or even "('|’|‘)".
But national and regional or group-creating stubborn specifics will surely soon be introduced to present us with new funny nonsense once in a while.

(Editing:)
Lupp wrote:I actually cannot imagine a solution to the "hyphenation problem" based on pure syntax.
JeJe wrote:Counting all hyphenations as one word.
Yes, I thought of this one, but...
JeJe wrote:Which people might disagree with.
... was anticipated.
I would agree, however. And the example
JeJe wrote:a what-you-might-have-been-wondering-about topic
is exactly the one I would like to apply the agreement to.

Re: Cursor word empty line

Posted: Mon Mar 12, 2018 12:48 am
by JeJe
Here's another version of my count code in an attached writer document with a dialog showing the word list.
Should be easy to adapt for anyone's else's count code.

(Just replace the word "EnumerateParagraphs" in the
"LoadWordsNew" sub in the "WCM" module with the name of your function which starts the adding of words.
And add the words by calling addWord(YourWord) from within your function.)