Not able to read special characters by XWordCursor.getString

Java, C++, C#, Delphi... - Using the UNO bridges
Post Reply
sumit7325
Posts: 6
Joined: Tue Apr 07, 2020 3:28 pm

Not able to read special characters by XWordCursor.getString

Post by sumit7325 »

Hi All,
I am successfully able to read word by word from a Docx file using below code snippet, the problem is all special character like ($,{ , }) are ignored by XWordCursor.getString

Code: Select all

XComponent xComp = xCompLoader.loadComponentFromURL(
                sUrl, "_blank", 0, propertyValues);


        com.sun.star.text.XTextDocument xTextDocument =
                (com.sun.star.text.XTextDocument) UnoRuntime.queryInterface(
                        com.sun.star.text.XTextDocument.class, xComp);


        XText xText = xTextDocument.getText();

        XSimpleText xSimpleText = UnoRuntime.queryInterface(
                XSimpleText.class, xText);
        XTextCursor xTextCursor = xSimpleText.createTextCursor();

        xTextCursor.gotoEnd(true);

        XTextRange xTextRange = UnoRuntime.queryInterface(
                XTextRange.class, xTextCursor);
        String sString = xTextRange.getString();


        XTextCursor textCursor = xTextRange.getText().createTextCursorByRange(xTextRange.getStart());
        XWordCursor wordCursor = (XWordCursor)
                UnoRuntime.queryInterface(XWordCursor.class, textCursor);


        wordCursor.gotoStart(false);     // go to start of text

        int wordCount = 0;
        String currWord;
        do {
            wordCursor.gotoEndOfWord(true);
            currWord = wordCursor.getString();
            if (currWord.length() > 0) {
                // System.out.println("<" + currWord + ">");
                wordCount++;
                System.out.println(currWord);
            }
        } while( wordCursor.gotoNextWord(false));
the output of the code is for the attached document is

Code: Select all

TestingFirstWord
Hello
test
name
employeeno
the expected output should be

Code: Select all

TestingFirstWord
Hello
test
${name} 
${employeeno}
I found a similar thread also http://openoffice.2283327.n4.nabble.com ... 11431.html but not able to get much information out of it.

Any help or clue is appreciated, thanks
Attachments
template.docx
(12.76 KiB) Downloaded 328 times
NeoOffice 2.2.3 with MacOS 10.4
User avatar
Zizi64
Volunteer
Posts: 11359
Joined: Wed May 26, 2010 7:55 am
Location: Budapest, Hungary

Re: Not able to read special characters by XWordCursor.getSt

Post by Zizi64 »

Maybe it is depends of the definition of the "word" language unit: the parentheses (even the special parentheses) are not part of a human language "word".
Therefore the expected output are more than "simple words", but the output are "real words".
Last edited by Zizi64 on Fri Apr 10, 2020 3:47 pm, edited 1 time in total.
Tibor Kovacs, Hungary; LO7.5.8 /Win7-10 x64Prof.
PortableApps/winPenPack: LO3.3.0-7.6.2;AOO4.1.14
Please, edit the initial post in the topic: add the word [Solved] at the beginning of the subject line - if your problem has been solved.
JeJe
Volunteer
Posts: 2779
Joined: Wed Mar 09, 2016 2:40 pm

Re: Not able to read special characters by XWordCursor.getSt

Post by JeJe »

Write your own function to do it how you want.

In Basic a simple split almost gives you the result you want (the exception is the double space)

sts = split("TestingFirstWord Hello test ${name} ${employeeno}"," ")
for i =0 to ubound(sts)
msgbox sts(i)
next
Windows 10, Openoffice 4.1.11, LibreOffice 7.4.0.3 (x64)
sumit7325
Posts: 6
Joined: Tue Apr 07, 2020 3:28 pm

Re: Not able to read special characters by XWordCursor.getSt

Post by sumit7325 »

Thank you for the quick reply,

Code: Select all

sts = split("TestingFirstWord Hello test ${name} ${employeeno}"," ")
for i =0 to ubound(sts)
msgbox sts(i)
next
in my case it can be single space or can be a tab space also, that is why I am trying to extract word by word irrecpect to any space
NeoOffice 2.2.3 with MacOS 10.4
JeJe
Volunteer
Posts: 2779
Joined: Wed Mar 09, 2016 2:40 pm

Re: Not able to read special characters by XWordCursor.getSt

Post by JeJe »

There are some options for XBreakiterator which you could look at - but its a trivial task writing code to do what you want if the native function doesn't do it... go through a string and start a new word or not depending on what the character is.

https://www.openoffice.org/api/docs/com ... dType.html
Windows 10, Openoffice 4.1.11, LibreOffice 7.4.0.3 (x64)
User avatar
Lupp
Volunteer
Posts: 3549
Joined: Sat May 31, 2014 7:05 pm
Location: München, Germany

Re: Not able to read special characters by XWordCursor.getSt

Post by Lupp »

I don't know about what you intend to do with your words or if the formatting is of any meaning. Anyway I feel sure that a WordCursor is the wrong tool in this case.
If you can bear stepping down to "stupid Basic" you can try what I would suggest (and probably ennoble it by moving to a different language / IDE). From my point of view the contained (primitive) Basic with its (powerful) bridge to the API is the means of choice for tasks of this simple kind.
Open your file (Actually .docx? Why?) with AOO or LibO. (I don't know about NeoOffice.)
Use Tools>Macros>Organize Macros>...Basic ... to create a Basic module (located in the document or elsewhere).
Insert the following code there.

Code: Select all

Sub getExtendedWords()
doc0    = ThisComponent
sd      = doc0.createSearchDescriptor
sd.SearchRegularExpression = True
sd.SearchString ="\S+"
myWords = doc0.FindAll(sd)
u       = myWords.Count - 1
doc1    = StarDesktop.loadComponentFromUrl("private:factory/scalc", "_blank",0, Array())
s1c1    = doc1.Sheets(0).Columns(0) 
outRg   = s1c1.getCellRangeByPosition(0,0,0,u)
outRgDA = outRg.getDataArray()
For j = 0 To u
  outRgDA(j)(0) = myWords(j).String
Next j
outRg.setDataArray(outRgDA)
End Sub
...and run it.

The attached file contains the code and a demo. If you (after checking) give permission to execute the code, it's a matter of milliseconds.
Of course you can output the resullts in a different way. I did it this way because it is simple and grants overview.
Attachments
template_AOO.odt
(12.55 KiB) Downloaded 333 times
On Windows 10: LibreOffice 24.2 (new numbering) and older versions, PortableOpenOffice 4.1.7 and older, StarOffice 5.2
---
Lupp from München
sumit7325
Posts: 6
Joined: Tue Apr 07, 2020 3:28 pm

Re: Not able to read special characters by XWordCursor.getSt

Post by sumit7325 »

Thank you Lupp for the response and detailed explanation, my intention with words is, I have to validate each and every word in Docx template against a certain regular expression, for example, a user can upload a Docx template with some placeholders like ${name}, ${employeeNo}, etc, So I have to do 2 step validation first I have to validate whether expression syntax is correct like $${name}, ${employee}}, or missing bracket will consider as a wrong expression and 2 one if placeholder syntax is correct then I have to validate these syntactically valid placeholders to application-specific placeholder list for example if user have added ${test} in template even though it is a valid expression but I have to ignore it because it is not application-specific . So these operations I have to perform in a Java-based web application that is why I need to extract each and every word and stick to java only :knock:
NeoOffice 2.2.3 with MacOS 10.4
User avatar
robleyd
Moderator
Posts: 5082
Joined: Mon Aug 19, 2013 3:47 am
Location: Murbko, Australia

Re: Not able to read special characters by XWordCursor.getSt

Post by robleyd »

Are you aware of https://poi.apache.org/ ? Just in case it might be helpful for you.
Cheers
David
OS - Slackware 15 64 bit
Apache OpenOffice 4.1.15
LibreOffice 24.2.2.2; SlackBuild for 24.2.2 by Eric Hameleers
sumit7325
Posts: 6
Joined: Tue Apr 07, 2020 3:28 pm

Re: Not able to read special characters by XWordCursor.getSt

Post by sumit7325 »

Thanks, robleyd for the suggestion but I have tried mostly all libraries out there including Apache POI, DOCX4j, Documents4j, and other python solutions also and under the hood all are using Apache open office only
NeoOffice 2.2.3 with MacOS 10.4
User avatar
Lupp
Volunteer
Posts: 3549
Joined: Sat May 31, 2014 7:05 pm
Location: München, Germany

Re: Not able to read special characters by XWordCursor.getSt

Post by Lupp »

sumit7325 wrote:... my intention with words is, I have to validate each and every word in Docx template against a certain regular expression, for example, a user can upload a Docx template with some placeholders like ${name}, ${employeeNo}, etc, So I have to do 2 step validation first I have to validate whether expression syntax is correct like $${name}, ${employee}}, or missing bracket will consider as a wrong expression and 2 one if placeholder syntax is correct then I have to validate these syntactically valid placeholders to application-specific placeholder list for example if user have added ${test} in template even though it is a valid expression but I have to ignore it because it is not application-specific . So these operations I have to perform in a Java-based web application that is why I need to extract each and every word and stick to java only :knock:
Nothing of what you tell here is surprising to me - except the :knock: .
However, I cannot understand your your lack of understanding my proposal...
sumit7325 wrote:Thanks, robleyd for the suggestion but I have tried mostly all libraries out there including Apache POI, DOCX4j, Documents4j, and other python solutions also and under the hood all are using Apache open office only
...and your response to "robleyd".
You obviously are using any successor of OpenOffice.org. Otherwise the "code snippet" you started with would make no sense.
1. You opened your something.docx with it. (That was probably via the java brige, but that's of no meaning here.)
2. You search for "words" using the term in a sense the WordCursor doesn't know.
== Therefore the WordCursor is the wrong tool.
3. In fact you search for "AttempedInsertionOfaPlaceholder".
4. If an unattended run of software (based on java or avaj or whatever) shall do this,
== you need to tell your program what you are looking for, and
== you should try a regular expression describing this concept syntactically.
5. Having found the suspects this way you want to check them for syntactical correctness as placeholders.
(6. An attempt not being correct may require a response ... output ...)
7. The syntactically correct placeholders need to be checked against a semantic(kind of)/restrictive criterion.

Everything can be done by efficient means available via services / interfaces and their methods provided by AOO and (even better probably) by LibreOffice (and probably also by NeoOffice). Your Java has a bridge to the API of whatever you are using. It must be able to have your RAM-representation of a TextDocument to create a SearchDescritor and the like...

As I would see it

Code: Select all

regExAttempt = "\$[^}\s]+(\}+)?"  REM Supposed attempt
and

Code: Select all

regExCorrect = "(?<=(^|\s))\$\{[^\{}\s]+\}(?=(\s|$))(?!\})"
are reasonable candidates when looking for what you need with the help of a SearchDescripor ...
If I had your lists of acceptable and of mandatory placeholders I might create a "complete" solution in Basic using another hour or two (at most). Equally it must be feasible with any language / IDE claiming righly to come with a sufficient bridge to "our" API. :knock:
On Windows 10: LibreOffice 24.2 (new numbering) and older versions, PortableOpenOffice 4.1.7 and older, StarOffice 5.2
---
Lupp from München
Post Reply