Not able to read special characters by XWordCursor.getString

sumit7325 · Post by **sumit7325** » Thu Apr 09, 2020 5:29 am

Hi All,
I am successfully able to read word by word from a Docx file using below code snippet, the problem is all special character like ($,{ , }) are ignored by XWordCursor.getString

Code: Select all

XComponent xComp = xCompLoader.loadComponentFromURL(
                sUrl, "_blank", 0, propertyValues);


        com.sun.star.text.XTextDocument xTextDocument =
                (com.sun.star.text.XTextDocument) UnoRuntime.queryInterface(
                        com.sun.star.text.XTextDocument.class, xComp);


        XText xText = xTextDocument.getText();

        XSimpleText xSimpleText = UnoRuntime.queryInterface(
                XSimpleText.class, xText);
        XTextCursor xTextCursor = xSimpleText.createTextCursor();

        xTextCursor.gotoEnd(true);

        XTextRange xTextRange = UnoRuntime.queryInterface(
                XTextRange.class, xTextCursor);
        String sString = xTextRange.getString();


        XTextCursor textCursor = xTextRange.getText().createTextCursorByRange(xTextRange.getStart());
        XWordCursor wordCursor = (XWordCursor)
                UnoRuntime.queryInterface(XWordCursor.class, textCursor);


        wordCursor.gotoStart(false);     // go to start of text

        int wordCount = 0;
        String currWord;
        do {
            wordCursor.gotoEndOfWord(true);
            currWord = wordCursor.getString();
            if (currWord.length() > 0) {
                // System.out.println("<" + currWord + ">");
                wordCount++;
                System.out.println(currWord);
            }
        } while( wordCursor.gotoNextWord(false));

the output of the code is for the attached document is

Code: Select all

TestingFirstWord
Hello
test
name
employeeno

the expected output should be

Code: Select all

TestingFirstWord
Hello
test
${name} 
${employeeno}

I found a similar thread also http://openoffice.2283327.n4.nabble.com ... 11431.html but not able to get much information out of it.

Any help or clue is appreciated, thanks

Post by **Zizi64** » Thu Apr 09, 2020 7:12 am

Maybe it is depends of the definition of the "word" language unit: the parentheses (even the special parentheses) are not part of a human language "word".
Therefore the expected output are more than "simple words", but the output are "real words".

Post by **JeJe** » Thu Apr 09, 2020 8:07 am

Write your own function to do it how you want.

In Basic a simple split almost gives you the result you want (the exception is the double space)

sts = split("TestingFirstWord Hello test ${name} ${employeeno}"," ")
for i =0 to ubound(sts)
msgbox sts(i)
next

sumit7325 · Post by **sumit7325** » Thu Apr 09, 2020 8:28 am

Thank you for the quick reply,

Code: Select all

sts = split("TestingFirstWord Hello test ${name} ${employeeno}"," ")
for i =0 to ubound(sts)
msgbox sts(i)
next

in my case it can be single space or can be a tab space also, that is why I am trying to extract word by word irrecpect to any space

Post by **JeJe** » Thu Apr 09, 2020 9:49 am

There are some options for XBreakiterator which you could look at - but its a trivial task writing code to do what you want if the native function doesn't do it... go through a string and start a new word or not depending on what the character is.

https://www.openoffice.org/api/docs/com ... dType.html

Lupp · Post by **Lupp** » Thu Apr 09, 2020 2:19 pm

I don't know about what you intend to do with your words or if the formatting is of any meaning. Anyway I feel sure that a WordCursor is the wrong tool in this case.
If you can bear stepping down to "stupid Basic" you can try what I would suggest (and probably ennoble it by moving to a different language / IDE). From my point of view the contained (primitive) Basic with its (powerful) bridge to the API is the means of choice for tasks of this simple kind.
Open your file (Actually .docx? Why?) with AOO or LibO. (I don't know about NeoOffice.)
Use Tools>Macros>Organize Macros>...Basic ... to create a Basic module (located in the document or elsewhere).
Insert the following code there.

Code: Select all

Sub getExtendedWords()
doc0    = ThisComponent
sd      = doc0.createSearchDescriptor
sd.SearchRegularExpression = True
sd.SearchString ="\S+"
myWords = doc0.FindAll(sd)
u       = myWords.Count - 1
doc1    = StarDesktop.loadComponentFromUrl("private:factory/scalc", "_blank",0, Array())
s1c1    = doc1.Sheets(0).Columns(0) 
outRg   = s1c1.getCellRangeByPosition(0,0,0,u)
outRgDA = outRg.getDataArray()
For j = 0 To u
  outRgDA(j)(0) = myWords(j).String
Next j
outRg.setDataArray(outRgDA)
End Sub

...and run it.

The attached file contains the code and a demo. If you (after checking) give permission to execute the code, it's a matter of milliseconds.
Of course you can output the resullts in a different way. I did it this way because it is simple and grants overview.

sumit7325 · Post by **sumit7325** » Fri Apr 10, 2020 8:01 am

Thank you Lupp for the response and detailed explanation, my intention with words is, I have to validate each and every word in Docx template against a certain regular expression, for example, a user can upload a Docx template with some placeholders like ${name}, ${employeeNo}, etc, So I have to do 2 step validation first I have to validate whether expression syntax is correct like $${name}, ${employee}}, or missing bracket will consider as a wrong expression and 2 one if placeholder syntax is correct then I have to validate these syntactically valid placeholders to application-specific placeholder list for example if user have added ${test} in template even though it is a valid expression but I have to ignore it because it is not application-specific . So these operations I have to perform in a Java-based web application that is why I need to extract each and every word and stick to java only

Post by **robleyd** » Fri Apr 10, 2020 9:46 am

Are you aware of https://poi.apache.org/ ? Just in case it might be helpful for you.

sumit7325 · Post by **sumit7325** » Fri Apr 10, 2020 11:09 am

Thanks, robleyd for the suggestion but I have tried mostly all libraries out there including Apache POI, DOCX4j, Documents4j, and other python solutions also and under the hood all are using Apache open office only

Lupp · Post by **Lupp** » Fri Apr 10, 2020 1:54 pm

sumit7325 wrote:... my intention with words is, I have to validate each and every word in Docx template against a certain regular expression, for example, a user can upload a Docx template with some placeholders like ${name}, ${employeeNo}, etc, So I have to do 2 step validation first I have to validate whether expression syntax is correct like $${name}, ${employee}}, or missing bracket will consider as a wrong expression and 2 one if placeholder syntax is correct then I have to validate these syntactically valid placeholders to application-specific placeholder list for example if user have added ${test} in template even though it is a valid expression but I have to ignore it because it is not application-specific . So these operations I have to perform in a Java-based web application that is why I need to extract each and every word and stick to java only

Nothing of what you tell here is surprising to me - except the

.
However, I cannot understand your your lack of understanding my proposal...

sumit7325 wrote:Thanks, robleyd for the suggestion but I have tried mostly all libraries out there including Apache POI, DOCX4j, Documents4j, and other python solutions also and under the hood all are using Apache open office only

...and your response to "robleyd".
You obviously are using any successor of OpenOffice.org. Otherwise the "code snippet" you started with would make no sense.
1. You opened your something.docx with it. (That was probably via the java brige, but that's of no meaning here.)
2. You search for "words" using the term in a sense the WordCursor doesn't know.
== Therefore the WordCursor is the wrong tool.
3. In fact you search for "AttempedInsertionOfaPlaceholder".
4. If an unattended run of software (based on java or avaj or whatever) shall do this,
== you need to tell your program what you are looking for, and
== you should try a regular expression describing this concept syntactically.
5. Having found the suspects this way you want to check them for syntactical correctness as placeholders.
(6. An attempt not being correct may require a response ... output ...)
7. The syntactically correct placeholders need to be checked against a semantic(kind of)/restrictive criterion.

Everything can be done by efficient means available via services / interfaces and their methods provided by AOO and (even better probably) by LibreOffice (and probably also by NeoOffice). Your Java has a bridge to the API of whatever you are using. It must be able to have your RAM-representation of a TextDocument to create a SearchDescritor and the like...

As I would see it

Code: Select all

regExAttempt = "\$[^}\s]+(\}+)?"  REM Supposed attempt

and

Code: Select all

regExCorrect = "(?<=(^|\s))\$\{[^\{}\s]+\}(?=(\s|$))(?!\})"

are reasonable candidates when looking for what you need with the help of a SearchDescripor ...
If I had your lists of acceptable and of mandatory placeholders I might create a "complete" solution in Basic using another hour or two (at most). Equally it must be feasible with any language / IDE claiming righly to come with a sufficient bridge to "our" API.

Not able to read special characters by XWordCursor.getString

Not able to read special characters by XWordCursor.getString

Re: Not able to read special characters by XWordCursor.getSt

Re: Not able to read special characters by XWordCursor.getSt

Re: Not able to read special characters by XWordCursor.getSt

Re: Not able to read special characters by XWordCursor.getSt

Re: Not able to read special characters by XWordCursor.getSt

Re: Not able to read special characters by XWordCursor.getSt

Re: Not able to read special characters by XWordCursor.getSt

Re: Not able to read special characters by XWordCursor.getSt

Re: Not able to read special characters by XWordCursor.getSt