Not able to read special characters by XWordCursor.getString

Java, C++, C#, Delphi, ??? - Using the UNO bridges

Not able to read special characters by XWordCursor.getString

Postby sumit7325 » Thu Apr 09, 2020 5:29 am

Hi All,
I am successfully able to read word by word from a Docx file using below code snippet, the problem is all special character like ($,{ , }) are ignored by XWordCursor.getString

Code: Select all   Expand viewCollapse view
XComponent xComp = xCompLoader.loadComponentFromURL(
                sUrl, "_blank", 0, propertyValues);


        com.sun.star.text.XTextDocument xTextDocument =
                (com.sun.star.text.XTextDocument) UnoRuntime.queryInterface(
                        com.sun.star.text.XTextDocument.class, xComp);


        XText xText = xTextDocument.getText();

        XSimpleText xSimpleText = UnoRuntime.queryInterface(
                XSimpleText.class, xText);
        XTextCursor xTextCursor = xSimpleText.createTextCursor();

        xTextCursor.gotoEnd(true);

        XTextRange xTextRange = UnoRuntime.queryInterface(
                XTextRange.class, xTextCursor);
        String sString = xTextRange.getString();


        XTextCursor textCursor = xTextRange.getText().createTextCursorByRange(xTextRange.getStart());
        XWordCursor wordCursor = (XWordCursor)
                UnoRuntime.queryInterface(XWordCursor.class, textCursor);


        wordCursor.gotoStart(false);     // go to start of text

        int wordCount = 0;
        String currWord;
        do {
            wordCursor.gotoEndOfWord(true);
            currWord = wordCursor.getString();
            if (currWord.length() > 0) {
                // System.out.println("<" + currWord + ">");
                wordCount++;
                System.out.println(currWord);
            }
        } while( wordCursor.gotoNextWord(false));


the output of the code is for the attached document is

Code: Select all   Expand viewCollapse view
TestingFirstWord
Hello
test
name
employeeno


the expected output should be

Code: Select all   Expand viewCollapse view
TestingFirstWord
Hello
test
${name}
${employeeno}


I found a similar thread also http://openoffice.2283327.n4.nabble.com/XWordCursor-gotoEndOfWord-misbehavior-td2811431.html but not able to get much information out of it.

Any help or clue is appreciated, thanks
Attachments
template.docx
(12.76 KiB) Downloaded 15 times
NeoOffice 2.2.3 with MacOS 10.4
sumit7325
 
Posts: 6
Joined: Tue Apr 07, 2020 3:28 pm

Re: Not able to read special characters by XWordCursor.getSt

Postby Zizi64 » Thu Apr 09, 2020 7:12 am

Maybe it is depends of the definition of the "word" language unit: the parentheses (even the special parentheses) are not part of a human language "word".
Therefore the expected output are more than "simple words", but the output are "real words".
Last edited by Zizi64 on Fri Apr 10, 2020 3:47 pm, edited 1 time in total.
Tibor Kovacs, Hungary; LO6.1.6, 6.2.8 /Win7-10 x64Prof.
PortableApps/winPenPack: LO3.3.0-6.4.3;AOO4.1.7
Please, edit the initial post in the topic: add the word [Solved] at the beginning of the subject line - if your problem has been solved.
User avatar
Zizi64
Volunteer
 
Posts: 9316
Joined: Wed May 26, 2010 7:55 am
Location: Budapest, Hungary

Re: Not able to read special characters by XWordCursor.getSt

Postby JeJe » Thu Apr 09, 2020 8:07 am

Write your own function to do it how you want.

In Basic a simple split almost gives you the result you want (the exception is the double space)

sts = split("TestingFirstWord Hello test ${name} ${employeeno}"," ")
for i =0 to ubound(sts)
msgbox sts(i)
next
Openoffice 4.1.2
Windows 8
JeJe
Volunteer
 
Posts: 1031
Joined: Wed Mar 09, 2016 2:40 pm

Re: Not able to read special characters by XWordCursor.getSt

Postby sumit7325 » Thu Apr 09, 2020 8:28 am

Thank you for the quick reply,
Code: Select all   Expand viewCollapse view
sts = split("TestingFirstWord Hello test ${name} ${employeeno}"," ")
for i =0 to ubound(sts)
msgbox sts(i)
next

in my case it can be single space or can be a tab space also, that is why I am trying to extract word by word irrecpect to any space
NeoOffice 2.2.3 with MacOS 10.4
sumit7325
 
Posts: 6
Joined: Tue Apr 07, 2020 3:28 pm

Re: Not able to read special characters by XWordCursor.getSt

Postby JeJe » Thu Apr 09, 2020 9:49 am

There are some options for XBreakiterator which you could look at - but its a trivial task writing code to do what you want if the native function doesn't do it... go through a string and start a new word or not depending on what the character is.

https://www.openoffice.org/api/docs/com ... dType.html
Openoffice 4.1.2
Windows 8
JeJe
Volunteer
 
Posts: 1031
Joined: Wed Mar 09, 2016 2:40 pm

Re: Not able to read special characters by XWordCursor.getSt

Postby Lupp » Thu Apr 09, 2020 2:19 pm

I don't know about what you intend to do with your words or if the formatting is of any meaning. Anyway I feel sure that a WordCursor is the wrong tool in this case.
If you can bear stepping down to "stupid Basic" you can try what I would suggest (and probably ennoble it by moving to a different language / IDE). From my point of view the contained (primitive) Basic with its (powerful) bridge to the API is the means of choice for tasks of this simple kind.
Open your file (Actually .docx? Why?) with AOO or LibO. (I don't know about NeoOffice.)
Use Tools>Macros>Organize Macros>...Basic ... to create a Basic module (located in the document or elsewhere).
Insert the following code there.
Code: Select all   Expand viewCollapse view
Sub getExtendedWords()
doc0    = ThisComponent
sd      = doc0.createSearchDescriptor
sd.SearchRegularExpression = True
sd.SearchString ="\S+"
myWords = doc0.FindAll(sd)
u       = myWords.Count - 1
doc1    = StarDesktop.loadComponentFromUrl("private:factory/scalc", "_blank",0, Array())
s1c1    = doc1.Sheets(0).Columns(0)
outRg   = s1c1.getCellRangeByPosition(0,0,0,u)
outRgDA = outRg.getDataArray()
For j = 0 To u
  outRgDA(j)(0) = myWords(j).String
Next j
outRg.setDataArray(outRgDA)
End Sub

...and run it.

The attached file contains the code and a demo. If you (after checking) give permission to execute the code, it's a matter of milliseconds.
Of course you can output the resullts in a different way. I did it this way because it is simple and grants overview.
Attachments
template_AOO.odt
(12.55 KiB) Downloaded 14 times
On Windows 10: LibreOffice 6.4 and older versions, PortableOpenOffice 4.1.7 and older, StarOffice 5.2
---
Lupp from München
User avatar
Lupp
Volunteer
 
Posts: 2861
Joined: Sat May 31, 2014 7:05 pm
Location: München, Germany

Re: Not able to read special characters by XWordCursor.getSt

Postby sumit7325 » Fri Apr 10, 2020 8:01 am

Thank you Lupp for the response and detailed explanation, my intention with words is, I have to validate each and every word in Docx template against a certain regular expression, for example, a user can upload a Docx template with some placeholders like ${name}, ${employeeNo}, etc, So I have to do 2 step validation first I have to validate whether expression syntax is correct like $${name}, ${employee}}, or missing bracket will consider as a wrong expression and 2 one if placeholder syntax is correct then I have to validate these syntactically valid placeholders to application-specific placeholder list for example if user have added ${test} in template even though it is a valid expression but I have to ignore it because it is not application-specific . So these operations I have to perform in a Java-based web application that is why I need to extract each and every word and stick to java only :knock:
NeoOffice 2.2.3 with MacOS 10.4
sumit7325
 
Posts: 6
Joined: Tue Apr 07, 2020 3:28 pm

Re: Not able to read special characters by XWordCursor.getSt

Postby robleyd » Fri Apr 10, 2020 9:46 am

Are you aware of https://poi.apache.org/ ? Just in case it might be helpful for you.
Cheers
David
Apache OpenOffice 420m2(Build:9821) - Slackware 14.2 - 64 bit
LibreOffice 6.0.7.3 - Slackware 14.2 - 64 bit
Apache OpenOffice 4.1.4 - Windows 7 Virtual machine
User avatar
robleyd
Moderator
 
Posts: 3368
Joined: Mon Aug 19, 2013 3:47 am
Location: Murbko, Australia

Re: Not able to read special characters by XWordCursor.getSt

Postby sumit7325 » Fri Apr 10, 2020 11:09 am

Thanks, robleyd for the suggestion but I have tried mostly all libraries out there including Apache POI, DOCX4j, Documents4j, and other python solutions also and under the hood all are using Apache open office only
NeoOffice 2.2.3 with MacOS 10.4
sumit7325
 
Posts: 6
Joined: Tue Apr 07, 2020 3:28 pm

Re: Not able to read special characters by XWordCursor.getSt

Postby Lupp » Fri Apr 10, 2020 1:54 pm

sumit7325 wrote:... my intention with words is, I have to validate each and every word in Docx template against a certain regular expression, for example, a user can upload a Docx template with some placeholders like ${name}, ${employeeNo}, etc, So I have to do 2 step validation first I have to validate whether expression syntax is correct like $${name}, ${employee}}, or missing bracket will consider as a wrong expression and 2 one if placeholder syntax is correct then I have to validate these syntactically valid placeholders to application-specific placeholder list for example if user have added ${test} in template even though it is a valid expression but I have to ignore it because it is not application-specific . So these operations I have to perform in a Java-based web application that is why I need to extract each and every word and stick to java only :knock:

Nothing of what you tell here is surprising to me - except the :knock: .
However, I cannot understand your your lack of understanding my proposal...
sumit7325 wrote:Thanks, robleyd for the suggestion but I have tried mostly all libraries out there including Apache POI, DOCX4j, Documents4j, and other python solutions also and under the hood all are using Apache open office only
...and your response to "robleyd".
You obviously are using any successor of OpenOffice.org. Otherwise the "code snippet" you started with would make no sense.
1. You opened your something.docx with it. (That was probably via the java brige, but that's of no meaning here.)
2. You search for "words" using the term in a sense the WordCursor doesn't know.
== Therefore the WordCursor is the wrong tool.
3. In fact you search for "AttempedInsertionOfaPlaceholder".
4. If an unattended run of software (based on java or avaj or whatever) shall do this,
== you need to tell your program what you are looking for, and
== you should try a regular expression describing this concept syntactically.
5. Having found the suspects this way you want to check them for syntactical correctness as placeholders.
(6. An attempt not being correct may require a response ... output ...)
7. The syntactically correct placeholders need to be checked against a semantic(kind of)/restrictive criterion.

Everything can be done by efficient means available via services / interfaces and their methods provided by AOO and (even better probably) by LibreOffice (and probably also by NeoOffice). Your Java has a bridge to the API of whatever you are using. It must be able to have your RAM-representation of a TextDocument to create a SearchDescritor and the like...

As I would see it
Code: Select all   Expand viewCollapse view
regExAttempt = "\$[^}\s]+(\}+)?"  REM Supposed attempt
and
Code: Select all   Expand viewCollapse view
regExCorrect = "(?<=(^|\s))\$\{[^\{}\s]+\}(?=(\s|$))(?!\})"
are reasonable candidates when looking for what you need with the help of a SearchDescripor ...
If I had your lists of acceptable and of mandatory placeholders I might create a "complete" solution in Basic using another hour or two (at most). Equally it must be feasible with any language / IDE claiming righly to come with a sufficient bridge to "our" API. :knock:
On Windows 10: LibreOffice 6.4 and older versions, PortableOpenOffice 4.1.7 and older, StarOffice 5.2
---
Lupp from München
User avatar
Lupp
Volunteer
 
Posts: 2861
Joined: Sat May 31, 2014 7:05 pm
Location: München, Germany


Return to External Programs

Who is online

Users browsing this forum: No registered users and 2 guests