Read each word from DOCX

Java, C++, C#, Delphi... - Using the UNO bridges
Post Reply
sumit7325
Posts: 6
Joined: Tue Apr 07, 2020 3:28 pm

Read each word from DOCX

Post by sumit7325 »

Hi All,
I am using Java UNO to load Docx document and based on the https://wiki.openoffice.org/wiki/API/Sa ... tStructure reference I am able to read paragraphs but not able to read each word in the paragraph :crazy: below is the sample code also attached the sample Docx template for reference

Code: Select all

while (xParagraphEnumeration.hasMoreElements()){
                XTextContent element  = (com.sun.star.text.XTextContent)
                      UnoRuntime.queryInterface(
                              com.sun.star.text.XTextContent.class,
                              xParagraphEnumeration.nextElement());

                //XServiceInfo xinfo = (XServiceInfo)test;
                XServiceInfo xInfo = UnoRuntime.queryInterface(
                        XServiceInfo.class, element );
                if (xInfo.supportsService ( "com.sun.star.text.TextTable" ) ){
                    System.out.println(xInfo);
                }else{
                    XEnumerationAccess xParaEnumerationAccess =
                            (com.sun.star.container.XEnumerationAccess)
                                    UnoRuntime.queryInterface(
                                            com.sun.star.container.XEnumerationAccess.class,
                                            element);

                    XEnumeration xTextPortionEnum =
                            xParaEnumerationAccess.createEnumeration();

                    while (xTextPortionEnum.hasMoreElements())
                    {
                        com.sun.star.text.XTextRange xTextRange =
                                (com.sun.star.text.XTextRange)UnoRuntime.queryInterface(
                                        com.sun.star.text.XTextRange.class,
                                        xTextPortionEnum.nextElement());

   

                        // this is returning whole line for exe ("Hello test  ${name}")   need to get each word Hello , test , ${name}
                        System.out.println(xTextRange.getString());
                    }

                }
Any help is appreciated , Thanks
Attachments
template.docx
(11.96 KiB) Downloaded 259 times
NeoOffice 2.2.3 with MacOS 10.4
User avatar
RoryOF
Moderator
Posts: 34611
Joined: Sat Jan 31, 2009 9:30 pm
Location: Ireland

Re: Read each word from DOCX

Post by RoryOF »

This code (in BASIC) will show how to get the word structure from each paragraph

Code: Select all

Dim Doc As Object
Dim Enum1 As Object
Dim Enum2 As Object
Dim TextElement As Object
Dim TextPortion As Object
 
Doc = ThisComponent
Enum1 = Doc.Text.createEnumeration
 
' loop over all paragraphs
While Enum1.hasMoreElements
  TextElement = Enum1.nextElement
 
  If TextElement.supportsService("com.sun.star.text.Paragraph") Then
    Enum2 = TextElement.createEnumeration
    ' loop over all sub-paragraphs 
 
    While Enum2.hasMoreElements
      TextPortion = Enum2.nextElement
      MsgBox "'" & TextPortion.String & "'"
      TextPortion.String = Replace(TextPortion.String, "you", "U") 
      TextPortion.String = Replace(TextPortion.String, "too", "2")
      TextPortion.String = Replace(TextPortion.String, "for", "4") 
    Wend
 
  End If
Wend
Apache OpenOffice 4.1.15 on Xubuntu 22.04.4 LTS
JeJe
Volunteer
Posts: 2779
Joined: Wed Mar 09, 2016 2:40 pm

Re: Read each word from DOCX

Post by JeJe »

https://wiki.openoffice.org/wiki/Writer/API/Text_cursor

oCursor.gotoNextWord(bExtend)
oCursor.gotoPreviousWord(bExtend)
oCursor.gotoEndOfWord(bExtend)
oCursor.gotoStartOfWord(bExtend)
Windows 10, Openoffice 4.1.11, LibreOffice 7.4.0.3 (x64)
Post Reply