Read each word from DOCX

Java, C++, C#, Delphi, ??? - Using the UNO bridges

Read each word from DOCX

Postby sumit7325 » Tue Apr 07, 2020 4:27 pm

Hi All,
I am using Java UNO to load Docx document and based on the https://wiki.openoffice.org/wiki/API/Samples/Java/Writer/TextDocumentStructure reference I am able to read paragraphs but not able to read each word in the paragraph :crazy: below is the sample code also attached the sample Docx template for reference

Code: Select all   Expand viewCollapse view
while (xParagraphEnumeration.hasMoreElements()){
                XTextContent element  = (com.sun.star.text.XTextContent)
                      UnoRuntime.queryInterface(
                              com.sun.star.text.XTextContent.class,
                              xParagraphEnumeration.nextElement());

                //XServiceInfo xinfo = (XServiceInfo)test;
                XServiceInfo xInfo = UnoRuntime.queryInterface(
                        XServiceInfo.class, element );
                if (xInfo.supportsService ( "com.sun.star.text.TextTable" ) ){
                    System.out.println(xInfo);
                }else{
                    XEnumerationAccess xParaEnumerationAccess =
                            (com.sun.star.container.XEnumerationAccess)
                                    UnoRuntime.queryInterface(
                                            com.sun.star.container.XEnumerationAccess.class,
                                            element);

                    XEnumeration xTextPortionEnum =
                            xParaEnumerationAccess.createEnumeration();

                    while (xTextPortionEnum.hasMoreElements())
                    {
                        com.sun.star.text.XTextRange xTextRange =
                                (com.sun.star.text.XTextRange)UnoRuntime.queryInterface(
                                        com.sun.star.text.XTextRange.class,
                                        xTextPortionEnum.nextElement());

   

                        // this is returning whole line for exe ("Hello test  ${name}")   need to get each word Hello , test , ${name}
                        System.out.println(xTextRange.getString());
                    }

                }


Any help is appreciated , Thanks
Attachments
template.docx
(11.96 KiB) Downloaded 29 times
NeoOffice 2.2.3 with MacOS 10.4
sumit7325
 
Posts: 6
Joined: Tue Apr 07, 2020 3:28 pm

Re: Read each word from DOCX

Postby RoryOF » Tue Apr 07, 2020 4:44 pm

This code (in BASIC) will show how to get the word structure from each paragraph
Code: Select all   Expand viewCollapse view
Dim Doc As Object
Dim Enum1 As Object
Dim Enum2 As Object
Dim TextElement As Object
Dim TextPortion As Object

Doc = ThisComponent
Enum1 = Doc.Text.createEnumeration

' loop over all paragraphs
While Enum1.hasMoreElements
  TextElement = Enum1.nextElement

  If TextElement.supportsService("com.sun.star.text.Paragraph") Then
    Enum2 = TextElement.createEnumeration
    ' loop over all sub-paragraphs

    While Enum2.hasMoreElements
      TextPortion = Enum2.nextElement
      MsgBox "'" & TextPortion.String & "'"
      TextPortion.String = Replace(TextPortion.String, "you", "U")
      TextPortion.String = Replace(TextPortion.String, "too", "2")
      TextPortion.String = Replace(TextPortion.String, "for", "4")
    Wend

  End If
Wend
Apache OpenOffice 4.1.7 on Xubuntu 20.04.1 (mostly 64 bit version) and very infrequently on Win2K/XP
User avatar
RoryOF
Moderator
 
Posts: 31541
Joined: Sat Jan 31, 2009 9:30 pm
Location: Ireland

Re: Read each word from DOCX

Postby JeJe » Tue Apr 07, 2020 5:28 pm

https://wiki.openoffice.org/wiki/Writer/API/Text_cursor

oCursor.gotoNextWord(bExtend)
oCursor.gotoPreviousWord(bExtend)
oCursor.gotoEndOfWord(bExtend)
oCursor.gotoStartOfWord(bExtend)
Openoffice 4.1.6
Windows 8
JeJe
Volunteer
 
Posts: 1191
Joined: Wed Mar 09, 2016 2:40 pm


Return to External Programs

Who is online

Users browsing this forum: No registered users and 1 guest