Extracting the text from the table present in .docx file

Creating a macro - Writing a Script - Using the API (OpenOffice Basic, Python, BeanShell, JavaScript)
Post Reply
vivekHavalad
Posts: 2
Joined: Sat Oct 10, 2020 10:29 pm

Extracting the text from the table present in .docx file

Post by vivekHavalad »

Hi All,
I am using Java UNO to load Docx document and based on the https://wiki.openoffice.org/wiki/API/Sa ... tStructure reference I am not able to extract text present in table format in .docx file below is the code for reference

Code: Select all

XTextTable table = Lo.createInstanceMSF(XTextTable.class,
 "com.sun.star.text.TextTable");
XTableRows rows = table.getRows();
for (int y=1; y < rows; y++) { 
  System.out.println("Table:" + rows[y]); 
}
Last edited by MrProgrammer on Sat Oct 10, 2020 10:55 pm, edited 1 time in total.
Reason: Moved from Writer forum to Macros and UNO PI
OpenOffice 3.1 on mac pro
John_Ha
Volunteer
Posts: 9584
Joined: Fri Sep 18, 2009 5:51 pm
Location: UK

Re: Extracting the text from the table present in .docx file

Post by John_Ha »

Record a macro to do it and then hack the macro?
LO 6.4.4.2, Windows 10 Home 64 bit

See the Writer Guide, the Writer FAQ, the Writer Tutorials and Writer for students.

Remember: Always save your Writer files as .odt files. - see here for the many reasons why.
FJCC
Moderator
Posts: 9274
Joined: Sat Nov 08, 2008 8:08 pm
Location: Colorado, USA

Re: Extracting the text from the table present in .docx file

Post by FJCC »

I do not know Java but it looks like you are creating a table with the createInstance() method. You are also trying to use the rows variable as if it is an integer but get(Rows() returns a container of row objects, not a count of he rows. I recorded some Java code using the MRI extension. It gets by name a table that exists in the document and extracts its DataArray, which has the content of each cell. Is that the sort of thing you are trying to do?

Code: Select all

import com.sun.star.container.NoSuchElementException;
import com.sun.star.container.XNameAccess;
import com.sun.star.lang.WrappedTargetException;
import com.sun.star.sheet.XCellRangeData;
import com.sun.star.text.XTextTable;
import com.sun.star.text.XTextTablesSupplier;
import com.sun.star.uno.RuntimeException;
import com.sun.star.uno.UnoRuntime;
import com.sun.star.uno.XComponentContext;

public static void snippet(XComponentContext xComponentContext, Object oInitialTarget)
{
	try
	{
		XTextTablesSupplier xTextTablesSupplier = UnoRuntime.queryInterface(
			XTextTablesSupplier.class, oInitialTarget);
		XNameAccess xNameAccess = xTextTablesSupplier.getTextTables();
		
		XTextTable xTextTable = UnoRuntime.queryInterface(
			XTextTable.class, xNameAccess.getByName("Table1"));
		
		XCellRangeData xCellRangeData = UnoRuntime.queryInterface(
			XCellRangeData.class, xTextTable);
		Object[][] oDataArray = xCellRangeData.getDataArray();
		
	}
	catch (NoSuchElementException e1)
	{
		// getByName
		e1.printStackTrace();
	}
	catch (WrappedTargetException e2)
	{
		// getByName
		e2.printStackTrace();
	}
	catch (RuntimeException e3)
	{
		// getByName
		e3.printStackTrace();
	}
}
OpenOffice 4.1 on Windows 10 and Linux Mint
If your question is answered, please go to your first post, select the Edit button, and add [Solved] to the beginning of the title.
User avatar
Lupp
Volunteer
Posts: 3549
Joined: Sat May 31, 2014 7:05 pm
Location: München, Germany

Re: Extracting the text from the table present in .docx file

Post by Lupp »

If the .docx the OQ was talking of, was successfully loaded by OOo, AOO, LibO, the usage of the API will not depend on the "format" of the source file. What's in RAM simply is a TextDocument for the software supporting all the respective services.

The information you (the OQ) found may be slightly misleading. It's true that TextTable objects are returned like paragraphs if the first-level objects contained in the text get enumerated. However TextTable objects don't support specific paragraph services. To the contrary they contain cells which in turn can enumerate paragraphs - and interspersed TextTable objects again.

All the TextTable objects contained in a TextDocument are also accessible from the .TextTables object of the document by index. Their cells are accessible then using their names which are found in the sequence .CellNames of the table. Accessing them one by one using something like myCell = myTextTable.getCellByName(theCellNamIfound) you can either analyze them for paragraphs and additional TextContent (TextTable, TextFrame, ...) - or (e.g.) just ask them for their .String property.

You see: TextTable objects are really simple and handy. I love them. Joking aside:
The attached example demonstrates the usage of some of the mentioned objects and methods in Basic. Java can surely do everything better and simpler. (Sorry. A joke again.)
textTableStrings.odt
(23.22 KiB) Downloaded 148 times
(There are 2 cells containing a second paragraph. You see: Only interested in strings, you can also split them in parts per paragraph without enumetrating the text.)

The raw Basic code:

Code: Select all

Sub getTextTableCellStrings
doc = ThisComponent
tTable = doc.TextTables(0)
cellNames = tTable.cellNames
u = Ubound(cellNames)
Dim tStrings(u, 1)
For j = 0 To u
  j_cName = cellNames(j)
  j_cell = tTable.getCellByName(j_cName)
  tStrings(j, 0) = j_cName
  tStrings(j, 1) = j_cell.String
  If InStr(j_cell.String, Chr(13) & Chr(10))<>0 Then Print "Internal new paragraph In cell: " & j_cName & "!"
Next j
Print "You can now evaluate the array ""tStrings"" for the contained strings." 
End Sub
On Windows 10: LibreOffice 24.2 (new numbering) and older versions, PortableOpenOffice 4.1.7 and older, StarOffice 5.2
---
Lupp from München
vivekHavalad
Posts: 2
Joined: Sat Oct 10, 2020 10:29 pm

Re: Extracting the text from the table present in .docx file

Post by vivekHavalad »

Thank you guys for support but i am facing :crazy:
com.sun.star.uno.RuntimeException: Table too complex
while doing

Code: Select all

XCellRangeData xCellRangeData = UnoRuntime.queryInterface(
         XCellRangeData.class, xTextTable);
OpenOffice 3.1 on mac pro
User avatar
Lupp
Volunteer
Posts: 3549
Joined: Sat May 31, 2014 7:05 pm
Location: München, Germany

Re: Extracting the text from the table present in .docx file

Post by Lupp »

-1- I cannot tell anything reliable concerning OOo 3.1.
-2- Text tables are generally a mess if you try to work with them via the API.
-3- I would suspect every table not being fully simple (n rows x m columns) to be "too complex".
-4- Complex tables as I understand the term are tables having merged or split cells. Just for fun I tried sometimes to work with that kind of tables in text documents using the API. It's a nightmare.
-5- The XCellRangeData interface only knows two methods (afaik): .getDataArray and .setDataArray(aDA) which are often used and needed when working with ranges in spreadsheets.

In spreadsheets also ranges with merged cells have their unchanged simple structure in the background, and splitting isn't supported at all. Text tables ara a mess. Ah, yes, I already told.
On Windows 10: LibreOffice 24.2 (new numbering) and older versions, PortableOpenOffice 4.1.7 and older, StarOffice 5.2
---
Lupp from München
User avatar
Lupp
Volunteer
Posts: 3549
Joined: Sat May 31, 2014 7:05 pm
Location: München, Germany

Re: Extracting the text from the table present in .docx file

Post by Lupp »

Completion (kind of)

I don't use Java.

What progamming language ever is used, the relevant work needs to be done directly based on the OOo/AOO/LibO API.

Doing the rest with Basic, the following code will create an array containing in one pair (indices 0, 1) per TextTable occurring in the document the table name and an array for the cells of the TextTable, where the cell names and the contents of the cells are stored in the obvious way (again indices 0, 1).

This should work independent of whether the TextTable is "simple" or "complex".

Code: Select all

Sub contentsFromTextTables()
tDoc = ThisComponent
textTables = tDoc.TextTables
countTt = textTables.Count
Dim contentsTt(1 To countTt, 0 To 1)
For j = 1 To countTt
  jTable = textTables(j - 1)
  contentsTt(j, 0) = jTable.Name
  jCellNames = jTable.CellNames : jCountCells = Ubound(jCellNames) + 1
  ReDim jContents( 1 To jCountCells, 0 To 1)
  For k = 1 To jCountCells
    kCellName = jCellNames(k - 1)
    jCell = jtable.getCellByName(kCellName)
    jContents(k, 0) = kCellName : jContents(k, 1) = jCell.String
  Next k
  contentsTt(j, 1) = jContents
Next j
REM Use and/or store the result now.
End Sub
On Windows 10: LibreOffice 24.2 (new numbering) and older versions, PortableOpenOffice 4.1.7 and older, StarOffice 5.2
---
Lupp from München
Post Reply