Extracting the text from the table present in .docx file

vivekHavalad · Post by **vivekHavalad** » Sat Oct 10, 2020 10:53 pm

Hi All,
I am using Java UNO to load Docx document and based on the https://wiki.openoffice.org/wiki/API/Sa ... tStructure reference I am not able to extract text present in table format in .docx file below is the code for reference

Code: Select all

XTextTable table = Lo.createInstanceMSF(XTextTable.class,
 "com.sun.star.text.TextTable");
XTableRows rows = table.getRows();
for (int y=1; y < rows; y++) { 
  System.out.println("Table:" + rows[y]); 
}

Post by **John_Ha** » Sun Oct 11, 2020 1:09 am

Record a macro to do it and then hack the macro?

Post by **FJCC** » Sun Oct 11, 2020 1:12 am

I do not know Java but it looks like you are creating a table with the createInstance() method. You are also trying to use the rows variable as if it is an integer but get(Rows() returns a container of row objects, not a count of he rows. I recorded some Java code using the MRI extension. It gets by name a table that exists in the document and extracts its DataArray, which has the content of each cell. Is that the sort of thing you are trying to do?

Code: Select all

import com.sun.star.container.NoSuchElementException;
import com.sun.star.container.XNameAccess;
import com.sun.star.lang.WrappedTargetException;
import com.sun.star.sheet.XCellRangeData;
import com.sun.star.text.XTextTable;
import com.sun.star.text.XTextTablesSupplier;
import com.sun.star.uno.RuntimeException;
import com.sun.star.uno.UnoRuntime;
import com.sun.star.uno.XComponentContext;

public static void snippet(XComponentContext xComponentContext, Object oInitialTarget)
{
	try
	{
		XTextTablesSupplier xTextTablesSupplier = UnoRuntime.queryInterface(
			XTextTablesSupplier.class, oInitialTarget);
		XNameAccess xNameAccess = xTextTablesSupplier.getTextTables();
		
		XTextTable xTextTable = UnoRuntime.queryInterface(
			XTextTable.class, xNameAccess.getByName("Table1"));
		
		XCellRangeData xCellRangeData = UnoRuntime.queryInterface(
			XCellRangeData.class, xTextTable);
		Object[][] oDataArray = xCellRangeData.getDataArray();
		
	}
	catch (NoSuchElementException e1)
	{
		// getByName
		e1.printStackTrace();
	}
	catch (WrappedTargetException e2)
	{
		// getByName
		e2.printStackTrace();
	}
	catch (RuntimeException e3)
	{
		// getByName
		e3.printStackTrace();
	}
}

Lupp · Post by **Lupp** » Sun Oct 11, 2020 1:43 am

If the .docx the OQ was talking of, was successfully loaded by OOo, AOO, LibO, the usage of the API will not depend on the "format" of the source file. What's in RAM simply is a TextDocument for the software supporting all the respective services.

The information you (the OQ) found may be slightly misleading. It's true that TextTable objects are returned like paragraphs if the first-level objects contained in the text get enumerated. However TextTable objects don't support specific paragraph services. To the contrary they contain cells which in turn can enumerate paragraphs - and interspersed TextTable objects again.

All the TextTable objects contained in a TextDocument are also accessible from the .TextTables object of the document by index. Their cells are accessible then using their names which are found in the sequence .CellNames of the table. Accessing them one by one using something like myCell = myTextTable.getCellByName(theCellNamIfound) you can either analyze them for paragraphs and additional TextContent (TextTable, TextFrame, ...) - or (e.g.) just ask them for their .String property.

You see: TextTable objects are really simple and handy. I love them. Joking aside:
The attached example demonstrates the usage of some of the mentioned objects and methods in Basic. Java can surely do everything better and simpler. (Sorry. A joke again.)

textTableStrings.odt: (23.22 KiB) Downloaded 170 times

(There are 2 cells containing a second paragraph. You see: Only interested in strings, you can also split them in parts per paragraph without enumetrating the text.)

The raw Basic code:

Code: Select all

Sub getTextTableCellStrings
doc = ThisComponent
tTable = doc.TextTables(0)
cellNames = tTable.cellNames
u = Ubound(cellNames)
Dim tStrings(u, 1)
For j = 0 To u
  j_cName = cellNames(j)
  j_cell = tTable.getCellByName(j_cName)
  tStrings(j, 0) = j_cName
  tStrings(j, 1) = j_cell.String
  If InStr(j_cell.String, Chr(13) & Chr(10))<>0 Then Print "Internal new paragraph In cell: " & j_cName & "!"
Next j
Print "You can now evaluate the array ""tStrings"" for the contained strings." 
End Sub

vivekHavalad · Post by **vivekHavalad** » Mon Nov 02, 2020 3:16 pm

Thank you guys for support but i am facing

com.sun.star.uno.RuntimeException: Table too complex

while doing

Code: Select all

XCellRangeData xCellRangeData = UnoRuntime.queryInterface(
         XCellRangeData.class, xTextTable);

Lupp · Post by **Lupp** » Mon Nov 02, 2020 4:10 pm

-1- I cannot tell anything reliable concerning OOo 3.1.
-2- Text tables are generally a mess if you try to work with them via the API.
-3- I would suspect every table not being fully simple (n rows x m columns) to be "too complex".
-4- Complex tables as I understand the term are tables having merged or split cells. Just for fun I tried sometimes to work with that kind of tables in text documents using the API. It's a nightmare.
-5- The XCellRangeData interface only knows two methods (afaik): .getDataArray and .setDataArray(aDA) which are often used and needed when working with ranges in spreadsheets.

In spreadsheets also ranges with merged cells have their unchanged simple structure in the background, and splitting isn't supported at all. Text tables ara a mess. Ah, yes, I already told.

Lupp · Post by **Lupp** » Mon Nov 02, 2020 11:02 pm

Completion (kind of)

I don't use Java.

What progamming language ever is used, the relevant work needs to be done directly based on the OOo/AOO/LibO API.

Doing the rest with Basic, the following code will create an array containing in one pair (indices 0, 1) per TextTable occurring in the document the table name and an array for the cells of the TextTable, where the cell names and the contents of the cells are stored in the obvious way (again indices 0, 1).

This should work independent of whether the TextTable is "simple" or "complex".

Code: Select all

Sub contentsFromTextTables()
tDoc = ThisComponent
textTables = tDoc.TextTables
countTt = textTables.Count
Dim contentsTt(1 To countTt, 0 To 1)
For j = 1 To countTt
  jTable = textTables(j - 1)
  contentsTt(j, 0) = jTable.Name
  jCellNames = jTable.CellNames : jCountCells = Ubound(jCellNames) + 1
  ReDim jContents( 1 To jCountCells, 0 To 1)
  For k = 1 To jCountCells
    kCellName = jCellNames(k - 1)
    jCell = jtable.getCellByName(kCellName)
    jContents(k, 0) = kCellName : jContents(k, 1) = jCell.String
  Next k
  contentsTt(j, 1) = jContents
Next j
REM Use and/or store the result now.
End Sub

Extracting the text from the table present in .docx file

Extracting the text from the table present in .docx file

Re: Extracting the text from the table present in .docx file

Re: Extracting the text from the table present in .docx file

Re: Extracting the text from the table present in .docx file

Re: Extracting the text from the table present in .docx file

Re: Extracting the text from the table present in .docx file

Re: Extracting the text from the table present in .docx file