How do I find the ToC of a document, and extract its links?

The Application Programming Interface and the OASIS Open Document Format

How do I find the ToC of a document, and extract its links?

Postby _savage » Tue Dec 05, 2017 3:15 am

Hi,

While walking all paragraphs of a Writer document and visiting their text content, I come across text portions which have Bookmarks whose name is something like “_Toc263250771”. However, when I walk the paragraphs and text portions that actually make up the ToC of the document itself, I see no indication or reference to these bookmarks.

How do I identify the actual ToC of the document? Where are these “_Toc…” bookmarks referenced? The paragraph’s style name (a string starting with “Content”) seems like a flimsy indicator.

Thanks!
Mac 10.11 using LO 5.3.6.1, Gentoo Linux using LO 5.3.4.2 headless.
_savage
 
Posts: 139
Joined: Sun Apr 21, 2013 12:55 am

Re: How do I find the ToC of a document, and extract its lin

Postby FJCC » Tue Dec 05, 2017 5:36 am

You can get at the ToC like this.
Code: Select all   Expand viewCollapse view
  oDocumentIndexes = ThisComponent.getDocumentIndexes()
  oObj1 = oDocumentIndexes.getByName("Table of Contents1"

HOwever, I can't find any way to see the links from the table of contents to the actual headings in the document. I will look some more later and post again if I find something.
AOO 3.4 or 4.1 on MS Windows XP ( before 2013-08-03) or Windows 7
If your question is answered, please go to your first post, select the Edit button, and add [Solved] to the beginning of the title.
FJCC
Moderator
 
Posts: 6162
Joined: Sat Nov 08, 2008 8:08 pm
Location: Colorado, USA

Re: How do I find the ToC of a document, and extract its lin

Postby _savage » Tue Dec 05, 2017 12:18 pm

FJCC wrote:However, I can't find any way to see the links from the table of contents to the actual headings in the document. I will look some more later and post again if I find something.

Thank you, FJCC, I’m curious what you can dig up. I’ve spent quite some time today inspecting interfaces and services of the paragraphs and text portions of the ToC text itself, but I couldn’t find anything that indicated a reference to the “_Toc…” bookmarks.

Looking at Word then the index object you mention is this one, yes?

toc.jpg
ToC Index Object
toc.jpg (11.04 KiB) Viewed 276 times
Mac 10.11 using LO 5.3.6.1, Gentoo Linux using LO 5.3.4.2 headless.
_savage
 
Posts: 139
Joined: Sun Apr 21, 2013 12:55 am

Re: How do I find the ToC of a document, and extract its lin

Postby _savage » Thu Dec 07, 2017 6:52 am

I poked around some more but can’t find much.

Code: Select all   Expand viewCollapse view
indices = document.getDocumentIndexes()
count = indices.getCount() # There’s one “index”, the ToC in the document.
for i in range(count):
    index = indices.getByIndex(i)                                                               
    # index is a XDocumentIndex with index.Level, index.Name, etc.
    anchor = index.getAnchor()                                                                 
    # anchor is a XTextRange for which I can find the document page it’s on.

Iterating over the paragraphs gives me XTextRanges as well. Still, I can’t associate from the ToC index to the paragraphs, or from the paragraphs to the ToC index.
Mac 10.11 using LO 5.3.6.1, Gentoo Linux using LO 5.3.4.2 headless.
_savage
 
Posts: 139
Joined: Sun Apr 21, 2013 12:55 am

Re: How do I find the ToC of a document, and extract its lin

Postby Lupp » Thu Dec 07, 2017 12:29 pm

Did you dream of something like
Code: Select all   Expand viewCollapse view
Sub getTOClinks()
Dim lA(0) As String
j=0
tD=ThisComponent
tTOC=tD.GetDocumentIndexes.GetByName("Table of Contents1")
tCursor=tTOC.Anchor.Text.CreateTextCursorByRange(tTOC.Anchor)
tEn=tCursor.CreateEnumeration
Do While tEn.HasMoreElements
  oneElA=tEn.NextElement.Anchor
  hURL=oneElA.HyperLinkURL
  lA(j) = oneElA.String & " :: " & hURL
  If hURL="" Then lA(j) = la(j) & "<not linked>"
  j = j + 1
  Redim Preserve lA(j) As String
Loop
Redim Preserve lA(j-1)
msg = Join(lA,Chr(10))
MsgBox(msg)
End Sub

Don't blame me for the sloppy style, please. (And it was the first time in my life that I tampered with a TOC.)
On Windows 10: LibreOffice 5.4.2 and older versions, PortableOpenOffice 4.1.3 and older, StarOffice 5.2
---
Maybe we might! (Create a powerful UFO: United Free Office)
Lupp from München
User avatar
Lupp
Volunteer
 
Posts: 1522
Joined: Sat May 31, 2014 7:05 pm

Re: How do I find the ToC of a document, and extract its lin

Postby _savage » Thu Dec 07, 2017 1:13 pm

Lupp wrote:Did you dream of something like […] Don't blame me for the sloppy style, please. (And it was the first time in my life that I tampered with a TOC.)

Thanks Lupp! That’s a good step, but still doesn’t give me links to the “_Toc…” bookmarks in the text. Here is the Python equivalent of your code:

Code: Select all   Expand viewCollapse view
indices = document.getDocumentIndexes()                                                         
toc = indices.getByName("Table of Contents1")                                                   
toc_anchor = toc.getAnchor()                                                                       
toc_cursor = toc_anchor.Text.createTextCursorByRange(toc_anchor) # I think toc_anchor.Text = document.
par_enum = toc_cursor.createEnumeration()                                                       
while par_enum.hasMoreElements():                                                               
    par = par_enum.nextElement()  # The par’s Anchor has an empty link as well.
    portion_enum = par.createEnumeration()                                                     
    while portion_enum.hasMoreElements():                                                       
        portion = portion_enum.nextElement()                                                   
        print(portion.getString())                                                             
        print(portion.HyperLinkURL)                                                             

Which prints the ToC line for each entry (so far so good) and then nothing as the portion’s link. So close… :(

I suspect that there aren’t any links though. While the original DOCX file has clickable page numbers in its TOC, the same DOCX in OpenOffice does not. Checked the ToC index of the document though (as per these instructions) and there should be links in that ToC based on the “Structure” settings for all “Levels”. Alas, there aren’t, which might explain the empty link strings in above’s code.

After some more tinkering:

Code: Select all   Expand viewCollapse view
indices = document.getDocumentIndexes()
toc = indices.getByName("Table of Contents1")
toc.update()

Calling update() does two things:

  • It duplicates every entry in the ToC (not good).
  • It updates the links, so that the above code now finds the links attached to their respective text portions (hooray).
Now if that duplication could be avoided?
Mac 10.11 using LO 5.3.6.1, Gentoo Linux using LO 5.3.4.2 headless.
_savage
 
Posts: 139
Joined: Sun Apr 21, 2013 12:55 am

Re: How do I find the ToC of a document, and extract its lin

Postby Lupp » Thu Dec 07, 2017 3:01 pm

_savage wrote:...but still doesn’t give me links to the “_Toc…” bookmarks in the text.
I obviously don't understand. I can use the links extracted by the posted code to create working hyperlinks in the text to jump to the respective headings.
Analyising the text (drawpage objects aside) I get the same info, as far as I can see. Very raw:
Code: Select all   Expand viewCollapse view
Sub getBookMarkLinks()
On Error Resume Next
tText=ThisComponent.Text
tTE=tText.CreateEnumeration
Do While tTE.HasMoreElements
oneTEl=tTE.NextElement
oneEnum=oneTEl.CreateEnumeration
Do While oneEnum.HasMoreElements
  oneSubEl=oneEnum.NextElement
  MsgBox(oneSubEl.BookMark.Anchor.String & " :: " & oneSubEl.BookMark.LinkDisplayName)
Loop
Loop
End Sub

Or do you want to get the links created automatically for jumping to the TOC entries?
Names are accessible via ThisComponent.Links.Headings, but I don't know how to get bookmark-like references. I even doubt if they exist.
On Windows 10: LibreOffice 5.4.2 and older versions, PortableOpenOffice 4.1.3 and older, StarOffice 5.2
---
Maybe we might! (Create a powerful UFO: United Free Office)
Lupp from München
User avatar
Lupp
Volunteer
 
Posts: 1522
Joined: Sat May 31, 2014 7:05 pm

Re: How do I find the ToC of a document, and extract its lin

Postby _savage » Thu Dec 07, 2017 8:39 pm

Lupp wrote:I obviously don't understand. I can use the links extracted by the posted code to create working hyperlinks in the text to jump to the respective headings. […]

I think my last post might have been somewhat unclear. At first it wasn’t working. But after I called the update() function on the ToC index object the links appeared all over the document.

However, all entries in the ToC are now duplicated! The same duplication happens when I update the ToC in the UI by right-clicking on it. Seems like a bug to me.

Lupp wrote:Or do you want to get the links created automatically for jumping to the TOC entries?

To save the call to update()? If that’s possible, yes.

Lupp wrote:Names are accessible via ThisComponent.Links.Headings, but I don't know how to get bookmark-like references. I even doubt if they exist.

Great, thanks, I’ll take a look at that.
Mac 10.11 using LO 5.3.6.1, Gentoo Linux using LO 5.3.4.2 headless.
_savage
 
Posts: 139
Joined: Sun Apr 21, 2013 12:55 am

Re: How do I find the ToC of a document, and extract its lin

Postby Lupp » Fri Dec 08, 2017 1:23 am

_savage wrote:However, all entries in the ToC are now duplicated! The same duplication happens when I update the ToC in the UI by right-clicking on it. Seems like a bug to me.
Cannot reproduce this with AOO 4.1.3 nor with recent versions of LibreOffice (5.4.3, 6.0.0.0beta1). Don't remember to have experienced this behaviour at any time.
May I suggest you create a signature to your account telling the Version of LibO / AOO and the OS you are working with?
_savage wrote:To save the call to update()? If that’s possible, yes.
I don't understand another time. How? In addition I don't need an extra update. The Sub works on the TOC as it is. (Of course your Sub will contain different and additional use of the results.)
On Windows 10: LibreOffice 5.4.2 and older versions, PortableOpenOffice 4.1.3 and older, StarOffice 5.2
---
Maybe we might! (Create a powerful UFO: United Free Office)
Lupp from München
User avatar
Lupp
Volunteer
 
Posts: 1522
Joined: Sat May 31, 2014 7:05 pm

Re: How do I find the ToC of a document, and extract its lin

Postby _savage » Fri Dec 08, 2017 1:48 am

Lupp wrote:Cannot reproduce this with AOO 4.1.3 nor with recent versions of LibreOffice (5.4.3, 6.0.0.0beta1). Don't remember to have experienced this behaviour at any time.

Take a look at this example document:

doubletoc.docx
Update ToC to duplicate its entries.
(37.56 KiB) Downloaded 5 times

Works fine in Word, open it in LO (I’m currently running 5.3.6.1 and 5.4.3.2). Freshly loaded, the ToC index object contains no links and only single entries. Then update it (either call update() on the object, or right-click in the UI) and all entries duplicate. Here is a screenshot of the document’s ToC loaded and after the update:

doubletoc.jpg
Before and after ToC update

PS: It seems that the duplication happens only at the top-level; headings nested below Heading 1 don’t get duplicated. I filed a bug report with LO.
Last edited by _savage on Sat Dec 09, 2017 2:55 pm, edited 1 time in total.
Mac 10.11 using LO 5.3.6.1, Gentoo Linux using LO 5.3.4.2 headless.
_savage
 
Posts: 139
Joined: Sun Apr 21, 2013 12:55 am

Re: How do I find the ToC of a document, and extract its lin

Postby Lupp » Fri Dec 08, 2017 12:02 pm

_savage wrote:...and 5.4.3.2
Your signature tells 5.3.4.2 (at the moment).
attachment wrote:doubletoc.docx
What do you expect of docx? Ok. I had zero or less experience with documents loaded from docx? Why do you think you should use it? Errors and surprises expected. (And even MS software is reported to not be able to open OOXML - strict or "transitional"? - in certain cases. See https://flosmind.wordpress.com/2017/05/18/libreoffice-can-open-xlsx-files-excel-cannot/ e.g.) In fact your example produces the TOC-doubling effect for me in LibO V5.4.3 and in V6.0.0.0beta1, too, but not in AOO V4.1.3. Anyway: Why tamper with documents in alien "formats". Different formats can never contain the same real thing. They are limited to similar appearance and a certain degree of compatibility during work. There is a saying that LibO is "better" in docx than AOO. As you see there are exceptions. Of course phenomena may depend on which software wrote the file. As I cannot test to reasonable detail with files written by 'MS Office' I will not again comment on issues related to such alien "formats". The one correctly approved family of open document formats is ODF. To also accept OOXML under the "open" label was a political issue, and a terrible mistake, imo. It mainly is a means to fight free and open software by the well-known commercial "vendor".
On Windows 10: LibreOffice 5.4.2 and older versions, PortableOpenOffice 4.1.3 and older, StarOffice 5.2
---
Maybe we might! (Create a powerful UFO: United Free Office)
Lupp from München
User avatar
Lupp
Volunteer
 
Posts: 1522
Joined: Sat May 31, 2014 7:05 pm


Return to UNO API and ODF

Who is online

Users browsing this forum: No registered users and 3 guests