How do I find the ToC of a document, and extract its links?

Creating a macro - Writing a Script - Using the API (OpenOffice Basic, Python, BeanShell, JavaScript)
Post Reply
_savage
Posts: 187
Joined: Sun Apr 21, 2013 12:55 am

How do I find the ToC of a document, and extract its links?

Post by _savage »

Hi,

While walking all paragraphs of a Writer document and visiting their text content, I come across text portions which have Bookmarks whose name is something like “_Toc263250771”. However, when I walk the paragraphs and text portions that actually make up the ToC of the document itself, I see no indication or reference to these bookmarks.

How do I identify the actual ToC of the document? Where are these “_Toc…” bookmarks referenced? The paragraph’s style name (a string starting with “Content”) seems like a flimsy indicator.

Thanks!
Mac 10.14 using LO 7.2.0.2, Gentoo Linux using LO 7.2.3.2 headless.
FJCC
Moderator
Posts: 9248
Joined: Sat Nov 08, 2008 8:08 pm
Location: Colorado, USA

Re: How do I find the ToC of a document, and extract its lin

Post by FJCC »

You can get at the ToC like this.

Code: Select all

  oDocumentIndexes = ThisComponent.getDocumentIndexes()
  oObj1 = oDocumentIndexes.getByName("Table of Contents1"
HOwever, I can't find any way to see the links from the table of contents to the actual headings in the document. I will look some more later and post again if I find something.
OpenOffice 4.1 on Windows 10 and Linux Mint
If your question is answered, please go to your first post, select the Edit button, and add [Solved] to the beginning of the title.
_savage
Posts: 187
Joined: Sun Apr 21, 2013 12:55 am

Re: How do I find the ToC of a document, and extract its lin

Post by _savage »

FJCC wrote:However, I can't find any way to see the links from the table of contents to the actual headings in the document. I will look some more later and post again if I find something.
Thank you, FJCC, I’m curious what you can dig up. I’ve spent quite some time today inspecting interfaces and services of the paragraphs and text portions of the ToC text itself, but I couldn’t find anything that indicated a reference to the “_Toc…” bookmarks.

Looking at Word then the index object you mention is this one, yes?
ToC Index Object
ToC Index Object
toc.jpg (11.04 KiB) Viewed 11772 times
Mac 10.14 using LO 7.2.0.2, Gentoo Linux using LO 7.2.3.2 headless.
_savage
Posts: 187
Joined: Sun Apr 21, 2013 12:55 am

Re: How do I find the ToC of a document, and extract its lin

Post by _savage »

I poked around some more but can’t find much.

Code: Select all

indices = document.getDocumentIndexes()
count = indices.getCount() # There’s one “index”, the ToC in the document.
for i in range(count):
    index = indices.getByIndex(i)                                                               
    # index is a XDocumentIndex with index.Level, index.Name, etc.
    anchor = index.getAnchor()                                                                  
    # anchor is a XTextRange for which I can find the document page it’s on.
Iterating over the paragraphs gives me XTextRanges as well. Still, I can’t associate from the ToC index to the paragraphs, or from the paragraphs to the ToC index.
Mac 10.14 using LO 7.2.0.2, Gentoo Linux using LO 7.2.3.2 headless.
User avatar
Lupp
Volunteer
Posts: 3542
Joined: Sat May 31, 2014 7:05 pm
Location: München, Germany

Re: How do I find the ToC of a document, and extract its lin

Post by Lupp »

Did you dream of something like

Code: Select all

Sub getTOClinks()
Dim lA(0) As String
j=0
tD=ThisComponent
tTOC=tD.GetDocumentIndexes.GetByName("Table of Contents1")
tCursor=tTOC.Anchor.Text.CreateTextCursorByRange(tTOC.Anchor)
tEn=tCursor.CreateEnumeration
Do While tEn.HasMoreElements
  oneElA=tEn.NextElement.Anchor
  hURL=oneElA.HyperLinkURL
  lA(j) = oneElA.String & " :: " & hURL
  If hURL="" Then lA(j) = la(j) & "<not linked>"
  j = j + 1
  Redim Preserve lA(j) As String
Loop
Redim Preserve lA(j-1)
msg = Join(lA,Chr(10))
MsgBox(msg)
End Sub
Don't blame me for the sloppy style, please. (And it was the first time in my life that I tampered with a TOC.)
On Windows 10: LibreOffice 24.2 (new numbering) and older versions, PortableOpenOffice 4.1.7 and older, StarOffice 5.2
---
Lupp from München
_savage
Posts: 187
Joined: Sun Apr 21, 2013 12:55 am

Re: How do I find the ToC of a document, and extract its lin

Post by _savage »

Lupp wrote:Did you dream of something like […] Don't blame me for the sloppy style, please. (And it was the first time in my life that I tampered with a TOC.)
Thanks Lupp! That’s a good step, but still doesn’t give me links to the “_Toc…” bookmarks in the text. Here is the Python equivalent of your code:

Code: Select all

indices = document.getDocumentIndexes()                                                         
toc = indices.getByName("Table of Contents1")                                                   
toc_anchor = toc.getAnchor()                                                                        
toc_cursor = toc_anchor.Text.createTextCursorByRange(toc_anchor) # I think toc_anchor.Text = document.
par_enum = toc_cursor.createEnumeration()                                                       
while par_enum.hasMoreElements():                                                               
    par = par_enum.nextElement()  # The par’s Anchor has an empty link as well.
    portion_enum = par.createEnumeration()                                                      
    while portion_enum.hasMoreElements():                                                       
        portion = portion_enum.nextElement()                                                    
        print(portion.getString())                                                              
        print(portion.HyperLinkURL)                                                             
Which prints the ToC line for each entry (so far so good) and then nothing as the portion’s link. So close… :(

I suspect that there aren’t any links though. While the original DOCX file has clickable page numbers in its TOC, the same DOCX in OpenOffice does not. Checked the ToC index of the document though (as per these instructions) and there should be links in that ToC based on the “Structure” settings for all “Levels”. Alas, there aren’t, which might explain the empty link strings in above’s code.

After some more tinkering:

Code: Select all

indices = document.getDocumentIndexes()
toc = indices.getByName("Table of Contents1")
toc.update()
Calling update() does two things:
  • It duplicates every entry in the ToC (not good).
  • It updates the links, so that the above code now finds the links attached to their respective text portions (hooray).
Now if that duplication could be avoided?
Mac 10.14 using LO 7.2.0.2, Gentoo Linux using LO 7.2.3.2 headless.
User avatar
Lupp
Volunteer
Posts: 3542
Joined: Sat May 31, 2014 7:05 pm
Location: München, Germany

Re: How do I find the ToC of a document, and extract its lin

Post by Lupp »

_savage wrote:...but still doesn’t give me links to the “_Toc…” bookmarks in the text.
I obviously don't understand. I can use the links extracted by the posted code to create working hyperlinks in the text to jump to the respective headings.
Analyising the text (drawpage objects aside) I get the same info, as far as I can see. Very raw:

Code: Select all

Sub getBookMarkLinks()
On Error Resume Next
tText=ThisComponent.Text
tTE=tText.CreateEnumeration
Do While tTE.HasMoreElements
 oneTEl=tTE.NextElement
 oneEnum=oneTEl.CreateEnumeration
 Do While oneEnum.HasMoreElements
  oneSubEl=oneEnum.NextElement
  MsgBox(oneSubEl.BookMark.Anchor.String & " :: " & oneSubEl.BookMark.LinkDisplayName)
 Loop
Loop
End Sub
Or do you want to get the links created automatically for jumping to the TOC entries?
Names are accessible via ThisComponent.Links.Headings, but I don't know how to get bookmark-like references. I even doubt if they exist.
On Windows 10: LibreOffice 24.2 (new numbering) and older versions, PortableOpenOffice 4.1.7 and older, StarOffice 5.2
---
Lupp from München
_savage
Posts: 187
Joined: Sun Apr 21, 2013 12:55 am

Re: How do I find the ToC of a document, and extract its lin

Post by _savage »

Lupp wrote:I obviously don't understand. I can use the links extracted by the posted code to create working hyperlinks in the text to jump to the respective headings. […]
I think my last post might have been somewhat unclear. At first it wasn’t working. But after I called the update() function on the ToC index object the links appeared all over the document.

However, all entries in the ToC are now duplicated! The same duplication happens when I update the ToC in the UI by right-clicking on it. Seems like a bug to me.
Lupp wrote:Or do you want to get the links created automatically for jumping to the TOC entries?
To save the call to update()? If that’s possible, yes.
Lupp wrote:Names are accessible via ThisComponent.Links.Headings, but I don't know how to get bookmark-like references. I even doubt if they exist.
Great, thanks, I’ll take a look at that.
Mac 10.14 using LO 7.2.0.2, Gentoo Linux using LO 7.2.3.2 headless.
User avatar
Lupp
Volunteer
Posts: 3542
Joined: Sat May 31, 2014 7:05 pm
Location: München, Germany

Re: How do I find the ToC of a document, and extract its lin

Post by Lupp »

_savage wrote:However, all entries in the ToC are now duplicated! The same duplication happens when I update the ToC in the UI by right-clicking on it. Seems like a bug to me.
Cannot reproduce this with AOO 4.1.3 nor with recent versions of LibreOffice (5.4.3, 6.0.0.0beta1). Don't remember to have experienced this behaviour at any time.
May I suggest you create a signature to your account telling the Version of LibO / AOO and the OS you are working with?
_savage wrote:To save the call to update()? If that’s possible, yes.
I don't understand another time. How? In addition I don't need an extra update. The Sub works on the TOC as it is. (Of course your Sub will contain different and additional use of the results.)
On Windows 10: LibreOffice 24.2 (new numbering) and older versions, PortableOpenOffice 4.1.7 and older, StarOffice 5.2
---
Lupp from München
_savage
Posts: 187
Joined: Sun Apr 21, 2013 12:55 am

Re: How do I find the ToC of a document, and extract its lin

Post by _savage »

Lupp wrote:Cannot reproduce this with AOO 4.1.3 nor with recent versions of LibreOffice (5.4.3, 6.0.0.0beta1). Don't remember to have experienced this behaviour at any time.
Take a look at this example document:
doubletoc.docx
Update ToC to duplicate its entries.
(37.56 KiB) Downloaded 435 times
Works fine in Word, open it in LO (I’m currently running 5.3.6.1 and 5.4.3.2). Freshly loaded, the ToC index object contains no links and only single entries. Then update it (either call update() on the object, or right-click in the UI) and all entries duplicate. Here is a screenshot of the document’s ToC loaded and after the update:
Before and after ToC update
Before and after ToC update
PS: It seems that the duplication happens only at the top-level; headings nested below Heading 1 don’t get duplicated. I filed a bug report with LO.
Last edited by _savage on Sat Dec 09, 2017 2:55 pm, edited 1 time in total.
Mac 10.14 using LO 7.2.0.2, Gentoo Linux using LO 7.2.3.2 headless.
User avatar
Lupp
Volunteer
Posts: 3542
Joined: Sat May 31, 2014 7:05 pm
Location: München, Germany

Re: How do I find the ToC of a document, and extract its lin

Post by Lupp »

_savage wrote:...and 5.4.3.2
Your signature tells 5.3.4.2 (at the moment).
attachment wrote:doubletoc.docx
What do you expect of docx? Ok. I had zero or less experience with documents loaded from docx? Why do you think you should use it? Errors and surprises expected. (And even MS software is reported to not be able to open OOXML - strict or "transitional"? - in certain cases. See https://flosmind.wordpress.com/2017/05/ ... el-cannot/ e.g.) In fact your example produces the TOC-doubling effect for me in LibO V5.4.3 and in V6.0.0.0beta1, too, but not in AOO V4.1.3. Anyway: Why tamper with documents in alien "formats". Different formats can never contain the same real thing. They are limited to similar appearance and a certain degree of compatibility during work. There is a saying that LibO is "better" in docx than AOO. As you see there are exceptions. Of course phenomena may depend on which software wrote the file. As I cannot test to reasonable detail with files written by 'MS Office' I will not again comment on issues related to such alien "formats". The one correctly approved family of open document formats is ODF. To also accept OOXML under the "open" label was a political issue, and a terrible mistake, imo. It mainly is a means to fight free and open software by the well-known commercial "vendor".
On Windows 10: LibreOffice 24.2 (new numbering) and older versions, PortableOpenOffice 4.1.7 and older, StarOffice 5.2
---
Lupp from München
_savage
Posts: 187
Joined: Sun Apr 21, 2013 12:55 am

Re: How do I find the ToC of a document, and extract its lin

Post by _savage »

Thanks, Lupp, for the conversation so far…

I’d like to pick up this conversation once more. Suppose that, while iterating over the paragraphs of a document, the current paragraph is one that was generated for the ToC, meaning that the following assertion is True:

Code: Select all

indeces = document.getDocumentIndexes()
index = indeces.getByName("Table of Contents1")
… # Iterate, and then:
assert index == par.DocumentIndex  # A par in the ToC has its DocumentIndex refer to the ToC index.
However, I fail to find any reference or association from this ToC entry paragraph to the original heading paragraph from which it was created. Similarly, I fail to find a reference from any of the heading paragraphs to their related ToC entry paragraphs.

Question: is there any kind of association available between a ToC entry paragraph and the heading paragraph it was created from?

And also: Paragraphs in a ToC index have names “Content n” where n is the equivalent of the OutlineLevel level of the associated heading paragraph (I am unsure whether this is a reliable constant though). Other than the name, does a ToC entry paragraph keep its level value stored someplace else?
Mac 10.14 using LO 7.2.0.2, Gentoo Linux using LO 7.2.3.2 headless.
Post Reply