How do I find the ToC of a document, and extract its links?
How do I find the ToC of a document, and extract its links?
Hi,
While walking all paragraphs of a Writer document and visiting their text content, I come across text portions which have Bookmarks whose name is something like “_Toc263250771”. However, when I walk the paragraphs and text portions that actually make up the ToC of the document itself, I see no indication or reference to these bookmarks.
How do I identify the actual ToC of the document? Where are these “_Toc…” bookmarks referenced? The paragraph’s style name (a string starting with “Content”) seems like a flimsy indicator.
Thanks!
While walking all paragraphs of a Writer document and visiting their text content, I come across text portions which have Bookmarks whose name is something like “_Toc263250771”. However, when I walk the paragraphs and text portions that actually make up the ToC of the document itself, I see no indication or reference to these bookmarks.
How do I identify the actual ToC of the document? Where are these “_Toc…” bookmarks referenced? The paragraph’s style name (a string starting with “Content”) seems like a flimsy indicator.
Thanks!
Mac 10.14 using LO 7.2.0.2, Gentoo Linux using LO 7.2.3.2 headless.
Re: How do I find the ToC of a document, and extract its lin
You can get at the ToC like this.
HOwever, I can't find any way to see the links from the table of contents to the actual headings in the document. I will look some more later and post again if I find something.
Code: Select all
oDocumentIndexes = ThisComponent.getDocumentIndexes()
oObj1 = oDocumentIndexes.getByName("Table of Contents1"
OpenOffice 4.1 on Windows 10 and Linux Mint
If your question is answered, please go to your first post, select the Edit button, and add [Solved] to the beginning of the title.
If your question is answered, please go to your first post, select the Edit button, and add [Solved] to the beginning of the title.
Re: How do I find the ToC of a document, and extract its lin
Thank you, FJCC, I’m curious what you can dig up. I’ve spent quite some time today inspecting interfaces and services of the paragraphs and text portions of the ToC text itself, but I couldn’t find anything that indicated a reference to the “_Toc…” bookmarks.FJCC wrote:However, I can't find any way to see the links from the table of contents to the actual headings in the document. I will look some more later and post again if I find something.
Looking at Word then the index object you mention is this one, yes?
Mac 10.14 using LO 7.2.0.2, Gentoo Linux using LO 7.2.3.2 headless.
Re: How do I find the ToC of a document, and extract its lin
I poked around some more but can’t find much.
Iterating over the paragraphs gives me XTextRanges as well. Still, I can’t associate from the ToC index to the paragraphs, or from the paragraphs to the ToC index.
Code: Select all
indices = document.getDocumentIndexes()
count = indices.getCount() # There’s one “index”, the ToC in the document.
for i in range(count):
index = indices.getByIndex(i)
# index is a XDocumentIndex with index.Level, index.Name, etc.
anchor = index.getAnchor()
# anchor is a XTextRange for which I can find the document page it’s on.
Mac 10.14 using LO 7.2.0.2, Gentoo Linux using LO 7.2.3.2 headless.
Re: How do I find the ToC of a document, and extract its lin
Did you dream of something like
Don't blame me for the sloppy style, please. (And it was the first time in my life that I tampered with a TOC.)
Code: Select all
Sub getTOClinks()
Dim lA(0) As String
j=0
tD=ThisComponent
tTOC=tD.GetDocumentIndexes.GetByName("Table of Contents1")
tCursor=tTOC.Anchor.Text.CreateTextCursorByRange(tTOC.Anchor)
tEn=tCursor.CreateEnumeration
Do While tEn.HasMoreElements
oneElA=tEn.NextElement.Anchor
hURL=oneElA.HyperLinkURL
lA(j) = oneElA.String & " :: " & hURL
If hURL="" Then lA(j) = la(j) & "<not linked>"
j = j + 1
Redim Preserve lA(j) As String
Loop
Redim Preserve lA(j-1)
msg = Join(lA,Chr(10))
MsgBox(msg)
End Sub
On Windows 10: LibreOffice 24.2 (new numbering) and older versions, PortableOpenOffice 4.1.7 and older, StarOffice 5.2
---
Lupp from München
---
Lupp from München
Re: How do I find the ToC of a document, and extract its lin
Thanks Lupp! That’s a good step, but still doesn’t give me links to the “_Toc…” bookmarks in the text. Here is the Python equivalent of your code:Lupp wrote:Did you dream of something like […] Don't blame me for the sloppy style, please. (And it was the first time in my life that I tampered with a TOC.)
Code: Select all
indices = document.getDocumentIndexes()
toc = indices.getByName("Table of Contents1")
toc_anchor = toc.getAnchor()
toc_cursor = toc_anchor.Text.createTextCursorByRange(toc_anchor) # I think toc_anchor.Text = document.
par_enum = toc_cursor.createEnumeration()
while par_enum.hasMoreElements():
par = par_enum.nextElement() # The par’s Anchor has an empty link as well.
portion_enum = par.createEnumeration()
while portion_enum.hasMoreElements():
portion = portion_enum.nextElement()
print(portion.getString())
print(portion.HyperLinkURL)
I suspect that there aren’t any links though. While the original DOCX file has clickable page numbers in its TOC, the same DOCX in OpenOffice does not. Checked the ToC index of the document though (as per these instructions) and there should be links in that ToC based on the “Structure” settings for all “Levels”. Alas, there aren’t, which might explain the empty link strings in above’s code.
After some more tinkering:
Code: Select all
indices = document.getDocumentIndexes()
toc = indices.getByName("Table of Contents1")
toc.update()
- It duplicates every entry in the ToC (not good).
- It updates the links, so that the above code now finds the links attached to their respective text portions (hooray).
Mac 10.14 using LO 7.2.0.2, Gentoo Linux using LO 7.2.3.2 headless.
Re: How do I find the ToC of a document, and extract its lin
I obviously don't understand. I can use the links extracted by the posted code to create working hyperlinks in the text to jump to the respective headings._savage wrote:...but still doesn’t give me links to the “_Toc…” bookmarks in the text.
Analyising the text (drawpage objects aside) I get the same info, as far as I can see. Very raw:
Code: Select all
Sub getBookMarkLinks()
On Error Resume Next
tText=ThisComponent.Text
tTE=tText.CreateEnumeration
Do While tTE.HasMoreElements
oneTEl=tTE.NextElement
oneEnum=oneTEl.CreateEnumeration
Do While oneEnum.HasMoreElements
oneSubEl=oneEnum.NextElement
MsgBox(oneSubEl.BookMark.Anchor.String & " :: " & oneSubEl.BookMark.LinkDisplayName)
Loop
Loop
End Sub
Names are accessible via ThisComponent.Links.Headings, but I don't know how to get bookmark-like references. I even doubt if they exist.
On Windows 10: LibreOffice 24.2 (new numbering) and older versions, PortableOpenOffice 4.1.7 and older, StarOffice 5.2
---
Lupp from München
---
Lupp from München
Re: How do I find the ToC of a document, and extract its lin
I think my last post might have been somewhat unclear. At first it wasn’t working. But after I called the update() function on the ToC index object the links appeared all over the document.Lupp wrote:I obviously don't understand. I can use the links extracted by the posted code to create working hyperlinks in the text to jump to the respective headings. […]
However, all entries in the ToC are now duplicated! The same duplication happens when I update the ToC in the UI by right-clicking on it. Seems like a bug to me.
To save the call to update()? If that’s possible, yes.Lupp wrote:Or do you want to get the links created automatically for jumping to the TOC entries?
Great, thanks, I’ll take a look at that.Lupp wrote:Names are accessible via ThisComponent.Links.Headings, but I don't know how to get bookmark-like references. I even doubt if they exist.
Mac 10.14 using LO 7.2.0.2, Gentoo Linux using LO 7.2.3.2 headless.
Re: How do I find the ToC of a document, and extract its lin
Cannot reproduce this with AOO 4.1.3 nor with recent versions of LibreOffice (5.4.3, 6.0.0.0beta1). Don't remember to have experienced this behaviour at any time._savage wrote:However, all entries in the ToC are now duplicated! The same duplication happens when I update the ToC in the UI by right-clicking on it. Seems like a bug to me.
May I suggest you create a signature to your account telling the Version of LibO / AOO and the OS you are working with?
I don't understand another time. How? In addition I don't need an extra update. The Sub works on the TOC as it is. (Of course your Sub will contain different and additional use of the results.)_savage wrote:To save the call to update()? If that’s possible, yes.
On Windows 10: LibreOffice 24.2 (new numbering) and older versions, PortableOpenOffice 4.1.7 and older, StarOffice 5.2
---
Lupp from München
---
Lupp from München
Re: How do I find the ToC of a document, and extract its lin
Take a look at this example document:Lupp wrote:Cannot reproduce this with AOO 4.1.3 nor with recent versions of LibreOffice (5.4.3, 6.0.0.0beta1). Don't remember to have experienced this behaviour at any time.
Works fine in Word, open it in LO (I’m currently running 5.3.6.1 and 5.4.3.2). Freshly loaded, the ToC index object contains no links and only single entries. Then update it (either call update() on the object, or right-click in the UI) and all entries duplicate. Here is a screenshot of the document’s ToC loaded and after the update:
PS: It seems that the duplication happens only at the top-level; headings nested below Heading 1 don’t get duplicated. I filed a bug report with LO.
Last edited by _savage on Sat Dec 09, 2017 2:55 pm, edited 1 time in total.
Mac 10.14 using LO 7.2.0.2, Gentoo Linux using LO 7.2.3.2 headless.
Re: How do I find the ToC of a document, and extract its lin
Your signature tells 5.3.4.2 (at the moment)._savage wrote:...and 5.4.3.2
What do you expect of docx? Ok. I had zero or less experience with documents loaded from docx? Why do you think you should use it? Errors and surprises expected. (And even MS software is reported to not be able to open OOXML - strict or "transitional"? - in certain cases. See https://flosmind.wordpress.com/2017/05/ ... el-cannot/ e.g.) In fact your example produces the TOC-doubling effect for me in LibO V5.4.3 and in V6.0.0.0beta1, too, but not in AOO V4.1.3. Anyway: Why tamper with documents in alien "formats". Different formats can never contain the same real thing. They are limited to similar appearance and a certain degree of compatibility during work. There is a saying that LibO is "better" in docx than AOO. As you see there are exceptions. Of course phenomena may depend on which software wrote the file. As I cannot test to reasonable detail with files written by 'MS Office' I will not again comment on issues related to such alien "formats". The one correctly approved family of open document formats is ODF. To also accept OOXML under the "open" label was a political issue, and a terrible mistake, imo. It mainly is a means to fight free and open software by the well-known commercial "vendor".attachment wrote:doubletoc.docx
On Windows 10: LibreOffice 24.2 (new numbering) and older versions, PortableOpenOffice 4.1.7 and older, StarOffice 5.2
---
Lupp from München
---
Lupp from München
Re: How do I find the ToC of a document, and extract its lin
Thanks, Lupp, for the conversation so far…
I’d like to pick up this conversation once more. Suppose that, while iterating over the paragraphs of a document, the current paragraph is one that was generated for the ToC, meaning that the following assertion is True:
However, I fail to find any reference or association from this ToC entry paragraph to the original heading paragraph from which it was created. Similarly, I fail to find a reference from any of the heading paragraphs to their related ToC entry paragraphs.
Question: is there any kind of association available between a ToC entry paragraph and the heading paragraph it was created from?
And also: Paragraphs in a ToC index have names “Content n” where n is the equivalent of the OutlineLevel level of the associated heading paragraph (I am unsure whether this is a reliable constant though). Other than the name, does a ToC entry paragraph keep its level value stored someplace else?
I’d like to pick up this conversation once more. Suppose that, while iterating over the paragraphs of a document, the current paragraph is one that was generated for the ToC, meaning that the following assertion is True:
Code: Select all
indeces = document.getDocumentIndexes()
index = indeces.getByName("Table of Contents1")
… # Iterate, and then:
assert index == par.DocumentIndex # A par in the ToC has its DocumentIndex refer to the ToC index.
Question: is there any kind of association available between a ToC entry paragraph and the heading paragraph it was created from?
And also: Paragraphs in a ToC index have names “Content n” where n is the equivalent of the OutlineLevel level of the associated heading paragraph (I am unsure whether this is a reliable constant though). Other than the name, does a ToC entry paragraph keep its level value stored someplace else?
Mac 10.14 using LO 7.2.0.2, Gentoo Linux using LO 7.2.3.2 headless.