[crosspost] What interface or service for utf encode decode

Creating a macro - Writing a Script - Using the API (OpenOffice Basic, Python, BeanShell, JavaScript)
Post Reply
User avatar
Lupp
Volunteer
Posts: 3542
Joined: Sat May 31, 2014 7:05 pm
Location: München, Germany

[crosspost] What interface or service for utf encode decode

Post by Lupp »

Also posred here: https://ask.libreoffice.org/en/question ... de-decode/)

Are there services / interfaces commissioned to encode in / decode from utf (8 / 16)?
The software is obviously capable to do it on import / export. How is the kind of task represented in the uno-api? I cannot find respective entries in https://api.libreoffice.org/docs/idl/re ... paces.html (e.g.).

The answer should be (next to) the same for Apache OO and for LibreO.
On Windows 10: LibreOffice 24.2 (new numbering) and older versions, PortableOpenOffice 4.1.7 and older, StarOffice 5.2
---
Lupp from München
User avatar
Villeroy
Volunteer
Posts: 31269
Joined: Mon Oct 08, 2007 1:35 am
Location: Germany

Re: [crosspost] What interface or service for utf encode dec

Post by Villeroy »

Some quick ideas:
It is certainly possible to open an encoded text file and save a differently encoded text file by means of MediaDescriptor properties "FilterName" and "FilterOptions". That is what might happen when you do it manually via File>Open and File>Save.
And then there are I/O streams which can be used to convert files. See http://www.openoffice.org/udk/common/ma ... reams.html As far as I understand that document, you can hook in your own converter but I can't see if it is possible to hook in any built-in converter.
I'm fairly sure there is a Python library for this.
Please, edit this topic's initial post and add "[Solved]" to the subject line if your problem has been solved.
Ubuntu 18.04 with LibreOffice 6.0, latest OpenOffice and LibreOffice
User avatar
Lupp
Volunteer
Posts: 3542
Joined: Sat May 31, 2014 7:05 pm
Location: München, Germany

Re: [crosspost] What interface or service for utf encode dec

Post by Lupp »

@Villeroy:
Thanks a lot for the quick response.
However, I was looking for a conversion service that might also be usable independent of the context of importing from/exporting to files or streams.

Background:
I took an opportunity to learn a bit about some topics in a field I rarely touched (except superficially, of course).
The motive came from a case where the "co-operation" of some components was "intransparent" and difficult to understand for me.
-1- The software of a service-provider in the web which interprets gotten URI and also sends back URI via the URL line of the browser.
-2- The browser (Firefox in my case) which shows lots of "doubtable" (needing encoding) characters as plain text and accepts them, too.
-3- The clipoard (of current Win 10) which has to take the complete URI for my purposes executing a Copy command.
-4- The text (In Win-Editor or in AOO/LibO Writer, Calc cell ... e.g.) that should accept and insert content from the clipboard executing a paste command.
The case in short (as far as I can):
I see the plain text - disallowed/reserved but used otherwise characters - and all in the URL line.
I get (with Copy / Paste or PasteSpecial...) no format offered but 'Unformatted text'. (Whether Editor or LibO...)
Accepting the insertion I get all the critical characters (URI-reserved or CodePoint>127) in UTF-8 encoding using the %-escaped representation, two-byte cases specifically, like %C3%BC for "ü" as in "München" mutating to "M%C3%BCnchen" this way.
Since I wanted to create, manipulate, ... in Calc and re-enter (send to Firefox) such URI (Very long Query component often!) I preferred to do it "in plain" as far as possible. But I also wanted to be able to insert percent-encoded characters if probably needed in specific cases.
I ended up with writing functions for the purpose, which is boring and annoying in my understanding.

===Editing===
Example:
kurviger.de/?point=München&point=Београд
is an accepted entry and shown in the URL-line, but
https://kurviger.de/?point=M%C3%BCnchen ... 0%B0%D0%B4
is returned after Copy/Paste.
My initial wondering was about the fact that Firefox seemingly doesn't send both versions to the cliboard, marked as different formats.
(The long Query I get after a few additional steps:
https://kurviger.de/?point=M%C3%BCnchen ... %20Streets)

By the way: "The Intenet Explorer 11 (TM)" doesn't even try to interpret the Cyrillic letters. Or tries it but fails? Make your choice.

===Editing Again=== (For completeness and for fun)
If I go to the URL- line in Firefox, but then select only a part of the URL/URI (just leave out the first character, e.g.) and copy the selection to the clipboard, I get the the unencoded text when pasting later from the clipboard into a Calc cell, e.g..
This was tested with Firefox 6.1.0 (64 bit) on Win 10 up-to-date.
On Windows 10: LibreOffice 24.2 (new numbering) and older versions, PortableOpenOffice 4.1.7 and older, StarOffice 5.2
---
Lupp from München
JeJe
Volunteer
Posts: 2764
Joined: Wed Mar 09, 2016 2:40 pm

Re: [crosspost] What interface or service for utf encode dec

Post by JeJe »

There is a windows api function WideCharToMultiByte.

https://docs.microsoft.com/en-gb/window ... omultibyte

There's a VB example here.

https://www.di-mgt.com.au/howto-convert ... -utf8.html

It may not be possible to get this to work from OOBasic as it doesn't support pointers but you may be able to run it some other way.
Windows 10, Openoffice 4.1.11, LibreOffice 7.4.0.3 (x64)
User avatar
Lupp
Volunteer
Posts: 3542
Joined: Sat May 31, 2014 7:05 pm
Location: München, Germany

Re: [crosspost] What interface or service for utf encode dec

Post by Lupp »

Thanks, JeJe!

I just don't like to rely on MS or to study VBA. I also wrote already a Basic module with few functions for related purposes, mainly to be sure I understood UTF-8 that I had never studied in advance. Nonetheless I would prefer to use uno services -if available- over ingenious but inefficient Basic code manufactured by Lupp.

Part of the mentioned module:
(To allow for the theoretical maximum of 6 bytes for one CodePoint is, of course, useless luxury.)
The UTF-8 results are represented by two HEX digits per byte (most significant byte first) in percent-escaped form as used in URI.

Code: Select all

REM  *****  BASIC  *****
REM V0.1 Wolfgang Jäger; about 2018-07-02; Errors expected!
Option Explicit

Global fa As Object

Sub createFa()
If IsNull(fa) Then
  fa = CreateUnoService("com.sun.star.sheet.FunctionAccess")
End If
End Sub

Function charToUtf8(pChar As String) 'For UniCode CodePoints < &H200000 only!
charToUtf8 = ":error:"
If pChar="" Then
  charToUtf8 = ""
  Exit Function
End If
Dim cPN As Long           'Asc() cannot return Long values! fa As Object, 
If IsNull(fa) Then createFa()
'fa = CreateUnoService("com.sun.star.sheet.FunctionAccess")
cPN = fa.CallFunction("UNICODE", Array(pChar))
charToUtf8 = uniCodeToUtf8(cPN
End Function

Function uniCodeToUtf8(ByVal pCode As Long)
uniCodeToUtf8 = ":error:"
If pCode<0 Then Exit Function
Dim k As Long, j As Long, topMax As Byte, b(-1 To 5) As Byte, r As String
Const delBits67 As Byte = &H3F, folMark As Byte = &H80
If pCode<&H80 Then                      'Nur ein Byte, da <=&HFF und Bit7=0
  k = 0
  If pCode<&H10 Then
    r = "%0" & Hex(pCode)
  Else
    r = "%" & Hex(pCode)
  End If
Else
  topMax = &H7F                         'Welcher "Schieberest" passt höchstens ins MSByte,
  k = -1                                'wenn dort die Markierung für die Zahl der Bytes steht?
  Do While (pCode>0) OR (b(k)>topMax)
    topMax = (topMax + 1 ) \&H02 -1
    k = k + 1
    b(k) = (delBits67 AND pCode)
    pCode = pCode \ &H40
  Loop
  r = ""
  For j = 0 To k - 1
    r= "%" & Hex(b(j) OR folMark) & r
  Next j
    r = "%" & Hex(b(k) OR (&HFF - (topMax * &H02 +1))) & r
End If
uniCodeToUtf8 = r
End Function
On Windows 10: LibreOffice 24.2 (new numbering) and older versions, PortableOpenOffice 4.1.7 and older, StarOffice 5.2
---
Lupp from München
Post Reply