The REGEX function in LibreOffice

Discuss the spreadsheet application
Post Reply
eeigor
Posts: 214
Joined: Sun Apr 12, 2020 10:56 pm

The REGEX function in LibreOffice

Post by eeigor »

Syntax
REGEX (Text; Expression [; [Replacement] [; Flags|Occurrence]])

Flags: Optional. "g" replaces all matches of Expression in Text, not extracted. If there is no match, Text is returned unmodified.

The REGEX function, in my opinion, is not designed very well in terms of the last parameter and the "g" flag.
When enabled, all occurrences will be replaced if the replacement parameter is specified.
As for the return of the found one, if the number of the occurrence is equal to zero, then the function for some reason returns the original string (?), and if the number of the occurrence is specified, it will return it.
How to handle zero is, of course, a matter of taste. But why does the "g" flag here only work for replacement and not work for return? Either the first occurrence or the given one is returned. However, in my opinion, when the "g" (Global) flag is set, all occurrences should also be returned (now if the replacement parameter is not specified, the flag is ignored). Then, if such a function is placed in a cell, we will see the first value from the array of matches. But an array of values ​​can be concatenated into a string using a delimiter, placed in a range of cells, counted, etc.
Bad decision. In addition, the Flags parameter only processes one flag (when performing a replacement). But there could (and should) be more of them: "ismwx" according to specification. And they should work not only for replacement, but also for extraction. And we need to extract any match. Then why did they bring together two different parameters: "Flags|Occurrence"? Who forgot to put a comma?
Ubuntu 18.04 LTS • LibreOffice 7.5.3.2 Community
User avatar
Lupp
Volunteer
Posts: 3542
Joined: Sat May 31, 2014 7:05 pm
Location: München, Germany

Re: There is smth wrong with the REGEX function

Post by Lupp »

Bugs (including enhancement requests and complaints about "bad decisions") concerning LibreOffice go to https://bugs.documentfoundation.org.
eeigor wrote:Who forgot to put a comma?
What do you mean?
On Windows 10: LibreOffice 24.2 (new numbering) and older versions, PortableOpenOffice 4.1.7 and older, StarOffice 5.2
---
Lupp from München
NZ_LinuxStudent
Posts: 14
Joined: Tue Feb 23, 2021 1:10 am
Location: New Zealand

Re: There is smth wrong with the REGEX function

Post by NZ_LinuxStudent »

Introduction to Regular Expressions
https://wiki.openoffice.org/wiki/Docume ... ced/RegExp

How does Apache OpenOffice handle regular expressions as comparison?

THE BELOW is a Simple Guide to replace LibreOffice with OpenOffice, it includes information to then revert back to LibreOffice
Download the lastest installer (OpenOffice_4.1.9 en-US https://sourceforge.net/projects/openof ... z/download), change to Download directory and extract the files.
Navigate to extracted files and run these commands;
sudo apt-get update
sudo apt-get -y remove --purge libreoffice* libexttextcat-data* && sudo apt-get -y autoremove
sudo dpkg -i en-US/DEBS/*.deb
sudo dpkg -i en-US/DEBS/desktop-integration/*.deb

You should now be running OpenOffice4.1.9, how does OpenOfficeCalc handle this in comparison, better? same? ...

To remove OpenOffice and replace with LibreOffice again ;
sudo apt-get -y purge openoffice*.* && sudo apt-get autoremove
sudo apt-get -y install libreoffice libreoffice-gnome

Hopefully this is of some help, I am not well versed in the Regular Expressions but hopefully the above wiki, or comparison to ApacheOpenOffice is of some use :)
 Edit:  https://wiki.openoffice.org/wiki/Docume ... xpressions 
OpenOffice 4.5.0 on LMDE 4 Debbie
eeigor
Posts: 214
Joined: Sun Apr 12, 2020 10:56 pm

Re: There is smth wrong with the REGEX function

Post by eeigor »

Lupp wrote:Bugs (including enhancement requests and complaints about "bad decisions") concerning LibreOffice go to https://bugs.documentfoundation.org.
eeigor wrote:Who forgot to put a comma?
What do you mean?
@Lupp, first I need to figure out if I'm right.
Semicolon of course:
REGEX (Text; Expression [; [Replacement] [; Flags] [; Occurrence] ] )

Upd:
By the way, all flags (ismx) work if you insert them directly into a regular expression: "(?ismx)…" or "(?ismx:…)" if the corresponding option is enabled. Except for one (w), the meaning of which I would ask, if possible, to explain to me by examples.
Flag Settings - Description
i - Ignore case (case insensitive)
s - Make . match newline too (single-line, dot all)
m - Make begin/end {^, $} consider each line
x - Allow comment in regex
w - Make {\w, \W, \b, \B} follow Unicode rules

(w - UREGEX_UWORD or re.U (re.UNICODE) in Python). Am I right?

All work. Great! The use of the "w" flag remains unclear. The question is purely theoretical.
To set a flag, if using the REGEX function, you can simply add the flag directly to the expression.
For example, to search for a word based on unicode: "(?w)\w+" or "(?w:\w+)"

E. g.:
The quick ("brown") fox can’t jump 32.3 feet, right?
The||quick||(|"|brown|"|)||fox||can’t||jump||32.3||feet|,||right|?|
What is red does not perceive as one word if you use "(?w)\b\w+\b", but should...

=REGEX("The quick (""brown"") fox can’t jump 32.3 feet, right?";"(?w)\b\w+\b";;5)
returns "jump", not "can't". Why?
Ubuntu 18.04 LTS • LibreOffice 7.5.3.2 Community
NZ_LinuxStudent
Posts: 14
Joined: Tue Feb 23, 2021 1:10 am
Location: New Zealand

Re: There is smth wrong with the REGEX function

Post by NZ_LinuxStudent »

Is this helpful ?

Regular Expressions in Writer-> https://wiki.openoffice.org/wiki/Docume ... _in_Writer
Using regular expressions in functions -> https://wiki.openoffice.org/wiki/Docume ... _functions
Finding and replacing text and formatting -> https://wiki.openoffice.org/wiki/Docume ... formatting
OpenOffice 4.5.0 on LMDE 4 Debbie
Mountaineer
Posts: 310
Joined: Sun Sep 06, 2020 8:27 am

Re: There is smth wrong with the REGEX function

Post by Mountaineer »

Great! The use of the "w" flag remains unclear. The question is purely theoretical.
Long, long time ago,
before there was unicode to bind them all,
there was software - unaware of umlauts äöü accents âè or special characters ß

For reasons of speed or compatibility you can decide
to ignore the modern possibillities to quote
greek or russian text with their own writing,
so these parts of your text would not be recognized as words. Or regex can take the Character-Class of Unicode in account while searching.

J.

PS Yes - for english speaking people this seems purely theoretical...
OpenOffice 3.1 on Windows Vista
eeigor
Posts: 214
Joined: Sun Apr 12, 2020 10:56 pm

Re: There is smth wrong with the REGEX function

Post by eeigor »

But on the advice of @Lupp, I still posted a feature improvement request.
https://bugs.documentfoundation.org/sho ... ?id=140708

But words with an accent in a word are recognized with the "w" flag disabled (?-w)
=REGEX("А́ Е́ И́ О́ У́ Ы́ Э́ Ю́ Я́ а́ е́ и́ о́ у́ ы́ э́ ю́ я́";"(?-w)\b\w+\b";;2)
returns "Е́" (Cyrillic). Why?

Upd:
That's why I said that there is something wrong with the REGEX function :)
Ubuntu 18.04 LTS • LibreOffice 7.5.3.2 Community
User avatar
Lupp
Volunteer
Posts: 3542
Joined: Sat May 31, 2014 7:05 pm
Location: München, Germany

Re: There is smth wrong with the REGEX function

Post by Lupp »

The RegEx engine used by LibreOffice, and (afaik) also by AOO is by ICU, and I don't think they support a (?w) flag or its negation. In fact I couldn't find something about a RegEx engine supporting a flag with the above assumed meaning. The (?w) I found is completely different.
On Windows 10: LibreOffice 24.2 (new numbering) and older versions, PortableOpenOffice 4.1.7 and older, StarOffice 5.2
---
Lupp from München
Mountaineer
Posts: 310
Joined: Sun Sep 06, 2020 8:27 am

Re: There is smth wrong with the REGEX function

Post by Mountaineer »

In fact I couldn't find something about a RegEx engine supporting a flag with the above assumed meaning.
It's in the ICU-docs under Flag-options:
http://userguide.icu-project.org/string ... ag-Options

J.
OpenOffice 3.1 on Windows Vista
User avatar
Lupp
Volunteer
Posts: 3542
Joined: Sat May 31, 2014 7:05 pm
Location: München, Germany

Re: There is smth wrong with the REGEX function

Post by Lupp »

Lupp wrote:...a flag with the above assumed meaning.
As far as I can see the "w flag" isn't related at all to the original question. However, there was a post containing
eeiogor wrote:But words with an accent in a word are recognized with the "w" flag disabled (?-w)
=REGEX("А́ Е́ И́ О́ У́ Ы́ Э́ Ю́ Я́ а́ е́ и́ о́ у́ ы́ э́ ю́ я́";"(?-w)\b\w+\b";;2)
returns "Е́" (Cyrillic). Why?
Upd:
That's why I said that there is something wrong with the REGEX function
I couldn't see in what way this might be related to the following information:
ICU wrote:w UREGEX_UWORD Controls the behavior of \b in a pattern. If set, word boundaries are found according to the definitions of word found in Unicode UAX 29, Text Boundaries. By default, word boundaries are identified by means of a simple classification of characters as either “word” or “non-word”, which approximates traditional regular expression behavior. The results obtained with the two options can be quite different in runs of spaces and other non-word characters.
Yes, there is doubtable behaviour/specification of the REGEX() function, mainly the (imo) bad idea to charge the absence of an arguemt on a parameter position with the task to pass relevant information. Anything else here looks fancy to me.
On Windows 10: LibreOffice 24.2 (new numbering) and older versions, PortableOpenOffice 4.1.7 and older, StarOffice 5.2
---
Lupp from München
Mountaineer
Posts: 310
Joined: Sun Sep 06, 2020 8:27 am

Re: There is smth wrong with the REGEX function

Post by Mountaineer »

Lupp wrote:
Lupp wrote:...a flag with the above assumed meaning.
As far as I can see the "w flag" isn't related at all to the original question. However, ...
Just a Side-Track, as eeoigor questiones in his second post on the w-flag. Unicode character-classes can have effect on word-boundaries, and probe to misunderstandings, especially if foreign letters exist also as a math symbol, like the omega.

J.

PS I didn't read the first post as a "question" anyway...
OpenOffice 3.1 on Windows Vista
eeigor
Posts: 214
Joined: Sun Apr 12, 2020 10:56 pm

Re: There is smth wrong with the REGEX function

Post by eeigor »

The "w" flag is redundant because the default for strings is Unicode matches, isn't it?
In Python the re.U flag still exists only for backward compatibility.
Ubuntu 18.04 LTS • LibreOffice 7.5.3.2 Community
Post Reply