[Solved] Greedy regex problem

Creating a macro - Writing a Script - Using the API

[Solved] Greedy regex problem

Postby nienberg » Mon Sep 28, 2009 9:20 pm

I've been happily coding away, converting a fairly complex MS Word solution into Writer, until I discovered a problem with the way OOo finds text in a Writer document using regular expressions. My research shows that it is well documented that the find is "greedy" and that is my problem. I am searching for text that is enclosed within tags (like html or xml tags). Here is an example:

Code: Select all   Expand viewCollapse view
Here is an example with <myTag>some tagged text</myTag> and some non-tagged text.  The problem is finding <myTag>individual instances of tagged text</myTag> that occur in the same paragraph without also finding all of the non-tagged text in between.

The basic regex that finds too much because it is greedy is:
Code: Select all   Expand viewCollapse view
<myTag>.*</myTag>

I've also tried this:
Code: Select all   Expand viewCollapse view
<myTag>[^<]</myTag>

which partly solves the problem, but fails for the case where tags are nested.
Code: Select all   Expand viewCollapse view
Here is an example of <myTag>tagged text that also includes <anotherTag>some nested tags</anotherTag> inside it</myTag>.  This won't work with the negated regex above.

Does anyone have any suggestions on how I might proceed? I realize this maybe isn't strictly a Macro/UNO API problem, since it can be demonstrated within Writer itself, but in the end I am using the regex in a Basic routine.

Thanks,
Last edited by nienberg on Wed Sep 30, 2009 6:17 pm, edited 2 times in total.
LibreOffice 3.4.3
Windows XP, Windows 7, and MacOS X 10.7
nienberg
 
Posts: 28
Joined: Mon Sep 21, 2009 9:23 pm
Location: Berkeley, CA

Re: Greedy regex problem

Postby Robert Tucker » Mon Sep 28, 2009 10:19 pm

Having found:

It would be nice to be able to switch the find/replace, as well as other places where regexps are used, to be either greedy or not greedy.

Come on, that usually is only a matter of adding [^x] (substitute x for whatever is appropriate) to the regexp. I'd say thats pretty low priority.

at:

http://wiki.services.openoffice.org/wik ... xpressions

and finding that:

Code: Select all   Expand viewCollapse view
<(myTag).*?\1>

will not work, I guess a solution in OpenOffice may be some way off.
LibreOffice 6.x.x on Fedora 31 and Ubuntu 19.10 (Dual Boot)
User avatar
Robert Tucker
Volunteer
 
Posts: 1247
Joined: Mon Oct 08, 2007 1:34 am
Location: Manchester UK

Re: Greedy regex problem

Postby nienberg » Tue Sep 29, 2009 3:27 am

Right, and to make matters worse, now I realize that the search will not work across paragraphs, so if there is a paragraph break inside of a set of tags, then nothing will be found. I guess I have to rethink this. Maybe I can work out a solution that manipulates the odt file directly using perl or something. But I'm open to other suggestions.

Thanks,
LibreOffice 3.4.3
Windows XP, Windows 7, and MacOS X 10.7
nienberg
 
Posts: 28
Joined: Mon Sep 21, 2009 9:23 pm
Location: Berkeley, CA

Re: Greedy regex problem

Postby Robert Tucker » Tue Sep 29, 2009 6:07 pm

In Writer, AltSearch it seems will work since as it says on the Help screen:

...subexpression of the type (.*)any or (.+)any are searched for, the shortest matching occurrence is found, contrary to the OOo standard search, which will find the longest matching occurrence. If it is necessary to preserve compatibility, you can delimit the whole search expression with an extra pair of parentheses: ((Mi)?ster). But this will, of course, lose you the chance to cite the subexpression...

It can also search across paragraphs.
LibreOffice 6.x.x on Fedora 31 and Ubuntu 19.10 (Dual Boot)
User avatar
Robert Tucker
Volunteer
 
Posts: 1247
Joined: Mon Oct 08, 2007 1:34 am
Location: Manchester UK

Re: Greedy regex problem

Postby nienberg » Wed Sep 30, 2009 1:35 am

Wow! Just when I had given up hope. That looks very promising. I installed the extension and tested manually. It looks like the [::BigBlock::] option does exactly what I need. Now my final question is how to call it from a Basic macro program. I noticed that recording a macro to used the extension doesn't work (it records nothing). Should I be calling the sub directly? The comments in the code are not in english, so it will be a bit of a challenge, but I assume I could figure it out if that is the best approach.

Thanks very much for your help,
LibreOffice 3.4.3
Windows XP, Windows 7, and MacOS X 10.7
nienberg
 
Posts: 28
Joined: Mon Sep 21, 2009 9:23 pm
Location: Berkeley, CA

Re: Greedy regex problem

Postby Robert Tucker » Wed Sep 30, 2009 9:13 am

Afraid I'm not much into macro writing. Perhaps I would be thinking more of pulling AltSearch apart to find out how it interacted with OpenOffice to do what it does – not something I want to do on a whim!
LibreOffice 6.x.x on Fedora 31 and Ubuntu 19.10 (Dual Boot)
User avatar
Robert Tucker
Volunteer
 
Posts: 1247
Joined: Mon Oct 08, 2007 1:34 am
Location: Manchester UK

Re: Greedy regex problem

Postby nienberg » Wed Sep 30, 2009 6:15 pm

It's a big complicated library with all the comments and variable names in Czech, but I agree that my best bet is to pick it apart until I understand how it works.

Thanks again for your help.
LibreOffice 3.4.3
Windows XP, Windows 7, and MacOS X 10.7
nienberg
 
Posts: 28
Joined: Mon Sep 21, 2009 9:23 pm
Location: Berkeley, CA

Re: [Solved] Greedy regex problem

Postby bugmenot111 » Thu Nov 25, 2010 1:59 pm

I have the same problem: non-greedy search is not allowed. Have you found any workaround? (I couldn't find anything useful in the AltSearch's source code)
OpenOffice 3.1 on Windows Vista
bugmenot111
 
Posts: 17
Joined: Fri Mar 26, 2010 12:14 pm

Re: [Solved] Greedy regex problem

Postby Villeroy » Fri Nov 26, 2010 12:11 am

bugmenot111 wrote:I have the same problem: non-greedy search is not allowed. Have you found any workaround? (I couldn't find anything useful in the AltSearch's source code)

Did he really write a regex extension in a language without regex support?
Please, edit this topic's initial post and add "[Solved]" to the subject line if your problem has been solved.
Ubuntu 18.04, no OpenOffice, LibreOffice 6.x
User avatar
Villeroy
Volunteer
 
Posts: 27574
Joined: Mon Oct 08, 2007 1:35 am
Location: Germany

Re: [Solved] Greedy regex problem

Postby nienberg » Mon Nov 29, 2010 5:04 am

My need was specifically to find tags with text between them, so after studying the AltSearch code I wrote a basic subroutine that searches for the opening tag, then (starting at the location of the opening tag) it finds the next closing tag, then selects the text in between. So it doesn't really use any of the regex capabilities at all, just the search capabilities.
LibreOffice 3.4.3
Windows XP, Windows 7, and MacOS X 10.7
nienberg
 
Posts: 28
Joined: Mon Sep 21, 2009 9:23 pm
Location: Berkeley, CA


Return to Macros and UNO API

Who is online

Users browsing this forum: No registered users and 3 guests