Page 1 of 2

[Rejected] Opinions requested: extending filter syntax (2)

Posted: Wed Jan 17, 2007 7:16 pm
by Wladimir Palant
The goal is to be able to create more specific filters without having to resort to regular expressions - we cannot speed-up processing for regular expressions. I had some success implementing features from regular expressions in simple filters. The syntax is similar yet not quite the same - we need to stay backward compatible. Here are the new features:

* Predefined character classes and special characters: "http://ad\d.server.com/*". Supported are \b, \B, \d, \D, \f, \n, \r, \s, \S, \t, \v, \w, \W, \c<X>, \x<hh>, \u<hhhh>. Meaning is the same as for Regular expressions.
* Custom character sets: "ad[sv]" meaning "ads or adv". Anything that is allowed in regular expressions should be allowed here as well - negated character sets, character ranges etc.
* Quantifiers: "abc{2,3}" meaning "abcc or abccc" (variations like {3} or {1,} are allowed as well). Quantifiers *, + and ? from regular expressions are specified like this: "abc{+}" (this is the syntax that is not the same as with regular expressions).
* Escaping: any special character can be escaped using a backslash. "a\*c" will match "a*c" but not "abc".

I put up a test page you can play with. It will show you the regular expression corresponding to a filter (comments and element hiding rules won't be recognized) and match it against test strings you enter. I will extend this test page later to make it show the shortcuts that can be used for a particular filter. This is the same code that should go into the next major release of Adblock Plus so please report any bugs you notice.

These filter features are targeted at advanced users, primarily filter list authors. They will also make the deregifier more useful. What do you think? Will this work, do I need to change anything?

Re: Opinions requested: extending filter syntax (2)

Posted: Wed Jan 17, 2007 7:40 pm
by chewey
This looks just like regular expressions in a little less powerful to me.
How is this fundamentally different from regular expressions? Or is it
just supposed to be a subset of the whole regex syntax?

I don't get it - so I undoubtedly am missing the point.

What am I missing?

Posted: Wed Jan 17, 2007 7:59 pm
by Wladimir Palant
I forgot to mention two issues. First: with this syntax I can no longer enforce the validity of the resulting regular expression. Quantifiers are the problem here, you can have filters like "ads{+}{+}" or "{?}ads" - these will be marked as invalid, something that never happened to simple filters before. The other thing is that it will be difficult (or maybe even impossible) to make the redundancy checker support the new syntax. Telling whether "ad[sv]" and "ad\w" are redundant is hard.

@chewey: The fundamental difference is that you cannot have alternative paths. Something like /(foo|bar)/ is not possible which gives ABP the chance to extract a shortcut - if there is plain text in the filter, you can be sure that this text has to be present in the matching address as well. So simple filters with the new syntax can use the optimized processing while regular expressions can not. Once this syntax is supported I want to officially deprecate regular expressions (they still will be supported however).

Posted: Thu Jan 18, 2007 12:00 am
by chewey
Wladimir Palant wrote:The fundamental difference is that you cannot have alternative paths. Something like /(foo|bar)/ is not possible which gives ABP the chance to extract a shortcut - if there is plain text in the filter, you can be sure that this text has to be present in the matching address as well.
Ahhh, yes, I see. Thanks for the explanation.

Well - I don't see any harm in supporting this.
I guess a good part of the regular regexps can
be modified to use this "regexp light" syntax.

Posted: Thu Jan 18, 2007 2:33 am
by sheepy
Won't it be easier to do it the other way, to allow specifying a shortcut for regx?

Manual shortcut maintenance is required in that case, however.
Actually I'm alarmed by the rate new syntaxes are being introduced. :?


Still, I can see that this is a simple change to the parser, and I don't think compatibility will be a serious problem, so I'm not against it and will test it.

Posted: Thu Jan 18, 2007 7:43 am
by Peng
It sounds simpler to me to just parse regular expression filters to get shortcuts out of them — try to find eight characters in a row without any regexp special syntax in them.

Posted: Thu Jan 18, 2007 8:52 am
by sheepy
No you can't, since it's possible those characters are in a branch or are actually string to not match (e.g. negative look ahead/behind).

If it's this simple, it'd be simple expression instead of regular expression. :wink:

Posted: Thu Jan 18, 2007 11:37 am
by Peng
sheepy wrote:No you can't, since it's possible those characters are in a branch or are actually string to not match (e.g. negative look ahead/behind).
Branch? As in parentheses? Well, obviously, you cut branches out.

So if you have the regexp:

Code: Select all

/^https?:\/\/(serve[abc]|click\d)\.eviladserver.com\//
... you reduce it to:

Code: Select all

http, ://, .eviladserver.com/
... find the first of those that's 8 characters long and make the shortcut from it.

Not trivial, but it shouldn't be that hard.

(Remove the character before ? or *, but remember to keep the one before +: it has to be there at least once.)

Posted: Thu Jan 18, 2007 1:53 pm
by Wladimir Palant
Sorry Peng, it really isn't that simple, regular expressions syntax is pretty complicated to parse. Also, the new syntax has an advantage over regular expressions: reading "http://ad\d.server.com/" is easier than reading "/http://ad\d\.server\.com/". The dot is most problematic when writing regular expressions, it is very common in addresses and escaping it makes the filter hard to read (not to mention that forgetting to escape the dot is a very common mistake). Regular expressions also seem to encourage filters like "///([^/]+\.)?ad(ima?ge?|manager|se?rv.*|stream|v|vert.*|x)?s?-?\d*\.(?!.+\.edu|jp/|$)/", something I want to get rid of.

In the end nobody is forced to use the new syntax, if it is only used by the deregifier and the (still theoretical) automatic filter generator - I am fine with it.

Posted: Thu Jan 18, 2007 2:18 pm
by Fox
Is it possible to use those or have something that blocks these with one (simple) filter.
140x120.jpg
142x360.jpg
242x240.jpg
I mean some way to tell that there must be 3 numbers, x and 3 numbers again, and that .jpg extension, and no * between them.

Posted: Thu Jan 18, 2007 2:20 pm
by Wladimir Palant
That would be "\d\d\dx\d\d\d.jpg" or "\d{3}x\d{3}.jpg"

Posted: Thu Jan 18, 2007 2:23 pm
by Fox
Thanks.

EDIT: did i find a bug or is testpage not ready.
this is my page: http://koti.mbnet.fi/foghorn/Fox/adblockingtest.htm
PNG pics there, so these are filters:
\d\d\dx\d\d\d.png
\d{3}x\d{3}.png
and they don't match.

But these real regexp work in Adblock Plus:
/\d\d\dx\d\d\d\.png/
/\d{3}x\d{3}\.png/

Posted: Thu Jan 18, 2007 3:30 pm
by Dr. Evil
sounds great :)

Posted: Thu Jan 18, 2007 3:49 pm
by Wladimir Palant
@Fox: If I go to the test page, enter "\d\d\dx\d\d\d.png" as the filter and "http://koti.mbnet.fi/foghorn/Fox/140x120.png" as test address it says "Matched". In the current Adblock Plus version this filter doesn't work of course - this feature hasn't been implemented yet.

Posted: Thu Jan 18, 2007 4:04 pm
by Fox
So i did enter wrong address there:)