[Rejected] Opinions requested: extending filter syntax (2)

Various discussions related to Adblock Plus development
Wladimir Palant

[Rejected] Opinions requested: extending filter syntax (2)

Post by Wladimir Palant »

The goal is to be able to create more specific filters without having to resort to regular expressions - we cannot speed-up processing for regular expressions. I had some success implementing features from regular expressions in simple filters. The syntax is similar yet not quite the same - we need to stay backward compatible. Here are the new features:

* Predefined character classes and special characters: "http://ad\d.server.com/*". Supported are \b, \B, \d, \D, \f, \n, \r, \s, \S, \t, \v, \w, \W, \c<X>, \x<hh>, \u<hhhh>. Meaning is the same as for Regular expressions.
* Custom character sets: "ad[sv]" meaning "ads or adv". Anything that is allowed in regular expressions should be allowed here as well - negated character sets, character ranges etc.
* Quantifiers: "abc{2,3}" meaning "abcc or abccc" (variations like {3} or {1,} are allowed as well). Quantifiers *, + and ? from regular expressions are specified like this: "abc{+}" (this is the syntax that is not the same as with regular expressions).
* Escaping: any special character can be escaped using a backslash. "a\*c" will match "a*c" but not "abc".

I put up a test page you can play with. It will show you the regular expression corresponding to a filter (comments and element hiding rules won't be recognized) and match it against test strings you enter. I will extend this test page later to make it show the shortcuts that can be used for a particular filter. This is the same code that should go into the next major release of Adblock Plus so please report any bugs you notice.

These filter features are targeted at advanced users, primarily filter list authors. They will also make the deregifier more useful. What do you think? Will this work, do I need to change anything?
User avatar
chewey
Posts: 501
Joined: Wed Jun 14, 2006 10:34 pm
Location: somewhere in Europe

Re: Opinions requested: extending filter syntax (2)

Post by chewey »

This looks just like regular expressions in a little less powerful to me.
How is this fundamentally different from regular expressions? Or is it
just supposed to be a subset of the whole regex syntax?

I don't get it - so I undoubtedly am missing the point.

What am I missing?
Wladimir Palant

Post by Wladimir Palant »

I forgot to mention two issues. First: with this syntax I can no longer enforce the validity of the resulting regular expression. Quantifiers are the problem here, you can have filters like "ads{+}{+}" or "{?}ads" - these will be marked as invalid, something that never happened to simple filters before. The other thing is that it will be difficult (or maybe even impossible) to make the redundancy checker support the new syntax. Telling whether "ad[sv]" and "ad\w" are redundant is hard.

@chewey: The fundamental difference is that you cannot have alternative paths. Something like /(foo|bar)/ is not possible which gives ABP the chance to extract a shortcut - if there is plain text in the filter, you can be sure that this text has to be present in the matching address as well. So simple filters with the new syntax can use the optimized processing while regular expressions can not. Once this syntax is supported I want to officially deprecate regular expressions (they still will be supported however).
User avatar
chewey
Posts: 501
Joined: Wed Jun 14, 2006 10:34 pm
Location: somewhere in Europe

Post by chewey »

Wladimir Palant wrote:The fundamental difference is that you cannot have alternative paths. Something like /(foo|bar)/ is not possible which gives ABP the chance to extract a shortcut - if there is plain text in the filter, you can be sure that this text has to be present in the matching address as well.
Ahhh, yes, I see. Thanks for the explanation.

Well - I don't see any harm in supporting this.
I guess a good part of the regular regexps can
be modified to use this "regexp light" syntax.
sheepy
Posts: 147
Joined: Sat Jun 17, 2006 8:44 pm

Post by sheepy »

Won't it be easier to do it the other way, to allow specifying a shortcut for regx?

Manual shortcut maintenance is required in that case, however.
Actually I'm alarmed by the rate new syntaxes are being introduced. :?


Still, I can see that this is a simple change to the parser, and I don't think compatibility will be a serious problem, so I'm not against it and will test it.
User avatar
Peng
Posts: 518
Joined: Fri Jun 09, 2006 8:14 pm
Location: Central Florida
Contact:

Post by Peng »

It sounds simpler to me to just parse regular expression filters to get shortcuts out of them — try to find eight characters in a row without any regexp special syntax in them.
Matt Nordhoff
sheepy
Posts: 147
Joined: Sat Jun 17, 2006 8:44 pm

Post by sheepy »

No you can't, since it's possible those characters are in a branch or are actually string to not match (e.g. negative look ahead/behind).

If it's this simple, it'd be simple expression instead of regular expression. :wink:
User avatar
Peng
Posts: 518
Joined: Fri Jun 09, 2006 8:14 pm
Location: Central Florida
Contact:

Post by Peng »

sheepy wrote:No you can't, since it's possible those characters are in a branch or are actually string to not match (e.g. negative look ahead/behind).
Branch? As in parentheses? Well, obviously, you cut branches out.

So if you have the regexp:

Code: Select all

/^https?:\/\/(serve[abc]|click\d)\.eviladserver.com\//
... you reduce it to:

Code: Select all

http, ://, .eviladserver.com/
... find the first of those that's 8 characters long and make the shortcut from it.

Not trivial, but it shouldn't be that hard.

(Remove the character before ? or *, but remember to keep the one before +: it has to be there at least once.)
Last edited by Peng on Fri Jan 19, 2007 5:26 am, edited 1 time in total.
Matt Nordhoff
Wladimir Palant

Post by Wladimir Palant »

Sorry Peng, it really isn't that simple, regular expressions syntax is pretty complicated to parse. Also, the new syntax has an advantage over regular expressions: reading "http://ad\d.server.com/" is easier than reading "/http://ad\d\.server\.com/". The dot is most problematic when writing regular expressions, it is very common in addresses and escaping it makes the filter hard to read (not to mention that forgetting to escape the dot is a very common mistake). Regular expressions also seem to encourage filters like "///([^/]+\.)?ad(ima?ge?|manager|se?rv.*|stream|v|vert.*|x)?s?-?\d*\.(?!.+\.edu|jp/|$)/", something I want to get rid of.

In the end nobody is forced to use the new syntax, if it is only used by the deregifier and the (still theoretical) automatic filter generator - I am fine with it.
Fox
Posts: 300
Joined: Sat Jun 10, 2006 3:05 pm
Location: Finland

Post by Fox »

Is it possible to use those or have something that blocks these with one (simple) filter.
140x120.jpg
142x360.jpg
242x240.jpg
I mean some way to tell that there must be 3 numbers, x and 3 numbers again, and that .jpg extension, and no * between them.
Wladimir Palant

Post by Wladimir Palant »

That would be "\d\d\dx\d\d\d.jpg" or "\d{3}x\d{3}.jpg"
Fox
Posts: 300
Joined: Sat Jun 10, 2006 3:05 pm
Location: Finland

Post by Fox »

Thanks.

EDIT: did i find a bug or is testpage not ready.
this is my page: http://koti.mbnet.fi/foghorn/Fox/adblockingtest.htm
PNG pics there, so these are filters:
\d\d\dx\d\d\d.png
\d{3}x\d{3}.png
and they don't match.

But these real regexp work in Adblock Plus:
/\d\d\dx\d\d\d\.png/
/\d{3}x\d{3}\.png/
Dr. Evil
Posts: 194
Joined: Fri Sep 08, 2006 3:51 pm

Post by Dr. Evil »

sounds great :)
Wladimir Palant

Post by Wladimir Palant »

@Fox: If I go to the test page, enter "\d\d\dx\d\d\d.png" as the filter and "http://koti.mbnet.fi/foghorn/Fox/140x120.png" as test address it says "Matched". In the current Adblock Plus version this filter doesn't work of course - this feature hasn't been implemented yet.
Fox
Posts: 300
Joined: Sat Jun 10, 2006 3:05 pm
Location: Finland

Post by Fox »

So i did enter wrong address there:)
Locked