[Rejected] Opinions requested: extending filter syntax (2)

Various discussions related to Adblock Plus development

[Rejected] Opinions requested: extending filter syntax (2)

Postby Wladimir Palant » Wed Jan 17, 2007 8:16 pm

The goal is to be able to create more specific filters without having to resort to regular expressions - we cannot speed-up processing for regular expressions. I had some success implementing features from regular expressions in simple filters. The syntax is similar yet not quite the same - we need to stay backward compatible. Here are the new features:

* Predefined character classes and special characters: "http://ad\d.server.com/*". Supported are \b, \B, \d, \D, \f, \n, \r, \s, \S, \t, \v, \w, \W, \c<X>, \x<hh>, \u<hhhh>. Meaning is the same as for Regular expressions.
* Custom character sets: "ad[sv]" meaning "ads or adv". Anything that is allowed in regular expressions should be allowed here as well - negated character sets, character ranges etc.
* Quantifiers: "abc{2,3}" meaning "abcc or abccc" (variations like {3} or {1,} are allowed as well). Quantifiers *, + and ? from regular expressions are specified like this: "abc{+}" (this is the syntax that is not the same as with regular expressions).
* Escaping: any special character can be escaped using a backslash. "a\*c" will match "a*c" but not "abc".

I put up a test page you can play with. It will show you the regular expression corresponding to a filter (comments and element hiding rules won't be recognized) and match it against test strings you enter. I will extend this test page later to make it show the shortcuts that can be used for a particular filter. This is the same code that should go into the next major release of Adblock Plus so please report any bugs you notice.

These filter features are targeted at advanced users, primarily filter list authors. They will also make the deregifier more useful. What do you think? Will this work, do I need to change anything?
Wladimir Palant
ABP Developer
 
Posts: 8398
Joined: Fri Jun 09, 2006 6:59 pm
Location: Cologne, Germany

Re: Opinions requested: extending filter syntax (2)

Postby chewey » Wed Jan 17, 2007 8:40 pm

This looks just like regular expressions in a little less powerful to me.
How is this fundamentally different from regular expressions? Or is it
just supposed to be a subset of the whole regex syntax?

I don't get it - so I undoubtedly am missing the point.

What am I missing?
User avatar
chewey
 
Posts: 501
Joined: Wed Jun 14, 2006 10:34 pm
Location: somewhere in Europe

Postby Wladimir Palant » Wed Jan 17, 2007 8:59 pm

I forgot to mention two issues. First: with this syntax I can no longer enforce the validity of the resulting regular expression. Quantifiers are the problem here, you can have filters like "ads{+}{+}" or "{?}ads" - these will be marked as invalid, something that never happened to simple filters before. The other thing is that it will be difficult (or maybe even impossible) to make the redundancy checker support the new syntax. Telling whether "ad[sv]" and "ad\w" are redundant is hard.

@chewey: The fundamental difference is that you cannot have alternative paths. Something like /(foo|bar)/ is not possible which gives ABP the chance to extract a shortcut - if there is plain text in the filter, you can be sure that this text has to be present in the matching address as well. So simple filters with the new syntax can use the optimized processing while regular expressions can not. Once this syntax is supported I want to officially deprecate regular expressions (they still will be supported however).
Wladimir Palant
ABP Developer
 
Posts: 8398
Joined: Fri Jun 09, 2006 6:59 pm
Location: Cologne, Germany

Postby chewey » Thu Jan 18, 2007 1:00 am

Wladimir Palant wrote:The fundamental difference is that you cannot have alternative paths. Something like /(foo|bar)/ is not possible which gives ABP the chance to extract a shortcut - if there is plain text in the filter, you can be sure that this text has to be present in the matching address as well.

Ahhh, yes, I see. Thanks for the explanation.

Well - I don't see any harm in supporting this.
I guess a good part of the regular regexps can
be modified to use this "regexp light" syntax.
User avatar
chewey
 
Posts: 501
Joined: Wed Jun 14, 2006 10:34 pm
Location: somewhere in Europe

Postby sheepy » Thu Jan 18, 2007 3:33 am

Won't it be easier to do it the other way, to allow specifying a shortcut for regx?

Manual shortcut maintenance is required in that case, however.
Actually I'm alarmed by the rate new syntaxes are being introduced. :?


Still, I can see that this is a simple change to the parser, and I don't think compatibility will be a serious problem, so I'm not against it and will test it.
sheepy
 
Posts: 147
Joined: Sat Jun 17, 2006 8:44 pm

Postby Peng » Thu Jan 18, 2007 8:43 am

It sounds simpler to me to just parse regular expression filters to get shortcuts out of them — try to find eight characters in a row without any regexp special syntax in them.
Matt Nordhoff
User avatar
Peng
 
Posts: 518
Joined: Fri Jun 09, 2006 8:14 pm
Location: Central Florida

Postby sheepy » Thu Jan 18, 2007 9:52 am

No you can't, since it's possible those characters are in a branch or are actually string to not match (e.g. negative look ahead/behind).

If it's this simple, it'd be simple expression instead of regular expression. :wink:
sheepy
 
Posts: 147
Joined: Sat Jun 17, 2006 8:44 pm

Postby Peng » Thu Jan 18, 2007 12:37 pm

sheepy wrote:No you can't, since it's possible those characters are in a branch or are actually string to not match (e.g. negative look ahead/behind).


Branch? As in parentheses? Well, obviously, you cut branches out.

So if you have the regexp:

Code: Select all
/^https?:\/\/(serve[abc]|click\d)\.eviladserver.com\//


... you reduce it to:

Code: Select all
http, ://, .eviladserver.com/


... find the first of those that's 8 characters long and make the shortcut from it.

Not trivial, but it shouldn't be that hard.

(Remove the character before ? or *, but remember to keep the one before +: it has to be there at least once.)
Last edited by Peng on Fri Jan 19, 2007 6:26 am, edited 1 time in total.
Matt Nordhoff
User avatar
Peng
 
Posts: 518
Joined: Fri Jun 09, 2006 8:14 pm
Location: Central Florida

Postby Wladimir Palant » Thu Jan 18, 2007 2:53 pm

Sorry Peng, it really isn't that simple, regular expressions syntax is pretty complicated to parse. Also, the new syntax has an advantage over regular expressions: reading "http://ad\d.server.com/" is easier than reading "/http://ad\d\.server\.com/". The dot is most problematic when writing regular expressions, it is very common in addresses and escaping it makes the filter hard to read (not to mention that forgetting to escape the dot is a very common mistake). Regular expressions also seem to encourage filters like "///([^/]+\.)?ad(ima?ge?|manager|se?rv.*|stream|v|vert.*|x)?s?-?\d*\.(?!.+\.edu|jp/|$)/", something I want to get rid of.

In the end nobody is forced to use the new syntax, if it is only used by the deregifier and the (still theoretical) automatic filter generator - I am fine with it.
Wladimir Palant
ABP Developer
 
Posts: 8398
Joined: Fri Jun 09, 2006 6:59 pm
Location: Cologne, Germany

Postby Fox » Thu Jan 18, 2007 3:18 pm

Is it possible to use those or have something that blocks these with one (simple) filter.
140x120.jpg
142x360.jpg
242x240.jpg
I mean some way to tell that there must be 3 numbers, x and 3 numbers again, and that .jpg extension, and no * between them.
Fox
 
Posts: 300
Joined: Sat Jun 10, 2006 3:05 pm
Location: Finland

Postby Wladimir Palant » Thu Jan 18, 2007 3:20 pm

That would be "\d\d\dx\d\d\d.jpg" or "\d{3}x\d{3}.jpg"
Wladimir Palant
ABP Developer
 
Posts: 8398
Joined: Fri Jun 09, 2006 6:59 pm
Location: Cologne, Germany

Postby Fox » Thu Jan 18, 2007 3:23 pm

Thanks.

EDIT: did i find a bug or is testpage not ready.
this is my page: http://koti.mbnet.fi/foghorn/Fox/adblockingtest.htm
PNG pics there, so these are filters:
\d\d\dx\d\d\d.png
\d{3}x\d{3}.png
and they don't match.

But these real regexp work in Adblock Plus:
/\d\d\dx\d\d\d\.png/
/\d{3}x\d{3}\.png/
Fox
 
Posts: 300
Joined: Sat Jun 10, 2006 3:05 pm
Location: Finland

Postby Dr. Evil » Thu Jan 18, 2007 4:30 pm

sounds great :)
Dr. Evil
 
Posts: 194
Joined: Fri Sep 08, 2006 3:51 pm

Postby Wladimir Palant » Thu Jan 18, 2007 4:49 pm

@Fox: If I go to the test page, enter "\d\d\dx\d\d\d.png" as the filter and "http://koti.mbnet.fi/foghorn/Fox/140x120.png" as test address it says "Matched". In the current Adblock Plus version this filter doesn't work of course - this feature hasn't been implemented yet.
Wladimir Palant
ABP Developer
 
Posts: 8398
Joined: Fri Jun 09, 2006 6:59 pm
Location: Cologne, Germany

Postby Fox » Thu Jan 18, 2007 5:04 pm

So i did enter wrong address there:)
Fox
 
Posts: 300
Joined: Sat Jun 10, 2006 3:05 pm
Location: Finland

Next

Return to Adblock Plus development

Who is online

Users browsing this forum: No registered users and 2 guests