[Rejected] Opinions requested: extending filter syntax (2)
[Rejected] Opinions requested: extending filter syntax (2)
The goal is to be able to create more specific filters without having to resort to regular expressions - we cannot speed-up processing for regular expressions. I had some success implementing features from regular expressions in simple filters. The syntax is similar yet not quite the same - we need to stay backward compatible. Here are the new features:
* Predefined character classes and special characters: "http://ad\d.server.com/*". Supported are \b, \B, \d, \D, \f, \n, \r, \s, \S, \t, \v, \w, \W, \c<X>, \x<hh>, \u<hhhh>. Meaning is the same as for Regular expressions.
* Custom character sets: "ad[sv]" meaning "ads or adv". Anything that is allowed in regular expressions should be allowed here as well - negated character sets, character ranges etc.
* Quantifiers: "abc{2,3}" meaning "abcc or abccc" (variations like {3} or {1,} are allowed as well). Quantifiers *, + and ? from regular expressions are specified like this: "abc{+}" (this is the syntax that is not the same as with regular expressions).
* Escaping: any special character can be escaped using a backslash. "a\*c" will match "a*c" but not "abc".
I put up a test page you can play with. It will show you the regular expression corresponding to a filter (comments and element hiding rules won't be recognized) and match it against test strings you enter. I will extend this test page later to make it show the shortcuts that can be used for a particular filter. This is the same code that should go into the next major release of Adblock Plus so please report any bugs you notice.
These filter features are targeted at advanced users, primarily filter list authors. They will also make the deregifier more useful. What do you think? Will this work, do I need to change anything?
* Predefined character classes and special characters: "http://ad\d.server.com/*". Supported are \b, \B, \d, \D, \f, \n, \r, \s, \S, \t, \v, \w, \W, \c<X>, \x<hh>, \u<hhhh>. Meaning is the same as for Regular expressions.
* Custom character sets: "ad[sv]" meaning "ads or adv". Anything that is allowed in regular expressions should be allowed here as well - negated character sets, character ranges etc.
* Quantifiers: "abc{2,3}" meaning "abcc or abccc" (variations like {3} or {1,} are allowed as well). Quantifiers *, + and ? from regular expressions are specified like this: "abc{+}" (this is the syntax that is not the same as with regular expressions).
* Escaping: any special character can be escaped using a backslash. "a\*c" will match "a*c" but not "abc".
I put up a test page you can play with. It will show you the regular expression corresponding to a filter (comments and element hiding rules won't be recognized) and match it against test strings you enter. I will extend this test page later to make it show the shortcuts that can be used for a particular filter. This is the same code that should go into the next major release of Adblock Plus so please report any bugs you notice.
These filter features are targeted at advanced users, primarily filter list authors. They will also make the deregifier more useful. What do you think? Will this work, do I need to change anything?
Re: Opinions requested: extending filter syntax (2)
This looks just like regular expressions in a little less powerful to me.
How is this fundamentally different from regular expressions? Or is it
just supposed to be a subset of the whole regex syntax?
I don't get it - so I undoubtedly am missing the point.
What am I missing?
How is this fundamentally different from regular expressions? Or is it
just supposed to be a subset of the whole regex syntax?
I don't get it - so I undoubtedly am missing the point.
What am I missing?
I forgot to mention two issues. First: with this syntax I can no longer enforce the validity of the resulting regular expression. Quantifiers are the problem here, you can have filters like "ads{+}{+}" or "{?}ads" - these will be marked as invalid, something that never happened to simple filters before. The other thing is that it will be difficult (or maybe even impossible) to make the redundancy checker support the new syntax. Telling whether "ad[sv]" and "ad\w" are redundant is hard.
@chewey: The fundamental difference is that you cannot have alternative paths. Something like /(foo|bar)/ is not possible which gives ABP the chance to extract a shortcut - if there is plain text in the filter, you can be sure that this text has to be present in the matching address as well. So simple filters with the new syntax can use the optimized processing while regular expressions can not. Once this syntax is supported I want to officially deprecate regular expressions (they still will be supported however).
@chewey: The fundamental difference is that you cannot have alternative paths. Something like /(foo|bar)/ is not possible which gives ABP the chance to extract a shortcut - if there is plain text in the filter, you can be sure that this text has to be present in the matching address as well. So simple filters with the new syntax can use the optimized processing while regular expressions can not. Once this syntax is supported I want to officially deprecate regular expressions (they still will be supported however).
Ahhh, yes, I see. Thanks for the explanation.Wladimir Palant wrote:The fundamental difference is that you cannot have alternative paths. Something like /(foo|bar)/ is not possible which gives ABP the chance to extract a shortcut - if there is plain text in the filter, you can be sure that this text has to be present in the matching address as well.
Well - I don't see any harm in supporting this.
I guess a good part of the regular regexps can
be modified to use this "regexp light" syntax.
Won't it be easier to do it the other way, to allow specifying a shortcut for regx?
Manual shortcut maintenance is required in that case, however.
Actually I'm alarmed by the rate new syntaxes are being introduced.
Still, I can see that this is a simple change to the parser, and I don't think compatibility will be a serious problem, so I'm not against it and will test it.
Manual shortcut maintenance is required in that case, however.
Actually I'm alarmed by the rate new syntaxes are being introduced.
Still, I can see that this is a simple change to the parser, and I don't think compatibility will be a serious problem, so I'm not against it and will test it.
Branch? As in parentheses? Well, obviously, you cut branches out.sheepy wrote:No you can't, since it's possible those characters are in a branch or are actually string to not match (e.g. negative look ahead/behind).
So if you have the regexp:
Code: Select all
/^https?:\/\/(serve[abc]|click\d)\.eviladserver.com\//
Code: Select all
http, ://, .eviladserver.com/
Not trivial, but it shouldn't be that hard.
(Remove the character before ? or *, but remember to keep the one before +: it has to be there at least once.)
Last edited by Peng on Fri Jan 19, 2007 5:26 am, edited 1 time in total.
Matt Nordhoff
Sorry Peng, it really isn't that simple, regular expressions syntax is pretty complicated to parse. Also, the new syntax has an advantage over regular expressions: reading "http://ad\d.server.com/" is easier than reading "/http://ad\d\.server\.com/". The dot is most problematic when writing regular expressions, it is very common in addresses and escaping it makes the filter hard to read (not to mention that forgetting to escape the dot is a very common mistake). Regular expressions also seem to encourage filters like "///([^/]+\.)?ad(ima?ge?|manager|se?rv.*|stream|v|vert.*|x)?s?-?\d*\.(?!.+\.edu|jp/|$)/", something I want to get rid of.
In the end nobody is forced to use the new syntax, if it is only used by the deregifier and the (still theoretical) automatic filter generator - I am fine with it.
In the end nobody is forced to use the new syntax, if it is only used by the deregifier and the (still theoretical) automatic filter generator - I am fine with it.
Thanks.
EDIT: did i find a bug or is testpage not ready.
this is my page: http://koti.mbnet.fi/foghorn/Fox/adblockingtest.htm
PNG pics there, so these are filters:
\d\d\dx\d\d\d.png
\d{3}x\d{3}.png
and they don't match.
But these real regexp work in Adblock Plus:
/\d\d\dx\d\d\d\.png/
/\d{3}x\d{3}\.png/
EDIT: did i find a bug or is testpage not ready.
this is my page: http://koti.mbnet.fi/foghorn/Fox/adblockingtest.htm
PNG pics there, so these are filters:
\d\d\dx\d\d\d.png
\d{3}x\d{3}.png
and they don't match.
But these real regexp work in Adblock Plus:
/\d\d\dx\d\d\d\.png/
/\d{3}x\d{3}\.png/
@Fox: If I go to the test page, enter "\d\d\dx\d\d\d.png" as the filter and "http://koti.mbnet.fi/foghorn/Fox/140x120.png" as test address it says "Matched". In the current Adblock Plus version this filter doesn't work of course - this feature hasn't been implemented yet.