Page 2 of 3

Re: [Roadmap] Faster filter matching algorithm

Posted: Mon Oct 04, 2010 10:11 am
by Wladimir Palant
A note so this doesn't confuse anybody: the change that just landed is a different thing even though it also improves filter matching performance. This change doesn't impact filters which is why it could land for Adblock Plus 1.3. In Adblock Plus 1.4 we will improve performance once more.

Re: [Roadmap] Faster filter matching algorithm

Posted: Tue Oct 05, 2010 11:15 pm
by Michael
Another advantage of the new algorithm is that, because there must be a string of at least three characters present in the filter, forums may be more easily searched for the origins of the suggestion, as many boards place limits on the minimum number of consecutive characters that they index.

Re: [Roadmap] Faster filter matching algorithm

Posted: Thu Oct 14, 2010 8:43 am
by Michael
Wladimir, how does the algorithm minimise the overlap of keywords when selecting them from filters?

Re: [Roadmap] Faster filter matching algorithm

Posted: Thu Oct 14, 2010 8:52 am
by Wladimir Palant
Yes, it tries to.

Re: [Roadmap] Faster filter matching algorithm

Posted: Thu Oct 14, 2010 9:57 pm
by Michael
Michael wrote:Wladimir, how does the algorithm minimise the overlap of keywords when selecting them from filters?
I am currently writing a blog post in preparation for the algorithm change and am therefore developing a tool to demonstrate the usage of each keyword within a subscription; however, while I was able to locate the area that I think the relevant function is located I was unable to determine exactly how it works.

Code: Select all

let candidates = text.toLowerCase().match(/[^a-z0-9%*][a-z0-9%]{3,}(?=[^a-z0-9%*])/g);
if (!candidates)
    return null;
let hash = this.shortcutHash;
let result = null;
let resultCount = Infinity;
let resultLength = 0;
for (let i = 0, l = candidates.length; i < l; i++)
{
    let candidate = text.substr(startingPoint + j, len);
    if (candidate.indexOf("*") < 0 && !(candidate in this.shortcutHash))
        return candidate;
    let candidate = candidates[i].substr(1);
    let count = candidate in hash ? hash[candidate].filterCount : 0;
    if (count < resultCount || (count == resultCount && candidate.length > resultLength))
    {
        result = candidate;
        resultCount = count;
        resultLength = candidate.length;
    }
}
return result;

Re: [Roadmap] Faster filter matching algorithm

Posted: Fri Oct 15, 2010 7:46 am
by Wladimir Palant

Code: Select all

    let count = candidate in hash ? hash[candidate].filterCount : 0;
    if (count < resultCount || (count == resultCount && candidate.length > resultLength))
This will prefer keywords that have fewer filters associated with them, ideally none. If the number of associated filter is the same as for the keyword we looked at before - longer keywords win.

Re: [Roadmap] Faster filter matching algorithm

Posted: Tue Nov 23, 2010 9:24 pm
by Wladimir Palant
This change just landed and will be in the next development build: https://hg.adblockplus.org/adblockplus/rev/65b0168d6a36

Please note that it will be likely reverted before Adblock Plus 1.3.5 release - this is planned for Adblock Plus 1.4, subscription authors need time to adjust.

Re: [Roadmap] Faster filter matching algorithm

Posted: Fri Nov 26, 2010 7:57 pm
by lyricconch
1) should not process the logic in regexp, it cause slow, and TM cannot optimize that

Code: Select all

let candidates = text.toLowerCase().match(/[^a-z0-9%*][a-z0-9%]{3,}(?=[^a-z0-9%*])/g);
// SLOW1: regex engine need recursive to find out match text. (by {3,}(?=) )
// just use "/aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa*" for test, and you will find why.
if(!candidates) return null;
for (let i=candidates.length-1;i>=0;i--){
    let t=candidates[i].substr(1);
//SLOW2: array access will be slow when arraylength>256(i have check the source code of jsarray.cpp)
//SLOW3 unnecessary substr cause Alloc/Free() it's much expensive then integer operation.
    ....
}
that's slow, it should be changed to:

Code: Select all

let candidates = text.toLowerCase().match(/[a-z0-9%*]+/g);
//now regexengine donot need any recursive.
if(!candidates) return null;
for each(let t in candidates) 
if(t.indexOf("*")===-1&&t.length>=3){
// indexOf doesnot cause heapmemory alloc/free.
// tracemonkey like your code better(your code get 1trace and this get 4) 
// but once trace monkey setup tracetree, this will be much more faster than yours. 
    ....
}
2) is it necessary to write a new filtertype to support "an array of filter"?
most of filters have uniqueID. and an array instanst cost 16byte and additional 4byte per elements. (length<256).
and you have to execute unnecessary array loop(this is always slow) on each uniqueID filter (80%+).
i think this should be better:

Code: Select all

let filtersfirst=[]; // store all the uniqueID filter
let filtersarray=[filtersfirst,]; // store Group filter
for each(let t in candidates) 
if(t.indexOf("*")===-1&&t.length>=3&&t in hash){
      var r = hash[t];
      if(r.length) // is Group type.
          filtersarray.push(r);
      else // is uniqueID type
          filtersfirst.push(r);
}
for each(let filters in filtersarray)
for each(let f in filters){
     f.match(.......);
}
3) i am writing a general-purpose urlmatcher now. as:

Code: Select all

Rule.prototype={
//prop:
     source:"", // source of rule, example: 
// ||adblockplug.org^^posting.php^^sid=**    abp.i2;$Ip?$Stat&clength>10000&$ReqProxy#hits:3;since:20101101;
     hash:"", //  use for fast check, eg: ||adblockplug.org^^posting.php^^sid=**      
     regex:"", // use for match eg: /https?:\/+[^\/]*\badblockplus.org\b.+\bposting\.php\b.+\bsid=([^&]*)/
     flag:"",  // 0x10=igorncase,0x20=lazyinit,0x40=cachedresult b0-b3=group(tosupport whitelist).... eg: 0x12
     result:"", // when match; this will be return to caller for more infomation. eg:"abp"
     extend:function(ctx,ret)[ret, this.result], // extend result, push additional infomation, 
// eg: function(ctx,reexec) return [reexec,this.result,ctx.providers.Ip(ret,this)]
     assert:function(ctx,ret) ret[0]?ret:null, // assert after fit, can use for callback
// eg function(ctx,retobj) {if(ctx.handles.Hit(this)&&ctx.info.contentlength>10000&&ctx.actions.ReqProxy(retobj[3])) return retobj }
     extra:{},  // extra data, eg: {hits:3, since:20101101}
//method:
     exec: function (ctx) {.....} ,  // powerful version 
     test: function (ctx) {.....} ,   // fast version
}
3) some ID's filterCount should be adjust before add any filter. such as
http, https, ... htm ... , eg: set them to 10. this is my matcher.append

Code: Select all

Matcher.prototype.append=
function append(rule){
    var {hashes}= this;
    var rule= new Rule(rule);
    var keys= rule.hashes.match(/[%*a-z0-9]+/g); //1200ms/3600r
    var vchoice=3, hashed=null, vkey=0, key="", choice="*";
//logw(120,rule.text,rule.hashes,keys)
    for each(key in keys) {
        var vkey= key.length;
        if(vkey<vchoice||key.indexOf("*")!==-1) continue;
        if(key in hashes) {
            vkey-= hashes[key].length;
            if(vkey>=vchoice) vchoice=vkey,choice=key;
            else if(key.length>choice.length) choice=key;
        } else if(vkey>=8){
            choice=key;
            break;
        } else {
            vchoice=vkey;
            choice=key; 
        }
    }
    if(!(hashed=hashes[choice])) hashes[choice]= rule;
    else if(hashed.length>1) hashed.push(rule);
    else hashes[choice]= [hashed,rule];
    return choice;/**/
}

Re: [Roadmap] Faster filter matching algorithm

Posted: Fri Nov 26, 2010 8:24 pm
by Wladimir Palant
I only had time to check your first suggestion so far. Please note that we aren't going to deal with pathological cases here - we are applying that regexp on filters. Your code will handle filters like "foobar" incorrectly - these are equivalent to "*foobar*" so foobar shouldn't get accepted as keyword. I tried to fix this but moving the logic out of the regexp made things significantly slower for me, not faster.

Re: [Roadmap] Faster filter matching algorithm

Posted: Fri Nov 26, 2010 8:35 pm
by lyricconch
it can change to ("*"+filter.text+"*").match(/[a-z0-9%*]+/g).
in my case, i have another function to generate hash(work on both regexp and simple rules),
i do not use filter.text directly....

in my test, /[a-z0-9%*]+/ then indexOf is faster then putting logic in regexp
i have build a debug version firefox from hg.mozila.org/mozilla-central,
and got js.exe, xpcshell.exe for develop.
i am using xpcshell.exe for module testing and js.exe for code analysis,
try to find out how tracejit methodjit optimize execution to get the best performance.

Re: [Roadmap] Faster filter matching algorithm

Posted: Sat Nov 27, 2010 12:56 am
by Wladimir Palant
lyricconch wrote:array access will be slow when arraylength>256
Maybe - but a filter with more than 256 possible keywords is very unlikely. At least for the regular case a |for each| loop definitely seems slower (around 25% slower). I tried using regexp.exec() to get the matches one by one, that is also slower (faster than |for each| however).
it can change to ("*"+filter.text+"*").match(/[a-z0-9%*]+/g)
I tried that. With the original regexp initializing 3700 filters took 66ms, this regexp takes 69ms. This is probably because the regexp returns more results, iterating through them takes longer. And the additional concatenation wastes another 10ms.

I tried the "optimal" variant - changed the regexp into /[a-z0-9%]{3,}/g and removed the substr() call. This saves 2ms, so that's the most I can gain by moving the logic out of the regexp. The additional logic that will be required in the loop will definitely need more time, doesn't sound like something that will work out.

Please note that I am doing my measurements in a Firefox 4 nightly, that's the environment that we should primarily focus on for optimizations (very soon most users will run that version).

Re: [Roadmap] Faster filter matching algorithm

Posted: Sat Nov 27, 2010 1:09 am
by Wladimir Palant
lyricconch wrote: var vkey= key.length;
if(vkey<vchoice||key.indexOf("*")!==-1) continue;
if(key in hashes) {
vkey-= hashes[key].length;
if(vkey>=vchoice) vchoice=vkey,choice=key;
I don't think that this will create an equal enough distribution. I tried a bunch of approaches including one that was very similar to this one - they all produced slower matching times than what I implemented now (take the keyword that has the lowest number of conflicts, if the number of conflicts is the same take the longest keyword).

Re: [Roadmap] Faster filter matching algorithm

Posted: Sat Nov 27, 2010 1:16 am
by Wladimir Palant
lyricconch wrote:2) is it necessary to write a new filtertype to support "an array of filter"?
most of filters have uniqueID. and an array instanst cost 16byte and additional 4byte per elements. (length<256).
and you have to execute unnecessary array loop(this is always slow) on each uniqueID filter (80%+).
I think you misread my code. I implemented this FilterGroup class exactly so that I don't need arrays for each keyword. Normally hash[keyword] will be a filter. Only when you have more than one filter on the same keyword hash[keyword] is set to a FilterGroup. And that filter group will loop through the filters to check for matches. But that doesn't happen if the keyword is unique, we still have optimal performance in that scenario - and we don't need to check whether we deal with a single filter or a filter group (this costs time as well).

Re: [Roadmap] Faster filter matching algorithm

Posted: Sat Nov 27, 2010 8:07 am
by lyricconch
oh i found why we get diff result on the same code.
we r using difference test sample.
most of easylist are "short" filters, (which means keywords of filter <3)
and have nice writing (which means each "token"'s length >3, most of them are "trunk")
but most of my filters are "long" filters (most of them are "by path" )
and do not have nice writing(because most of them are generated from regexp, so get many "fragment")

the Matcher.append is almost the same as yours, but it break the loop when it found an unique keyword with length>=8;
it adjust the "keyword priory" by checking filterCounts in hash[keyword] to avoid getting a big array on the same keyword
(for example, keyword "google" have vkey=6, but if hash["google"] have 4 filters, it will be adjusted to 6-4=2, which is the same priory as unique keyword "at"(vkey=2). (hash["http"]=[]; hash["http"].length=256))

sorry, i misread .... yours is better.....

Re: [Roadmap] Faster filter matching algorithm

Posted: Sat Nov 27, 2010 1:27 pm
by Wladimir Palant
It is pretty hard to make this code faster... I found one slight optimization however, barely measurable: https://hg.adblockplus.org/adblockplus/rev/92698e4d90e3