[Done] Faster filter matching algorithm

Various discussions related to Adblock Plus development

Re: [Roadmap] Faster filter matching algorithm

Postby Wladimir Palant » Mon Oct 04, 2010 10:11 am

A note so this doesn't confuse anybody: the change that just landed is a different thing even though it also improves filter matching performance. This change doesn't impact filters which is why it could land for Adblock Plus 1.3. In Adblock Plus 1.4 we will improve performance once more.
Wladimir Palant
ABP Developer
 
Posts: 8398
Joined: Fri Jun 09, 2006 6:59 pm
Location: Cologne, Germany

Re: [Roadmap] Faster filter matching algorithm

Postby Michael » Tue Oct 05, 2010 11:15 pm

Another advantage of the new algorithm is that, because there must be a string of at least three characters present in the filter, forums may be more easily searched for the origins of the suggestion, as many boards place limits on the minimum number of consecutive characters that they index.
Michael
 
Posts: 1361
Joined: Sat Dec 19, 2009 1:29 pm

Re: [Roadmap] Faster filter matching algorithm

Postby Michael » Thu Oct 14, 2010 8:43 am

Wladimir, how does the algorithm minimise the overlap of keywords when selecting them from filters?
Michael
 
Posts: 1361
Joined: Sat Dec 19, 2009 1:29 pm

Re: [Roadmap] Faster filter matching algorithm

Postby Wladimir Palant » Thu Oct 14, 2010 8:52 am

Yes, it tries to.
Wladimir Palant
ABP Developer
 
Posts: 8398
Joined: Fri Jun 09, 2006 6:59 pm
Location: Cologne, Germany

Re: [Roadmap] Faster filter matching algorithm

Postby Michael » Thu Oct 14, 2010 9:57 pm

Michael wrote:Wladimir, how does the algorithm minimise the overlap of keywords when selecting them from filters?

I am currently writing a blog post in preparation for the algorithm change and am therefore developing a tool to demonstrate the usage of each keyword within a subscription; however, while I was able to locate the area that I think the relevant function is located I was unable to determine exactly how it works.

Code: Select all
let candidates = text.toLowerCase().match(/[^a-z0-9%*][a-z0-9%]{3,}(?=[^a-z0-9%*])/g);
if (!candidates)
    return null;
let hash = this.shortcutHash;
let result = null;
let resultCount = Infinity;
let resultLength = 0;
for (let i = 0, l = candidates.length; i < l; i++)
{
    let candidate = text.substr(startingPoint + j, len);
    if (candidate.indexOf("*") < 0 && !(candidate in this.shortcutHash))
        return candidate;
    let candidate = candidates[i].substr(1);
    let count = candidate in hash ? hash[candidate].filterCount : 0;
    if (count < resultCount || (count == resultCount && candidate.length > resultLength))
    {
        result = candidate;
        resultCount = count;
        resultLength = candidate.length;
    }
}
return result;
Michael
 
Posts: 1361
Joined: Sat Dec 19, 2009 1:29 pm

Re: [Roadmap] Faster filter matching algorithm

Postby Wladimir Palant » Fri Oct 15, 2010 7:46 am

Code: Select all
    let count = candidate in hash ? hash[candidate].filterCount : 0;
    if (count < resultCount || (count == resultCount && candidate.length > resultLength))

This will prefer keywords that have fewer filters associated with them, ideally none. If the number of associated filter is the same as for the keyword we looked at before - longer keywords win.
Wladimir Palant
ABP Developer
 
Posts: 8398
Joined: Fri Jun 09, 2006 6:59 pm
Location: Cologne, Germany

Re: [Roadmap] Faster filter matching algorithm

Postby Wladimir Palant » Tue Nov 23, 2010 10:24 pm

This change just landed and will be in the next development build: https://hg.adblockplus.org/adblockplus/rev/65b0168d6a36

Please note that it will be likely reverted before Adblock Plus 1.3.5 release - this is planned for Adblock Plus 1.4, subscription authors need time to adjust.
Wladimir Palant
ABP Developer
 
Posts: 8398
Joined: Fri Jun 09, 2006 6:59 pm
Location: Cologne, Germany

Re: [Roadmap] Faster filter matching algorithm

Postby lyricconch » Fri Nov 26, 2010 8:57 pm

1) should not process the logic in regexp, it cause slow, and TM cannot optimize that
Code: Select all
let candidates = text.toLowerCase().match(/[^a-z0-9%*][a-z0-9%]{3,}(?=[^a-z0-9%*])/g);
// SLOW1: regex engine need recursive to find out match text. (by {3,}(?=) )
// just use "/aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa*" for test, and you will find why.
if(!candidates) return null;
for (let i=candidates.length-1;i>=0;i--){
    let t=candidates[i].substr(1);
//SLOW2: array access will be slow when arraylength>256(i have check the source code of jsarray.cpp)
//SLOW3 unnecessary substr cause Alloc/Free() it's much expensive then integer operation.
    ....
}


that's slow, it should be changed to:
Code: Select all
let candidates = text.toLowerCase().match(/[a-z0-9%*]+/g);
//now regexengine donot need any recursive.
if(!candidates) return null;
for each(let t in candidates)
if(t.indexOf("*")===-1&&t.length>=3){
// indexOf doesnot cause heapmemory alloc/free.
// tracemonkey like your code better(your code get 1trace and this get 4)
// but once trace monkey setup tracetree, this will be much more faster than yours.
    ....
}

2) is it necessary to write a new filtertype to support "an array of filter"?
most of filters have uniqueID. and an array instanst cost 16byte and additional 4byte per elements. (length<256).
and you have to execute unnecessary array loop(this is always slow) on each uniqueID filter (80%+).
i think this should be better:
Code: Select all
let filtersfirst=[]; // store all the uniqueID filter
let filtersarray=[filtersfirst,]; // store Group filter
for each(let t in candidates)
if(t.indexOf("*")===-1&&t.length>=3&&t in hash){
      var r = hash[t];
      if(r.length) // is Group type.
          filtersarray.push(r);
      else // is uniqueID type
          filtersfirst.push(r);
}
for each(let filters in filtersarray)
for each(let f in filters){
     f.match(.......);
}

3) i am writing a general-purpose urlmatcher now. as:
Code: Select all
Rule.prototype={
//prop:
     source:"", // source of rule, example:
// ||adblockplug.org^^posting.php^^sid=**    abp.i2;$Ip?$Stat&clength>10000&$ReqProxy#hits:3;since:20101101;
     hash:"", //  use for fast check, eg: ||adblockplug.org^^posting.php^^sid=**     
     regex:"", // use for match eg: /https?:\/+[^\/]*\badblockplus.org\b.+\bposting\.php\b.+\bsid=([^&]*)/
     flag:"",  // 0x10=igorncase,0x20=lazyinit,0x40=cachedresult b0-b3=group(tosupport whitelist).... eg: 0x12
     result:"", // when match; this will be return to caller for more infomation. eg:"abp"
     extend:function(ctx,ret)[ret, this.result], // extend result, push additional infomation,
// eg: function(ctx,reexec) return [reexec,this.result,ctx.providers.Ip(ret,this)]
     assert:function(ctx,ret) ret[0]?ret:null, // assert after fit, can use for callback
// eg function(ctx,retobj) {if(ctx.handles.Hit(this)&&ctx.info.contentlength>10000&&ctx.actions.ReqProxy(retobj[3])) return retobj }
     extra:{},  // extra data, eg: {hits:3, since:20101101}
//method:
     exec: function (ctx) {.....} ,  // powerful version
     test: function (ctx) {.....} ,   // fast version
}


3) some ID's filterCount should be adjust before add any filter. such as
http, https, ... htm ... , eg: set them to 10. this is my matcher.append
Code: Select all
Matcher.prototype.append=
function append(rule){
    var {hashes}= this;
    var rule= new Rule(rule);
    var keys= rule.hashes.match(/[%*a-z0-9]+/g); //1200ms/3600r
    var vchoice=3, hashed=null, vkey=0, key="", choice="*";
//logw(120,rule.text,rule.hashes,keys)
    for each(key in keys) {
        var vkey= key.length;
        if(vkey<vchoice||key.indexOf("*")!==-1) continue;
        if(key in hashes) {
            vkey-= hashes[key].length;
            if(vkey>=vchoice) vchoice=vkey,choice=key;
            else if(key.length>choice.length) choice=key;
        } else if(vkey>=8){
            choice=key;
            break;
        } else {
            vchoice=vkey;
            choice=key;
        }
    }
    if(!(hashed=hashes[choice])) hashes[choice]= rule;
    else if(hashed.length>1) hashed.push(rule);
    else hashes[choice]= [hashed,rule];
    return choice;/**/
}
lyricconch
 
Posts: 14
Joined: Fri Apr 16, 2010 8:57 am

Re: [Roadmap] Faster filter matching algorithm

Postby Wladimir Palant » Fri Nov 26, 2010 9:24 pm

I only had time to check your first suggestion so far. Please note that we aren't going to deal with pathological cases here - we are applying that regexp on filters. Your code will handle filters like "foobar" incorrectly - these are equivalent to "*foobar*" so foobar shouldn't get accepted as keyword. I tried to fix this but moving the logic out of the regexp made things significantly slower for me, not faster.
Wladimir Palant
ABP Developer
 
Posts: 8398
Joined: Fri Jun 09, 2006 6:59 pm
Location: Cologne, Germany

Re: [Roadmap] Faster filter matching algorithm

Postby lyricconch » Fri Nov 26, 2010 9:35 pm

it can change to ("*"+filter.text+"*").match(/[a-z0-9%*]+/g).
in my case, i have another function to generate hash(work on both regexp and simple rules),
i do not use filter.text directly....

in my test, /[a-z0-9%*]+/ then indexOf is faster then putting logic in regexp
i have build a debug version firefox from hg.mozila.org/mozilla-central,
and got js.exe, xpcshell.exe for develop.
i am using xpcshell.exe for module testing and js.exe for code analysis,
try to find out how tracejit methodjit optimize execution to get the best performance.
lyricconch
 
Posts: 14
Joined: Fri Apr 16, 2010 8:57 am

Re: [Roadmap] Faster filter matching algorithm

Postby Wladimir Palant » Sat Nov 27, 2010 1:56 am

lyricconch wrote:array access will be slow when arraylength>256

Maybe - but a filter with more than 256 possible keywords is very unlikely. At least for the regular case a |for each| loop definitely seems slower (around 25% slower). I tried using regexp.exec() to get the matches one by one, that is also slower (faster than |for each| however).
it can change to ("*"+filter.text+"*").match(/[a-z0-9%*]+/g)

I tried that. With the original regexp initializing 3700 filters took 66ms, this regexp takes 69ms. This is probably because the regexp returns more results, iterating through them takes longer. And the additional concatenation wastes another 10ms.

I tried the "optimal" variant - changed the regexp into /[a-z0-9%]{3,}/g and removed the substr() call. This saves 2ms, so that's the most I can gain by moving the logic out of the regexp. The additional logic that will be required in the loop will definitely need more time, doesn't sound like something that will work out.

Please note that I am doing my measurements in a Firefox 4 nightly, that's the environment that we should primarily focus on for optimizations (very soon most users will run that version).
Wladimir Palant
ABP Developer
 
Posts: 8398
Joined: Fri Jun 09, 2006 6:59 pm
Location: Cologne, Germany

Re: [Roadmap] Faster filter matching algorithm

Postby Wladimir Palant » Sat Nov 27, 2010 2:09 am

lyricconch wrote: var vkey= key.length;
if(vkey<vchoice||key.indexOf("*")!==-1) continue;
if(key in hashes) {
vkey-= hashes[key].length;
if(vkey>=vchoice) vchoice=vkey,choice=key;

I don't think that this will create an equal enough distribution. I tried a bunch of approaches including one that was very similar to this one - they all produced slower matching times than what I implemented now (take the keyword that has the lowest number of conflicts, if the number of conflicts is the same take the longest keyword).
Wladimir Palant
ABP Developer
 
Posts: 8398
Joined: Fri Jun 09, 2006 6:59 pm
Location: Cologne, Germany

Re: [Roadmap] Faster filter matching algorithm

Postby Wladimir Palant » Sat Nov 27, 2010 2:16 am

lyricconch wrote:2) is it necessary to write a new filtertype to support "an array of filter"?
most of filters have uniqueID. and an array instanst cost 16byte and additional 4byte per elements. (length<256).
and you have to execute unnecessary array loop(this is always slow) on each uniqueID filter (80%+).

I think you misread my code. I implemented this FilterGroup class exactly so that I don't need arrays for each keyword. Normally hash[keyword] will be a filter. Only when you have more than one filter on the same keyword hash[keyword] is set to a FilterGroup. And that filter group will loop through the filters to check for matches. But that doesn't happen if the keyword is unique, we still have optimal performance in that scenario - and we don't need to check whether we deal with a single filter or a filter group (this costs time as well).
Wladimir Palant
ABP Developer
 
Posts: 8398
Joined: Fri Jun 09, 2006 6:59 pm
Location: Cologne, Germany

Re: [Roadmap] Faster filter matching algorithm

Postby lyricconch » Sat Nov 27, 2010 9:07 am

oh i found why we get diff result on the same code.
we r using difference test sample.
most of easylist are "short" filters, (which means keywords of filter <3)
and have nice writing (which means each "token"'s length >3, most of them are "trunk")
but most of my filters are "long" filters (most of them are "by path" )
and do not have nice writing(because most of them are generated from regexp, so get many "fragment")

the Matcher.append is almost the same as yours, but it break the loop when it found an unique keyword with length>=8;
it adjust the "keyword priory" by checking filterCounts in hash[keyword] to avoid getting a big array on the same keyword
(for example, keyword "google" have vkey=6, but if hash["google"] have 4 filters, it will be adjusted to 6-4=2, which is the same priory as unique keyword "at"(vkey=2). (hash["http"]=[]; hash["http"].length=256))

sorry, i misread .... yours is better.....
lyricconch
 
Posts: 14
Joined: Fri Apr 16, 2010 8:57 am

Re: [Roadmap] Faster filter matching algorithm

Postby Wladimir Palant » Sat Nov 27, 2010 2:27 pm

It is pretty hard to make this code faster... I found one slight optimization however, barely measurable: https://hg.adblockplus.org/adblockplus/rev/92698e4d90e3
Wladimir Palant
ABP Developer
 
Posts: 8398
Joined: Fri Jun 09, 2006 6:59 pm
Location: Cologne, Germany

PreviousNext

Return to Adblock Plus development

Who is online

Users browsing this forum: No registered users and 2 guests