Bug in Matcher.findKeyword?

Various discussions related to Adblock Plus development
Post Reply
deNULL
Posts: 1
Joined: Thu Feb 25, 2016 3:58 pm

Bug in Matcher.findKeyword?

Post by deNULL »

Hello, I'm having some troubles adding a new filter to my subscription file. When I add the line "agentguru.ru$third-party" (without quotes) it doesn't work as expected — scripts from the domain agentguru.ru are not blocked.

After digging into code, I found this line in Matcher.findKeyword():

Code: Select all

let candidates = text.toLowerCase().match(/[^a-z0-9%*][a-z0-9%]{3,}(?=[^a-z0-9%*])/g);
(https://hg.adblockplus.org/adblockplusc ... er.js#l125)

The regular expression looks wrong: asterisks are probably meant to be outside of square brackets, like /[^a-z0-9%]*[a-z0-9%]{3,}(?=[^a-z0-9%]*)/g. In its current form this expression fails to capture any keywords from url unless it starts with a special character.

When I fix this part of code (in Chrome DevTools), AdBlock starts working correctly, blocking scripts from the domain mentioned above.
User avatar
mapx
Posts: 21940
Joined: Thu Jan 06, 2011 2:01 pm

Re: Bug in Matcher.findKeyword?

Post by mapx »

you should use

Code: Select all

||agentguru.ru^$third-party
en/filters#anchors
lewisje
Posts: 2743
Joined: Mon Jun 14, 2010 12:07 pm

Re: Bug in Matcher.findKeyword?

Post by lewisje »

Actually, that regex is correct; it's used in the faster filter-matching algorithm introduced in 2010: forum/viewtopic.php?t=6118

The regex means "match all substrings that start with a character other than a letter, a digit, %, or *, followed by at least three characters that are letters, digits, or %, that are followed by a character other than a letter, a digit, %, or * (but this last character is not part of the matched substring)."

Anyway, the reason your original filter didn't work out so well was that it had no keyword in it: A URL pattern that begins with a character other than * or | has an implicit *, and same for ending (except if both beginning and ending characters are /, in which case it's a regex filter); a keyword in a URL pattern is a sequence of letters, digits, and percent signs (for percent-encoding) that is at least three characters long and cannot be a subset of a longer such sequence in a URL (that means it is delimited on both ends by the URL-end character |, by the "all-subdomains, both protocols" initial sequence ||, by the separator-wildcard ^, or by a non-special character that is not a letter, a digit, or %), and a blocking or whitelist filter from which a keyword can be extracted can be used very quickly.

The only difference between my description and what the regex catches is that the regex also captures the special character at the start of the keyword, and this is only because JavaScript regexes do not support lookbehind (the ?= sequence is lookahead, and the ?! sequence is negative lookahead), and it's less efficient to make an inner capturing group just to get the part formally known as the "keyword"; later on, the substr method on strings is used to disregard that excess character: https://hg.adblockplus.org/adblockplusc ... er.js#l134

Still, your unoptimizable filter should have been used somehow, just regarded as "slow" (snail icon) in the ABP filter list.
There's a buzzin' in my brain I really can't explain; I think about it before they make me go to bed.
Post Reply