Recognizing third-party content · 2007-01-31 16:07 by Wladimir Palant

I have done all the preparation work so that now I can finally implement the $third-party filter option allowing to restrict filters to third-party or same-party content. This would be used for filters like */banners/*$third-party — if some webmaster is crazy enough to call the directory with site logos “banners” those still won’t be blocked. This filter will only block something coming from the directory “banners” on a different server.

That’s the theory at least. However, recognizing what is third-party and what isn’t turned out to be a difficult task, and efficiency concerns (the third-party check will have to be done for every address) don’t make it easier. Usually something is considered third-party if it comes from a different second level domain, e.g. bugzilla.mozilla.org and addons.mozilla.org are same-party while mozilla.org and adblockplus.org are not. For recognizing the second-level domain part Firefox (Gecko) usually follows the one dot rule — the second-level domain is the ending of the server name that contains at most one dot. Unfortunately it will treat “co.uk” as a second-level domain.

Now Gecko 1.9 has a new mechanism for recognizing top-level domains: Effective TLD Service. Its database of top-level domains isn’t complete yet but it is already good enough. So once could use this to find the top-level domain and go to the next dot which would mark the end of the second-level domain. Of course this would only work in Firefox 3 and other browsers that will be based on Gecko 1.9, in older browsers the one dot rule will have to do.

Yet there are more issues. For reasons I don’t know the Effective TLD Service requires the server name to be encoded in UTF-8. Adblock Plus has it in UTF-16 however. So to use the service properly the server name would need to be converted into UTF-8 — fun way to waste CPU time. One can go without converting of course but that might cause wrong results with some international domain names (fortunately there are no international TLDs yet). So finally the third alternative would be to look for non-ASCII in the server name and fall back to the one dot rule if it has some. Right now I am a little undecided about which solution I should choose. Update: looking more into this, this last issue isn’t as critical as I thought first. However, bug 368989 is a showstopper at the moment.

Tags:

Comment [4]

  1. chewey · 2007-01-31 23:53 · #

    To me, this sounds like we should at least wait for Gecko 1.9 to be in official builds. $third-party would definitely be nice to have, but according to your description doesn’t seem to be worth the hassle at the moment.

  2. Matt Nordhoff (Peng) · 2007-02-01 18:15 · #

    Option 4: Try to get the Effective TLD Service to support UTF-16? Easier said than done?

    Reply from Wladimir Palant:

    I guess it can be done, nothing is using Effective TLD Service yet (not counting my patch in bug 368700 which has to convert to UTF-8 explicitly). But I would first need to find out why UTF-8 was used in the first place, probably because the rules database is stored in UTF-8.

    Reply from Wladimir Palant:

    Ok, I completely missed something: Gecko usually stores URLs internally in UTF-8. Thanks to Boris Zbarsky for clarifying. However, this brings up bug 368989 so I’ll have to see how this will be solved.

  3. pirlouy · 2008-05-14 20:49 · #

    Any news on this ? Any news of AB+ development in fact ?
    I guess you’re busy, but there should inevitably be demotivation. Can we say AB+ enhancement is not planned for next weeks/months ?

    Reply from Wladimir Palant:

    I definitely want to start working on this one again. But, as usually, I don’t know when I will get time.

  4. Pete · 2009-01-19 14:21 · #

    Very interested to read your notes.

    I’ve written a FF browser add on which includes a feature to check pages for links to a list of unwanted third party domains, not unlike Adblock in some ways. It also checks and logs unexpected browser status changes (such as redirects).

    Perhaps it would be possible to use the browser status change events to suppress third party content?

    You could start with a basic method of operation; ie a simplistic suppression of outgoing browser requests for images or javascript if host != last request host… and perhaps add more sophisticated features later?

    Other areas that need deeper consideration would probably include IFRAME content, where the source address of the IFRAME content does not match the source address of the page.

    Pete.

    Reply from Wladimir Palant:

    The feature discussed here has been implemented in Adblock Plus 1.0. See also http://adblockplus.org/blog/third-party-javascript-yes-it-is-a-security-risk for one of its possible use cases.

Commenting is closed for this article.