Filtering HTML code in Adblock Plus · 2008-09-11 18:54 by Wladimir Palant

Henrik Gemal blogged about a new feature in Firefox, extensions can now inspect and modify the response of HTTP requests before it gets to the sender. And the best news is, it is coming to Firefox 3.0.3 as well, so extension developers don’t need to wait a year before this feature can be used. Obviously, Firebug and Firekeeper developers want this — the former to display the response, the latter to prevent a malicious response from ever reaching the sender. However, it could be useful for Adblock Plus as well.

Now that we have a relatively simple and supported way to change the response coming from a server, filtering HTML code in Adblock Plus is worth considering. It could be filter rules like http://example.com/bad/*.html$filter=/window\.open\(.*?\)/. The filter above would attach only to pages matching http://example.com/bad/*.html and then remove everything matching the regular expression /window\.open\(.*?\)/. This could be used to get rid of inline JavaScript doing something bad, like opening pop-ups. Or to remove parts of a page where element hiding hits its limits.

However, there are still open questions. One is performance, how much will filtering slow down page load? Probably quite a bit, so the expression before the $filter part should be as specific as possible to prevent unnecessary filtering of web pages. Then there is a problem that web page content arrives in chunks. How can one apply a regular expression to it? So far I see only one solution: specifying two regular expressions, one of which marks the beginning of text to be removed and one the end of it. And the beginning marker shouldn’t span too much text so that it can always be found by looking at two consecutive chunks.

Then there is the user interface for this feature. Ideally, matching filters should be displayed in the list of blockable items, so that the user can see when something is removed from the current page. But for that it is necessary to associate an HTTP request with a tab (a content window) in Firefox. I know there is a hack to do it but is there a proper way as well? Finally, what kind of help can be offered when such a filter needs to be created? Problem is that the DOM tree of a web page has no connection to its source code, so any selection in a live web page falls flat. Only option that I see right now is displaying the source code (that might already be a problem if it wasn’t cached) and letting the user select markers for beginning and end of the code to remove. At least it should be possible to show a preview.

Any comments, ideas?

Tags:

Comment [12]

  1. Jake · 2008-09-11 20:32 · #

    Great work -I’d like to see features like the RIP addon – be able to match id’s or certain classes and kill them – so the content won’t be loaded.

    I can see this being very powerful.

    One thing that has always, always bugged me:

    Blocking ads from

    ads.adcompany.com
    www.adcompany.com
    cdn.adcompany.com
    adcompany.com

    does this need TWO entries?

    .adcompany.com/ and
    adcompany.com/* ?

    Reply from Wladimir Palant:

    Have you seen Element Hiding Helper extension? The element hiding rules can already hide everything that RIP can hide, though element hiding works immediately rather than wait for the page to load. This is about going further and deal with cases where RIP/EHH won’t help you.

  2. malte · 2008-09-11 23:06 · #

    Why regular expressions? Simple string matching should do it as well. Rules could look something like that: |http://evil-filesharer-site.fake/|$start-source-filter=onclick=,end-source-filter=href=
    (Though start-source-filter and end-source-filter are a bit long, but this isn’t about the naming.)

    The strings can be parsed much easier, so you could also delay handling over (parts of) a chunk as long as you can’t be sure that it won’t match a filter.

    But there is at least one other problem: how do you determine the channels you’re gonna watch? It’s hard to know whether a url typed into the location bar is an image (or some other binary data you don’t want to listen to) or a web page. And surely people are gonna write bad filters.

    Besides, there’s even more you could do with that functionality. It should be possible not only to remove but also to add or replace content. (e.g. to replace the adblock detection script with one that always says that the ads have fully loaded)

    Reply from Wladimir Palant:

    The example is a regular expression simply because we will certainly see variations in the HTML code that needs to be removed, and it will be difficult to handle these variations with plain strings. But you are right, strings can be handled much easier.

    As to which channels to apply this to – I can look at the MIME type of the response and ignore “image/*” and “application/octet-stream”. At least that’s one possibility.

    As to removing/adding content – I forgot to mention that in the article. I definitely don’t want that functionality because it would give subscription authors a way to XSS any web site. And while I am certain that subscription authors are generally trustworthy, I don’t want to rely on that (and I also don’t want to require HTTPS to be used for subscription downloads).

  3. Robert Wetzlmayr · 2008-09-11 23:15 · #

    May I ask a layman’s question not very realted to AdBlock Plus but to hte security topic in general:

    Will this introspection capability increase the abuse potential of a fictious maliciuos FF extension with respect to phishing, keylogging et cetera? How tight are the quality control procedures at addons.mozilla.org?

    Reply from Wladimir Palant:

    A malicious Firefox extension already could do all these things – maybe it was a little more complicated but it was possible. In general, when you have a malicious extension installed you already lost, attempting to restrict it after that are pointless.

    As to quality control procedures at addons.mozilla.org – good question. They used to be pretty non-existent. My impression is that reviewers now try to find common security vulnerabilities, yet I wouldn’t rely on AMO’s quality control too much.

  4. Boris · 2008-09-12 00:07 · #

    Is the “hack” you refer to looking at the notification callbacks on the loadgroup?

    Reply from Wladimir Palant:

    Yes, I think that was it – from the code you see that you will (usually) get a docshell this way but it isn’t documented anywhere meaning that it is an implementation detail.

  5. timeless · 2008-09-12 17:01 · #

    i’d say it isn’t worth it.

    (function (){
    a={};
    var x=a.parent;
    var b=’‘;
    for each (c in [111, 112, 101, 110]) b+=String.fromCharCode©;
    x[b](“http://mozilla.com”);
    })()

    this is what noscript is for. between noscript and the popup blocker, it was a royal pain to write this testcase.

    Reply from Wladimir Palant:

    Yes, obfuscation, sure – but I didn’t mean to have generic rules with this feature, it is to “fix” specifically the few sites where everything else fails. Obfuscation won’t help if the filter is targeted at this particular kind of obfuscation. NoScript cannot be recommended (already because it breaks too many sites, but also because of quirkiness in the user interface), and not all the annoyances are JavaScript-related. Popup blocker fails if the site abuses an authentic click event (some do).

  6. Verb · 2008-09-12 17:59 · #

    Nice, hope to see ABP benefit from this.

  7. Verb · 2008-09-12 18:00 · #

    And thank you for updating your readers with useful posts :)

  8. Ares2 · 2008-09-13 02:11 · #

    Sounds like a great addition, especially for Anti-ABP-Scripts.

    So theoretically it’s possible to filter for example parts of an external Javascript file as well this way?

    Reply from Wladimir Palant:

    Theoretically – yes. Practically, however, this feature should only be used as a last resort. The way it looks right now, performance will be an issue.

  9. Anonymous Coward · 2008-09-18 06:24 · #

    If you aren’t using Privoxy then you lose. The Adblock extension has always been a farce and if you use any other browser isn’t available. Privoxy does a far better job and doesn’t slow down your browser.

    Reply from Wladimir Palant:

    You chose the right name :)

  10. Archaeopteryx · 2008-09-26 02:04 · #

    I’d like if Adblock Plus would modify the stream so that ScrapBook archives the pages so like I see them (no additional stuff like not shown images, not used styles etc. which are waisting my disk space).

  11. Volvox · 2008-10-09 23:12 · #

    I do think the Privoxy and Proxomitron filter syntaxes could be useful for study as they have had this sort of thing for some time now—

    http://www.privoxy.org/user-manual/filter-file.html

    http://www.sankey.ws/proxlang.html

    A few notes regarding Proxomitron:

    “Scope bounds” set the size (in characters) of the look-ahead needed to match a given filter, though this would seem even more inelegant than the two-marker solution you propose above.

    Filters can be set to act on JS and XML files as well as HTML, (the existing $tags would seem adequate to this end although one might wish to add an $html tag for certain special cases.)

    There is provision made for matching certain common synonymous expressions (eg foo=bar vs foo=‘bar’ vs foo=“bar”). This may not be necessary in AB+ as many cases involving these would use element hiding instead.

    A set of pseudo-characters match the start of parsing for <html> <head> <body> et al even when these are not explicitly marked in the html code. I suspect something along these lines would prove quite useful.

  12. Two-RegEx "Solution" · 2010-09-17 10:21 · #

    If you decide to “begin matching” at “<script”, then it won’t match in the 1/100 or so times that “<scr” and “ipt” happen to be sent in different chunks.

    P.S. The “Name” field (at least) of this comment form has bad filtering applied to it. Quote characters turn into &quot;, which gets turned into &amp;quot;, which gets turned into &amp;amp;quote;, and so on, when previewing.

    Reply from Wladimir Palant:

    Quoting the article: “And the beginning marker shouldn’t span too much text so that it can always be found by looking at two consecutive chunks.”

    Anyway, why are you commenting on a blog post that is two years old? The current discussion on this topic is in https://adblockplus.org/forum/viewtopic.php?f=4&t=5977

Commenting is closed for this article.