The problem with the original proposal is that it is too general. This makes it hard to optimize performance, but it is also a security problem. A filter subscription could in theory remove parts of the page in such a way that the result would be some malicious JavaScript code, e.g. one stealing user's password. Of course, all filter subscription maintainers are nice and responsible people - but do they also keep their web servers secure? If one of these servers is hacked or an attacker simply manipulates subscription data when it is being downloaded (most subscriptions don't use HTTPS) we might have a problem.
Which is why a reduced solution would be good. And here is one: instead of removing generic parts of the page, why not remove only entire HTML/XML blocks? This would be similar to element hiding - except that things would really be removed, meaning that this approach could be applied to inline scripts and XML data. Here is how a filter might look like:
Code: Select all
http://example.com/bad/*.html$htmlcut=script:not([src])
http://example.com/bad/*.html$htmlcut=div#foobar
* Checking parent or sibling elements isn't possible, all selector parts should refer to the element to be removed.
* Tag name is mandatory (I think this is required for reasons of performance).
* Removing elements that don't have a closing tag (allowed for <P> or <LI> in HTML) isn't possible.
* Attribute selectors like [foo], [foo="bar"], [foo*="bar"] or [foo~"^ba+r"] are allowed (yes, that last one is a regular expression - we can do more than CSS usually allows). #foo is equivalent to [id="foo"] and #bar is equivalent to [class~"\bbar\b"].
* Negating selectors is allowed - :not(#foo) is ok.
* Looking for some text inside the element (by regular expression) is allowed - :text(\badblock\b)
It would be nice to extend this to CSS and JavaScript somehow. In case of CSS the idea would probably be to kill off a selector. Maybe like this:
Code: Select all
http://example.com/bad/*.css$csscut=#ad
Not sure yet how to deal with JavaScript in a sane way... And, of course, parsing HTML is still very hard - not sure whether the solution will be precise enough.