[Rejected] Filtering HTML code

Wladimir Palant · Post by **Wladimir Palant** » Fri Aug 27, 2010 5:54 pm

I got a realistic idea on how to implement this: blog/filtering-html-code-in-adblock-plus. Mind you, this is not final, just something I am currently thinking about - and would like to hear your opinion about it.

The problem with the original proposal is that it is too general. This makes it hard to optimize performance, but it is also a security problem. A filter subscription could in theory remove parts of the page in such a way that the result would be some malicious JavaScript code, e.g. one stealing user's password. Of course, all filter subscription maintainers are nice and responsible people - but do they also keep their web servers secure? If one of these servers is hacked or an attacker simply manipulates subscription data when it is being downloaded (most subscriptions don't use HTTPS) we might have a problem.

Which is why a reduced solution would be good. And here is one: instead of removing generic parts of the page, why not remove only entire HTML/XML blocks? This would be similar to element hiding - except that things would really be removed, meaning that this approach could be applied to inline scripts and XML data. Here is how a filter might look like:

Code: Select all

http://example.com/bad/*.html$htmlcut=script:not([src])
http://example.com/bad/*.html$htmlcut=div#foobar

What follows "cut" there looks like a CSS selector - but it isn't one. Applying CSS selectors would require parsing the entire document, that would be way too slow. So we would have a simplified form of CSS:

* Checking parent or sibling elements isn't possible, all selector parts should refer to the element to be removed.
* Tag name is mandatory (I think this is required for reasons of performance).
* Removing elements that don't have a closing tag (allowed for <P> or <LI> in HTML) isn't possible.
* Attribute selectors like [foo], [foo="bar"], [foo*="bar"] or [foo~"^ba+r"] are allowed (yes, that last one is a regular expression - we can do more than CSS usually allows). #foo is equivalent to [id="foo"] and #bar is equivalent to [class~"\bbar\b"].
* Negating selectors is allowed - :not(#foo) is ok.
* Looking for some text inside the element (by regular expression) is allowed - :text(\badblock\b)

It would be nice to extend this to CSS and JavaScript somehow. In case of CSS the idea would probably be to kill off a selector. Maybe like this:

Code: Select all

http://example.com/bad/*.css$csscut=#ad

Here "#ad" is would be a real CSS selector - the one to remove.

Not sure yet how to deal with JavaScript in a sane way... And, of course, parsing HTML is still very hard - not sure whether the solution will be precise enough.

Dr. Evil · Post by **Dr. Evil** » Sat Aug 28, 2010 11:10 pm

The way I understand it, we'd use such filters as a last resort, when nothing else helps... However, with those restrictions in place, I believe that a site could relatively easily circumvent any blocking of that kind with javascript:

Code: Select all

<div id="content">... content...</div>
<style>#content { display: none; } /* some other important rules we don't want to remove */</style>
<script>
document.write("<style>#content { display: block; }</style>"); // only show the page content when this script runs
document.write("... ads ...");
</script>

EDIT: no scripting needed!

Code: Select all

<div id="content">... content...</div>
<style>#content { display: none; } /* some other important rules we don't want to remove */</style>
<div><style>#content { display: block; }</style>
... ads ...
</div>

Also, getting every quirk in html parsing right (i.e. the same way as Firefox) seems nearly impossible to me...

Wladimir Palant · Post by **Wladimir Palant** » Sun Aug 29, 2010 2:41 pm

Yes, this feature is meant a last resort - it would be a hit on performance even with all the optimizations in place. And - yes, a website that *really* wants to make things messy will still be able to do. So the question is: do you think that such a feature would still be useful?

Michael · Post by **Michael** » Mon Aug 30, 2010 4:22 pm

I can certainly envisage that disabling the script tag would be a useful option for several websites; however, in order to judge whether or not the syntax would be used in subscriptions an order of magnitude of efficiency needs to be established. What exactly does slowing down browsing "quite a bit" mean?

Wladimir Palant · Post by **Wladimir Palant** » Mon Aug 30, 2010 7:00 pm

It means - on pages where this feature isn't used there won't be any change. However, on pages where it is used there might be a slowdown that is noticeable. I doubt that we are talking about seconds here - but 500 ms could happen I guess.

Michael · Post by **Michael** » Mon Aug 30, 2010 7:59 pm

I would certainly suggest that filters removing inline JavaScript in anti-Adblock situations could be used in EasyList, although any other application would probably be too trivial because of efficiency issues; however, because of the potential benefits I would welcome the suggested syntax.

Guest · Post by **Guest** » Tue Aug 31, 2010 9:17 pm

Wladimir Palant wrote:Yes, this feature is meant a last resort - it would be a hit on performance even with all the optimizations in place. And - yes, a website that *really* wants to make things messy will still be able to do. So the question is: do you think that such a feature would still be useful?

Well, there are a few cases, where it would be useful. But I don't think those are enough to justify the effort and the possible performance problems. Imho, if there were a way to block inline scripts (including event handlers) from running, we'd have a solution for 99% of those cases.

Wladimir Palant · Post by **Wladimir Palant** » Wed Sep 01, 2010 8:06 am

Guest, see example above:

Code: Select all

http://example.com/bad/*.html$htmlcut=script:not([src])

This would prevent all inline scripts from running. Of course, if we are looking into preventing selected pages from running JavaScript altogether there might be simpler solutions. The approach above has the advantage of being more flexible, it would be possible to remove only some script blocks.

ComodoF · Post by **ComodoF** » Wed Sep 22, 2010 8:41 pm

Sry for the noob question, does this mean that you will be able in the future to kill those very annoying self opening pop-unders??

If so this is really a fantastic improvement

thx

Wladimir Palant · Post by **Wladimir Palant** » Wed Sep 22, 2010 8:49 pm

Yes, that should be possible.

neko2sonic · Post by **neko2sonic** » Fri Dec 17, 2010 1:43 am

Love this idea. More control is always nice. Is this something you still think you might implement?

Sounds like it would be useful for blocking videos loaded via RTMP like the following:

Code: Select all

http://www.fox17.com/template/cgi-bin/wcm/wcm_video.pl?pop=ads&loc=wztv&v=5660&f=top_stories&338079198

Also, I saw that you'd like to support CSS and JavaScript along with the planned HTML and XML, but what about JSON?

Code: Select all

http://arkansasmatters.com/libraries/nxd/ajax/?data=get_video&ext=lib_video&vid_id=562906&bw=undefined

Wladimir Palant · Post by **Wladimir Palant** » Tue Dec 13, 2011 8:45 am

Off-topic question has been moved into a separate topic: forum/viewtopic.php?f=1&t=8973. Please refrain from off-topic questions, especially in the Future Development forum.

Lain_13 · Post by **Lain_13** » Thu Dec 15, 2011 12:48 am

Another proposal: removing parents by childs.

Syntax: cut=text to search[|tag_name][|levels_to_skip]

[ ] - this part might be skipped, [ and ] don't have to be in real filter.
If | have to be inside text to search then it must be specified as ||.

How it works:
1. Search for 'text to search' in consecutive pairs of incoming chunks.
If located and no additional parameters specified then go back and search for first < tag >. If text located inside tag's parameters then it have to be this tag (so, probably backsearch fo < tag instead of < tag > ).
2. If 'tag_name' specified then instead of searching any tag search for specific tag (usually it will be things like 'script', 'table' or 'div').
3. If 'levels_to_skip' specified (it must be numeric value like 1, 2, 3, ...) then count first located tag as 1 and go back deeper in search for previous (same, if tag_name specified) tags.
4. Search for appropriate closing tag and cut part from starting tag to this one.

I think something like this will be more useful then pseudo-css and shouldn't be exploitable to do malicious things.

Wladimir Palant · Post by **Wladimir Palant** » Thu Dec 15, 2011 6:46 am

I think that the most likely solution to the "remove parents by children" problem will be this one: http://www.w3.org/TR/selectors4/#subject. I don't know when Mozilla plans to implement it but I guess that it isn't too far off.

The problem when working with HTML code directly is that you don't have a DOM tree - the document hasn't been parsed yet and determining parent/child relations and similar is very non-trivial. Generally, I have doubts that the proposal here will ever be implemented. We already tried using the API required for HTML code filtering, in a non-modifying way (for redirect tracking). The side-effects were quite considerable, so in Adblock Plus 2.0 we no longer use it. I'm all but sure that I want to open that can of worms again.

Esmeralda · Post by **Esmeralda** » Thu Jan 19, 2012 11:14 pm

I hope that you can figure it out eventually. It's a great idea if you can make it work...

Adblock Plus

[Rejected] Filtering HTML code

[Rejected] Filtering HTML code

Re: Filtering HTML code

Re: Filtering HTML code

Re: Filtering HTML code

Re: Filtering HTML code

Re: Filtering HTML code

Re: Filtering HTML code

Re: Filtering HTML code

Re: Filtering HTML code

Re: Filtering HTML code

Re: Filtering HTML code

Re: Filtering HTML code

Re: Filtering HTML code

Re: Filtering HTML code

Re: Filtering HTML code