[Rejected] Filtering HTML code

Various discussions related to Adblock Plus development

[Rejected] Filtering HTML code

Postby Wladimir Palant » Fri Aug 27, 2010 5:54 pm

I got a realistic idea on how to implement this: https://adblockplus.org/blog/filtering-html-code-in-adblock-plus. Mind you, this is not final, just something I am currently thinking about - and would like to hear your opinion about it.

The problem with the original proposal is that it is too general. This makes it hard to optimize performance, but it is also a security problem. A filter subscription could in theory remove parts of the page in such a way that the result would be some malicious JavaScript code, e.g. one stealing user's password. Of course, all filter subscription maintainers are nice and responsible people - but do they also keep their web servers secure? If one of these servers is hacked or an attacker simply manipulates subscription data when it is being downloaded (most subscriptions don't use HTTPS) we might have a problem.

Which is why a reduced solution would be good. And here is one: instead of removing generic parts of the page, why not remove only entire HTML/XML blocks? This would be similar to element hiding - except that things would really be removed, meaning that this approach could be applied to inline scripts and XML data. Here is how a filter might look like:

Code: Select all
http://example.com/bad/*.html$htmlcut=script:not([src])
http://example.com/bad/*.html$htmlcut=div#foobar


What follows "cut" there looks like a CSS selector - but it isn't one. Applying CSS selectors would require parsing the entire document, that would be way too slow. So we would have a simplified form of CSS:

* Checking parent or sibling elements isn't possible, all selector parts should refer to the element to be removed.
* Tag name is mandatory (I think this is required for reasons of performance).
* Removing elements that don't have a closing tag (allowed for <P> or <LI> in HTML) isn't possible.
* Attribute selectors like [foo], [foo="bar"], [foo*="bar"] or [foo~"^ba+r"] are allowed (yes, that last one is a regular expression - we can do more than CSS usually allows). #foo is equivalent to [id="foo"] and #bar is equivalent to [class~"\bbar\b"].
* Negating selectors is allowed - :not(#foo) is ok.
* Looking for some text inside the element (by regular expression) is allowed - :text(\badblock\b)

It would be nice to extend this to CSS and JavaScript somehow. In case of CSS the idea would probably be to kill off a selector. Maybe like this:

Code: Select all
http://example.com/bad/*.css$csscut=#ad


Here "#ad" is would be a real CSS selector - the one to remove.

Not sure yet how to deal with JavaScript in a sane way... And, of course, parsing HTML is still very hard - not sure whether the solution will be precise enough.
Wladimir Palant
ABP Developer
 
Posts: 8397
Joined: Fri Jun 09, 2006 6:59 pm
Location: Cologne, Germany

Re: Filtering HTML code

Postby Dr. Evil » Sat Aug 28, 2010 11:10 pm

The way I understand it, we'd use such filters as a last resort, when nothing else helps... However, with those restrictions in place, I believe that a site could relatively easily circumvent any blocking of that kind with javascript:
Code: Select all
<div id="content">... content...</div>
<style>#content { display: none; } /* some other important rules we don't want to remove */</style>
<script>
document.write("<style>#content { display: block; }</style>"); // only show the page content when this script runs
document.write("... ads ...");
</script>


EDIT: no scripting needed!
Code: Select all
<div id="content">... content...</div>
<style>#content { display: none; } /* some other important rules we don't want to remove */</style>
<div><style>#content { display: block; }</style>
... ads ...
</div>


Also, getting every quirk in html parsing right (i.e. the same way as Firefox) seems nearly impossible to me...
Dr. Evil
 
Posts: 194
Joined: Fri Sep 08, 2006 3:51 pm

Re: Filtering HTML code

Postby Wladimir Palant » Sun Aug 29, 2010 2:41 pm

Yes, this feature is meant a last resort - it would be a hit on performance even with all the optimizations in place. And - yes, a website that *really* wants to make things messy will still be able to do. So the question is: do you think that such a feature would still be useful?
Wladimir Palant
ABP Developer
 
Posts: 8397
Joined: Fri Jun 09, 2006 6:59 pm
Location: Cologne, Germany

Re: Filtering HTML code

Postby Michael » Mon Aug 30, 2010 4:22 pm

I can certainly envisage that disabling the script tag would be a useful option for several websites; however, in order to judge whether or not the syntax would be used in subscriptions an order of magnitude of efficiency needs to be established. What exactly does slowing down browsing "quite a bit" mean?
Michael
 
Posts: 1361
Joined: Sat Dec 19, 2009 1:29 pm

Re: Filtering HTML code

Postby Wladimir Palant » Mon Aug 30, 2010 7:00 pm

It means - on pages where this feature isn't used there won't be any change. However, on pages where it is used there might be a slowdown that is noticeable. I doubt that we are talking about seconds here - but 500 ms could happen I guess.
Wladimir Palant
ABP Developer
 
Posts: 8397
Joined: Fri Jun 09, 2006 6:59 pm
Location: Cologne, Germany

Re: Filtering HTML code

Postby Michael » Mon Aug 30, 2010 7:59 pm

I would certainly suggest that filters removing inline JavaScript in anti-Adblock situations could be used in EasyList, although any other application would probably be too trivial because of efficiency issues; however, because of the potential benefits I would welcome the suggested syntax.
Michael
 
Posts: 1361
Joined: Sat Dec 19, 2009 1:29 pm

Re: Filtering HTML code

Postby Guest » Tue Aug 31, 2010 9:17 pm

Wladimir Palant wrote:Yes, this feature is meant a last resort - it would be a hit on performance even with all the optimizations in place. And - yes, a website that *really* wants to make things messy will still be able to do. So the question is: do you think that such a feature would still be useful?
Well, there are a few cases, where it would be useful. But I don't think those are enough to justify the effort and the possible performance problems. Imho, if there were a way to block inline scripts (including event handlers) from running, we'd have a solution for 99% of those cases.
Guest
 

Re: Filtering HTML code

Postby Wladimir Palant » Wed Sep 01, 2010 8:06 am

Guest, see example above:

Code: Select all
http://example.com/bad/*.html$htmlcut=script:not([src])


This would prevent all inline scripts from running. Of course, if we are looking into preventing selected pages from running JavaScript altogether there might be simpler solutions. The approach above has the advantage of being more flexible, it would be possible to remove only some script blocks.
Wladimir Palant
ABP Developer
 
Posts: 8397
Joined: Fri Jun 09, 2006 6:59 pm
Location: Cologne, Germany

Re: Filtering HTML code

Postby ComodoF » Wed Sep 22, 2010 8:41 pm

Sry for the noob question, does this mean that you will be able in the future to kill those very annoying self opening pop-unders??

If so this is really a fantastic improvement

thx
ComodoF
 

Re: Filtering HTML code

Postby Wladimir Palant » Wed Sep 22, 2010 8:49 pm

Yes, that should be possible.
Wladimir Palant
ABP Developer
 
Posts: 8397
Joined: Fri Jun 09, 2006 6:59 pm
Location: Cologne, Germany

Re: Filtering HTML code

Postby neko2sonic » Fri Dec 17, 2010 2:43 am

Love this idea. More control is always nice. Is this something you still think you might implement?

Sounds like it would be useful for blocking videos loaded via RTMP like the following:

Code: Select all
http://www.fox17.com/template/cgi-bin/wcm/wcm_video.pl?pop=ads&loc=wztv&v=5660&f=top_stories&338079198


Also, I saw that you'd like to support CSS and JavaScript along with the planned HTML and XML, but what about JSON?

Code: Select all
http://arkansasmatters.com/libraries/nxd/ajax/?data=get_video&ext=lib_video&vid_id=562906&bw=undefined
neko2sonic
 
Posts: 10
Joined: Wed Jul 21, 2010 2:37 pm

Re: Filtering HTML code

Postby Wladimir Palant » Tue Dec 13, 2011 9:45 am

Off-topic question has been moved into a separate topic: viewtopic.php?f=1&t=8973. Please refrain from off-topic questions, especially in the Future Development forum.
Wladimir Palant
ABP Developer
 
Posts: 8397
Joined: Fri Jun 09, 2006 6:59 pm
Location: Cologne, Germany

Re: Filtering HTML code

Postby Lain_13 » Thu Dec 15, 2011 1:48 am

Another proposal: removing parents by childs.

Syntax: cut=text to search[|tag_name][|levels_to_skip]

[ ] - this part might be skipped, [ and ] don't have to be in real filter.
If | have to be inside text to search then it must be specified as ||.

How it works:
1. Search for 'text to search' in consecutive pairs of incoming chunks.
If located and no additional parameters specified then go back and search for first < tag >. If text located inside tag's parameters then it have to be this tag (so, probably backsearch fo < tag instead of < tag > ).
2. If 'tag_name' specified then instead of searching any tag search for specific tag (usually it will be things like 'script', 'table' or 'div').
3. If 'levels_to_skip' specified (it must be numeric value like 1, 2, 3, ...) then count first located tag as 1 and go back deeper in search for previous (same, if tag_name specified) tags.
4. Search for appropriate closing tag and cut part from starting tag to this one.

I think something like this will be more useful then pseudo-css and shouldn't be exploitable to do malicious things.
Lain_13
 
Posts: 114
Joined: Fri Dec 18, 2009 6:24 pm
Location: Wonderful World, Ubuntu Linux

Re: Filtering HTML code

Postby Wladimir Palant » Thu Dec 15, 2011 7:46 am

I think that the most likely solution to the "remove parents by children" problem will be this one: http://www.w3.org/TR/selectors4/#subject. I don't know when Mozilla plans to implement it but I guess that it isn't too far off.

The problem when working with HTML code directly is that you don't have a DOM tree - the document hasn't been parsed yet and determining parent/child relations and similar is very non-trivial. Generally, I have doubts that the proposal here will ever be implemented. We already tried using the API required for HTML code filtering, in a non-modifying way (for redirect tracking). The side-effects were quite considerable, so in Adblock Plus 2.0 we no longer use it. I'm all but sure that I want to open that can of worms again.
Wladimir Palant
ABP Developer
 
Posts: 8397
Joined: Fri Jun 09, 2006 6:59 pm
Location: Cologne, Germany

Re: Filtering HTML code

Postby Esmeralda » Fri Jan 20, 2012 12:14 am

I hope that you can figure it out eventually. It's a great idea if you can make it work...
Esmeralda
 
Posts: 1
Joined: Fri Jan 20, 2012 12:10 am


Return to Adblock Plus development

Who is online

Users browsing this forum: No registered users and 1 guest