The ABP Crawler

Various discussions related to Adblock Plus development
Post Reply
fhd
Posts: 119
Joined: Mon Sep 03, 2012 5:29 pm

The ABP Crawler

Post by fhd »

We've been working on a Firefox extension that loads a specified list of sites and records every resource requested by each of those, and whether or not that resource was blocked by ABP.

The idea is to regularly run this tool on our infrastructure, in order to provide data that helps filter list authors get rid of duplicate and obsolete filters.

The current version is pretty much a proof-of-concept that works as outlined above. It's probably not very useful in this state, so: What kind of data would you like to get from this?
MonztA
ABP Developer
Posts: 3957
Joined: Mon Aug 14, 2006 12:18 am
Location: Germany

Re: The ABP Crawler

Post by MonztA »

Would it be possible to also use Fanboy's repository as well?
fhd wrote:So, what kind of data would you like to get from this?
So the EasyList authors would then get an e-mail with 0-hits filters once the crawler ran through all sites?
User avatar
Crits
Posts: 394
Joined: Mon Jan 16, 2012 11:54 am
Location: France
Contact:

Re: The ABP Crawler

Post by Crits »

Some ideas, more or less complicated to implement:

- Certain ads and advertising campaigns can be periodic: for example, the filters for one website may only hit during the first half of a week and not during the second half. I've already encountered this case for some websites.
The ABP crawler should keep track of the 0-hit filters and should only warn of their deprecation if they didn't hit for 2 or 3 crawls done at different times. Besides, even for non-periodic ads, it can also reduce the margin of error.

- Associate the 0-hit filters with the corresponding URL in the filter list changelog commit, so that the list maintainers can double-check whether the filter is really deprecated or not.

- Treat generic filters and third-party advert filters separately in the output list of 0-hit filters, as these filters may still be useful even if they don't hit.
Author of Liste FR, an ad-blocking subscription for French websites
fhd
Posts: 119
Joined: Mon Sep 03, 2012 5:29 pm

Re: The ABP Crawler

Post by fhd »

MonztA wrote:Would it be possible to also use Fanboy's repository as well?
Yes, the crawler can load arbitrary sites. We have a script that scans the Easylist log for sites. Since Fanboy seems to have a similar convention for commits, it seems possible. The same goes for the one that extracts the filters.
Crits wrote:Some ideas, more or less complicated to implement:

- Certain ads and advertising campaigns can be periodic: for example, the filters for one website may only hit during the first half of a week and not during the second half. I've already encountered this case for some websites.
The ABP crawler should keep track of the 0-hit filters and should only warn of their deprecation if they didn't hit for 2 or 3 crawls done at different times. Besides, even for non-periodic ads, it can also reduce the margin of error.

- Associate the 0-hit filters with the corresponding URL in the filter list changelog commit, so that the list maintainers can double-check whether the filter is really deprecated or not.

- Treat generic filters and third-party advert filters separately in the output list of 0-hit filters, as these filters may still be useful even if they don't hit.
Sounds all pretty useful and doable to me.
famlam
Posts: 59
Joined: Sat Aug 07, 2010 2:06 pm

Re: The ABP Crawler

Post by famlam »

Which of the following will it be able to handle?
1. ads that show up once every 10 refreshes or such (if they for example randomly show an ad)
2. localized ads: some ads don't show up unless you are in the US / UK / Netherlands / Germany / NameYourCountry
3. ads that show up after you started / finished / are watching / paused a video (pretty much all of the object_subrequest ads)
4. ads that show up after a specific user action, like you first have to fill in a route at a route planner before the ad (and route :) ) show up, without the URL changing.
5. ads that are based upon the referrer (very unlikely this is possible)
6. ads on pages that do no longer exist, but similar do (e.g., the crawler automatically visits up to a specific level all first-party links on a page, so in case of a 404/301 it'll automatically scan that page, find the link to the homepage, on the home page there probably will be a link to the video index, on the index there will be the link to another video.
7. ads that appear after a cookie is set (e.g., second visit on a page)
8. ads that require JS to be disabled
9. ads that are activated after a timeout (e.g., you'll have to wait 5 minutes on a page. Parsing JS isn't a solution here, since it's not necessary that the image it loads after a timeout (for example) is hardcoded, nor is the css selector)
Also, will it be possible to
1. report filters that are domain specific for which at least one of the domains does no longer exists / is *permanently* redirected / is parked / ... (e.g., if you have @@||ads.adserver.ad^$domain=yoursite.com|allshopsuk.co.uk|anothersite, then it'll tell you that allshopsuk.co.uk is parked)
2. since not all URLs are in the repository, add your own URLs too, to be checked. For example, for this commit we'd need a custom URL: https://hg.adblockplus.org/easylist/rev/a5f3ec510844 (just randomly picked one). Also, if you have http://videosite.com/video/123456ABCDE and that particular video doesn't exist anymore, that it'll not report it all the time, but that you can instruct it to visit video /EDCBA654321 instead. Or if videosite.com switched domains, that you can send it to site.com/video instead. Or what to think about the initial commit for the repository (https://hg.adblockplus.org/easylist/rev/2) or filters that were moved from for example EasyList Germany to EasyList and thus appear as URL in the other repository only
3. support lists that are not hg, perhaps by letting them submit 1. a file with all URLs where an ad appeared, 2. the URL to their list (my dutch list for example)
4. ignore reported filters, if we're sure they still exist
Also, I agree with Crits that generic filters (no domain mentioned, perhaps except for excluded domains) should not be reported over and over again.
Last edited by famlam on Wed Dec 05, 2012 6:58 pm, edited 2 times in total.
User avatar
Crits
Posts: 394
Joined: Mon Jan 16, 2012 11:54 am
Location: France
Contact:

Re: The ABP Crawler

Post by Crits »

Also, it should be possible to create a list of the filters that should not be checked by the ABP crawler, as some filters can match requests that may never be recorded by the crawler (e.g. the requests sent after manually starting a video, or clicking a button which executes some javascript, ...)

EDIT: famlam's answer is more complete than mine on this particular subject, I just didn't have time to read it.
Author of Liste FR, an ad-blocking subscription for French websites
eric@@@Z
Posts: 1
Joined: Wed Dec 05, 2012 6:47 pm

Re: The ABP Crawler

Post by eric@@@Z »

I've just started working on the crawler. I've spent the last few days wrapping my head around everything and mulling over how to proceed. These are my initial thoughts. The crawler supports a larger workflow:
  • Gather observations
  • Analyze observations
  • Update filter lists
At this point, I'm planning for an initial release just to gather observations. The output would be a record of the job: all the pages visited, all the resources loaded, whether they were blocked. This is a concrete unit of functionality that provides some amount of immediate utility. I'd rather get this out early than wait to figure out exactly how the analytics will work.

Later on, the output of multiple jobs can be compiled into a database of observations. Such a structure, for example, provides the information needed to detect, for example, randomized ad sources which don't show up deterministically on a per-job basis. The first version won't do this, but it does have an eye to doing it eventually. The overall data hierarchy looks something like this:
  1. Observation database. This is likely a per-group database to support the maintenance of a single list, though this isn't really a technical requirement. (Not in initial version.)
  2. Session. One job run of the crawler will generate a session. We can conceive, also, of manual sessions that generate recorded observations.
  3. Trial. The simplest trial is to load a single URL and to watch all the loaded assets. More complicated trials, though, such as watching after simulated user action or waiting for timeouts, enter the system as other kinds of trials.
  4. Observation. The simplest observation is to see an asset load and to see whether it's blocked. Other kinds of observations could be recorded, though, such as setting a tracking cookie.
The goal with all this is to have a system where we can do incremental development to adapt to changing circumstances. For initial release, we'll have something simple, but with an eye to expanding later to be able to observe non-simple behaviors of web sites.
Wladimir Palant

Re: The ABP Crawler

Post by Wladimir Palant »

@eric: I think that you got the overall structure pretty well there. Only two notes:
  • We probably want to have one large database rather than per-group databases. While each filter list team should only access the data for its own filter list we might have situations where evaluating all data is necessary. Unless of course you didn't mean that evaluation across databases shouldn't be possible.
  • The observations we are interested in aren't quite as trivial as the current state of the crawler suggests - we rather need a bunch of context information as well. Most importantly, we need to know the URLs of all parent frames - these are important because exception rules might apply to them. Also, we need to know the type of the request, something that the current version doesn't capture.
Post Reply