Downloading a file regularly - how hard can it be? · 2010-05-11 11:06 by Wladimir Palant

I’ve been playing a little with numbers to find a way to get filter subscription downloads distributed equally throughout the week. You probably guessed it from the title of this post, getting the desired behavior isn’t quite easy.

What’s the problem?

Adblock Plus downloads filter subscription in regular intervals. The exact length of the interval is determined by the subscription author, in case of EasyList we are talking about 5 days. This interval seems to be a good compromise between getting updates to the users fast and not causing unnecessarily much traffic/server load. Right now it is implemented in the simplest possible way: Every time Adblock Plus downloads a filter subscription it sets a new expiration time for it (5 days in the future in this case). It then checks regularly whether any subscriptions have an expiration time which lies in the past, this subscription will be downloaded then.

This sounds pretty good in theory. Regardless of the weekday when your first download happens, over time you will download on each day with equal probability (unlike with a download interval of 7 days for example where you will always download on the same weekday). That is, until you remember that computers aren’t switched on all the time. In particular, about 10% of Adblock Plus users seem to be office workers — their computer is always switched off during weekends. So if the determined expiration time happens to be on a weekend the actual download will happen on the following Monday when the computer (and browser) is switched on.

The situation is actually even worse as the following table should make clear:

First download Second download Final state
Mon Tue Wed Thu Fri Mon Tue Wed Thu Fri
Mon 1         1        
Tue 1         1        
Wed 1         1        
Thu   1       1        
Fri     1     1        
Total 3 1 1     5        

Already the second download can no longer happen on a Thursday or Friday. Starting with the third download it is always a Monday. And that’s an effect that is very visible on the download servers — there is a significant downloads peak on Monday, especially around 8 AM in the densely populated timezones.

Spreading it a little

If the fixed download interval creates such nasty effects, what about randomizing the interval a little? Let’s say, instead of using a fixed 5 days interval we would choose something between 3 and 7 days randomly. Each interval length is chosen with probability 0.2 (in other words, 20%). What we get then:

First download Second download Final state
Mon Tue Wed Thu Fri Mon Tue Wed Thu Fri
Mon 0.6     0.2 0.2 0.51 0.1 0.08 0.16 0.15
Tue 0.6 0.2     0.2 0.51 0.1 0.08 0.16 0.15
Wed 0.6 0.2 0.2     0.51 0.1 0.08 0.16 0.15
Thu 0.4 0.2 0.2 0.2   0.51 0.1 0.08 0.16 0.15
Fri 0.2 0.2 0.2 0.2 0.2 0.51 0.1 0.08 0.16 0.15
Total 2.4 0.8 0.6 0.6 0.6 2.55 0.5 0.4 0.8 0.75

This is clearly better since it spreads half of the downloads to weekdays other than Monday. One more tweak I could think of: don’t decide about randomization immediately but wait three days first and decide then whether you want to wait another 0, 1, 2, 3, or 4 days. This idea looks like it should produce the same result but the important difference is: if the three days are over on a Saturday we still won’t decide about the additional delay until Monday. What we get then:

First download Second download Final state
Mon Tue Wed Thu Fri Mon Tue Wed Thu Fri
Mon 0.6     0.2 0.2 0.41 0.12 0.09 0.18 0.2
Tue 0.6 0.2     0.2 0.41 0.12 0.09 0.18 0.2
Wed 0.2 0.2 0.2 0.2 0.2 0.41 0.12 0.09 0.18 0.2
Thu 0.2 0.2 0.2 0.2 0.2 0.41 0.12 0.09 0.18 0.2
Fri 0.2 0.2 0.2 0.2 0.2 0.41 0.12 0.09 0.18 0.2
Total 1.8 0.8 0.6 0.8 1.0 2.05 0.6 0.45 0.9 1

So this is another small improvement. The question is however, can we get rid of that Monday peak altogether?

Counting differently

The only way I could think of was counting differently — ignore the days where the browser was never switched on, only count the days where we checked whether we should download. In a way, this continues the previous idea of splitting the entire expiration time into two stages, only that now we have as many stages as we have days. Since we no longer count the weekend days the result with a fixed download interval is very simple:

First download Second download Final state
Mon Tue Wed Thu Fri Mon Tue Wed Thu Fri
Mon 1         1        
Tue   1         1      
Wed     1         1    
Thu       1         1  
Fri         1         1
Total 1 1 1 1 1 1 1 1 1 1

The “Total” row looks very nice, it’s exactly what we would like to have. However, our download interval matches the number of days now, with the effect that people always download on the same weekday (and those currently downloading on a Monday will continue downloading on a Monday). The obvious solution is to randomize the download interval again, let’s take equal probabilities for 4, 5, and 6 days this time:

First download Second download Final state
Mon Tue Wed Thu Fri Mon Tue Wed Thu Fri
Mon 0.33 0.33     0.33 0.2 0.2 0.2 0.2 0.2
Tue 0.33 0.33 0.33     0.2 0.2 0.2 0.2 0.2
Wed   0.33 0.33 0.33   0.2 0.2 0.2 0.2 0.2
Thu     0.33 0.33 0.33 0.2 0.2 0.2 0.2 0.2
Fri 0.33     0.33 0.33 0.2 0.2 0.2 0.2 0.2
Total 1 1 1 1 1 1 1 1 1 1

There is only one concern left now. What if somebody takes an extended vacation? His filter list will be very outdated yet Adblock Plus will ignore that and wait several days before a download. My first idea for the fix was counting “missing days” with a small factor like 0.25 instead of ignoring them completely. What we get then:

First download Second download Final state
Mon Tue Wed Thu Fri Mon Tue Wed Thu Fri
Mon 0.5 0.17     0.33 0.31 0.19 0.16 0.14 0.19
Tue 0.5 0.33 0.17     0.31 0.19 0.16 0.14 0.19
Wed 0.17 0.33 0.33 0.17   0.31 0.19 0.16 0.14 0.19
Thu   0.17 0.33 0.33 0.17 0.31 0.19 0.16 0.14 0.19
Fri 0.17   0.17 0.33 0.33 0.31 0.19 0.16 0.14 0.19
Total 1.33 1 1 0.83 0.83 1.55 0.95 0.8 0.7 0.95

This caused the Monday peak to reappear, probably not a good idea after all. It should be better to set an additional “conventional” expiration time, if that is reached the download will be triggered even if the computer was never switched on since the previous download.

Summary

To sum up, here is the algorithm I think to implement in Adblock Plus:

Tags:

Comment [4]

  1. ecjs · 2010-05-11 23:48 · #

    Markov chain in a finite state space. The final state matrix is equal “a rank-one matrix in which each row is the stationary distribution”. What a nice application!

    And what about keeping the system as it is, but adding the computation of the remainder of the division by seven of the number of days between the day softExpiration lies in the past and the day of the last download? If it is equal to zero, then delay the download for at least one day, unless hardExpiration lies in the past. Just an idea, but no stochastic theory there.

    Reply from Wladimir Palant:

    No, this isn’t really helping. E.g. if we delay by 1 or 2 days randomly we will move some downloads that would normally happen on a Monday to Tuesday and Wednesday – but the next download for these will be Monday again (remainder isn’t zero this time). So we will have alternating downloads for Monday and Tuesday/Wednesday but none for Thursday or Friday.

    First download Second download Final state
    Mon Tue Wed Thu Fri Mon Tue Wed Thu Fri
    Mon   0.5 0.5     0.5 0.25 0.25    
    Tue 1         0.5 0.25 0.25    
    Wed 1         0.5 0.25 0.25    
    Thu   1       0.5 0.25 0.25    
    Fri     1     0.5 0.25 0.25    
    Total 2 1.5 1.5     2.5 1.25 1.25    
  2. ecjs · 2010-05-11 23:50 · #

    By “the day softExpiration lies in the past”, I mean the day Adblock Plus finds out that softExpiration date lies in the past, and tries so to download the updated file.

  3. Nathaniel Kofalt · 2010-05-12 00:30 · #

    An elegant solution to a deceptively complex problem! Thanks for the myriad details, it was a very interesting read.

  4. zeus · 2010-05-19 06:41 · #

    How about a custom http header returned that adblock recognizes that says check back on x.

    Still causes some bandwidth usage but very simple to implement.

    Reply from Wladimir Palant:

    No custom header necessary, a “304 Not Modified” response will do – and is already recognized by the client. However, this doesn’t really solve the problem because EasyList is served via HTTPS now. And serving an empty response via HTTPS only helps reduce bandwidth and CPU load nominally (SSL handshake still needs to happen).

Commenting is closed for this article.