On fluctuations in performance testing results · 2011-04-08 10:25 by Wladimir Palant

Yesterday I concluded that (with all bugs fixed) the results of Mozilla’s add-on performance measurements shouldn’t fluctuate by more than 2% of the Firefox startup time. Sorry, I was wrong. Later that day I noticed Read It Later, currently #43 on that list, supposedly causing 4% slower Firefox startup times. Yet this extension was definitely disabled during performance testing due to bug 648229. Does this now mean that each disabled add-on causes a 4% performance impact? Definitely not, disabled add-ons have no measurable effect on performance. So digging up the raw numbers for that add-on was definitely a good idea. Here they come:

Test run Reference time (no extensions) Read It Later 2.1.1
Windows 7 on March 26th 548.89 549.00 +0.0%
Windows 7 on April 2nd 541.89 617.68 +14.0%
Windows XP on March 26th 399.79 399.63 -0.0%
Windows XP on April 2nd 401.21 402.79 +0.4%
Mac OS X on March 26th 694.79 690.58 -0.6%
Mac OS X on April 2nd 699.58 699.58 +0.0%
Fedora Linux on March 26th 498.37 494.05 -0.9%
Fedora Linux on April 2nd 495.95 511.63 +3.2%

Most results indeed show something very close to the reference time which makes sense. However, on Fedora this extension supposedly caused 3% slowdown during the second run. On Windows 7 it even got to 14%. Here are the results for the individual measurements that this average consists of:

845 552 530 541 543 535 554 549 552 535 550 603 565 598 703 720 809 753 826 718

The first measurement is always significantly higher and is ignored for the average, I already mentioned that. If you look at the other measurements however— the times were pretty close to the reference value (as expected), and then something changed and the numbers got 200 ms higher. Had this happened at the beginning of the test run, this would have increased the extension’s score on Windows 7 by 35%! Even if the measurements on all other platforms were correct (Fedora’s wasn’t), it would translate into 9% more in the overall score. In fact, I suspect that this is exactly what happened to the add-on that has been tested right after it.

Nils Maier thinks that such fluctuations have something to do with other jobs running on the same machine, particularly ones doing heavy I/O. I am no expert on this, so I can neither agree nor disagree. Dear AMO, please clear this up before you start stigmatizing add-ons as “slow”.

Update: I looked into the results of the other add-ons tested on the same machine (talos-r3-w7-020). Both the add-on tested before Read It Later (Flagfox) and the one tested after it (SimilarWeb) show the same irregularities — the individual measurements were first pretty close to the test results from the previous week and became significantly higher towards the end of the test. This brought Flagfox 14% more on Windows 7 (4% more in the overall score) and SimilarWeb got 8% more (2% more in the overall score). In case of SimilarWeb these 2% were enough to push it into the list of Top 10 worst offenders.

Update2: Nils wrote a script to check the standard deviation of the performance measurements, you can see his script and the results here: https://gist.github.com/909583. I could reproduce his results with my own script, checking the logs for all platforms. Out of 100 add-ons, twelve add-ons weren’t tested at all, most likely because the download failed (different download packages for different platforms). Four add-ons were only tested partially (NoScript and Adblock Plus only on Windows 7/XP, BetterPrivacy only on OS X and Fedora, Web of Trust only on Windows 7/XP and Fedora). In all four cases the test timed out because the browser couldn’t be closed, most likely because of first-run pages. Plus Nils found five add-ons with negative impact — these were definitely tested in a disabled state but there are likely many more.

As to the remaining, measurements for 16 add-ons show high standard deviation (more than 10% of the average startup time). Given that these irregularities only appear on one or two platforms it should be safe to exclude the extension itself as the source of the deviations. One extreme case seems to be Tree Style Tab that was tested in a disabled state — yet one of the measurements on Windows 7 was whooping 500% above reference time. Similar scenario happened with Forecastfox on OS X, it had multiple ouliers with one of them being 110% above reference time. Both add-ons didn’t make the list because they got a negative score on at least one other platform so their results were ignored (this happened to most extensions that were tested in a disabled state). The addons with high deviations that made the list are (starting with the highest standard deviation): Read It Later, StumbleUpon, RSS Ticker, FastestFox, Flagfox, Download YouTube Videos (both Windows 7 and Fedora results are suspicious), Personas Plus, SimilarWeb (also both Windows 7 and Fedora suspicious), CoolPreviews.


Comment [18]

  1. yamaban · 2011-04-08 13:23 · #

    There’s a good one-liner in German: “Wer viel mißt, mißt viel Mist”, which translates loosely into: “He who measures much, measures much shit”.

    Sincerely said: These ‘irregularities’ make the whole measurements suspect.
    For each measurement there should be noted how much cpu/io/net load was there during the test run.
    Without these numbers, AMO’s tests can’t be taken for more than a trend.

    Sad really, because the intention behind these tests (to make FX a better experience) is more than just valid.

    As you’ve said before, AMO could make some things better NOW, with just little work, as providing developers a link to the test-protocols, fixing “bug 648229” or at least announce if a add-on can’t be tested due to this (truth in advertising).

    One thing for sure: slow or not, — In my opinion it’s fast enough for me, — I don’t like the thought of browsing without Adblock Plus.

    PS: Even with abonnement of EasyList_Germany + EasyPrivacy+EasyList + over 380 personal filters, the browsing isn’t slower than without. What is slower, is startup and opening a new window by about 10-15% of the time without. Acceptable for me (Nightly, Linux 64bit, Nettop w/ Atom 330/1,6GHz/ION + 4GB ram).

  2. Nils Maier · 2011-04-08 17:07 · #

    It should be noted that high stddev values are not necessarily (but still may be) due to external influences in tests.
    It is, however, also possible that addons have largely non-deterministic operations in the startup path, such as network reads with varying network latencies, executing certain code only each x-runs and so on. (That doesn’t apply to addons that were disabled during tests, of course)

  3. Michael Kaply · 2011-04-08 18:57 · #

    Have you noticed that Mozilla and the AMO team is staying completely silent on this issue?

    They released the numbers and then stepped away.

    I know that they are partying in Las Vegas, but they could at least say something.

    Very aggravating.

    Reply from Wladimir Palant:

    Yes, I noticed. So far I only got a personal opinion from some Mozilla employees, nothing “official”. I sent a mail to Justin Scott because of that, I hope he will reply.

  4. Jorge · 2011-04-08 22:25 · #

    If there are bugs being filed, I’d love to know about them so I can keep track of their progress. That’s the best way to move things forward and reach a level of reliability that is more acceptable. We hope to iterate on this new system quickly and your help and feedback is much appreciated.
    I’m not sure what kind of response you’re expecting from us. I’d love to talk about it, so please ping me on IRC or send me an email.

    Reply from Wladimir Palant:

    There you go – CC’ed on all the bugs filed so far. I didn’t file one yet for add-ons being installed unpacked.

    PS: I think there is a whole lot that Mozilla could do. Email sent.

  5. smo · 2011-04-08 23:06 · #

    + on 3

  6. Michael Kaply · 2011-04-08 23:46 · #

    The response we’re expecting from you is:

    “We screwed up. We released performance numbers that were basically incorrect and using invalid methodology. We threw lots of add-on authors under the bus. We’ll do better next time.”

  7. Jorge · 2011-04-09 01:02 · #

    We will be posting some rectifications in the Add-ons blog very soon and will be updating the page to clarify a few things for people who are arriving to it directly (like most of the media).

    We don’t think that the list of top slow add-ons is incorrect. Those add-ons are slow, and the only exceptional case I know of is FlashGot and the skewed Mac OS results. This should be fixed soon. The list is incomplete due to testing failures that should be fixed soon. I also think there needs to be a lower bound to the list because right now we’re showing a number of add-ons with very low overhead that just happen to be in the top 50 of the 100 we’re testing.

    Like I said before, the best approach is to file bugs and help improve the system. It is definitely a priority for us to have reliable results and anything you can do to report errors is really appreciated.

    Reply from Wladimir Palant:

    If you show some addons but omit others without making it clear – this is extremely unfair. My guess is that no more than 40 test results are even approximately correct. My testing indicates that even the most simple extension causes 3% slowdown, with increased complexity of the add-on the amount of unavoidable slowdown should also increase. Meaning that anything on your list below 3% was most likely either tested disabled or was affected by other issues. And FlashGot is by no means the only extension that got a higher score though it is probably the one with the most significant deviation. For example, I guess that SimilarWeb scored around 4% higher than it should have.

  8. Ken Saunders · 2011-04-09 06:30 · #

    Wladimir, I applaud your efforts for making things as clear, accurate, and fair as possible on your end, and for standing up for developers.

    For the record, I really could care less about startup time. Seriously, I just don’t care. When I need a faster Firefox, I have a profile for it. My add-ons come first though, and I take any weight that bring.

    “Those add-ons are slow”
    Jorge, there’s an issue with that line, a serious one, and it’s the same way AMO is classifying add-ons. Do they cause Firefox to run slowly, use more resources, other? Or do they just increase startup time? As far as I know according to the list, it’s just startup time and so it isn’t fair to classify them they way that is currently being done.

    “Add-ons provide many useful features and functions, but they can also cause Firefox to become slower. Some add-ons can even slow Firefox to a crawl and make it difficult to use for regular web browsing”

    “The following add-ons have the most impact on how long it takes Firefox to start up.”

    So what’s the message here?

    I’m well aware of the fact that the majority of users don’t know that add-ons can have an impact on Firefox performance, they always blame Firefox, I get that, but the messaging needs to be changed.
    The media is calling them slow add-ons, or add-ons that slow down Firefox. Right now, there are only numbers showing that they slow down startup time. I’m sure that there will be other performance test reports later, but that isn’t the case right now.
    I’ve read comments from Firebug users on one tech news site, and the majority of them just don’t care about startup time. They care about their add-on. I’m certain the same goes for Adblock Plus and others.

    By the way, I’m fine with 11,000 or 1.2 million Easy List filters.

  9. Dave Garrett · 2011-04-09 08:36 · #

    When the perf page first launched I saw Flagfox getting listed at 12%, then later it was up to 14%, now it’s apparently at 16%. I haven’t released any updates since the launch of this page, so it’s clearly giving bad results for some reason. Not only would I hope the accuracy of this thing get fixed, but I’d really like a detailed profile to be generated and emailed to us so we can actually investigate and fix the problems this thing is trying to warn us about.

    I also filed a bug regarding the wording of the page: it erroneously implies they’re the worst startup time addons, and they’re not. The blog post about its launch stated it’s only out of the top 100 used extensions, but the page itself just says “add-ons have the most impact”. https://bugzilla.mozilla.org/show_bug.cgi?id=647398

  10. Nils Maier · 2011-04-09 18:11 · #

    “I’m not sure what kind of response you’re expecting from us. I’d love to talk about it, so please ping me on IRC or send me an email.”

    I raised lots and lots of issues before you went public, showed that the “blessed test run” in fact was utterly wrong, asked about and questioned the methodology in general, etc.
    Wladminir now added a whole bunch of different issues on top of that.
    I always CC’ed at least amo-editor-internal, and often CC’ed you directly as well.
    I consider many of the issues we raised to be proven errors and problems by now.

    We are not talking about minor inconsistencies here… There are pretty grave problems present (from implementation errors to systematic accuracy issues), most of which should have either popped up during internal testing or community feedback.
    At least noticing that out of 100 add-ons not even half produced reasonable results should have been a clear indication that there is something fishy going on here.

    The actions I would have expected, before mozilla made the announcement, would have included:

    - Verify all components of the system, incl. methodology, implementation and interpretation of the test results.

    - Document the methodology and interpretation, so that “outsiders” may peer-review it

    - Get and address feedback from the add-on community!

    - Document more clearly what the tests are about in the first place and what add-ons were tested, so that add-on authors and users can understand what this is all about and can make an informed decision.

    - Make available the test data in a usable fashion, either for authors to verify the results or to check were potential issues are (such as non-determinism, or the OSX Flashgot stuff for example)

    - Make available tools to add-on authors so that they can test their add-ons themselves (and test new, optimized code)

    And lets not forget another, very important issue here:
    There are virtually no tools or documentation helping developers to rectify startup perf issues. The only actual documentation is the “Best Practices” article, that is nothing more than a stub, and I know that because I wrote a major part of that stub.
    The first email Justin send spoke of a 2 week grace period where authors are expected to improve things. Even if the results were accurate, authors would have lacked knowledge, tools, documentation and time.

    Since it is too late for all this now, I expect mozilla to retract the results and apologize for going public with results of questionable accuracy and interpretation. And wait for a reboot of this campaign until the tests are fixed/are of a proven accuracy with sane interpretation and viable tools and documentation are available to authors.

  11. Jorge · 2011-04-10 00:35 · #

    @Ken: you’re obviously not the typical Firefox user. Most users do care a lot about performance, and startup time is not only one of the top complaints from Firefox users but also one of the top performance bottlenecks affected by add-ons, if not the top one. Resource usage at runtime is important, but testing this reliably is much harder. We decided to begin with startup time and I really believe that it will make a tangible difference in the months to come.

    @Dave: thanks.

    @Nils: I’m sorry our methodology and messaging are not up to your standards, but we’re not retracting this initiative just because of that. Several bugs are being filed around this, and I recommend you add your own if you feel there’s anything else missing, including the communication and policy fronts.

  12. Nils Maier · 2011-04-10 02:28 · #


    To say it more bluntly: Your (as in mozilla) methodology, testing/QA and messaging was absolutely horrible, and continues to be horrible. You seriously failed here, miserably. You acted irresponsible and grossly negligent. You might be affecting revenues of some add-on based businesses already, because you’re actively misinforming and ill-advising users. I already read in some German forums that users are actively using the numbers to decide which add-ons to uninstall and recommending to other users to consult the list.

    Now act like a grown-ups and retract the proven wrong, inaccurate, meaningless and misleading results and apologize for the mess you created.

    It should be noted that I didn’t ask for the campaign/initiative to be abandoned, but for it to be postponed again until you got your stuff straight.
    I applaud the idea of add-on performance tests in general, but you have to do it right.

  13. ancestor · 2011-04-10 06:20 · #

    Your work investigating the issues is admirable and so is your composure reporting everything. Thanks on behalf of other add-on authors.

    “I’m sorry our methodology and messaging are not up to your standards, but we’re not retracting this initiative just because of that.”

    Either you are being intentionally condescending here or you honestly don’t appreciate the weight of the presented criticism. Please read Nil’s comment again, it is not about anyone’s personal standards but about standards and values that are widely respected, even required, in the Mozilla community.

  14. Jorge · 2011-04-11 20:42 · #

    @ancestor: I understand the arguments very well, and my conclusion is that the test results are still valuable for users to make good decisions about the add-ons they have installed. If people are uninstalling them and getting a performance boost, and that’s what they wanted, they I think that’s a good thing.

    I wasn’t trying to be condescending. My point was that we can’t hold off on launching this initiative until we have satisfied every developer’s concerns. His input is very valuable, but I would be happier if he used all that energy trying to improve things instead of just trying to stop them.

    The bugs in the system are filed and have a very high priority for us. However, I’m pretty sure that even after we fix all of these problems, there will still be complaints coming from developers featured on the slow list.

    Reply from Wladimir Palant:

    Jorge, 0.2s is not something a user can really perceive – so if you convince them that the add-on slows down Firefox startup by 50% then that’s exactly what they will notice after uninstalling that add-on (placebo effect). One would think that after all the irrational bashing Firefox got when somebody convinced users that its was “slow and bloated” Mozilla would be careful not to put others in their own ecosystem into the same position.

  15. Johan Sundström · 2011-04-12 00:19 · #

    What I find most ridiculous of all about these perf tests is that actually browsing web pages is much slower with a naked Firefox installation, than it is with AdBlock Plus and the EasyList subscription, based on all the crud (flash, ads, et cetera) not being loaded or rendered thanks to AdBlock Plus.

    But the test only accounts for a boundary condition that, in the case of AdBlock Plus, doesn’t really matter in the larger perspective, even if it did affect startup to the extent postulated.

    I agree that users are getting confused by the information, especially if they were to remove AdBlock Plus due to start-up slow-down. They’d be cheating themselves of speed boosts and other improved performance in their daily browsing.

  16. ancestor · 2011-04-12 04:06 · #

    “I understand the arguments very well, and my conclusion is that the test results are still valuable for users to make good decisions about the add-ons they have installed. If people are uninstalling them and getting a performance boost, and that’s what they wanted, they I think that’s a good thing.”

    Yes, that is the good part. There is also the other one: users making bad decisions because of incorrect and incomplete data, developers having their hard-earned reputation unfairly damaged. Are you sure the tradeoff is positive? Because if Wladimir’s investigation is right, it doesn’t seem like it.

    Of course it’s not realistic to “hold off on launching this initiative until you have satisfied every developer’s concerns” but that’s a total straw man. What we are talking about is fixing some glaring, show-stopping problems. To take the most spectacular one, according to Wladimir’s investigation, approximately half of all add-ons are effectively not tested at all. Let me say this again: the test does not work half of the time! How on earth did this not get caught? Are you seriously insisting that bugs of this magnitude should not be blockers? This is a straight-up comical problem to let slip in into production.

    This project personally affects the developer community so it should have been carried out very carefully. Instead, it was rushed and botched. It’s not an opinion, it’s a quantifiable, technical fact. It is frustrating that Mozilla won’t acknowledge it and say: this is not up to par with our standards, we screwed up.

    Reply from Wladimir Palant:

    There is an explanation why that particular problem wasn’t noticed: there was only one test run with Firefox 4 before the announcement (on March 26th) and nobody checked the results too closely apparently. Before that they were testing with Firefox 3.6 and pretty much all the add-ons were explicitly compatible.

    Still, it is disappointing to see how bad this data is – and how the announcement failed to even mention that “we are still working on fixing some remaining issues” (like numerous test crashes that somebody must have noticed).

  17. Tim · 2011-04-14 14:33 · #

    The take home message (for and from an average FF user):
    1) Always use “Adblock Plus” (and Firebug if you are a WebDev).
    2) Ignore the incompetents of jorge@mozilla.org and https://addons.mozilla.org/en-US/firefox/performance/

  18. ultravioletu · 2011-04-15 10:52 · #

    I cannot believe what I’m reading here.

    @Jorge: your product is a web BROWSER, not a startup program.

    I would like to see the effect of addons to the browsing experience, not to the startup. How many times does a user starts the browser?

    Or, if you think my previous question is stupid: how the extra 0,1 s adds to the perceived slowness of the overall user experience compared to the delay one has got after tying an URL and pressing ENTER? Btw, why is nobody taking a look at the size and performance of urlclassifier3.sqlite?

    To put in in the “car” metaphor, what this test did was comparing the times of locking/unlocking the doors to determine which model is faster.

    I also welcome performance measurements to identify and help sort out performance problems in addons, but they have to be realistic.

    Reply from Wladimir Palant:

    People are looking into Firefox performance. And it is improving (particularly the location bar got a lot faster for me with Firefox 4).

    Most people don’t keep the browser open all the time, they start it regularly. So startup time is relevant. The idea is to measure run-time performance as well later, it is a lot more complicated however.

Commenting is closed for this article.