Friday Post 2: Rocketbooming by in Uncategorized / October 27th, 2006
Ze Frank is in the middle of a nerd fight about download stats, and it all boils down to Rocketbooming. I’m not going to get into the issue of whether or not Rocketboom’s reaction to this is justified or not, because the baseline issue is actually more interesting.
It’s about the metrics, really. Here’s the thing: when I wrote the blurb for the back cover of the Crow book, I said 225,000 people had downloaded the Pig book. I got that number by looking at the download stats of the original files (in all languages), and running a few filters. We started at something approaching 375,000. Take out bots and other such nonsense, and it drops a good amount. But then you look at the downloads and you see that someone downloaded it three times in the space of a minute… maybe their PDF viewer was wonky, they hit reload a few times to make it work. Whatever it was, I tried to filter by IP based on timeframe, so that I could arrive at actual downloads. After a lot of that, I trimmed the number down to about 225,000, which is still fantastically large.
I know if I wrote my initial stats (375,000) on the Crow book, it’d be much more impressive, and people would think I was super-1337. But I know in my heart that’s not a valid number (even 225,000 is iffy to me), and all it does is set up a standard that is unhealthy. People have to be dishonest or suffer the consequences.
Let’s say I, the author of the Crow book, am a different person. That book’s been downloaded less than 50,000 times (with filtering). Without filtering, about 70,000. So as an author of a similarly-targeted product, what do I announce as my stats? 70,000? If I don’t, I’m a small-fry, and advertisers will shy away from me (not that I’m looking for advertisers, but y’know). But if I say 70,000 and some advertiser wants to see results for that kind of audience, they won’t get it, because my actual number is likely only 50,000! I’m setting myself up for a fall, but I don’t have much choice because my competitor (the Pig book) is going around yelling huge numbers from the rooftops.
The thing is, Rocketboom needs to filter down their numbers not because they’re being dishonest (I don’t think you’d call it dishonest anyway). They have to do it because they’re pushing metrics in the wrong direction… we have the power and intelligence on the web to at least improve on the TV model, which is a lot of silly guesswork and extrapolation. We should be able to say: our downloads are X, and our likely real audience is X-Y. You don’t get numbers that compete with Grey’s Anatomy, but you get a better ROI. Advertisers will get more actual bang for the buck because your bang is reasonable. And if Rocketboom tapers their numbers, it will put less pressure on their competitors to fudge numbers to apppear to be in the same league.
What will happen in this kind of arms race is that one day, a vidcast that has a so-so audience will distribute it via every possible outlet, claim 1 million downloads a day (of which 10,000 are actually watched) and push the overall value per download on the web so low that Rocketboom, with their 300,000 downloads a day, will start to lose money. And then they’ll have to inflate their stats, and so on, until the only people that can play professionally are the ones that can sign deals with big distributors to help boost their download stats.
Transparency and admissions of imperfection are key to internet life, and it helps everyone to admit that their download stats are flawed. If we trim them back and try and present REALISTIC numbers rather than “competing with TV” numbers, advertisers will end up a lot happier.
Tags: crowwhocouldfly, pigandthebox, rocketboom, statistics, zefrank



October 28th, 2006 at 2:29 am
Massaging web statistics into useful data isn’t even a black art; it’s little more than a one-line shell script and a bit of guesswork to draw some conclusions.
My suggestion is this:
grep -i “myfile” | egrep -o “^([0-9]{1,3}.){3}[0-9]{1,3}” /var/log/apache/access_log | sort –unique | wc -l
… then count your blessings that you have anything even approaching this level of accuracy for the number of people that have seen myfile–you’re still talking about a head count many orders of magnitude more accurate than you get with something like Nielsen ratings, all in about the space of a minute for a moderately large (~250MB) web log.
My two cents.
October 28th, 2006 at 11:33 pm
I was reading an article about how one counts views of videos, and it seems to me they’re purposely inflating results. Exactly like you said… uniques only, snag the lot. You’ll maybe lose a few people sharing an IP, but I’d say that’s probably offset by unused downloads.
The thing that’s key is that this is a pull medium… you wait for people to request a file, and you deliver it. That alone should make it easier to count relatively precise numbers. For things like Revver, you can even deduce better numbers by how many ads were served (which happens at the end of the video). Nielsen ratings are hocus pocus next to this.
But they’re counting distribution channels where downloads are automatic. If I sign a deal with some hardware manufacturer to have my podcast download automatically on every unit sold, that’s GREAT for me, but I can’t reasonably count all those downloads as eyeballs. Sure, they were downloaded, but there’s a difference between a pull and an intentional pull.
What we need is a standard methodology for counting views. Like your grep, minus some standard percentage for what we assume are the non-views. It’s downsizing the audience, but if the accepted standard isn’t to downsize somewhat, some fool will spam half the universe with an embedded flash video and claim an audience of several million. Cause it’s just as valid.
My computer’s nearly recovered. I wanted to remember to partition my HD this time so I could once again dual-boot my machine, and (smacks head) I forgot. All I need to do is find my Illustrator CD so I can fix the Swedish Pig book and draw some pandas.
Stupid computers.
October 29th, 2006 at 12:04 am
Glad to hear the computer’s back up and running (well, nearly). I’ve been having problems with BSODs the past few days myself, so I’m reformatting and reinstalling the antichrist-cum-operating system on one box, thankfully have many more where that came from though.
I agree with your points about standardisation regarding downloads or “views”–problem is, it’s more complicated than any standardised math formula is going to be capable of explaining clearly (see under U.S. electoral college to see what I mean). For example, you mention shared IP addresses–what about viewers who don’t hit the web server at all from any IP address, because there’s a content cache server between them and your data? This used to be common practice overseas where bandwidth to the continent is expensive, but now most major ISPs do it to some extent too (think about the advantage of hosting Microsoft’s Patch Tuesday files on a LAN of >1M customers as opposed to everyone rushing to download the same files within a few hours’ window).
That’s why I stand by best bet being to just to state a few hard statistics like hits and unique hits, then state a margin of error of perhaps +-10% to account for missing positives and false negatives. If you want harder numbers for distribution then BitTorrent might be another idea, as trackers can produce more reliable statistics (including how many people actually completed the download as opposed to just hitting the web server then aborting 5 seconds later because it was too slow or uninteresting)–but again, I remind you of the statistics of electoral results. If they can’t get it perfect the chances of anyone else doing it seem pretty slim.