Skip to Navigation | Skip to Content

Search Analytics

Conversations with your customers

Title In Progress

Where does the Long Tail begin?

And, for that matter, what constitutes the middle torso? We've been wondering if there is an accepted, standard way to slice up the friendly Zipf Distribution into a fat head, a middle torso, and a long tail. By percentiles? By query frequency thresholds? One could argue that queries which occur only once constitute the long tail, but it's equally plausible to include twosies as well.

You may not have realized this, but taking a statistics class in grad school fifteen years ago entitles one to a lifetime of free consulting with the hapless stats instructor. So I contacted my old friend and professor, Joe Janes, now Associate Dean of the University of Washington iSchool, for his opinion. What follows are Joe's thoughts; please share yours in the comment section below:

Well, this gave me a chance to stretch my statistical muscles, which have been dormant of late. The short answer, so far as I know and can tell, is that there's no firm definition. I liked the Wikipedia entry on Zipf, which I imagine you've already seen, and which has some interesting, more detailed external links. I think given the loosey-goosey nature of the "long tail" discussion, there aren't any firm cutoffs for what constitutes the long tail, middle torso, bald spot, etc.

In such cases, I would fall back on old friends like percentiles and their cousins, deciles and quartiles and such. One might look at a Zipf/long tail diagram (the cumulative distribution function on the Zipf wiki page, for example), if you look at it upside down, and arbitrarily say that the 'tail' begins at a k of about 2 for most of the curves.

From that, and a little integral calculus, or a simulator, you could say that 10% (or 20%, or 5% or whatever) of the curve lies to the right of that point, and that's what you think of as the tail. You could do much the same for any section of the distribution.

This is how inferential statistics is done (the old 5% likelihood of being wrong thing), or confidence intervals in surveys and polls. It's trickier with a distribution like this, with multiple parameters, as opposed to the normal distribution with only 2 and a standardized version.

I couldn't find a good simulator but I think that's what your spreadsheet is doing, so that may serve your purposes.

This may be a long-winded way of saying "I don't know" or "You've already got it right"...or it might actually be helpful.

Bottom line—if you create or propose "standards" for this sort of thing, I don't think anybody will call you wrong, from strictly statistical perspectives, unless what you proposed was truly wacky (like the tail only captured 0.5% of the curve or 95% or whatever).

Excellent advice, even if the answer is still a bit elusive. Many thanks, Joe!

So, where would you draw the lines to segment your Zipf distribution?

—Lou Rosenfeld

Post a comment

We’ve enabled comment moderation on Rosenfeld Media. Upon posting your comment, it will not immediately appear on this page. Hang tight, we’ll be sure to screen it before too long. (Starred fields are required)

Within This Book's Site: