The Middle Torso
Search queries, when counted and sorted by frequency, fall into a hockey stick-shaped Zipf Distribution (you can see an example from Michigan State University in our Chapter One draft). A handful of the most common queries generally account for a surrisingly large portion of all queries. Information architects are especially interested in these "short head" queries—they help us determine efficient ways to match users with the content they need (via such options as Best Bets). Conversely, the "long tail"—queries that show up infrequently or perhaps only once—are especially rich to exploit in ecommerce settings, as Chris Anderson demonstrates in his forthcoming book's blog.
What we've not heard much about is the "middle torso," to coin a phrase: queries that fall in the middle area—not terribly frequent, nor freakishly rare. So far we've not encountered any discussion of the torso. Maybe it's just not that meaningful? But intuition suggests that the middle torso has potential and is worth exploring further. Is the middle torso the wild west, where queries crash into each other on their way up or down? Or are these queries fairly stable? Is the torso the off-season home for queries that are only hot at certain times of the year? What if any trends do these queries exhibit?
Do middle torso queries differ in interesting ways from frequent and rare queries? We've seen some examples of logs where frequent queries are for known items, while long tail queries are topical; do middle torso queries share some similarlly distinguishing characteric?
We'd love to hear your thoughts, whether you have concrete experience analyzing middle torso queries, or just enjoy speculating (like we do).
Comments
The naming convention for the power curve seems to be not quite firm. I have heard it called the spike, inflection, flats, and tail. I played around with it at on client discussion (they were having a tough time understanding) as a giraffe with a head, long widening neck, body, and tail, which seemed to work.
There is a lot of focus on the head and tail, but the neck and body (torso) do not get much attention, well publicly. The head and upper part of the neck are really easy to deal with. The lower neck and torso is where many who care about search are working to improve solutions. I don't know much public research that is there, but it is a constant question. The torso starts relying on small group ontologies, as there is quite often a breadth of terms used for the objects or the sub-sets of groups are in this torso. Information architects may be some of those down in the torso as it is a relatively small group (still) of practitioners that have a roughly homogenous set of terms used by portions of the whole group. Items of interest, when looking at the whole rarely make it to the neck, but within the IS fields some of the terms and objects pointed to will make it to the neck of the search.
Works around small group ontologies start to get at this torso area.
The tail is where some of the fun hard problems reside, which may be the attraction and reason for attention. Until recently we have lacked the resources to easily examine the tail. The tail often uses collaborative filtering, where matching of individuals terms/preferenace as they relate to objects can be matched. Matching engines that focus on the object and references to it and matching on the terms (metadata) that are not popular becomes the task.
Posted by: vanderwal | June 30, 2006 10:17 AM
The difficulty with this not-very-frequent-but-not-infrequent occurances is their relevance in the context of the content you are offering. I couldn't generalize, but if you're specifically thinking of search log analysis, this grey area in the curve may provide good insight about seasonal data or content popularity at a particularly time - which, depending on the goal of your site, can help you determine points of access to "archived" information (ie: if your goal is to help users go "deep" into content that's not the newest or most popular). It can help creating a content hierarchy as well as navigation (secondary or contextual) to pieces of content that aren't necessarily frequent (in the overall analysis) or "relevant" (through card sorting or whatever method used to group larger buckets of content).
But what IS this portion? How MUCH is the neck/body?
Really depends where you draw the line for the "head" and the "tail". Ask a giraffe...
Posted by: Livia Labate | June 30, 2006 08:36 PM
Our organization reviews the search metrics on a monthly basis. The long tail of search strings is more entertaining than anything else. We have started to collect the food groups people have searched for: Salmon, Salt, Coka-Cola, etc. Not sure what people expected to find but we get a kick out of it.
Turning the question around to look at the missed tail of search queries; value comes in from missed hits. What we are looking at here is the head of the tail of missed hits which indicates we are missing content somewhere. For example, suppose 15 people search for “Web Service Customer Account Information” and no results are presented then maybe we need to create the service since ROI is only about 2.
Maybe we have two tails, one for hits and one for misses; both can be revealing.
Posted by: RTodd | July 2, 2006 11:26 AM
Livia asks
Really depends where you draw the line for the "head" and the "tail".
I've been drawing the line so that the volume "to the left" of the line equals the volume "to the right" of the line. In a Zipf curve, this means that the line will be pretty close to the left -- if you have 1000 queries, but 100 of those queries make up half of the total search volume, that's where you draw the line.
This is wholly unscientific but provides an interesting look at where the balance is.
The more mathmatically minded among us may take a look at the graph and identify the spot where the slope of the line is equal to -1, as the curve changes from the head to the tail. (This may in fact be the same spot as the half-and-half, though I'm not sure as I haven't tried this.)
Lou's suggestion of a "middle torso" makes me think you could draw two lines, splitting your graph into thirds, with each part representing an equal amount of search volume. The head would be small, the torso would be a bit larger, and the tail would be the longest, each section representing 1/3 of the total search volume. I haven't tried this yet but this post may inspire me to do so...
Posted by: Jeff Lash | July 7, 2006 08:14 AM
When we look at weekly searches (Both Hits and Misses) the long tail phenomenon seems to be more prominent in the misses. Intuitively this seems right. But especially for the misses it would be important to focus on the long tail. We are currently trying to categorize some of the misses into different groups. A single usage of a Keyword which results in a Miss might not look like a problem but when you look at these categories then you may have a high occurrence of a particular category in the misses. Especially true for sites which sell a lot of different number of products and combinations.
Posted by: Ash | July 20, 2006 06:22 PM
Ash, can you share any examples of what types of categories you're coming up with for your long tail misses?
Posted by: Lou Rosenfeld | July 20, 2006 06:40 PM