Sample chapter: search analytics and metadata
We're publishing a fairly complete version of our chapter on how you can use internal search analytics to develop and improve metadata (318Kb PDF file). We'd naturally love to hear what you think, especially if we're missing anything obvious, or if you can come up with better examples of concepts for us to use to illustrate some of these concepts. (Examples are always the hardest part!) Many thanks.
Comments
I have a lot of comments on this chapter, so I'll start at the beginning.
First, lose the squirrels. There is some big metaphor about hoarding that is not used in the rest of the chapter. If it isn't central to the chapter, don't start with it. In fact, don't use it.
The second page is a sales pitch. Lost that, too. Anybody can say "it costs millions". We used to say that business lost "31.5 million dollars" due to poor search every year. Or maybe it was "billion". I don't remember, and nobody believes those numbers.
The chapter really starts at the paragraph where you say, "But here's the point". Big hint, that opening. Everything before that is beside the point.
Instead of the sales pitch, here are two concrete things you can use which will help your readers size and cost per-page metadata (which is insanely expensive).
1. I was consulting with a major telecom company, and the CEO had decided that every page in their intranet should have metadata. I made a quick mental calculation on the effort required. They had around four million pages. Assuming six minutes per page (too short, but handy for calculations), that would take 10,000 weeks. It would mean 100 people working for two years, easily a twenty million dollar job.
2. I've only found two published accounts of cataloging effort and cost. The most detailed is from public library which was given a collection of 6000 jazz records. I'll find the reference for you, but they published a nice table of the level of information and the time and cost. The other published information was in The Whole Library Handbook, mentioning that full LoC cataloging for a book takes about an hour. Or maybe they said an hour and a half.
For estimation, I'd start at $5-20 per page for the basic stuff: title, date, and author.
At the top of page 7, you suggest tagging pages with alternate synonyms. I wouldn't suggest per-page metadata for that. Too danged expensive compared to synonyms in the search engine.
More to come...
Posted by: Walter Underwood | October 2, 2007 12:41 AM
Great comments; keep'em coming, Walter!
Posted by: Lou Rosenfeld | October 2, 2007 07:29 AM
The costing on the Jazz collection is in the article "How Much Will It Cost?" in this issue of the Ohio Library Council's TechKnow newsletter:
http://www.olc.org/pdf/Techknow5.03.pdf
It makes more sense if you know enough MARC to interpret stuff like "As above with 246 Varying Form of Title, 518 recording session info." I'd reproduce that chart in the book, translated to English, of course.
I found The Whole Library Handbook in the 025 stacks of my local library. I like to think that I made a reference librarian happy by asking where to find info on cataloging.
Posted by: Walter Underwood | October 2, 2007 12:07 PM
When I saw the chapter title, I expect to hear "titles are the most important metadata" and "use your search logs to find vocabulary your searchers use". That is my fundamental advice for metadata, but I'm not sure it is in the chapter.
I was surprised that so much of the chapter was about synonyms. I don't think of those as metadata. They are really linguistic categories, maybe corpus-specific, but not document-specific. In search engines, synonym implementations are independent of metadata management or categories. It is common to have synonyms without the other features, and the implementation of synonyms is way up in query processing or way down in indexing, independent of specific documents.
I think that treating synonyms with metadata is confusing, but if you want to, I'd organize it this way.
That's about it.
Wunder
Posted by: Walter Underwood | November 6, 2007 11:34 PM