Animefaniac: Search evaluation at Google

from google blog

9/15/2008 03:04:00 PM

This series of posts has described Google's search quality efforts in areas such as ranking and search UI. Now I'll describe search evaluation. Simply put, search evaluation is the process of measuring the quality of our search results and our users' experience with search.

Let me introduce myself. I'm Scott Huffman, an engineering director responsible for leading search evaluation, working with a talented team of statisticians and software engineers. I've been here since 2005, and have been working on search in one form or another for the past fourteen years or so.

When I'm interviewing folks interested in joining the search evaluation team, I often use this scenario to describe what we do: Imagine a Google ranking engineer bursts into your office. "I have a great idea for improving our search results!" she exclaims. "It's simple: Whenever a page's title starts with the letter T, move it up in the results three slots." This engineer comes armed with several example search queries where, lo and behold, this idea actually improves the results significantly.

Now, you and I may think that this "letter T" hack is really a silly idea, but how can we know for sure? Search evaluation is charged with answering such questions. This hack hasn't really come up, but we are constantly evaluating everything, which can include:

- proposed improvements to segmentation of Chinese queries

- new approaches to fight spam

- techniques for improving how we handle compound Swedish words

- changes to how we handle links and anchortext

- and everything in between

As Udi mentioned in his initial post on search quality, in 2007 we launched over 450 improvements to Google search, and every one of them went through a comprehensive evaluation process.

Not surprisingly, we take search evaluation very seriously. Precise evaluation enables our teams to know "which way is up". One of our tenets in search quality is to be very data-driven in our decision-making. We try hard not to rely on anecdotal examples, which are often misleading in search (where decisions can affect hundreds of millions of queries a day). Meticulous, statistically-meaningful evaluation gives us the data we need to make real search improvements.

Evaluating search is difficult for several reasons.

First, understanding what a user really wants when they type a query -- the query's "intent" -- can be very difficult. For highly navigational queries like [ebay] or [orbitz], we can guess that most users want to navigate to the respective sites. But how about [olympics]? Does the user want news, medal counts from the recent Beijing games, the IOC's homepage, historical information about the games, ... ? This same exact question, of course, is faced by our ranking and search UI teams. Evaluation is the other side of that coin.
Second, comparing the quality of search engines (whether Google versus our competitors, Google versus Google a month ago, or Google versus Google plus the "letter T" hack) is never black and white. It's essentially impossible to make a change that is 100% positive in all situations; with any algorithmic change you make to search, many searches will get better and some will get worse.
Third, there are several dimensions to "good" results. Traditional search evaluation has focused on the relevance of the results, and of course that is our highest priority as well. But today's search-engine users expect more than just relevance. Are the results fresh and timely? Are they from authoritative sources? Are they comprehensive? Are they free of spam? Are their titles and snippets descriptive enough? Do they include additional UI elements a user might find helpful for the query (maps, images, query suggestions, etc.)? Our evaluations attempt to cover each of these dimensions where appropriate.
Fourth, evaluating Google search quality requires covering an enormous breadth. We cover over a hundred locales (country/language pairs) with in-depth evaluation. Beyond locales, we support search quality teams working on many different kinds of queries and features. For example, we explicitly measure the quality of Google's spelling suggestions, universal search results, image and video searches, related query suggestions, stock oneboxes, and many, many more.

To get at these issues, we employ a variety of evaluation methods and data sources:

Human evaluators. Google makes use of evaluators in many countries and languages. These evaluators are carefully trained and are asked to evaluate the quality of search results in several different ways. We sometimes show evaluators whole result sets by themselves or "side by side" with alternatives; in other cases, we show evaluators a single result at a time for a query and ask them to rate its quality along various dimensions.
Live traffic experiments. We also make use of experiments, in which small fractions of queries are shown results from alternative search approaches. Ben Gomes talked about how we make use of these experiments for testing search UI elements in his previous post. With these experiments, we are able to see real users' reactions (clicks, etc.) to alternative results.

Clearly, we can never measure anything close to all the queries Google will get in the future. Every day, in fact, Google gets many millions of queries that we have never seen before, and will never see again. Therefore, we measure statistically, over representative samples of the query-stream. The "letter T" hack probably does improve a few queries, but over a representative sample of queries it affects, I'm confident it would be a big loser.

One of the key skills of our evaluation team is experimental design. For each proposed search improvement, we generate an experiment plan that will allow us to measure the key aspects of the change. Often, we use a combination of human and live traffic evaluation. For instance, consider a proposed improvement to Google's "related searches" feature to increase its coverage across several locales. Our experiment plan might include live traffic evaluation in which we show the updated related search suggestions to users and measure click-through rates in each locale and break these down by position of each related search suggestion. We might also include human evaluation, in which for a representative sample of queries in each locale, we ask evaluators to rate the appropriateness, usefulness, and relevance of each individual related search suggestion. Including both types of evaluation allows us to understand the overall behavioral impact on users (via the live traffic experiment), and measure the detailed quality of the suggestions in each locale along multiple dimensions (via the human evaluation experiment).

Choosing an appropriate sample of queries to evaluate can be subtle. When evaluating a proposed search improvement, we consider not only whether a given query's results are changed by the proposal, but also how much impact the change is likely to have on users. For instance, a query whose first three results are changed is likely much higher impact than one for which results 9 and 10 are swapped. In Amit Singhal's previous post on ranking, he discussed synonyms. Recently, we evaluated a proposed update to make synonyms more aggressive in some cases. On a flat (non-impact-weighted) sample of affected queries, the change appeared to be quite positive. However, using an evaluation of an impact-weighted sample, we found that the change went much too far. For example, in Chinese, it synonymized "small" (小) and "big" (大)... not a good idea!

We're serious about search evaluation because we are serious about giving you the highest quality search experience possible. Rather than guess at what will be useful, we use a careful data-driven approach to make sure our "great ideas" really are great for you. In this environment, the "letter T" hack never had a chance.

Posted by Scott Huffman, Engineering Director

Links to this post

Google Discusses Search Evaluation Process: Google had been doing a series of posts about search quality. Today, the latest post in the series discusses how evaluation enters into the the process. Scott Huffman, Engineering Director, gave four insights into the nuances of ...; Posted by SEO News at 09:07
Google Evaluates Search Quality: In his recent blog post, Scott Huffman, Google’s engineering director for search evaluation, elaborates on Google’s search evaluation process effectively tagging the quality of their search results and the quality of their user’s ...; Posted by sbhummer4u at 07:40
Search evaluation at Google: The “Official Google Blog” put up a post yesterday titled “Search evaluation at Google” that gives some really interesting insight into how Google goes about improving search result quality and user-experience. ...; Posted by Kieran Hawe at 07:15
The Human Side To Google: Search Evaluations: Google's next blog post regarding web search is about search evaluation, basically how Google measures quality. Scott Huffman, an Engineering Director who is responsible for leading search evaluation, wrote the blog post. ...; Posted by Barry Schwartz at 07:12
Search evaluation at Google: In this series of posts you can find descriptions about Google’s search quality efforts in areas such as ranking and search UI. Read more on Google blog.; Posted by mbeatini at 06:09
New & Improved :: Google Human !: Now Google has come up with something that seems brand new: Human Evaluation! ;D. Well, apart from the fact that I just happened to write about this yesterday over at Kara Swisher’s blog (”It’s the Economy, Stupid“) — mentioning about ...; Posted by nmw at 05:00
Official Google Blog: Search evaluation at Google: Official Google Blog: Search evaluation at Google.; Posted by Manish Chauhan at 03:02
How Google Evaluates Sites: Google Factors for ranking: Human Evaluation Click-through rates Use of test data center - which explains all the cracy ranking jumping. Here is an interesting blog: Official Google Blog: Search evaluation at Google: "One of the key ...; Posted by Administrator at 01:29
links for 2008-09-16: Official Google Blog: Search evaluation at Google. (tags: via:mento.info). Share/Save/Bookmark.; Posted by shawnmor at 01:01
WWGIRWWW? - What We Get Is Really What We Want?: How to unit test a search engine result? Any thoughts? How to validate and assert the result quality? How to guarantee the ranking applied is the optimal one? These questions are causing me enough chagrin. ...; Posted by Akshay at 20:21
Beyond Relevance: Over at the Google Blog, Scott Huffman writes an entry on Search Evaluation at Google. Traditional search evaluation has focused on the relevance of the results, and of course that is our highest priority as well. ...; Posted by jeff.dalton at 16:24

Create a Link

Animefaniac

Tuesday, September 16, 2008

Search evaluation at Google

Links to this post

No comments:

Analytics