| US 7,590,626 B2 | ||
| Distributional similarity-based models for query correction | ||
| Mu Li, Beijing (China); and Ming Zhou, Beijing (China) | ||
| Assigned to Microsoft Corporation, Redmond, Wash. (US) | ||
| Filed on Oct. 30, 2006, as Appl. No. 11/589,557. | ||
| Prior Publication US 2008/0104056 A1, May 01, 2008 | ||
| Int. Cl. G06F 17/30 (2006.01) | ||
| U.S. Cl. 707—3 [707/4; 707/5; 704/9; 704/238; 704/258; 704/275; 705/10; 706/52; 717/205; 717/257] | 9 Claims |

| 1. A method comprising:
receiving an input search query;
identifying a set of candidate word sequences;
a processor determining a distributional similarity between a word of the input search query and a term of one of the candidate
word sequences using a query log of logged search queries by:
identifying a set of all co-occurrence words that co-occur with the word of the input search query in at least one logged
search query in the query log and that also co-occur with the term of one of the candidate word sequences in at least one
logged search query in the query log;
for each co-occurrence word in the set of identified co-occurrence words:
determining the number of logged search queries in which the co-occurrence word and the word of the input search query appeared
together in the query log to form a first co-occurrence frequency; and
determining the number of logged search queries in which the co-occurrence word and the term of the candidate word sequence
appeared together in the query log to form a second co-occurrence frequency;
using the first and second co-occurrence frequencies for each co-occurrence word in the set of all co-occurrence words to
determine the distributional similarity using a metric from a set of metrics consisting of a confusion probability metric
and a cosine metric, wherein the confusion probability metric is calculated by taking a sum over all co-occurrence words in
the set of all co-occurrence words where each summand in the sum is computed based at least on a product of a probability
of the word of the input search query given the co-occurrence word and a probability of the term of the candidate word sequence
given the co-occurrence word and wherein the cosine metric is calculated by determining the cosine of an angle between a first
vector for the word of the input search query and a second vector for the term of the candidate word sequence;
using the distributional similarity to determine an error model probability that describes the probability of the input search
query given the candidate word sequence associated with the distributional similarity;
using the error model probability to determine a probability of the candidate word sequence associated with the distributional
similarity given the input search query; and
using the probability of the candidate word sequence associated with the distributional similarity given the input search
query to select a candidate word sequence as a corrected word sequence for the search query.
|