US 7,590,626 B2
Distributional similarity-based models for query correction
Mu Li, Beijing (China); and Ming Zhou, Beijing (China)
Assigned to Microsoft Corporation, Redmond, Wash. (US)
Filed on Oct. 30, 2006, as Appl. No. 11/589,557.
Prior Publication US 2008/0104056 A1, May 01, 2008
Int. Cl. G06F 17/30 (2006.01)
U.S. Cl. 707—3  [707/4; 707/5; 704/9; 704/238; 704/258; 704/275; 705/10; 706/52; 717/205; 717/257] 9 Claims
OG exemplary drawing
 
1. A method comprising:
receiving an input search query;
identifying a set of candidate word sequences;
a processor determining a distributional similarity between a word of the input search query and a term of one of the candidate word sequences using a query log of logged search queries by:
identifying a set of all co-occurrence words that co-occur with the word of the input search query in at least one logged search query in the query log and that also co-occur with the term of one of the candidate word sequences in at least one logged search query in the query log;
for each co-occurrence word in the set of identified co-occurrence words:
determining the number of logged search queries in which the co-occurrence word and the word of the input search query appeared together in the query log to form a first co-occurrence frequency; and
determining the number of logged search queries in which the co-occurrence word and the term of the candidate word sequence appeared together in the query log to form a second co-occurrence frequency;
using the first and second co-occurrence frequencies for each co-occurrence word in the set of all co-occurrence words to determine the distributional similarity using a metric from a set of metrics consisting of a confusion probability metric and a cosine metric, wherein the confusion probability metric is calculated by taking a sum over all co-occurrence words in the set of all co-occurrence words where each summand in the sum is computed based at least on a product of a probability of the word of the input search query given the co-occurrence word and a probability of the term of the candidate word sequence given the co-occurrence word and wherein the cosine metric is calculated by determining the cosine of an angle between a first vector for the word of the input search query and a second vector for the term of the candidate word sequence;
using the distributional similarity to determine an error model probability that describes the probability of the input search query given the candidate word sequence associated with the distributional similarity;
using the error model probability to determine a probability of the candidate word sequence associated with the distributional similarity given the input search query; and
using the probability of the candidate word sequence associated with the distributional similarity given the input search query to select a candidate word sequence as a corrected word sequence for the search query.