US 7,516,194 B1
Method for downloading high-volumes of content from the internet without adversely effecting the source of the content or being detected
Nick Lamkins, Portland, Oreg. (US); Rick O'Brien, Eugene, Oreg. (US); and James Dirksen, Portland, Oreg. (US)
Assigned to Microsoft Corporation, Redmond, Wash. (US)
Filed on May 21, 2003, as Appl. No. 10/443,110.
Claims priority of provisional application 60/382779, filed on May 21, 2002.
Int. Cl. G06F 15/16 (2006.01)
U.S. Cl. 709—218  [709/217; 709/219] 22 Claims
OG exemplary drawing
 
1. A system for downloading a plurality of documents from a plurality of content servers, said content servers being linked to a plurality of routers that each have a different network address, said system comprising:
a plurality of pullers;
a director for:
creating a list of URLs of the plurality of documents to be downloaded from the plurality of content servers, each of the plurality of said documents being identified by a different URL; and
assigning a portion of the list of URLs to each of the pullers such that each portion assigned to a particular puller includes all documents to be retrieved from a single content server wherein no two pullers initiate requests to adjacent URLs, wherein adjacent URLs identify documents located on the same content server;
wherein each of the plurality of pullers is responsive to the director for:
receiving the assigned portion of the list of URLs;
queuing requests to retrieve documents identified by the received portion of the list of URLs wherein the requests having different URLs are queued by the puller;
determining if the URL of a first queued request is adjacent to the URL of a document being currently downloaded;
if the URL of the first queued request is adjacent to the URL of a document being currently downloaded, waiting until the currently downloading document has been received before initiating the first queued request to avoid overlapping requests to the content server;
if the URL of the queued request is not adjacent to the URL of a document being currently downloaded, initiating the first queued request; and
a proxy gateway responsive to each of the pullers for receiving the initiated requests to retrieve documents, and for retrieving documents corresponding to the list of URL from the content servers via the routers.