Artificial Intelligence Patent Dataset

To assist researchers and policymakers focusing on the determinants and impacts of artificial intelligence (AI) invention, OCE released two data files, collectively called the Artificial Intelligence Patent Dataset (AIPD). The first data file identifies United States (U.S.) patents issued between 1976 and 2020 and pre-grant publications (PGPubs) published through 2020 that contain one or more of several AI technology components (including machine learning, natural language processing, computer vision, speech, knowledge processing, AI hardware, evolutionary computation, and planning and control). OCE generated this data file using a machine learning (ML) approach that analyzed patent text and citations to identify AI in U.S. patent documents (Abood and Feltenberger 2018; Toole et al. 2020). OCE’s approach is based on the methodology of Abood and Feltenberger (2018), but also includes an analysis of patent claims to better identify AI contained in the technical and legal scope of the invention. The second data file contains the patent documents used to train the ML models.

A working paper describing these data is available at SSRN and as a published version in the Journal of Technology Transfer. Users are requested to cite this documentation when using these data: Giczy, A.V., Pairolero, N.A. & Toole, A.A. Identifying artificial intelligence (AI) invention: a novel AI patent dataset. J Technol Transf (2021). https://doi.org/10.1007/s10961-021-09900-2

This effort was made possible through cross business unit collaboration among OCE, the Office of Policy and International Affairs, the Patents Business Unit, and the Office of the Chief Information Officer. The AIPD was used in the USPTO report “Inventing AI: Tracing the diffusion of artificial intelligence with U.S. patents.”

For questions, please email EconomicsData@uspto.gov.

Release notes: The AIPD was updated on August 2, 2021 to fix a minor issue affecting the 2019 and 2020 “vision” and “any_ai” predictions.

Data files

Download full set of 2020 data files [.dta format (512 MB)] [.tsv format (1.03 GB)]

Download individual data files:

File Name2020*
ai_model_predictionsDTA
496 MB
TSV
1.02 GB
ai_model_training_doc_seedgroupsDTA
16.2 MB
TSV
14.3 MB

* Note: the 2020 .dta files are saved in the Stata-14 format.