World Intellectual Property Organization

Datasets for automatic text categorization in the IPC

This is the homepage of Datasets for automatic text categorization in the IPC. These datasets are primarily intended to be incentives for research and development on artificial intelligence applied to automatic text categorization in the International Patent Classification (IPC).

They are made of excerpts of patent document information in XML format and in particular their IPC symbol(s). (See related specifications for each dataset).

Under the following Conditions of use, WIPO freely provides these datasets in exchange of feedback on related research, which could potentially benefit the IPC.

The datasets proposed by WIPO for research on automated categorization are made available under the following conditions:

  1. WIPO is the sole distributor of the data collections.
  2. You, the user, may only grant access to the data to persons working under your supervision and control. You ensure that such persons comply with the conditions of this agreement.
  3. The display, reproduction, transmission, distribution, sale, or publication of the data is prohibited.
  4. Small excerpts of the data may be displayed to others or published in a scientific or technical context, solely for the purpose of describing the research and development or related issues.
  5. These datasets will be used for non-commercial research and development purposes only.
  6. Users of the WIPO datasets will communicate to WIPO the results of experiments performed with the data. WIPO can make internal use of all results communicated to it at its own discretion. WIPO will not disclose these results to third parties without express permission from the author.
  7. Results can be published in academic journals provided that the origin of the data is indicated. The name of the concerned data collection, i.e., respectively WIPO-en-alpha, WIPO-de-alpha, WIPO-en-gamma, WIPO-fr-gamma and WIPO-en-delta datasets will be referred to in print as respectively:
    • "WIPO-en-alpha dataset, World Intellectual Property Office, Geneva, Switzerland, 2002".   
    • "WIPO-de-alpha dataset, World Intellectual Property Office, Geneva, Switzerland, 2003".
    • "WIPO-en-gamma dataset, World Intellectual Property Office, Geneva, Switzerland, 2015".
    • "WIPO-fr-gamma dataset, World Intellectual Property Office, Geneva, Switzerland, 2015".
    • "WIPO-en-delta dataset, World Intellectual Property Office, Geneva, Switzerland, 2018".   
    • "WIPO-fr-delta dataset, World Intellectual Property Office, Geneva, Switzerland, 2018".   
    Electronic copies of published articles will be sent to the contact indicated below, with a full bibliographic reference.
  8. You, the user, agree to delete the data from any media on which it has been stored, if WIPO requires you to do so for legal or regulatory reasons.

For the purpose of tracking interest in these datasets, WIPO requests to register for access.

WIPO-delta dataset collection

This product contains data sourced from EPO databases, © European Patent Organisation.

WIPO-delta datasets are considered per language. The documentation of WIPO-delta datasets includes:

  • Specifications of WIPO-delta datasets
  • XML Schema
  • WIPO-en-delta dataset (2019): English collection of around 55 million excerpts of patent documents from DOCDB XML from week 6 of 2019.  View the 2019 taxonomy in English.
  • WIPO-fr-delta dataset (2019): French collection of around 5 million excerpts of patent documents from DOCDB XML from week 6 of 2019.  View the 2019 taxonomy in French.

 

Legacy datasets (2002-2009)

WIPO-alpha

  • Full details about the content of the WIPO-alpha dataset is found in the WIPO-alpha readme
  • WIPO-en-alpha dataset (2002): English collection of around 75,000 excerpts of patent documents. View the 2000 taxonomy in English
  • WIPO-de-alpha dataset (2003): German collection of around 110,000 excerpts of patent documents.  View the taxonomy in German

WIPO-gamma

This product contains data sourced from EPO databases, © European Patent Organisation.

  • Specifications of WIPO-gamma datasets
  • WIPO-en-gamma dataset (2009): English collection of around 1.1 million excerpts of patent documents. View the 2009 taxonomy in English
  • WIPO-fr-gamma dataset (2009): French collection of around 840,000 excerpts of patent documents. View the 2009 taxonomy in French

Contact

Research results, questions and comments are to be sent to the following e-mail address: patrick.fievet@wipo.int

Tools

Explore WIPO