Training sets for automated text categorization in the IPC
January 2009

Welcome page
Register for access
WIPO-alpha readme
View the taxonomy in English or German
View a sample XML record in English or German.
View the DTD

 

 

 

This is the homepage of the WIPO automated categorization datasets. We provide information about the datasets' contents, how get access and links to various relevant background sources of information.

We are currently freely distributing two collections of XML documents that have been manually classified in a complex hierarchical taxonomy known as the International Patent Classification (IPC):

  1. WIPO-alpha: containing over 75,000 patent documents in English.

  2. WIPO-de: containing over 110,000 patent documents in German.

These data collections are made available to the community for research purposes. In this way, we specifically aim to encourage research into the automated categorization of patent documents. Users of the collections are requested to communicate their results to WIPO.

Full details about the content of the WIPO-alpha dataset is found in the WIPO-alpha readme.

If you have any questions or comments, feel free to get in touch with us by email: patrick.fievet@wipo.int.