Readme Information for WIPO-alpha Autocategorization Training Set

June 2009

World Intellectual Property Organization
34, Chemin des Colombettes
CH-1211 Genève 20
Switzerland

claims@wipo.int

This document provides information about the WIPO-alpha and WIPO-de collections of patent documents made publicly available for research into automated categorization. Links to a variety of relevant online resources are provided. Short descriptions of the taxonomy, the documents in the collection, and suggested categorization tasks are provided.

 I.        Abbreviations

ASCII

American Standard Code for Information Interchange

CD-ROM

Compact Disk - Read Only Memory

CLAIMS

Classification Automated Information System

DTD

Document Type Definition

DVD

Digital Versatile Disk

FTP

File Transfer Protocol

IPC

International Patent Classification

ISO

International Standards Organization

OCR

Optical Character Recognition

PCT

Patent Cooperation Treaty

WIPO

World Intellectual Property Organization

XML

eXtended Mark-up Language

 

1                        Introduction



1.1                 Background

The World Intellectual Property Organization (WIPO) is an international organization dedicated to promoting the use and protection of works of the human spirit. With headquarters in Geneva, Switzerland, WIPO is one of the 16 specialized agencies of the United Nations system of organizations.

The International Patent Classification, which is commonly referred to as the IPC, is a standard taxonomy developed and administered by WIPO for classifying patents and patent applications. The IPC covers all areas of technology and is currently used by the industrial property offices of more than 90 countries.

Patent classification is indispensable for the retrieval of patent documents in the search for prior art. Such retrieval is crucial to patent-issuing authorities, potential inventors, research and development units, and others concerned with the application or development of technology.

In 2006, the International Patent Classification system was divided into a stable core level of classification and an advanced dynamic set of categories that more frequently updated.

The CLAIMS project at WIPO provided an automated classification assistance system IPCCAT (available under http://www.wipo.int/ipccat/ ) for categorizing patent applications initially in the IPC 7 and now in IPC 2009.01. The system facilitates the attribution of IPC codes to patent applications, particularly in small patent offices, and promote the use of the IPC in member States. It should support the classification of patent documents in various languages.

In order to promote the use of the IPC in research into automated categorization, both in the academic community and for commercial partners, WIPO is making available an extensive multilingual collection of patent documents.

 The first collection, published in late 2002 and known as WIPO-alpha, consists of English-language and German-language documents.

WIPO also provides additional information about a second collection of German-language patent documents, known as WIPO-de.

A new dataset

 

1.2                 Motivation for automated categorization

The use of patent documents and the IPC for research into automated categorization is interesting for the following reasons:

1.      The IPC covers a huge range of topics and uses a diverse technical and scientific vocabulary. A large proportion of the documents are concerned with chemistry, mechanics, and electronics.

2.      The IPC is a complex, hierarchical taxonomy, containing 60 million documents classified worldwide.

3.      Domain experts in national patent offices currently classify patent documents fully manually. These experts have an intimate knowledge of the IPC system.

4.      Patent documents are often available in several languages. Professional translators have already performed large numbers of translations manually.

 

1.3                 Availability

Information about the WIPO collections for automated categorization research is available online at http://www.wipo.int/classifications/ipc/en/ITsupport/Documentation/categorization.html.

The document collections are made available to the general public in accordance with the conditions of use detailed below. To request access, an application form should be completed at http://www.wipo.int/classifications/ipc/en/ITsupport/Documentation/categorization.html.

Online access to the WIPO-alpha collection is free of charge via an FTP server located at WIPO headquarters in Geneva, Switzerland. Details about the FTP location and password are transmitted by email after registration.

 

2                        Conditions of Use

The WIPO-alpha collection of documents for automated categorization is made available under the following conditions:

1.      WIPO is the sole distributor of the data collections.

2.      You, the user, may only grant access to the data to persons working under your supervision and control. You ensure that such persons comply with the conditions of this agreement.

3.      The display, reproduction, transmission, distribution, sale, or publication of the data is prohibited.

4.      Small excerpts of the data may be displayed to others or published in a scientific or technical context, solely for the purpose of describing the research and development or related issues.

5.      The WIPO training sets will be used for non-commercial research and development purposes only.

6.      Users of the WIPO training sets will communicate to WIPO the results of experiments performed with the data. WIPO can make internal use of all results communicated to it at its own discretion. WIPO will not disclose these results to third parties without express permission from the author.

7.      Results can be published in academic journals provided that the origin of the data is indicated. The WIPO-alpha data collection will be referred to in print as: "WIPO-alpha dataset, World Intellectual Property Office, Geneva, Switzerland, 2002".         Electronic copies of published articles will be sent to WIPO with a full bibliographic reference.

8.      You, the user, agree to delete the data from any media on which it has been stored, if WIPO requires you to do so for legal or regulatory reasons.

 

3                        International Patent Classification


3.1                 Taxonomy

The International Patent Classification (IPC) is a complex hierarchical classification system comprising sections, classes, subclasses and groups (main groups and subgroups). The seventh edition of the IPC consisted of 8 sections, about 120 classes, about 630 subclasses, and approximately 69,000 groups.

Every category in the IPC is indicated by a symbol and has a title. The IPC divides all technological fields into eight sections designated by one of the capital letters A to H. Each section is subdivided into classes, whose symbols consist of the section symbol followed by a two-digit number, such as A01. In turn, each class is divided into several subclasses, whose symbols consist of the class symbol followed by a capital letter, for example, A01B, as illustrated in Table 1.

A

Section

01

B

1/00   Main group

or

Class

1/24    Subgroup

Subclass

Group

Table 1: IPC symbol example

Each subclass is broken down into groups, which are either main groups or subgroups. Main group symbols consist of the subclass symbol followed by a one- to three-digit number, an oblique stroke and the number 00, for example, A01B 1/00. Subgroups form subdivisions under the main groups. Each subgroup symbol includes the subclass symbol followed by the one- to three-digit number of its main group, the oblique stroke and a number of at least two digits other than 00, for example, A01B 1/02. Table 2 shows a portion of the IPC at the start of Section A.

IPC taxonomy sample

Section

Subsection

Class

Subclass

References

A          SECTION A — HUMAN NECESSITIES

             AGRICULTURE

A01     AGRICULTURE; FORESTRY; ANIMAL HUSBANDRY; HUNTING; TRAPPING; FISHING

A01B   SOIL WORKING IN AGRICULTURE OR FORESTRY; PARTS, DETAILS, OR ACCESSORIES OF AGRICULTURAL MACHINES OR IMPLEMENTS, IN GENERAL (making or covering furrows or holes for sowing, planting or manuring A01C 5/00; machines for harvesting root crops A01D; mowers convertible to soil working apparatus or capable of soil working A01D 42/04; mowers combined with soil working implements A01D 43/12; soil working for engineering purposes E01, E02, E21)

Table 2: Portion of the IPC classification at the start of Section A

The title of a subgroup is preceded by one or more dots indicating the hierarchical position of that subgroup, i.e. indicating that each subgroup forms a subdivision of the nearest group above it having one dot less. The hierarchy of subgroups can only be deduced by examining the titles and number of dots of all neighbouring subgroups and cannot be deduced from the symbols alone. An example is shown in Table 3.

IPC

Symbol

Title

Section

B

PERFORMING OPERATIONS; TRANSPORTING

Class

B64

AIRCRAFT; AVIATION; COSMONAUTICS

Subclass

B64C

AEROPLANES; HELICOPTERS

Main group

B64C 25/00

Alighting gear

One-dot subgroup

B64C 25/02

Undercarriages

Two-dot subgroup

B64C 25/08

• • non-fixed, e.g., jettisonable

Three-dot subgroup

B64C 25/10

• • • retractable, foldable, or the like

Four-dot subgroup

B64C 25/18

• • • • Operating mechanisms

Table 3: IPC taxonomy extract

Detailed information about the IPC taxonomy is available online (http://www.wipo.int/classifications/ipc/en/general/).

Patent categorization in the IPC is complicated by the following factors:

1.      References: Many IPC categories contain references and notes, which serve to guide the classification procedure, as illustrated in Table 2. There are two main types of references:

o        Limitation of scope: These references serve to restrict the patents classified in the category, and indicate related categories where some patents should be preferably placed.

o        Guidance: Lists related categories where similar patents are placed.

The references may list similar categories that are distant in the IPC hierarchy. In Table 2, for example, a reference to class E01 exists in subclass A01B. The IPC thus contains a multitude of hyperlinks.

2.      Placement rules: Patent classification is governed by additional placement rules. In certain parts of the IPC, a last-place rule governs the classification of documents relating to two categories at the same hierarchical level (see for example note (2) of class C07), and indicates that the second of two categories should always be selected if two are found to concord with the subject of the patent application. In other parts of the IPC, different specific rules hold (see for example note (5) of subclass B32B, where a first-place rule holds).

3.      Secondary codes: A majority of patents do not have a single IPC code, but are also associated with a set of secondary (i.e. supplementary)  classification codes, relating to other aspects expressed in the patent.

A catchword index that indicates relevant IPC categories for over 20,000 words has been developed to guide human categorization. The catchword index is also available online and for download in XML formats.

 

3.2                 Editions

Information about various editions of the IPC is available under http://www.wipo.int/classifications/ipc/en/ipc_editions.html

 

3.3                 Languages

The IPC exists and is published online in its two authentic versions, English (http://www.wipo.int/classifications/ipc/ipc8/?lang=en ) and French (http://www.wipo.int/classifications/ipc/ipc8/?lang=fr&menulang=FR ).

Complete texts of the IPC are also prepared and published in other languages.   Some of them are accessible through the IPC bridge from the above WIPO official internet publication.

 

3.4                 Downloads

The IPC taxonomy is available for download in a variety of formats (e.g. XML) under http://www.wipo.int/classifications/ipc/en/support/.

 

4                        Collection structure



4.1                 Document description

The documents in the WIPO-alpha collection consist of patent applications submitted to WIPO under the Patent Cooperation Treaty (PCT). This scheme enables patent applications to be submitted simultaneously in a large number of countries. Not all the documents in the collection became or will become granted patents.

A patent application includes a title, a list of inventors, a list of applicant companies or individuals, an abstract, a claims section, and a long description. Accompanying figures may also be present, but these are not retained in the WIPO-alpha collection. The documents in the WIPO-alpha collection are in English.

IPC codes have been attributed to the patent applications in the WIPO-alpha collection by national or regional patent offices. The syntax of the attributed IPC symbols has been validated in this collection.

Documents from IPC edition 7 are primarily used for constituting the WIPO-alpha collection. Some documents from edition 6 are included, provided their IPC codes are still valid in IPC edition 7. The documents were published between 1998 and 2002.

Because patent applications are sometimes republished several times, following corrections, the WIPO-alpha collection may contain a residual number of duplicate or very similar records. Efforts have been made to remove such duplication.

The documents have been converted to electronic form by optical character recognition (OCR). Because of extensive automatic and manual checking of the resulting text, few cases OCR errors are expected to subsist.

The documents originate from a variety of countries, indicated by the first two characters of the patent application number, according to the following standard: http://www.wipo.int/export/sites/www/standards/en/pdf/03-03-01.pdf.

About WIPO-de collection:

The documents in the WIPO-de collection consist of German-language patent documents provided by the German Patent Office (Deutsches Patent- und Markenamt, http://www.dpma.de) and extracted from the DEPAROM collection (http://www.deparom.de).

Accompanying figures may also have been present, but these are not retained in the WIPO-de collection. The documents in the WIPO-de collection are in German.

The German Patent Office has attributed IPC codes to the patent applications in the WIPO-de collection.

It is not expected that many of the German-language documents in WIPO-de be found translated as English-language documents in WIPO-alpha.

 

4.2                 Record format

Sample documents from the WIPO-alpha and WIPO-de collections are provided in Tables 4. Documents are provided in ASCII format with XML markup.

The reference information in the <record> tag contains the country of origin (equal to "WO" for the PCT documents in WIPO-alpha), a document reference number, the kind of publication type (A1 for patent applications published with a search report, A2 for patent applications published or republished without a search report, A3 for patents with a late publication of the search report), an application number and a publication number. Priority numbers in a <prs> tag indicate patent publication numbers and dates in various countries.

A patent application always has a main IPC code, in the <ipcs> tag, and can have several additional secondary codes indicating secondary aspects or technologies of the patent application. The IPC codes are indicated in a fixed-format without spaces and with zero padding to three digits of the group number (for example A01B 1/00 is indicated as A01B00100). The subclass code is obtained by retaining the first four characters of the IPC symbol (A01B in this example), while the main-group is obtained by retaining the first 7 characters of the IPC symbol (A01B001). The IPC edition used for the classification is indicated as an attribute.

Inventors and applicant companies are given in <ins> and <pas> tags respectively.

In the WIPO-alpha dataset, the title, abstract, claims, and full description are provided in English, in <tis>, <abs>, <cls>, <txts> tags respectively. The claims section is set in a special legalistic language and serves to determine the exact scope of the future granted patent.

In the WIPO-de dataset, the title, claims, and full description are provided in German, in <tis>, <cls>, <txts> tags respectively.

<?xml version="1.0" encoding="iso-8859-1"?>

<!DOCTYPE record SYSTEM "../../../../ipctraining.dtd">

<record cy="WO" an="US0024942" pn="WO012006320010322" dnum="0120063" kind="A1">

<prs>

<pr prn="US19990914 60/153,825"/>

</prs>

<ipcs ed="7" mc="D01D00106">

<ipc ic="D01F00110"></ipc>

<ipc ic="D01F00604"></ipc>

<ipc ic="D01F00606"></ipc>

</ipcs>

<ins>

<in>COOK, Michael, Charles</in>

<in>MCDOWALL, Debra, Jean</in>

<in>STANO, Dana, Elizabeth</in>

<in>POWERS, Michael, David</in>

<in>MARMON, Samuel, Edward</in>

</ins>

<pas>

<pa>KIMBERLY-CLARK WORLDWIDE, INC.</pa>

</pas>

<tis>

<ti xml:lang="EN">METHOD OF FORMING A TREATED FIBER AND A TREATED FIBER FORMED THEREFROM

</ti>

</tis>

<abs>

<ab xml:lang="EN">The present disclosure is directed to a method of forming a treated fiber. A molten polymer is delivered to a fiber spinning assembly adapted to form and distribute polymer streams. At least one treatment is applied in a liquid state to at least one region on the surface of at least one molten polymer stream within the fiber spinning assembly. A substantial portion of the treatment remains on the surface of the resulting fiber within the treated region. One or more regions on the surface of the molten polymer may be treated with one or multiple treatments. The degree of coverage may vary from little coverage to complete coverage of the fiber surface. The treated regions may be in contact with one another or may be separate and distinct. A nonwoven web may be produced with selectively treated fiber regions by designing one or more fiber spinning assemblies to treat selected fibers or to apply multiple treatments. The regions of the nonwoven web may vary in treatment type, amount, or degree of coverage.</ab>

</abs>

<cls>

<cl xml:lang="EN">We claim:

1. A method of forming a treated fiber comprising:

a) providing a molten polymer;

b) delivering said molten polymer to a fiber spinning assembly adapted to

form and distribute a stream of said molten polymer; andc) applying a treatment in a liquid state to at least one region on the surface

of said molten polymer stream within said fiber spinning assembly,

such that a substantial portion of said treatment remains on the surface of the resulting fiber within said treated region.2. The method of claim 1, wherein said treatment has a boiling point of at least about300°F(149°C.

[… abridged …]</cl>

</cls>

<txts>

<txt xml:lang="EN"> METHOD OF FORMING A TREATED FIBER AND

A TREATED FIBER FORMED THEREFROM

Field of the Invention

The present invention relates to a treated fiber and a method of forming a treated fiber. Such treated fibers find many applications, for example, in nonwoven fabrics, yarns, carpets, and otherwise where fibers having one or more modified properties are desired. Background of the Invention

Nonwoven fabrics are finding increasing use in various applications, including personal care absorbent articles such as diapers, training pants, incontinence garments, mattress pads, wipers, and feminine care products (e. g., sanitary napkins), medical applications such as surgical drapes, gowns, wound care dressings, and facemasks, articles of clothing or portions thereof including industrial workwear and lab coats,household and industrial operations including liquid and air filtration, and the like. It is often desirable to modify the properties of the nonwoven fabric to perform a function or meet a requirement for a particular application.

[… abridged …]

Having thus described the invention in detail, it should be apparent that various modifications can be made in the present invention without departing from the spirit and scope of the following claims.</txt>

</txts>

</record>

Table 4: WIPO-alpha XML record structure, with abridged content

<record cy="DE" an="" date="20010214" pn="DE 10106849 A1 20020919" dnum="10106849" kind="A1">

<ipcs ed="7" mc="A01B02304">

</ipcs>

<ins>

<in>Pokriefke, Michael, 04229 Leipzig, DE</in>

</ins>

<pas>

<pa>Amazonen-Werke H. Dreyer GmbH Co. KG, 49205 Hasbergen, DE</pa>

</pas>

<tis>

<ti xml:lang="DE">

Scheibenegge

</ti>

</tis>

<cls>

<cl xml:lang="DE">

1. Scheibenegge mit einem Rahmen, an welchem die Scheiben zumindest teilweise versetzt zu einer quer zur Zugrichtung der Scheibenegge verlaufenden Geraden angeordnet sind, dadurch gekennzeichnet , dass der Rahmen einen vorderen und einen hinteren Träger (6 , 7 ), an welchem die Scheiben angeordnet sind, aufweist, dass jeder Träger aus zumindest

zwei mittels zumindest eines in Fahrtrichtung (4 ) gekröpften Zwischenträgers (8 ) miteinander verbundenen Teilträgern (6 ', 6 ", 7 ', 7 ", 7 ''') besteht.

2. Scheibenegge nach Anspruch 1, dadurch gekennzeichnet, dass der vordere Träger (6 ) zickzackförmige ausgebildet ist, dass der hintere Träger (7 ) einen weiteren Knick aufweist.

3. Scheibenegge nach einem oder mehreren der vorstehenden Ansprüche, dadurch gekennzeichnet, dass die Zwischenträger (8 ) an dem vorderen Träger(6 ), quer zur Fahrtrichtung (4 ) gesehen, an einer anderen Stelle als bei dem hinteren Träger (7 ) angeordnet sind.

4. Scheibenegge nach einem oder mehreren der vorstehenden Ansprüche, dadurch gekennzeichnet, dass zumindest an einigen Zwischenträgern (8 ) zumindest eine Scheibe (2 ) angeordnet ist.

5. Scheibenegge nach einem oder mehreren der vorstehenden Ansprüche, dadurch gekennzeichnet, dass hinter der Scheibenegge eine die Scheibenegge in der Eindringtiefe in den Boden führende Nachlaufwalze (10 ) angeordnet ist.

</cl>

</cls>

<txts>

<txt xml:lang="DE">

Die Erfindung betrifft eine Scheibenegge gemäß des Oberbegriffes des Patentanspruches 1. Derartige Scheibeneggen weisen z. B. in der Praxis X- oder Vförmig angeordnete Scheibeneggenabschnitte auf, die von der Arbeitsweise her ruhig laufen und eine zufriedenstellende Arbeit verrichten, jedoch in Fahrtrichtung sehr lang bauen. In der EP 04.81 538 B1 ist eine Scheibenegge beschrieben, bei welcher die Scheiben in zwei hintereinander verlaufenden Querreihen, die quer zur Fahrtrichtung und parallel zueinander verlaufen, angeordnet sind. Eine derartige Scheibenegge baut sehr kurz, hat jedoch den Nachteil, wenn sie in Arbeitsrichtung so eingesetzt wird, dass die sich auf dem Feld befindlichen Fahrspuren in Querrichtung überfahren werden, zum "Springen" neigt. Der Erfindung liegt die Aufgabe zugrunde, eine kurzbauende Scheibenegge zu schaffen, die die Vorteile einer X- oder Vförmig angeordneten Scheibenegge im Hinblick auf eine ruhige Arbeitslage aufweist.

Diese Aufgabe wird erfindungsgemäß dadurch gelöst, dass jeder Träger aus zumindest zwei mittels zumindest eines in Fahrtrichtung gekröpften Zwischenträgers miteinander verbundenen Teilträgern besteht. Infolge dieser Maßnahmen sind die in einer Querreihe angeordneten Scheiben, welche schräg zur Fahrtrichtung zum Boden angestellt sind, nicht mehr auf einer Geraden in einer Querreihe nebeneinander angeordnet. Somit sind die einzelnen Scheiben teilweise versetzt in Fahrtrichtung angeordnet, so dass die Scheiben in einer Querreihe nicht auf einer quer zur Fahrtrichtung verlaufenden Linie, sondern versetzt zu dieser Linie im Boden zum Eingriff kommen.

In einer Ausführungsform ist vorgesehen, dass der vordere Träger zick-zack-förmig ausgebildet ist und der hintere Träger einen weiteren Knick aufweist. Auch kann der Zwischenträger an dem vorderen Träger, quer zur

Fahrtrichtung gesehen, an einer anderen Stelle als bei dem hinteren Träger angeordnet sein. Vorteilhaft hat sich bei der Anordnung der Scheiben ausgewirkt, dass zumindest an einigen Zwischenträgern zumindest eine Scheibe angeordnet ist. Eine genaue Tiefenführung der Scheibenegge lässt sich dadurch erreichen, dass hinter der Scheibenegge eine die Scheibenegge in der Eindringtiefe in den Boden führenden Nachlaufwalze angeordnet ist. Weitere Einzelheiten der Erfindung sind der Beispielsbeschreibung und den Zeichnungen zu entnehmen.

[...]

</txt>

</txts>

</record>

Table 5: WIPO-de XML record structure, with abridged content

 

The XML tags are explained in the document DTD in Table 6. It should be noted that this DTD has been designed for all future WIPO collections, and not all tags are used in the WIPO-alpha dataset.

<!-- ## DTD for WIPO classification dataset ## -->

<!—- This file is: ipctraining.dtd -->

<!—- a record contains information about a single patent or patent application

prs:     collection of priority numbers

     ins:     collection of inventors

pas:     collection of applicants

ipcs:    collection of IPC codes

uscs:    collection of US codes

tis:     collection of titles in several languages

abs:     collection of abstracts in several languages

cls:     collection of claims in several languages

txts:    collection of full-text versions in several languages

 -->

<!ELEMENT record (prs?, ipcs, uscs?, ins, pas, tis, abs, cls?, txts?)>

<!-— the attributes of a record are:

     cy:      country of origin

kind:    publication type

dnum:    document number

     an:      applicant number

     pn:      publication number

-->

<!ATTLIST record

   cy   CDATA        #REQUIRED

   kind    CDATA #REQUIRED

   dnum    CDATA #REQUIRED

   an   CDATA        #REQUIRED

   pn   CDATA        #REQUIRED

>

<!—- pr:      contains a priority number in a prn attribute -->

<!ELEMENT prs (pr*)>

<!ELEMENT pr EMPTY>

<!ATTLIST pr

   prn  CDATA         #REQUIRED

>

<!—- ipcs:    contains the IPC codes

     mc:      main IPC classification

     ed:      edition of the IPC used

ipc:     contains additional IPC codes

     ic:      supplementary IPC codes -->

<!ELEMENT ipcs (ipc*)>

<!ATTLIST ipcs

   ed   CDATA      #REQUIRED

   mc   CDATA     #REQUIRED

>

<!ELEMENT ipc EMPTY>

<!ATTLIST ipc

   ic   CDATA      #REQUIRED

>

<!—- uscs:    contains the US patent classification codes

     mc:      main US classification

usc:     additional US codes

type:    type of additional US code

     uc:      value of additional US code -->

<!ELEMENT uscs (usc+)>

<!ATTLIST uscs

   mc   CDATA      #REQUIRED

>

<!ELEMENT usc EMPTY>

<!ATTLIST usc

   type CDATA #REQUIRED

   uc   CDATA #REQUIRED

>

<!—- in: contains the name of an inventor -->

<!ELEMENT ins (in+)>

<!ELEMENT in #PCDATA>

<!—- pa: contains the name of an applicant company -->

<!ELEMENT pas (pa+)>

<!ELEMENT pa #PCDATA>

<!—- ti: contains a title with language indicated in xml:lang attribute -->

<!ELEMENT tis (ti+)>

<!ELEMENT ti #PCDATA>

<!ATTLIST ti

   xml:lang   CDATA      #REQUIRED

>

<!—- ab: contains an abstract with language indicated in xml:lang attribute -->

<!ELEMENT abs (ab+)>

<!ELEMENT ab #PCDATA>

<!ATTLIST ab

   xml:lang   CDATA      #REQUIRED

>

<!—- cl: contains the claims section with language indicated in xml:lang attribute -->

<!ELEMENT cls (cl+)>

<!ELEMENT cl #PCDATA>

<!ATTLIST cl

   xml:lang   CDATA      #REQUIRED

>

<!—- txt: contains the full-text description with language indicated in xml:lang attribute -->

<!ELEMENT txts (txt+)>

<!ELEMENT txt #PCDATA>

<!ATTLIST txt

   xml:lang   CDATA      #REQUIRED

>

Table 6: DTD for WIPO categorization datasets

The language of the document fields are indicated in xml:lang attributes using ISO639-1 codes to represent languages (www.loc.gov/standards/iso639-2/englangn.html).

3.3                 File Structure

The WIPO-alpha collection consists of two randomly-split non-overlapping sub-collections of patent applications, which are named training collection and test collection.

The training collection consists of documents roughly evenly spread across the IPC main groups, subject to the restriction that each subclass contains between 20 and 2000 documents. This collection is primarily designed for training automated categorizers.

The test collection consists of documents distributed roughly according to the frequency of a typical year's patent applications (year 2001 was used for this purpose), subject to the restriction that each subclass contains between 10 and 1000 documents. This collection is primarily designed for testing automated categorizers trained earlier. All documents in the test collection also have attributed IPC symbols, so there is no blind data.

As the documents in the two collections are indistinguishable, it is in principle possible to train and test categorizers by mixing the collections, although this is discouraged.

The training and test collections are organised in a set of directories that reflects the IPC taxonomy and the main IPC symbol of each patent document. The file hierarchy is as follows:

·         One directory per IPC Section

o        One directory per IPC Class

§         One directory per IPC subclass

·         One directory per IPC main group

o        One file per patent application.

Documents have been placed once in this hierarchy on the basis of their main IPC symbol only. It is thus expected that the main IPC symbol will be used to place documents for training.

Because of the desire to distribute documents evenly across the taxonomy, not all IPC subclasses have been included in WIPO-alpha. If very few documents have a main IPC symbol in a given subclass, this subclass has not been included in WIPO-alpha. The main IPC symbols of the WIPO-alpha collection are distributed in 114 classes and 451 subclasses. Secondary IPC symbols may contain additional classes and subclasses. The list of classes of the main IPC symbols in WIPO-alpha is provided in the Appendix. Statistics about the document distribution are indicated in Table 7.

WIPO-alpha Collection

Number of documents

Average number of documents per class

Median number of documents per class

Training collection

46,324

406

213

Test collection

28,926

253

78

Table 7: WIPO-alpha document distribution statistics

 

WIPO-de collection:

The file structure of the WIPO-de collection is different from that used for WIPO-alpha. To avoid distributing huge numbers of small files, the patent documents are distributed in large files that aggregate all records in an IPC main group.

The documents are organised in a set of directories that reflects the IPC taxonomy and the main IPC symbol of each patent document. The file hierarchy is as follows:

·         One directory per IPC Section

o        One directory per IPC Class

§         One directory per IPC subclass

·         One file per IPC main group aggregating all documents in the group.

Note that contrary to WIPO-alpha, the training and test collections are distributed together in a single folder tree. Documents have been placed once in the file hierarchy on the basis of their main IPC symbol only.

Together with the patent documents, WIPO provides catalogue files indicating a suggested split of the documents into training and test collections, both for IPC class and IPC subclass categorization tasks. It is expected that the main IPC symbol will be used to place documents for training.

The catalogue files give the document file name, the patent document number, and the list of all IPC symbols for that document, with the main IPC symbol indicated first. An extract from the catalogue file for class-level training is shown in Table 8.

A/01/B/A01B001.xml DE10002864 A01 B25

A/01/B/A01B001.xml DE10017022 A01

A/01/B/A01B001.xml DE10020232 A01

A/01/B/A01B001.xml DE10020472 A01

A/01/B/A01B001.xml DE10021386 A01

A/01/B/A01B001.xml DE10033896 A01 E01

Table 8: Extract of the IPC class-level training catalogue

The rules used for constituting the training and test collections are as follows:

§         The test and training documents are selected by patent application date, with earlier documents used for training and later documents used for testing.

§         Above 450 available documents per category, a 2/1 ratio was used for training/test numbers. WIPO limites the number of training/test documents to 1000/500 per category respectively. Between 300 and 450 available documents, WIPO takes 300 training documents, with the remaining taken as test documents. Below 300 available documents, WIPO uses 10 documents for testing, and all others for training.

§         Because of the desire to distribute documents evenly across the taxonomy, not all IPC subclasses have been included in the catalogue files. If very few documents have a main IPC symbol in a given subclass, this subclass has not been included in the training and test catalogues. However, patent documents for infrequent categories are present in WIPO-de.

It should be noted that these rules are not the same as a global filtering on application dates, with all documents before a single given date taken as training documents and those after that date taken as test documents. Nevertheless, the scenario used here insures that a sufficient number of training documents are obtained in each category.

Note that the test collections at class and subclass level are different.

Overall, the WIPO-de collection contains 117,246 documents. The main IPC symbols of the WIPO-de catalogues are distributed in 120 classes and 598 subclasses. Secondary IPC symbols may contain additional classes and subclasses.

Statistics of the document distributions proposed in the catalogue files for training and testing are indicated in Table 9. The numbers of documents in the WIPO-de catalogue files are similar to those in WIPO-alpha at class level, and about 56% larger at subclass level.

IPC depth

Number categories

Number of docs

Maximum docs  per category

Minimum docs per category

Average docs per category

Median docs per category

Class

Training

116

50555

1000

35

436

300

Test

116

21271

500

10

183

125

Total

71826

Subclass

Training

424

84822

1000

20

200

129

Test

424

26006

500

10

61

10

Total

110828

Table 9 4: WIPO-de document distribution statistics

 

5                        Categorization tasks

For training and testing an auto-categorization system, all textual and numerical information in XML attributes should be discarded or ignored. Only the information "outside" XML tags may be retained (inventors, applicants, title, abstract, claims, and description). The dependence of the categorization accuracy on the fields used is of interest.

As the WIPO-alpha collection is pre-split into training and test sub-collections, it is expected that researchers will use this partition in their investigations. It is possible to retain only part of the training collection, such as those documents that do not have any additional IPC symbols.

The following automated categorization tasks are suggested, although researchers are encouraged to pursue work in any direction:

1.      IPC class categorization: Following training, the test patent applications are attributed to an IPC class.

2.      IPC subclass categorization: Following training, the test patent applications are attributed to an IPC subclass.

Previous tests performed in-house at WIPO have made use of the three evaluation measures for categorization success shown in Figure 1.

·         Top prediction: The top category predicted by the classifier is compared with the main IPC class, shown as [mc] in Figure 1.

·         Three guesses: The top three categories predicted by the classifier are compared with the main IPC class. If a single match is found, the categorization is deemed successful. This measure is adapted to evaluating categorization assistance, where a user ultimately makes the decision. In this case, it is tolerable that the correct guess appears second or third in the list of suggestions.

·         All classes: WIPO compares the top prediction of the classifier with all classes associated with the document, in the main IPC symbol and in additional IPC symbols, shown as (ic) in Figure 1. If a single match is found, the categorization is deemed successful.

Figure 1: Three evaluation measures for categorization success

Indications about the possible improvement in categorization accuracy by rejecting from auto-categorization those documents that have predictions with low confidence levels are also of high interest. Confusion matrices and other evaluations of categorization results are of high interest.

It is equally instructive to report the necessary training and testing times for the categorization package, as well as the hardware employed.

6                        Future developments

In 2009, a second larger multilingual collection of patent documents, known as WIPO-gamma, will be released. This is expected to contain documents in English and French.

The WIPO-gamma collection will thus contain documents having abstracts in more than one language, and a single-language full-text version. Further effort will be put into removing duplicate records.

The WIPO gamma collection will be used to retrain WIPO categorization tool in the IPC (see IPCCAT).

Appendix: IPC classes of main IPC symbols in WIPO-alpha


A 01 agriculture; forestry; animal husbandry; hunting; trapping; fishing

A 21 baking; edible doughs

A 22 butchering; meat treatment; processing poultry or fish

A 23 foods or foodstuffs; their treatment, not covered by other classes

A 24 tobacco; cigars; cigarettes; smokers' requisites

A 41 wearing apparel

A 42 headwear

A 43 footwear

A 44 haberdashery; jewellery

A 45 hand or travelling articles

A 46 brushware

A 47 furniture; domestic articles or appliances; coffee mills; spice mills; suction cleaners in general

A 61 medical or veterinary science; hygiene

A 62 life-saving; fire-fighting

A 63 sports; games; amusements

B 01 physical or chemical processes or apparatus in general

B 02 crushing, pulverising, or disintegrating; preparatory treatment of grain for milling

B 03 separation of solid materials using liquids or using pneumatic tables or jigs; magnetic or electrostatic separation of solid materials from solid materials or fluids; separation by high-voltage electric

B 04 centrifugal apparatus or machines for carrying-out physical or chemical processes

B 05 spraying or atomising in general; applying liquids or other fluent materials to surfaces, in general

B 06 generating or transmitting mechanical vibrations in general

B 07 separating solids from solids; sorting

B 08 cleaning

B 09 disposal of solid waste; reclamation of contaminated soil

B 21 mechanical metal-working without essentially removing material; punching metal

B 22 casting; powder metallurgy

B 23 machine tools; metal-working not otherwise provided for

B 24 grinding; polishing

B 25 hand tools; portable power-driven tools; handles for hand implements; workshop equipment; manipulators

B 26 hand cutting tools; cutting; severing

B 27 working or preserving wood or similar material; nailing or stapling machines in general

B 28 working cement, clay, or stone

B 29 working of plastics; working of substances in a plastic state in general 

B 30 presses

B 31 making paper articles; working paper

B 32 layered products

B 41 printing; lining machines; typewriters;

B 42 bookbinding; albums; files; special printed matter

B 43 writing or drawing implements; bureau accessories

B 44 decorative arts

B 60 vehicles in general

B 61 railways

B 62 land vehicles for travelling otherwise than on rails

B 63 ships or other waterborne vessels; related equipment

B 64 aircraft; aviation; cosmonautics

B 65 conveying; packing; storing; handling thin or filamentary material

B 66 hoisting; lifting; hauling

B 67 opening or closing bottles, jars or similar containers

B 81 micro-structural technology

C 01 inorganic chemistry

C 02 treatment of water, waste water, sewage, or sludge

C 03 glass; mineral or slag wool

C 04 cements; concrete; artificial stone; ceramics; refractories

C 05 fertilisers; manufacture thereof

C 06 explosives; matches

C 07 organic chemistry

C 08 organic macromolecular compounds; their preparation or chemical working-up; compositions based thereon

C 09 dyes; paints; polishes; natural resins; adhesives; miscellaneous compositions; miscellaneous applications of materials

C 10 petroleum, gas or coke industries; technical gases containing carbon monoxide; fuels; lubricants; peat

C 11 animal or vegetable oils, fats, fatty substances or waxes; fatty acids therefrom; detergents; candles

C 12 biochemistry; beer; spirits; wine; vinegar; microbiology; enzymology; mutation or genetic engineering

C 21 metallurgy of iron

C 22 metallurgy; ferrous or non-ferrous alloys; treatment of alloys or non-ferrous metals

C 23 coating metallic material; coating material with metallic material; chemical surface treatment; diffusion treatment of metallic material; coating by vacuum evaporation, by sputtering, by ion implantation or by chemical vapour deposition, in general; inhibiting corrosion of metallic material or incrustation in general

C 25 electrolytic or electrophoretic processes; apparatus therefor

C 30 crystal growth

D 01 natural or artificial threads or fibres; spinning

D 02 yarns; mechanical finishing of yarns or ropes; warping or beaming

D 03 weaving

D 04 braiding; lace-making; knitting; trimmings; non-woven fabrics

D 05 sewing; embroidering; tufting

D 06 treatment of textiles or the like; laundering; flexible materials not otherwise provided for

D 21 paper-making; production of cellulose

E 01 construction of roads, railways, or bridges

E 02 hydraulic engineering; foundations; soil-shifting

E 03 water supply; sewerage

E 04 building

E 05 locks; keys; window or door fittings; safes

E 06 doors, windows, shutters, or roller blinds, in general; ladders

E 21 earth or rock drilling; mining

F 01 machines or engines in general; engine plants in general; steam engines

F 02 combustion engines; hot-gas or combustion-product engine plants

F 03 machines or engines for liquids; wind, spring, weight, or miscellaneous motors; producing mechanical power or a reactive propulsive thrust, not otherwise provided for

F 04 positive-displacement machines for liquids; pumps for liquids or elastic fluids

F 15 fluid-pressure actuators; hydraulics or pneumatics in general

F 16 engineering elements or units; general measures for producing and maintaining effective functioning of machines or installations; thermal insulation in general

F 17 storing or distributing gases or liquids

F 21 lighting

F 22 steam generation

F 23 combustion apparatus; combustion processes

F 24 heating; ranges; ventilating

F 25 refrigeration or cooling; combined heating and refrigeration systems; heat pump systems; manufacture or storage of ice; liquefaction or solidification of gases

F 26 drying

F 27 furnaces; kilns; ovens; retorts

F 28 heat exchange in general

F 41 weapons

F 42 ammunition; blasting

G 01 measuring; testing

G 02 optics

G 03 photography; cinematography; analogous techniques using waves other than optical waves; electrography; holography

G 04 horology

G 05 controlling; regulating

G 06 computing; calculating; counting

G 07 checking-devices

G 08 signalling

G 09 educating; cryptography; display; advertising; seals

G 10 musical instruments; acoustics

G 11 information storage

G 21 nuclear physics; nuclear engineering

H 01 basic electric elements

H 02 generation, conversion, or distribution of electric power

H 03 basic electronic circuitry

H 04 electric communication technique

H 05 electric techniques not otherwise provided for