Helping Improve Machine Translation for Patent Documents

August 4, 2011

WIPO is pleased to release to the scientific and R&D community a new linguistic data product, which will contribute to improving the quality of machine translation systems for patent documents. 
The PATENTSCOPE Corpus of Parallel Patent Applications (Coppa) uses data from WIPO’s international PATENSTCOPE database of patent documents to provide a bilingual “corpus” consisting of more than 8 million parallel segments of text in English and French, covering over 170 million words. Technical details can be found here.  Other language pairs will be added in the future if the associated source data become available to WIPO in sufficient volume with the required redistribution rights.
The availability - in a user-friendly format - of this vast corpus will contribute significantly to efforts aimed at building more accurate machine translation systems for patent texts.  Better machine translation systems will, in turn,  lower the linguistic barriers for inventors and for patent offices. Ultimately, more accurate machine translation will improve the efficiency of the international patent system, as well as accessibility to the global repository of technological information contained within it.
The parallel segments were obtained by breaking down the abstracts and titles of twenty years’ worth of PCT international patent applications (from 1990 – 2010) into sentences, and mapping these sentences onto their translated versions which were produced by specialist patent translation professionals. The resulting product is a treasure trove for linguistic research, in particular for terminology extraction, translation memory building and machine translation research.
WIPO is making the Corpus available free of charge to academic and private research institutions wishing to use it for research purposes only. In return these institutions commit to sharing the published results with WIPO.  For other parties wishing to use the product for non-academic research purposes, it is available for CHF 2,000, and is subject to a no redistribution policy.