About Intellectual Property IP Training IP Outreach IP for… IP and... IP in... Patent & Technology Information Trademark Information Industrial Design Information Geographical Indication Information Plant Variety Information (UPOV) IP Laws, Treaties & Judgements IP Resources IP Reports Patent Protection Trademark Protection Industrial Design Protection Geographical Indication Protection Plant Variety Protection (UPOV) IP Dispute Resolution IP Office Business Solutions Paying for IP Services Negotiation & Decision-Making Development Cooperation Innovation Support Public-Private Partnerships The Organization Working with WIPO Accountability Patents Trademarks Industrial Designs Geographical Indications Copyright Trade Secrets WIPO Academy Workshops & Seminars World IP Day WIPO Magazine Raising Awareness Case Studies & Success Stories IP News WIPO Awards Business Universities Indigenous Peoples Judiciaries Genetic Resources, Traditional Knowledge and Traditional Cultural Expressions Economics Gender Equality Global Health Climate Change Competition Policy Sustainable Development Goals Enforcement Frontier Technologies Mobile Applications Sports Tourism PATENTSCOPE Patent Analytics International Patent Classification ARDI – Research for Innovation ASPI – Specialized Patent Information Global Brand Database Madrid Monitor Article 6ter Express Database Nice Classification Vienna Classification Global Design Database International Designs Bulletin Hague Express Database Locarno Classification Lisbon Express Database Global Brand Database for GIs PLUTO Plant Variety Database GENIE Database WIPO-Administered Treaties WIPO Lex - IP Laws, Treaties & Judgments WIPO Standards IP Statistics WIPO Pearl (Terminology) WIPO Publications Country IP Profiles WIPO Knowledge Center WIPO Technology Trends Global Innovation Index World Intellectual Property Report PCT – The International Patent System ePCT Budapest – The International Microorganism Deposit System Madrid – The International Trademark System eMadrid Article 6ter (armorial bearings, flags, state emblems) Hague – The International Design System eHague Lisbon – The International System of Appellations of Origin and Geographical Indications eLisbon UPOV PRISMA Mediation Arbitration Expert Determination Domain Name Disputes Centralized Access to Search and Examination (CASE) Digital Access Service (DAS) WIPO Pay Current Account at WIPO WIPO Assemblies Standing Committees Calendar of Meetings WIPO Official Documents Development Agenda Technical Assistance IP Training Institutions COVID-19 Support National IP Strategies Policy & Legislative Advice Cooperation Hub Technology and Innovation Support Centers (TISC) Technology Transfer Inventor Assistance Program WIPO GREEN WIPO's Pat-INFORMED Accessible Books Consortium WIPO for Creators WIPO ALERT Member States Observers Director General Activities by Unit External Offices Job Vacancies Procurement Results & Budget Financial Reporting Oversight

Helping Improve Machine Translation for Patent Documents

August 4, 2011

WIPO is pleased to release to the scientific and R&D community a new linguistic data product, which will contribute to improving the quality of machine translation systems for patent documents. 
 
The PATENTSCOPE Corpus of Parallel Patent Applications (Coppa) uses data from WIPO’s international PATENSTCOPE database of patent documents to provide a bilingual “corpus” consisting of more than 8 million parallel segments of text in English and French, covering over 170 million words. Technical details can be found here.  Other language pairs will be added in the future if the associated source data become available to WIPO in sufficient volume with the required redistribution rights.
 
The availability - in a user-friendly format - of this vast corpus will contribute significantly to efforts aimed at building more accurate machine translation systems for patent texts.  Better machine translation systems will, in turn,  lower the linguistic barriers for inventors and for patent offices. Ultimately, more accurate machine translation will improve the efficiency of the international patent system, as well as accessibility to the global repository of technological information contained within it.
 
The parallel segments were obtained by breaking down the abstracts and titles of twenty years’ worth of PCT international patent applications (from 1990 – 2010) into sentences, and mapping these sentences onto their translated versions which were produced by specialist patent translation professionals. The resulting product is a treasure trove for linguistic research, in particular for terminology extraction, translation memory building and machine translation research.
 
WIPO is making the Corpus available free of charge to academic and private research institutions wishing to use it for research purposes only. In return these institutions commit to sharing the published results with WIPO.  For other parties wishing to use the product for non-academic research purposes, it is available for CHF 2,000, and is subject to a no redistribution policy.