Releasing 1.8 million open access publications from publisher systems for text and data mining

Publié le 23 mars 2018 par Thérèse Hameau

Text and data mining offers an opportunity to improve the way we access and analyse the outputs of academic research. But the technical infrastructure of the current scholarly communication system is not yet ready to support TDM to its full potential, even for open access outputs. To address this problem, Petr Knoth, Nancy Pontika and Lucas Anastasiou have developed the CORE Publisher Connector, a toolkit service designed to assist text miners in accessing content though a single machine interface. The Connector aims to solve the heterogeneity among publisher APIs and assist text miners with data collection, provide a centralised point of access to all openly available scientific publications, and provide a high-performance, constantly updated access interface.

…

Open access and text mining of research papers have one thing in common: both aim to improve access to scientific knowledge for people. As a result, text mining is performed in large corpuses of text. In fact, many of the text mining tasks, such as semantic search, recommender systems, question answering, or content summarisation, are only able to realise their full potential when run on an as large a corpus of publications as possible. This means that text miners must typically invest considerable time, effort, and resources in collecting their corpus of interest. Sometimes, this task may prove impossible due to the technical restrictions and limitations of publisher platforms. According to a 2014 Jisc report, text miners can spend up to 90% of their total investigation time on the data collection.

To eliminate these extra steps and save time and money for text miners we have developed the CORE Publisher Connector, a toolkit service designed to assist text miners on accessing content though a single machine interface.

…

The aim of the CORE Publisher Connector is to:

– Create a seamless layer for accessing content across publishers: the Connector attempts to solve the heterogeneity among publisher APIs and assist text miners with data collection
– Provide a generic, centralised point of access to all available resources: this is a large corpus of millions of open access scientific publications
– Provide a high-performance, up-to-date access interface: the corpus will be constantly updated to easily surface open access scientific literature.

…

This work is innovative in two ways: it constitutes the first systematic aggregation of gold and hybrid open access content from key publishers; content that aggregators like CORE and OpenAIRE have not been harvesting so far. This work liberates nearly two million papers from key publishers for a range of activities, including TDM. It is also the first deployment of ResourceSync as an effective technology for distributing a large corpus of scholarly literature.

…

[…] CORE is a global aggregation service which has harvested over 83 million metadata records from more than 3,600 data sources and tens of thousands of journals. The majority of these records contain links to article full texts. In addition, as CORE has already ingested the content from the Publisher Connector, CORE is now directly hosting more than 10 million open access full texts, making CORE the world’s largest full text aggregator. The full texts are available directly from CORE via the CORE API (REST and ResourceSync) and downloadable as a dataset.

…

L'information