A practical view on text data mining from ContentMine
We at ContentMine are a non-profit NGO from Cambridge UK, who are practitioners on the forefront of text data mining – the free and open way. Here we summarize our insights and how you can to TDM in practice.
Text data mining is a wide field, and we focus on scholarly literature. Most of of our sources are Open Access, a precondition for unimpeded mining, , and we develop open source software to do so.
TDM in practice
But why do text data mining, or content mining as we call it, with scientific publications? A central purpose is to explore massive numbers of publications and find out about whole research fields: how they evolved, who is important, how the use of language changed, which instruments or methods are used, or what drug, gene, species or spatial entity is mentioned. Often you also want to find patterns of entity occurrences or get statistical aggregates of them. To be more concrete: It is not difficult to count all genes named in one publication, but what if you want to count them in all medical studies on a specific virus after the outbreak to get a better understanding of it? There are maybe thousands or even more pages to read, nothing that can be done quickly and without massive resources used. So here content mining and open access helps immensely, especially when it is urgent and of high importance.
…