The dynamics and social organization of innovation in the field of oncology/Instruments
- 1 Questionnaires
- 2 Audiovisual
- 3 Data extract, transform, load
- 4 Data analysis
- 5 Infrastructure
Data extract, transform, load
Mining the ESMO conference abstracts
Tools in the esmominer repository allow us to obtain, process and structure the abstracts from ESMO conferences. It automates:
- download the data of the conferences, which comes in different formats
- where needed, do image pre-processing and OCR
- parse an OCR'd file into individual abstracts (both "4 discrete quadrants" and "2 continuous columns" styles)
- scrape HTML data from the website, properly parsing metadata and content
Still missing is: (a) a means to separate header and content in the OCR'd data, (b) attribute abstracts to sections in the OCR'd data.
Collecting data on the editorial board of oncology journals
- a methodology for manually extracting editorial board members from cancer journals
Tools in the abstractology repository implement several data analysis methods.
This is the basic class that can import several different sources of data (csv, pubmed xml, cortext sqlite, esmominer json, etc), tokenize and prepare the corpus to be used by the other classes, as well as carry on some smart transformations on the text by extracting more relevant terms. It also contains a few generic statistic functions and plots, and can get citation data from WebOfScience for documents that have a compatible identifier such as a PubmedID. It also facilitates sampling and filtering the data to carry on modelling and analysis with the other classes.
This class makes use of gensim to train Word2Vec (and could possibly do Doc2Vec) models on the data, including splitting the data to produce several models, for example one model per some period of time. It also can subsample the data to balance the amount of training per period, and do other tricks. It implements a similarity score between some text and the models, which is an improved version of the score provided by Gensim, fixing some biases not taken into account by that.
This class takes texts and produces networks that in turn can be analysed using the Stochastic Block Model models implemented in graph-tool. It is designed to produce several kinds of networks that serve different analytical purposes (term-term networks, document-term networks, context-term networks etc), and to split networks by period to do layer analysis and model comparison. It will handle these tasks and the launch the calculation of approximatively optimal blockstates for these networks. It can also provide a kind of text to model score by resorting to calculating the probability of missing edges on the network.
Here are implemented several plots and statistical measures that employ the similarity score provided by one of the previous classes to understand the evolution of how typical an abstract is along a set of models respective to each period of the corpus. It allows for creating annotated texts that enhance the reading of a document. It is also intended to locate documents that have a specific profile associated with greater innovation or influence.
Here are implemented several plots and statistical measures that employ the blockstate of the networks generated in Graphology to provide insights into the structure and evolution of the corpus. in terms of this structure. It provides ways to relate these measures to exogenous data such as citation counts, existing categorizations, and other attribute data.
Work done in this project will, as time permits, be ported and made available for use within the CorText Platform
Below we list some tools employed in, but not directly developed by, this project
- The Python programming language
- The lxml software library, to manipulate HTML and XML files
- The pdfminer.six software library, to extract text plus layout from PDF files
- The poppler software library, to render page images from PDF files
- The tesseract-ocr software project, to turn images into text plus layout by OCR
- The SciPy stack for scientific computing
- The Pandas data analysis software library
- The Bokeh data visualisation software library
- The gensim natural language processing software library
- The graph-tool network analysis software library