WikiJournal Preprints/Aggregation of scholarly publications and extracted knowledge on COVID19 and epidemics/Open Notebook
This article is an unpublished pre-print not yet undergoing peer review.
To submit this article for peer review, please:
Article information
Abstract
- Availability of resources
- All code available at github.com/petermr/openVirus
- Open notebook as supplementary material 1
Open notebook
[edit | edit source]Survey of initial [R] package availability
[edit | edit source]- WikipediR - Wikipedia page metadata and contents retreival
- WikidataR - Wikidata item and property retrieval
- WikidataQueryServiceR - Wrapper for query.wikidata.org SPARQL queries
- QueryWikidataR - Alternative package for above
T.Shafee(Evo﹠Evo)talk 09:10, 19 March 2020 (UTC)
Missing capabilities
[edit | edit source]Currently no ability to write to Wikidata from R. An initial solution is to produce tables in the quickstatements format and use that tool to catch edit wikidata (Issue-17a). An alternative is to make a function to deposit via the quickstatements API (I-17b). Final option is to make a function to deposit via the wikidata API (I-15). T.Shafee(Evo﹠Evo)talk 09:10, 19 March 2020 (UTC)
Closing capability gaps
[edit | edit source]I've put together a simple as_quickstatement
function in wikiPackageTesting.R to format up quickstatements in the "V1" format to paste to the quickstatements tool. T.Shafee(Evo﹠Evo)talk 09:51, 24 March 2020 (UTC)
I've also made some suggestions to the WikidataR respoitory (Pull-32 and P-33) since they seems of general enough use that integration with that package would be useful. 10:20, 24 March 2020 (UTC)
Having got a simple function working for submitting quickstatements from [R] (as a tibble for reading in [R], as a V1 format for pasting into the website, and as a direct api submission). There seems to be a current bug in qualifier submission via the API (same submission via api versus via website), and the CREATE and LAST features still need to e implemented. I'll continue working on those at I-15. This should allow us to e.g. write "main subject" (P921) to wikidata items of articles. Having got the hang of the quickstatements API, the general wikidata API seems less intimidating, so I'll also look into if that is useful for any cases. T.Shafee(Evo﹠Evo)talk 06:03, 5 April 2020 (UTC)
Creating Dictionaries
[edit | edit source]From Wikipedia Templates
[edit | edit source]Have created first version of a tool to create dictionaries from Wikipedia Templates (I-21). See github.com/petermr/ami3. Petermr (discuss • contribs) 09:56, 20 March 2020 (UTC)
Currently as a test. This code will convert the three Templates to structured dictionaries with Wikidata IDs. They will need editing to remove false positives and glitches.
@Test public void testCreateFromMediawikiTemplateListURL() { String[] templates = {"Baltimore_(virus_classification)", "Antiretroviral_drug", "Virus_topics"}; String fileTop = "/Users/pm286/projects/openVirus/dictionaries/"; String wptype = "mwk"; for (String template : templates) { createFromMediaWikiTemplate(template, fileTop, wptype); } }
Results in https://github.com/petermr/openVirus/tree/master/dictionaries
Currently turning this into a commandline. Testing:
String args = "create " + " --directory " + "/Users/pm286/projects/openVirus/dictionaries/" + " --outformats html,xml " + " --template " + "Baltimore_(virus_classification) Antiretroviral_drug Virus_topics" + " --wptype " + mwk + " --wikilinks wikidata" ; new AMIDictionaryTool().runCommands(args);
which translates to commandline:
ami-dictionary -p covid2020 create --directory /Users/pm286/projects/openVirus/dictionaries --outformats html,xml --template Baltimore_(virus_classification) Antiretroviral_drug Virus_topics --wptype mwk --wikilinks wikidata
You MAY have to escape the '(...)' on your machine
Recently ran commandline (note `'...'` round parentheses `(...)`
ami-dictionary -p . create --directory /Users/pm286/projects/openVirus/dictionaries/ --outformats html,xml --template Antiretroviral_drug --wptype mwk --wikilinks wikidata ami-dictionary -p . create --directory /Users/pm286/projects/openVirus/dictionaries/ --outformats html,xml --template 'Baltimore_(virus_classification)' --wptype mwk --wikilinks wikidata ami-dictionary -p . create --directory /Users/pm286/projects/openVirus/dictionaries/ --outformats html,xml --template Virus_topics --wptype mwk --wikilinks wikidata
(Am working on multiple Templates in one job but not yet working).
From Wikidata queries
[edit | edit source]Use of the Wikidata SPARQL query service would also be a logical way to build additional dictionaries (will prep a process via WikidataQueryServiceR).
CORD19 dataset
[edit | edit source]overview from Sebastian Kohlmeier
[edit | edit source]Sebastian Kohlmeier <sebastiank@allenai.org> Fri, Mar 20, 3:00 PM (18 hours ago) to t.shafee, pm286
Hi Thomas and Peter,
Great to meet and fantastic to hear that you are planning to use the CORD-19 dataset! I would recommend taking a look at the licensing information that we've published in case you have any questions, but given that your use case is to disseminate the data for research purposes in an open dataset tI don't see any issues (I would recommend referring back to our license and retaining the original licensing information to ensure that users of your database are aware).
By the way, we are actively working to expand the dataset and expect to more than double the size of the full text dataset in today's release (we've received all coronavirus-related full text articles from Elsevier). Going forward we will continue to further invest in the resource to add more content and supplemental functional (e.g. biomedical entities, citation links, etc.). Once you get openVirus up and running we should discuss how information could potentially flow back into CORD-19 or if there are other ways we can collaborate!
Thanks again and please let us know if you have any questions!
licensing and the right to mine
[edit | edit source]Formally there may be restrictions on mining copyrighted material, but in the UK we fought for and won the right in 2014 to text mine, without explicit permission, anything we have legal right to read. As a UK citizen resident in the UK I am exercising this right. I will therefore expose all of the text I mine and all of the results. This is the only consistent approach for Open Notebook Science.
analysis of CORD-19
[edit | edit source]Have looked at overview and downloaded some of the material from https://pages.semanticscholar.org/coronavirus-research
Latest release contains papers up until 2020-03-20 with over 29,000 full text articles. (Changelog from previous release.)
Commercial use subset (includes PMC content) -- 9118 full text (new: 128), 183Mb Non-commercial use subset (includes PMC content) -- 2353 full text (new: 385), 41Mb Custom license subset -- 16959 full text (new: 15533), 345Mb bioRxiv/medRxiv subset (pre-prints that are not peer reviewed) -- 885 full text (new: 110), 14Mb Metadata file -- 60Mb Readme Each paper is represented as a single JSON object. The schema is available here.
Immediate reaction:
- have been mining EuropePMC for years so may continue with that.
- custom licence subset. This will be largely new and I'll try to look today.
- this is called PMC subset (pmc_custom_license.tar.gz) but I can't find a PMCID in the first article. - the metadata section gives title and authors but no place of publication.
- biorxiv / medrxiv . Useful. I have written a scraper for these but it's slow . Will compare my subset with these. Interestingly they are in JSON, whereas I would expect JATS. Maybe this is to make it more easily mineable? I've developed my AMI system for JATS.
- Metadata . Not sure whether this duplicates what's in the paper or is additional. Separating metadata can be a technical problem.
earlier release
[edit | edit source]Latest release contains papers up until 2020-03-13 with over 13,000 full text articles.
Commercial use subset (includes PMC content) -- 9000 papers, 186Mb Non-commercial use subset (includes PMC content) -- 1973 papers, 36Mb PMC custom license subset -- 1426 papers, 19Mb bioRxiv/medRxiv subset (pre-prints that are not peer reviewed) -- 803 papers, 13Mb Metadata file -- 47Mb
Petermr (discuss • contribs) 09:59, 20 March 2020 (UTC)
Dictionary creation
[edit | edit source]Am now going to look for other Wikipedia Templates, Category's and Lists which may be relevant to epidemics. Please add others.
Epidemics (C)
[edit | edit source]https://en.wikipedia.org/wiki/Category:Epidemiology
Anti-herpes (C)
[edit | edit source]https://en.wikipedia.org/wiki/Category:Anti-herpes_virus_drugs
RNA antivirals
[edit | edit source]https://en.wikipedia.org/wiki/Template:RNA_antivirals
DNA antivirals
[edit | edit source]https://en.wikipedia.org/wiki/Template:DNA_antivirals (Yes I know it's an RNA virus)