WikiJournal Preprints/Aggregation of scholarly publications and extracted knowledge on COVID19 and epidemics/Open Notebook

From Wikiversity
Jump to navigation Jump to search

WikiJournal Preprints
Open access • Publication charge free • Public peer review

WikiJournal User Group is a publishing group of open-access, free-to-publish, Wikipedia-integrated academic journals. <seo title=" Wikiversity Journal User Group, WikiJournal Free to publish, Open access, Open-access, Non-profit, online journal, Public peer review "/>

<meta name='citation_doi' value=>

Article information

Authors: Thomas Shafee[a] , Peter Murray-Rust[b][i] 

See author information ▼
  1. La Trobe University
  2. University of Cambridge


This project aims to use modern tools, especially Wikidata (and Wikpedia), R, Java, textmining, with semantic tools to create a modern integrated resource of all current published information on viruses and their epidemics. It relies on collaboration and gifts of labour and knowledge.
Availability of resources

Open notebook[edit | edit source]

Survey of initial [R] package availability[edit | edit source]

T.Shafee(Evo﹠Evo)talk 09:10, 19 March 2020‎ (UTC)

Missing capabilities[edit | edit source]

Currently no ability to write to Wikidata from R. An initial solution is to produce tables in the quickstatements format and use that tool to catch edit wikidata (Issue-17a). An alternative is to make a function to deposit via the quickstatements API (I-17b). Final option is to make a function to deposit via the wikidata API (I-15). T.Shafee(Evo﹠Evo)talk 09:10, 19 March 2020‎ (UTC)

Closing capability gaps[edit | edit source]

I've put together a simple as_quickstatement function in wikiPackageTesting.R to format up quickstatements in the "V1" format to paste to the quickstatements tool. T.Shafee(Evo﹠Evo)talk 09:51, 24 March 2020 (UTC)

I've also made some suggestions to the WikidataR respoitory (Pull-32 and P-33) since they seems of general enough use that integration with that package would be useful. 10:20, 24 March 2020 (UTC)

Having got a simple function working for submitting quickstatements from [R] (as a tibble for reading in [R], as a V1 format for pasting into the website, and as a direct api submission). There seems to be a current bug in qualifier submission via the API (same submission via api versus via website), and the CREATE and LAST features still need to e implemented. I'll continue working on those at I-15. This should allow us to e.g. write "main subject" (P921) to wikidata items of articles. Having got the hang of the quickstatements API, the general wikidata API seems less intimidating, so I'll also look into if that is useful for any cases. T.Shafee(Evo﹠Evo)talk 06:03, 5 April 2020 (UTC)

Creating Dictionaries[edit | edit source]

From Wikipedia Templates[edit | edit source]

Have created first version of a tool to create dictionaries from Wikipedia Templates (I-21). See Petermr (discusscontribs) 09:56, 20 March 2020 (UTC)

Currently as a test. This code will convert the three Templates to structured dictionaries with Wikidata IDs. They will need editing to remove false positives and glitches.

	public void testCreateFromMediawikiTemplateListURL() {
		String[] templates = {"Baltimore_(virus_classification)", "Antiretroviral_drug", "Virus_topics"};
		String fileTop = "/Users/pm286/projects/openVirus/dictionaries/";
		String wptype = "mwk";
		for (String template : templates) {
			createFromMediaWikiTemplate(template, fileTop, wptype);

Results in

Currently turning this into a commandline. Testing:

		String args =
				"create " +
		       " --directory " + "/Users/pm286/projects/openVirus/dictionaries/" +
		       " --outformats html,xml " +
		       " --template " + "Baltimore_(virus_classification) Antiretroviral_drug Virus_topics" +
		       " --wptype " + mwk +
		       " --wikilinks wikidata"
		new AMIDictionaryTool().runCommands(args);

which translates to commandline:

ami-dictionary -p covid2020 create --directory /Users/pm286/projects/openVirus/dictionaries 
   --outformats html,xml  --template Baltimore_(virus_classification) Antiretroviral_drug Virus_topics
   --wptype mwk --wikilinks wikidata

You MAY have to escape the '(...)' on your machine

Recently ran commandline (note `'...'` round parentheses `(...)`

ami-dictionary -p . create --directory /Users/pm286/projects/openVirus/dictionaries/ --outformats html,xml --template Antiretroviral_drug --wptype mwk --wikilinks wikidata
ami-dictionary -p . create --directory /Users/pm286/projects/openVirus/dictionaries/ --outformats html,xml --template 'Baltimore_(virus_classification)' --wptype mwk --wikilinks wikidata
ami-dictionary -p . create --directory /Users/pm286/projects/openVirus/dictionaries/ --outformats html,xml --template Virus_topics --wptype mwk --wikilinks wikidata

(Am working on multiple Templates in one job but not yet working).

From Wikidata queries[edit | edit source]

Use of the Wikidata SPARQL query service would also be a logical way to build additional dictionaries (will prep a process via WikidataQueryServiceR).

CORD19 dataset[edit | edit source]

overview from Sebastian Kohlmeier[edit | edit source]

Sebastian Kohlmeier <> Fri, Mar 20, 3:00 PM (18 hours ago) to t.shafee, pm286

Hi Thomas and Peter,

Great to meet and fantastic to hear that you are planning to use the CORD-19 dataset! I would recommend taking a look at the licensing information that we've published in case you have any questions, but given that your use case is to disseminate the data for research purposes in an open dataset tI don't see any issues (I would recommend referring back to our license and retaining the original licensing information to ensure that users of your database are aware).

By the way, we are actively working to expand the dataset and expect to more than double the size of the full text dataset in today's release (we've received all coronavirus-related full text articles from Elsevier). Going forward we will continue to further invest in the resource to add more content and supplemental functional (e.g. biomedical entities, citation links, etc.). Once you get openVirus up and running we should discuss how information could potentially flow back into CORD-19 or if there are other ways we can collaborate!

Thanks again and please let us know if you have any questions!

licensing and the right to mine[edit | edit source]

Formally there may be restrictions on mining copyrighted material, but in the UK we fought for and won the right in 2014 to text mine, without explicit permission, anything we have legal right to read. As a UK citizen resident in the UK I am exercising this right. I will therefore expose all of the text I mine and all of the results. This is the only consistent approach for Open Notebook Science.

analysis of CORD-19[edit | edit source]

Have looked at overview and downloaded some of the material from

Latest release contains papers up until 2020-03-20 with over 29,000 full text articles. (Changelog from previous release.)

Commercial use subset (includes PMC content) -- 9118 full text (new: 128), 183Mb Non-commercial use subset (includes PMC content) -- 2353 full text (new: 385), 41Mb Custom license subset -- 16959 full text (new: 15533), 345Mb bioRxiv/medRxiv subset (pre-prints that are not peer reviewed) -- 885 full text (new: 110), 14Mb Metadata file -- 60Mb Readme Each paper is represented as a single JSON object. The schema is available here.

Immediate reaction:

  • have been mining EuropePMC for years so may continue with that.
  • custom licence subset. This will be largely new and I'll try to look today.
- this is called PMC subset (pmc_custom_license.tar.gz) but I can't find a PMCID in the first article.
- the metadata section gives title and authors but no place of publication.
  • biorxiv / medrxiv . Useful. I have written a scraper for these but it's slow . Will compare my subset with these. Interestingly they are in JSON, whereas I would expect JATS. Maybe this is to make it more easily mineable? I've developed my AMI system for JATS.
  • Metadata . Not sure whether this duplicates what's in the paper or is additional. Separating metadata can be a technical problem.

earlier release[edit | edit source]

Latest release contains papers up until 2020-03-13 with over 13,000 full text articles.

Commercial use subset (includes PMC content) -- 9000 papers, 186Mb Non-commercial use subset (includes PMC content) -- 1973 papers, 36Mb PMC custom license subset -- 1426 papers, 19Mb bioRxiv/medRxiv subset (pre-prints that are not peer reviewed) -- 803 papers, 13Mb Metadata file -- 47Mb

Petermr (discusscontribs) 09:59, 20 March 2020 (UTC)

Dictionary creation[edit | edit source]

Am now going to look for other Wikipedia Templates, Category's and Lists which may be relevant to epidemics. Please add others.

Epidemics (C)[edit | edit source]

Anti-herpes (C)[edit | edit source]

RNA antivirals[edit | edit source]

DNA antivirals[edit | edit source] (Yes I know it's an RNA virus)