Open Global Health/ContentMine

From Wikiversity
Jump to navigation Jump to search
Ya old Content Mine!

This page organizes the work regarding the ContentMine Fellowship awarded to Ale Abdo as part of the Open Global Health effort.

Tag along[edit | edit source]

Interested in following up or contributing to this project?

Ongoing[edit | edit source]

Roadmap[edit | edit source]

First step is to organize a database of pubmed and conference abstracts relevant to global health issues from recent years (>2000).

Then, some effort will involve recognizing what kinds of facts are actually interesting and extractable. This will likely involve creating dictionaries and improving the software.

Once we're able to extract facts, create a nice interface to make facts and their sources easily discoverable. Main idea for now is a tool to assist a person who is looking for an overview and references on a combination of conditions, location and social aspects.

Following that, check if we can trace the precocity of facts in conference abstracts relative to published literature. If successful, figure out features of conference abstracts that could help us infer their reliability.

Tasks[edit | edit source]

See below and perhaps some issues at GtiLab.

Data[edit | edit source]

Currently working on the dataset coming from an EPMC search for:

(KW:"global health") AND (PUB_TYPE:"Review" OR PUB_TYPE:"review-article" OR PUB_TYPE:"Meta-Analysis")

The specific command is

node node_modules/getpapers/bin/getpapers.js -q '(KW:"global health") AND (PUB_TYPE:"Review" OR PUB_TYPE:"review-article" OR PUB_TYPE:"Meta-Analysis")' -o data/globalhealth.all -xa

Currently it sometimes breaks getpapers, see this issue.

After getting the files I use norma to run and output scholarly html as

./contentmine/norma/bin/norma --project data/globalhealth.all --input fulltext.xml --output scholarly.html --transform nlm2html

Finally I manage to access the data from python using pycproject, with one quirck however.

Dictionary[edit | edit source]

Turning XML MeSH into a CM-formatted dictionary.

For this I put together mesh2cmdict.py. It automatically downloads and processes MeSH into a CM-compatible dictionary, optionally filtering specific branches of the MeSH hierarchy.

Places dictionary: MeSH is actually kinda poor in terms of places, stopping at country level with only a few 'global' cities (and a very colonialist point of view).

Social condition dictionary: MeSH seems OK, matching for the social branch.

Does CM understand synonyms in a dictionary? If so, how? For this and other issues, See this discourse thread.

Fact extraction[edit | edit source]

The syntax to use the dictionary is supposed to be like:

./contentmine/ami/bin/cmine data/gp_globalhealth word\(search\)w.search:contentmine/dicts/desc2017-cmdict.xml

However it outputs empty results. This bug is being discussed in discourse.

Turned out the syntax was not only scarcely documented, but also had an error. The right syntax being:

./contentmine/ami/bin/cmine data/gp_globalhealth 'search(file:///srv/lisis-lab/devroot/home/ale/contentmine/dicts/desc2017-cmdict.xml)'

This is still processing here! =)

Interface[edit | edit source]

So, I'd love people to interface in the most useful way with the facts extracted. Some ideas...

  • I recently met a doctoral student working in computer linguistics that has an interesting system which infers agreement relationships between a text and a term, and through a search can display relevant extracts exposing these relationships in a meaningful way for people to interpret. I thought it would be cool to use that in this project, if it can for example highlight the agreement of papers on a disease, in a certain place, to some form of treatment. I've contacted him about this and now am waiting for a reply.
    • He responded, not directly usable, shared some other tips.
  • Had another idea, train a word embedding model and enrich abstracts with annotations over words present in MeSH, so one can navigate to abstracts that use that same word in a similar context. I believe this is feasible and that's likely what I'm going for.

Suggestions for future fellowship programs[edit | edit source]

We were asked to give some directions for a continued fellowship program. Here are mine, in bullet form because I like guns.

  • stimulate fellows to document their efforts - their journals, not their code - in a single collaborative environment such as right here, or keep them linked there;
  • could be github as well, each fellow a folder in a single project tree, but my feeling is github is good for code, not so much to organize collaboration.
  • challenge fellows to standardize how they document stuff, this will make them pay attention to how each other setting up their environment; but also to what they've been up to;
  • make sure the team responds to issues in place; I've often had an answer through chat of something I had posted on discourse, this gets in the way of immediate and future reference.
  • better documentation to start with is very much needed; not only better content but also better organization. For each package: installation, usage, behaviour, features.
  • use something non-proprietary for conference calls, like Jitsi or Riot which are both web based. Riot could also substitute Slack.

Resources[edit | edit source]

Fellowship application

Fellowship blog posts

GitLab repository