User:OpenScientist/Open grant writing/Wissenswert 2011/Documentation/Crawler
From Wikiversity
Contents |
Purpose[edit]
This tool shall regularly crawl open access repositories for articles containing supplementary materials that have not yet been uploaded to Wikimedia Commons.
Example workflow[edit]
Taking the Open Subset of PubMed Central as an example. Technical background.
Repeat at regular intervals[edit]
- Fetch file list (CSV) (also available as plaintext version)
- find articles that changed since last update
- process each new article
XML files[edit]
Each about 1GB in size. Download only for initial run, not at every update.
- ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.A-B.tar.gz
- ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.C-H.tar.gz
- ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.I-N.tar.gz
- ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.O-Z.tar.gz
ID list[edit]
Not strictly necessary, but could be used for checking in case of problems or for updates.
Find articles with supplementary materials[edit]
- Check XML files for articles with supplementary files (e.g. <xref ref-type="supplementary-material" rid="pone.0000133.s002">Movie S1</xref> and <xref ref-type="supplementary-material" rid="pone.0000133.s001">Audio S1</xref>)
- If such exist,
- Check supplementary materials for
- video (e.g. <supplementary-material id="pone.0000133.s002" mimetype="video/quicktime" xlink:href="info:doi/10.1371/journal.pone.0000133.s002" position="float" xlink:type="simple"> <label>Movie S1</label> <caption> <p>Movie showing the black smoker acoustic recording system deployed at the Sully vent in September 2004 with audio from the same deployment. The audio and video are not contemporaneous because the remotely-operated vehicle carrying the video camera generated excessive noise. The video is included to provide context for the audio. The audio has been upsampled to 8 kHz, and high-pass filtered at 10 Hz using a 4-pole Butterworth filter. It is played in real-time (i.e. without time stretching or pitch shifting). Because much of the acoustic energy falls below ∼100 Hz, speakers with good bass response are required to properly reproduce the sound. Most laptop speakers will not produce sound.</p> <p>(9.58 MB MOV)</p> </caption> </supplementary-material>)
- or audio (<supplementary-material id="pone.0000133.s001" mimetype="audio/x-wav" xlink:href="info:doi/10.1371/journal.pone.0000133.s001" position="float" xlink:type="simple"> <label>Audio S1</label> <caption> <p>Audio file containing a short section of sound collected with the black smoker acoustic recording system at Puffer in September 2005. The audio has been upsampled to 8 kHz, and high-pass filtered at 10 Hz using a 4-pole Butterworth filter. It is played in real-time (i.e. without time stretching or pitch shifting). Because much of the acoustic energy falls below ∼100 Hz, speakers with good bass response are required to properly reproduce the sound. Most laptop speakers will not produce sound.</p> <p>(0.96 MB WAV)</p> </caption> </supplementary-material> files
- If such exist,
- grep PMCID (e.g. <article-id pub-id-type="pmc">1762412</article-id>)
- prefix the relative URL given in xlink:href with http://www.ncbi.nlm.nih.gov/pmc/articles/PMCPMCID/bin" to download the files under their "supplementary-material id"
- http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1762412/bin/pone.0000133.s002.mov (suffixes added according to mime type or final line in caption)
- and http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1762412/bin/pone.0000133.s001.wav.
- store metadata for further processing.
- Check supplementary materials for
Code[edit]
The code for the crawler is being developed within the Github project https://github.com/erlehmann/open-access-media-importer.
Example articles[edit]
Many files[edit]
- doi:10.1371/journal.pone.0000133 - audio and video in supplement
- doi:10.1371/journal.pone.0005929 - audio and video in supplement
- doi:10.1371/journal.pone.0010346 - audio and video in supplement
- doi:10.1371/journal.pone.0000794 - multiple videos in supplement
- doi:10.1371/journal.pone.0008793 - multiple videos in supplement
- doi:10.1371/journal.pone.0010848 - multiple videos in supplement
- doi:10.1371/journal.pone.0018243 - multiple videos in supplement
- doi:10.1371/journal.pone.0020395 - multiple videos in supplement
- doi:10.1371/journal.pone.0025109 - multiple videos in supplement
- doi:10.1371/journal.pone.0027227 - multiple videos in supplement
- doi:10.1186/1472-6785-10-9 - multiple audio files in supplement
Videos with sound[edit]
- doi:10.1371/journal.pone.0016128#s5, Movie S1; declared as "text" in XML
- doi:10.1371/journal.pone.0032931#s5, Video S1
- doi:10.1371/journal.pone.0011385#s5, Video S1
File formats[edit]
Audio[edit]
- WAV, e.g. doi:10.1371/journal.pone.0001580, doi:10.1371/journal.pone.0005915, doi:10.1371/journal.pone.0007808
- MP3, e.g. doi:10.1371/journal.pone.0004065
- OGG: none in PLoS ONE or PLoS Biology so far
Video[edit]
- MP4, e.g. doi:10.1371/journal.pone.0011385
- M4V, e.g. doi:10.1371/journal.pone.0013812
- AVI, e.g. doi:10.1371/journal.pone.0003826
- MOV, e.g. doi:10.1371/journal.pone.0004497
- OGV: none in PLoS ONE or PLoS Biology so far
Sources of errors[edit]
- doi:10.1371/journal.pone.0005977 has multiple movies labeled as being in MP3 format
- doi:10.1371/journal.pone.0002804 had Fig. 8 and 9 mixed up - this could also happen to supplementary files or their legends. Sometimes (as in this case), a formal correction is published.
Potential further targets for crawling[edit]
- The World Bank's Open Knowledge Repository (CC BY, with XML version)
- Hindawi Hindawi XML Corpus Download (mostly CC BY, some CC0)