User:OpenScientist/Open grant writing/Wissenswert 2011/Documentation/Crawler

From Wikiversity
Jump to: navigation, search

Purpose[edit]

This tool shall regularly crawl open access repositories for articles containing supplementary materials that have not yet been uploaded to Wikimedia Commons.

Example workflow[edit]

Taking the Open Subset of PubMed Central as an example. Technical background.

Repeat at regular intervals[edit]

XML files[edit]

Each about 1GB in size. Download only for initial run, not at every update.

ID list[edit]

Not strictly necessary, but could be used for checking in case of problems or for updates.

Find articles with supplementary materials[edit]

  • Check XML files for articles with supplementary files (e.g. <xref ref-type="supplementary-material" rid="pone.0000133.s002">Movie S1</xref> and <xref ref-type="supplementary-material" rid="pone.0000133.s001">Audio S1</xref>)
  • If such exist,
    • Check supplementary materials for
      • video (e.g. <supplementary-material id="pone.0000133.s002" mimetype="video/quicktime" xlink:href="info:doi/10.1371/journal.pone.0000133.s002" position="float" xlink:type="simple"> <label>Movie S1</label> <caption> <p>Movie showing the black smoker acoustic recording system deployed at the Sully vent in September 2004 with audio from the same deployment. The audio and video are not contemporaneous because the remotely-operated vehicle carrying the video camera generated excessive noise. The video is included to provide context for the audio. The audio has been upsampled to 8 kHz, and high-pass filtered at 10 Hz using a 4-pole Butterworth filter. It is played in real-time (i.e. without time stretching or pitch shifting). Because much of the acoustic energy falls below ∼100 Hz, speakers with good bass response are required to properly reproduce the sound. Most laptop speakers will not produce sound.</p> <p>(9.58 MB MOV)</p> </caption> </supplementary-material>)
      • or audio (<supplementary-material id="pone.0000133.s001" mimetype="audio/x-wav" xlink:href="info:doi/10.1371/journal.pone.0000133.s001" position="float" xlink:type="simple"> <label>Audio S1</label> <caption> <p>Audio file containing a short section of sound collected with the black smoker acoustic recording system at Puffer in September 2005. The audio has been upsampled to 8 kHz, and high-pass filtered at 10 Hz using a 4-pole Butterworth filter. It is played in real-time (i.e. without time stretching or pitch shifting). Because much of the acoustic energy falls below ∼100 Hz, speakers with good bass response are required to properly reproduce the sound. Most laptop speakers will not produce sound.</p> <p>(0.96 MB WAV)</p> </caption> </supplementary-material> files
    • If such exist,


Code[edit]

The code for the crawler is being developed within the Github project https://github.com/erlehmann/open-access-media-importer.

Example articles[edit]

Many files[edit]

Videos with sound[edit]

File formats[edit]

Audio[edit]

Video[edit]

Sources of errors[edit]

Potential further targets for crawling[edit]

Blog posts[edit]