Getting stats out of Wikiversity XML dumps

From Wikiversity
Jump to navigation Jump to search

Someone was wondering how many users edit talk pages and not main namespace articles. So I wrote a couple scripts. Nothing serious but here they are in case someone finds this sort of thing fun. Linux, no windows, sorry... It's not efficient or nice either.

You go get the dump first, at http://download.wikimedia.org/backup-index.html and look for the first (most recent entry with enwikiversity in it.

When you get to the subpage, you want the full archive, bz2 or 7z depending on what utilities you have lying around. These are the ones that say "All pages with complete edit history". Uncompress it and you're ready to get crunching.


First I wanted to get the entire list of users out.

grep '<username' enwikiversity-20070903-pages-meta-history.xml> users.txt
# did this cause the file is otherwise slow and big in the sort)
cat users.txt  | uniq > users-1.txt
cat users-1.txt | sort | uniq > users-uniq.txt
cat users-uniq.txt  | sed -e 's/^\(\s\)*<username>//g; s/<\/username>//g;' > userlist.txt

Next I wanted information on each revision in the xml dump: username, title, namespace.

cat enwikiversity-20070903-pages-meta-history.xml  | ./versity-xml.pl    > out

Here's the script:

#!/usr/bin/perl

binmode(STDOUT, ":utf8");
binmode(STDIN, ":utf8");

use encoding(UTF8);
# write out user, namespace, namespace, for each revision

#Wikiversity:Help desk
#Wikiversity talk:Help desk
#User_talk:

while (<STDIN>) {
    $line=$_;
    if ($line =~ /<page>/) {
        $user="";
        $ns="";
        $title="";
    }
    elsif ($line =~ /<title>/) {
           if ($line =~ /<title>([^:]+):(.+)<\/title>/) {
               $ns=$1;
               $title=$2;
           }
           elsif ($line =~ /<title>(.*)<\/title>/) {
               print "ok here\n";
               $ns="main";
               $title=$1;
           }
    }
    elsif ($line =~ /<username>/) {
        if ($line =~ /<username>(.*)<\/username>/) {
            $user=$1;
        }
    }
    elsif ($line =~ /<revision>/) {
        $user="";
    }
    elsif ($line =~ /<\/revision>/) {
        $out = "$user\t$ns\t$title\n";
        if ($out !~ /^\t\t$/) {
            print "$user\t$ns\t$title\n";
        }
    }
    elsif ($line =~ /<\/page>/) {
#       $out = "$user\t$ns\t$title\n";
#       if ($out !~ /^\t\t$/) {
#           print "$user\t$ns\t$title\n";
#       }
    }
}

Hey, didn't I say it wouldn't be pretty?

What you need to know to make the script make sense is the structure of the XML dump. If you look at one it has this sort of stuff in it:

  <page>
    <title>User:Cormaggio</title>
    <id>2</id>
    <revision>
      <id>4</id>
      <timestamp>2006-08-15T08:19:38Z</timestamp>
      <contributor>
        <username>Cormaggio</username>
        <id>8</id>
      </contributor>
      <comment>greetings :-)</comment>
      <text xml:space="preserve">Hello all, it's great to finally have Wikiversity up and running! I'm so looking forward to 
working on this project - but am pretty busy over the next month with my dissertation (about Wikiversity ;-)). I'll be happy 
to answer any questions about the project - I've been pretty active in getting this project started. Looking forward to worki
ng with you! [[User:Cormaggio|Cormaggio]] 08:19, 15 August 2006 (UTC)</text>
    </revision>
    <revision>
      ...
  </page>

IP address contributors are tagged with <ip>blot.blot.blot.blot</ip> instead of <username>.

You can see that the namespace is separated from the title of the page by a colon ':' and if there is no colon then the article is in the main namespace.

I only want one entry per user per page for this next bit...

# did this cause the file is otherwise slow and big in the sort)
cat out | uniq > out-1.txt
cat out-1.txt | sort | uniq > out-uniq.txt

Now collect all the users that edited either the Help desk or User talk pages

grep Wikiversity out-uniq.txt | grep Help| grep desk > these.txt
grep 'Wikiversity talk' out-uniq.txt | grep Help | grep desk >> these.txt
grep User_talk out-uniq.txt  >> these.txt
more these.txt  | awk -F'\t' '{ print $1} ' | sort | uniq > possible

Get some numbers...

./check-these.sh

Here's the script for that:

#!/bin/bash
i='some user name here'; echo "$i" >> check2; echo "non-user_talk, helpdesk edits" >> check2;grep "$i" out | grep -v 'Help desk' | grep
 -v 'User talk' | grep -v User | wc -l >> check2; echo "helpdesk edits" >> check2; grep "$i" out | grep 'Help desk' | wc -l  
>> check2;echo "user talk edits" >> check2; grep "$i" out | grep 'User talk' | wc -l >> check2
i='some other user name here'; echo "$i" >> check2; echo "non-user_talk, helpdesk edits" >> check2;grep "$i" out | grep -v 'Help desk' | grep
 -v 'User talk' | grep -v User | wc -l >> check2; echo "helpdesk edits" >> check2; grep "$i" out | grep 'Help desk' | wc -l  
>> check2;echo "user talk edits" >> check2; grep "$i" out | grep 'User talk' | wc -l >> check2

and so on, where the names came out of the "possible" list.

The output such as it is looks like

first_user_name
non-user_talk, helpdesk edits
0
helpdesk edits
1
user talk edits
0
second_user_name
non-user_talk, helpdesk edits
0
helpdesk edits
2
user talk edits
0

and that's it. There weren't so many folks who had edited the help desk or user talk pages in the past year, and we didn't really care about looking at earlier data, so once we had this file we could inspect it by hand to see any useful trends.