WikiXRay and Statistics on Wikipedia








A review by Juliana Brunello

Felipe Ortega‘s research – Wikipedia: A quantitative analysis – is a well organized, plain structured piece of formal scientific work. It was with this thesis that he earned in 2009 a PhD degree in Computer Sciences, issued by the Universidad Rey Juan Carlos (Spain). He has been a Researcher and Project Manager at Libresoft at the same university since 2007.

Ortega presents two clear objectives:

First, he wants to analyze comparatively the top ten versions of Wikipedia, based on the official number of articles in each one. He will then try to identify some characteristic patterns which should make it possible to understand the ways Wikipedia and its community functions.

By the time of his research, the top ten versions were the English, German, French, Polish, Japanese, Dutch, Italian, Portuguese, Spanish, Swedish (In the meantime surpassed by the Russian version) versions; in this particular sequence.

Second, he wants to contribute with a new software capable of such an analysis, that could also be used by other researchers in the field.

Both objectives are connected: Ortega developed a software tool called WikiXRay, which is able to download the entire public database dumps of Wikipedia, automating its quantitative analysis. It operates with GNU R, a statistical package that is also FLOSS, like WikiXRay itself. Basically, the program downloads the dump, parses it, loads it into a MySQL database, makes a statistical analysis and then presents the final graphics and other statistical results. Among other features, WikiXRay analyzes the social structure, the level of inequality and the demography of the authors as well as their contributions. The author emphasizes that there is still need for future improvement of the software and gives some ideas of what they could be. It is remarkable to see how he uses the same principles of collaboration seen in FLOSS and Wikipedia in the software developed by him which he, in return, uses to analyze one of them (Wikipedia).

I ask myself at this point, if the software is a byproduct of this research or if the research is a byproduct of the software.

As one can deduce after reading about WikiXRay, this study is extensive and heavily based on statistics, like most quantitative analyses are, and therefore difficult to be understood by the layman. Nevertheless, the methodological explanation is completely consistent with the purpose of his thesis and demonstrates high accuracy. He also refers to possible problems concerning some of the data and the solutions he came up with.

Yet, Ortega’s dissertation is much more than just formulas and statistical results. In the beginning of his dissertation he describes his motivation in analysing Wikipedia, pointing out what makes such a study interesting. He also gives the reader an overview of the Wikipedia project, presenting historical, communal and technical aspects. He presents, for instance, the different user levels – for example administrators, stewards and rollbacks – along with some basic policies – like the NPoV. He also shows some interesting facts, like the costs of harware and software. The first exceeds 800.00$, the latter, on the other hand, is entirely FLOSS. This information not only helps the reader to contextualize the results of his research, but also gives the non-insider something to begin with.

Furthermore, Ortega presents seven questions that are later answered by him, after making use of WikiXRay:

  • “How does the community of authors in the top ten Wikipedias evolve over time?”
  • “What is the distribution of content and pages in the top ten Wikipedias?”
  • “How does the coordination among authors in the top ten Wikipedias evolve over time?”
  • “Which are the key parameters defining the social structure and stratification of Wikipedia authors?”
  • “What is the average lifetime of Wikipedia volunteer authors in the project?”
  • “Can we identify basic quantitative metrics to describe the reputation of Wikipedia authors and the quality of Wikipedia articles?”
  • “Is it possible to infer, based on previous history data, any sustainability conditions affecting the top ten Wikipedias in due course?”

Just to give you a ‘taste’, here are some of his findings:

  • The number of logged authors has reached a steady state in all language versions since 2006/2007, just like the number of revisions done by them.
  • “In some language versions … the share of revisions attributed to bots are by no means negligible.”
  • Bots influence “the share of pages in each namespace and thus the composition of the whole set…”
  • There is, most of the time, a positive correlation between the length of an article and the number of contributors that worked on it.
  • More than 50% abandon the project after 200 days and over 50% of core authors leave the core after less than 100 days (less than 30 in the English and the Portuguese versions).
  • There might be a threat to quality coming up, as there are fewer contributors reviewing articles.

Summing up, I believe that Ortega’s main achievement was to prove the use of WikiXRay as an effective tool. Some of his conclusions are easy to understand and statistic-layman appropriate. Others require a more advanced knowledge of statistics and their terminology. The reader might find some trouble in understanding some of the answeres, as some of them are not further interpreted, but merely a description of the graphics supplied by the software. Anyway, I found it difficult to link some of the statistical explanation to their meanings, what can be due to my layman status. Yet, even if you are not very found of math, it is an excellent research, which provides a wide range of information about Wikipedia.

Though his research was entirely written in English, he annexed a summary in Spanish in the end of it. His thesis hasn’t been published (yet), but it can be downloaded at http://libresoft.es/Members/jfelipe/thesis-wkp-quantanalysis/view. I recommend the layman reader to take a look at Ortega’s ‘5.2 Relevant Conclusions’, which offers a very good and simple summary of the results of his research.