Interview with Felipe Ortega

Posted: July 21, 2010 at 9:46 am  |  By: julianabrunello  |  Tags: , ,

As a computer scientist and engineer you could have chosen among many different objects of study, why Wikipedia?
For 3 reasons:
  • Firstly, at that time (2005) it was very clear for me that Wikipedia was a new Internet phenomenon, a flagship initiative, one that would play a key role in the Open Movements arena, just like Linux did back in the 90s.

  • Secondly, Wikipedia was creating a vast compilation of activity records from millions of editors. It had the potential to become one of the largest online commuities in the world, thus making it unique and offering a great challenge to analyze it, at the same time.

  • Finally, because of its goal. We all know that compiling all human knowledge is a very ambitious goal. But even in its current state (still a long way to go for completing its mission) Wikipedia has proven to be valuable for hundreds of millions of persons. Its content is adhered to the definition of free cultural works [ http://freedomdefined.org/Definition ]. It's a perfect example of the advantages of this totally open model for knowledge production. So, I thought it deserved my attention to explain why this approach works in practice, defying our classical preconceived models for collaborative production.

What gave you the idea to develop the WikiXRay tool?
From Eric S. Raymond's "The Cathedral and the Bazaar": "Every good work of software starts by scratching a developer's personal itch".

I don't think that my tool (still in "beta" state after 3 years) can be really considered a "good work of software", yet. But it did start from a personal itch: the lack of a reliable tool to parse available data dumps published by WMF. I think I tried 3 or 4 tools, and mwdumper was the best one, but I constantly got errors parsing the huge Engilish Wikipedia dump.

Thus, I thought carefully about my options and it was pretty clear that, at least, I should attempt to build my own tool if I ever wanted to do serious work with those data.

In addition to this, reproducibility was (and still is) and obsession for me. Too frequently, we find yet another quantitative analysis on a certain set of FLOSS projects, online communities, etc. And you cannot reproduce it, validate it and learn from their approach for the simple reason that the source code is not available anywhere! So, the best option for me was to code libre software, under GPL, and let others to freely inspect, adapt, use and distribute it.

What was the most difficult thing in developing this tool?
There were many difficult problems to solve, but I think the most complicated one was to build a parser with decent performance. Parsing huge XML files was a little bit tricky, since you cannot store the whole thing in memory (+2TB is out of question).

Interestingly, last month I found a new parser (also libre software) that apparently outperforms mine. But that's also the good side of libre software. Now I can try to adapt it in my own code :) . In any case, I'm very happy that WikiXRay is still one of the best options out there to analyze Wikipedia dumps.

How do you see WikiXRay being used in future research? Can it be used on other platforms as well?
Well, right now it's a cross-platform tool, and I've heard of people using it on Windows, Linux and Mac OS without problems. All dependencies (MySQL, GNU-R and Python) are available for these platforms. This is great when you're trying to build something useful for a broad audience. It's a good starting point.

However, right now it only works for MediaWiki dumps. In the future, I'd love to have alternative parsers for other wikis (Tiki Wiki, DokuWiki, MoinMoin... well I can not mention them all!).The advantage is that the analysis modules are independent from the type of platform analyzed, as long as you store the info using the same data model.

Other ideas could be:
  • Feeding a web interface to visualize the current state of your wiki (community, activity, trends). This could be great as a service for large wiki communities, like Wikia, or even for enterprise wikis.

  • Adding support for seasonality analysis, trends and forecasting (something I'd love to work on as soon as we find time and funding :-) ).

  • Integrating additional perspectives like: Social Network Analysis, effort analysis and patterns, co-authorship, forecasting/identifying prospective top-quality articles...

You have stated on your WikiSym 2010 summary that “the need to find solutions for social scientists and engineers to work together in interdisciplinary groups, is probably one of the top-priority issues in [your] research agenda.”How do you envision it? Any concrete plans already?
This is absolutely right, and it's a must for reasearch on virtual communities today. If we just focus on numbers, trends, activity patterns etc. and we obviate the social side of the story, we're missing in practice half of the whole picture. We will never understand virtual communities completely.

I'd like to explore why there seem to be so many difficulties to create interdisciplinary working teams (tech sciences + social sciences). Admittedly, we may "speak" and "interpret" things in a bit different way. But we must overcome these differences, since they are not the problem but the *asset* when we build this kind of teams.

We don't have concrete plans for WikiSym 2011, yet. But I'd love to have a panel where researchers from both "worlds" can sit around a table with the audience and debate on best practices for interdisciplinary teams to become a reality.

What are your predictions concerning the future of Wikipedia and its influence?
[Smiling] You know, last time I answered this question, there was a strong polemic, so I tend to be cautious (even though subsequent research and reports have surfaced some of the key problems we had already identified).

My impression is that Wikipedia influence will keep on growing, specially in development countries, as Wikipedias with fewer articles attract more contributions and expand their coverage. At the same time, we need to find new ways for weaving edits from academics and scholars with the contributions from the large existing community, to address the problem of creating content in very specific niches of knowledge. This also involves spreading the word on how to use Wikipedia effectively among students and scholars alike, and eliminating widespread FUD among many faculties who still think that Wikipedia is just "a perfect source for students to avoid doing the hard work".

Finally, I think we still have to find "new ways of using Wikipedia". Many people use it as an encyclopedia, right. But we can also see a source for information contextualization and categorization, for creating thesaurus, for translation... The longer the list, the best we will exploit the many possibilities of this "everyday partner".

Anything else you would like to add? Comments, ideas, thoughts?
If I can make a call, I would really like to spot the attention of funding entities (private and public foundations, EU government, etc.) on the urgent need to invest in research on virtual communities. In our own research group, we have spent several years in this research line with very little support, but with great results and outreach, so far. NSF is funding 7 or 8 projects on virtual communities and open collaboration in the USA, while EU is somewhat lagging behind, in my opinion.

Not only Wikipedia, but virtual communities in general are a core piece of the Information Society, and the so-called "Future Internet". If we just focus on technology but forget about "people using technology" we may lose and important perspective: that this Information Society should be user-centered, and not technology-centered. There must be a serious effort to fund research lines to understand this reality, creating interdisciplinary teams to face up the challenge.

WikiXRay and Statistics on Wikipedia

Posted: March 22, 2010 at 12:13 pm  |  By: julianabrunello  |  Tags: , , , ,

A review on Felipe Ortega's PhD thesis 'Wikipedia: a quantitative analysis'. Felipe Ortega's research - Wikipedia: A quantitative analysis - is a well organized, plain structured piece of formal scientific work. It was with this thesis that he earned in 2009 a PhD degree in Computer Sciences, issued by the Universidad Rey Juan Carlos (Spain). He has been a Researcher and Project Manager at Libresoft at the same university since 2007. Ortega presents two clear objectives: First, he wants to analyze comparatively the top ten versions of Wikipedia, based on the official number of articles in each one. He will then try to identify some characteristic patterns which should make it possible to understand the ways Wikipedia and its community functions. Second, he wants to contribute with a new software capable of such an analysis, that could also be used by other researchers in the field...
Read more here