Interview with Felipe Ortega

As a computer scientist and engineer you could have chosen among many different objects of study, why Wikipedia?

For 3 reasons:

  • Firstly, at that time (2005) it was very clear for me that Wikipedia was a new Internet phenomenon, a flagship initiative, one that would play a key role in the Open Movements arena, just like Linux did back in the 90s.
  • Secondly, Wikipedia was creating a vast compilation of activity records from millions of editors. It had the potential to become one of the largest online commuities in the world, thus making it unique and offering a great challenge to analyze it, at the same time.
  • Finally, because of its goal. We all know that compiling all human knowledge is a very ambitious goal. But even in its current state (still a long way to go for completing its mission) Wikipedia has proven to be valuable for hundreds of millions of persons. Its content is adhered to the definition of free cultural works [ ]. It’s a perfect example of the advantages of this totally open model for knowledge production. So, I thought it deserved my attention to explain why this approach works in practice, defying our classical preconceived models for collaborative production.

What gave you the idea to develop the WikiXRay tool?

From Eric S. Raymond’s “The Cathedral and the Bazaar”: “Every good work of software starts by scratching a developer’s personal itch“.

I don’t think that my tool (still in “beta” state after 3 years) can be
really considered a “good work of software”, yet. But it did start from
a personal itch: the lack of a reliable tool to parse available data
dumps published by WMF. I think I tried 3 or 4 tools, and mwdumper was
the best one, but I constantly got errors parsing the huge Engilish
Wikipedia dump.

Thus, I thought carefully about my options and it was pretty clear that,
at least, I should attempt to build my own tool if I ever wanted to do
serious work with those data.

In addition to this, reproducibility was (and still is) and obsession
for me. Too frequently, we find yet another quantitative analysis on a
certain set of FLOSS projects, online communities, etc. And you cannot
reproduce it, validate it and learn from their approach for the simple
reason that the source code is not available anywhere! So, the best
option for me was to code libre software, under GPL, and let others to
freely inspect, adapt, use and distribute it.

What was the most difficult thing in developing this tool?

There were many difficult problems to solve, but I think the most
complicated one was to build a parser with decent performance. Parsing
huge XML files was a little bit tricky, since you cannot store the whole
thing in memory (+2TB is out of question).

Interestingly, last month I found a new parser (also libre software)
that apparently outperforms mine. But that’s also the good side of libre
software. Now I can try to adapt it in my own code :). In any case, I’m
very happy that WikiXRay is still one of the best options out there to
analyze Wikipedia dumps.

How do you see WikiXRay being used in future research? Can it be used on other platforms as well?

Well, right now it’s a cross-platform tool, and I’ve heard of people
using it on Windows, Linux and Mac OS without problems. All dependencies
(MySQL, GNU-R and Python) are available for these platforms. This is
great when you’re trying to build something useful for a broad audience.
It’s a good starting point.

However, right now it only works for MediaWiki dumps. In the future, I’d
love to have alternative parsers for other wikis (Tiki Wiki, DokuWiki,
MoinMoin… well I can not mention them all!).The advantage is that the
analysis modules are independent from the type of platform analyzed, as
long as you store the info using the same data model.

Other ideas could be:

  • Feeding a web interface to visualize the current state of your wiki (community, activity, trends). This could be great as a service for large wiki communities, like Wikia, or even for enterprise wikis.
  • Adding support for seasonality analysis, trends and forecasting (something I’d love to work on as soon as we find time and funding 🙂 ).
  • Integrating additional perspectives like: Social Network Analysis, effort analysis and patterns, co-authorship, forecasting/identifying prospective top-quality articles…

You have stated on your WikiSym 2010 summary that “the need to find
solutions for social scientists and engineers to work together in
interdisciplinary groups, is probably one of the top-priority issues in
[your] research agenda.
”How do you envision it? Any concrete plans

This is absolutely right, and it’s a must for reasearch on virtual
communities today. If we just focus on numbers, trends, activity
patterns etc. and we obviate the social side of the story, we’re missing
in practice half of the whole picture. We will never understand virtual
communities completely.

I’d like to explore why there seem to be so many difficulties to create
interdisciplinary working teams (tech sciences + social sciences).
Admittedly, we may “speak” and “interpret” things in a bit different
way. But we must overcome these differences, since they are not the
problem but the *asset* when we build this kind of teams.

We don’t have concrete plans for WikiSym 2011, yet. But I’d love to have
a panel where researchers from both “worlds” can sit around a table with
the audience and debate on best practices for interdisciplinary teams to
become a reality.

What are your predictions concerning the future of Wikipedia and its influence?

[Smiling] You know, last time I answered this question, there was a
strong polemic, so I tend to be cautious (even though subsequent
research and reports have surfaced some of the key problems we had
already identified).

My impression is that Wikipedia influence will keep on growing,
specially in development countries, as Wikipedias with fewer articles
attract more contributions and expand their coverage. At the same time,
we need to find new ways for weaving edits from academics and scholars
with the contributions from the large existing community, to address the
problem of creating content in very specific niches of knowledge. This
also involves spreading the word on how to use Wikipedia effectively
among students and scholars alike, and eliminating widespread FUD among
many faculties who still think that Wikipedia is just “a perfect source
for students to avoid doing the hard work”.

Finally, I think we still have to find “new ways of using Wikipedia”.
Many people use it as an encyclopedia, right. But we can also see a
source for information contextualization and categorization, for
creating thesaurus, for translation… The longer the list, the best we
will exploit the many possibilities of this “everyday partner”.

Anything else you would like to add? Comments, ideas, thoughts?

If I can make a call, I would really like to spot the attention of
funding entities (private and public foundations, EU government, etc.)
on the urgent need to invest in research on virtual communities. In our
own research group, we have spent several years in this research line
with very little support, but with great results and outreach, so far.
NSF is funding 7 or 8 projects on virtual communities and open
collaboration in the USA, while EU is somewhat lagging behind, in my

Not only Wikipedia, but virtual communities in general are a core piece
of the Information Society, and the so-called “Future Internet”. If we
just focus on technology but forget about “people using technology” we
may lose and important perspective: that this Information Society should
be user-centered, and not technology-centered. There must be a serious
effort to fund research lines to understand this reality, creating
interdisciplinary teams to face up the challenge.