Dataset and scripts released by SoNet @ FBK group
Reference paper: Social networks of Wikipedia. Massa, Paolo (2011). ACM Hypertext 2011: 22nd ACM Conference on Hypertext and Hypermedia.
If you appreciate the fact I released the scripts and the datasets, please cite this paper. Thanks! --Paolo
Network extracted from User Talk pages of Venetian Wikipedia visualized with Gephi.
Abstract:
Wikipedia, the free online encyclopedia anyone can edit, is a live social experiment: millions of individuals volunteer their knowledge and time to collective create it. It is hence interesting trying to understand how they do it. While most of the attention concentrated on article pages, a less known share of activities happen on user talk pages, Wikipedia pages where a message can be left for the specific user. This public conversations can be studied from a Social Network Analysis perspective in order to highlight the structure of the βtalkβ network. In this paper we focus on this preliminary extraction step by proposing different algorithms. We then empirically validate the differences in the networks they generate on the Venetian Wikipedia with the real network of conversations extracted manually by coding every message left on all user talk pages. The comparisons show that both the algorithms and the manual process contain inaccuracies that are intrinsic in the freedom and unpredictability of Wikipedia growth. Nevertheless, a precise description of the involved issues allows to make informed decisions and to base empirical findings on reproducible evidence. Our goal is to lay the foundation for a solid computational sociology of wikis. For this reason we release the scripts encoding our algorithms as open source and also some datasets extracted out of Wikipedia conversations, in order to let other researchers replicate and improve our initial effort.
Python scripts, released as open source under the GPL license, on github.com
- signature2graph.py: (signature algorithm) generates a social network by parsing signatures on user talk pages. Input: pages-meta-current
- utpedit2graph.py: (history algorithm) generates a social network by parsing the edit history of user talk pages. Input: stub-meta-history (or pages-meta-history)
- graph_enrich.py: downloads information about the users (such as the role of the user) from the wikipedia API and adds this information on the node of the graph
- graph_analysis.py: reports various Social Network Analysis indexes about a networks such as number of nodes, number of edges, indegrees and outdegrees (mean, sd, 5 max values), density, reciprocity, transitivity, mean distance, efficiency, centrality (computed with pagerank, betweenness, degree centrality). It can also reports these indexes for subgroups of nodes (for example for admins, bots or anonymous users in Wikipedia). It can also consider only edges inserted before and/or after a certain date, in order to conduct longitudinal analysis.
Venetian Wikipedia (2 networks extracted automatically with the 2 algorithms but also 1 networks resulting from manual coding of User Talk pages)
Networks:
Networks are in graphml format. Right-click for downloading the desired file and then open it with your preferred program to analyze networks. We like Gephi.
- vecwiki-20091230-manual-coding.graphml: network of conversations on User Talk pages in Venetian Wikipedia, as extracted from the manual coding of each page. Pages were analyzed as they were on 2009-12-30.
- vecwiki-20091230-signature-algorithm.graphml: network of conversations on User Talk pages in Venetian Wikipedia as extracted from pages-meta-current XML dump (see below) by looking for signatures (algorithm "signature" above). The processed dump contained the situation of User Talk pages on 2009-12-30.
- vecwiki-20091230-history-algorithm.graphml: network of conversations on User T alk pages in Venetian Wikipedia as extracted from stub-meta-history by looking at editors of User Talk pages (algorithm "history" above). The processed dump contained all edits made up to 2010-06-29 but, in processing it, only edits up to 2009-12-30 were considered (algorithm "history" has a parameter by which you can specify the last date to be considered for edits).
Venetian Wikipedia XML dumps
- vecwiki-20091230-pages-meta-current.xml.bz2: Venetian Wikipedia XML dump used as input for signature algorithm to extract the network vecwiki-20091230-signature-algorithm.graphml
- vecwiki-20100629-pages-meta-history.xml.bz2: Venetian Wikipedia XML dump used as input for history algorithm to extract the network vecwiki-20091230-history-algorithm.graphml
Large Wikipedias (2 networks extracted automatically with the 2 algorithms)
- German Wikipedia
dewiki-20110131-pages-meta-current.graphml
dewiki-20110131-stub-meta-history.graphml
- Spanish Wikipedia
eswiki-20110203-pages-meta-current.graphml
eswiki-20110203-stub-meta-history.graphml
- Italian Wikipedia
itwiki-20110130-pages-meta-current.graphml
itwiki-20110130-stub-meta-history.graphml
- Chinese Wikipedia
zhwiki-20110127-pages-meta-current.graphml
zhwiki-20110127-stub-meta-history.graphml
Stats about the previous 4 networks as outputted by running the graph_analysis.py algorithm
gipoco.com
is neither affiliated with the authors of this page or responsible
for its contents. This is a safe-cache copy of the original web site.
gipoco.com
is neither affiliated with the authors of this page nor responsible
for its contents. This is a safe-cache copy of the original web site.