Big Data: Demonstrating the Value of the UK Web Domain Dataset for Social Science Research

February 2012 - February 2014

This project aims to enhance JISC's UK Web Domain archive, a 30 TB archive of the .uk country-code top level domain collected from 1996 to 2010. It will extract link graphs from the data and disseminate social science research using the collection.

Contact:

Professor Helen Margetts

Tel: +44 (0)1865 287207

Email: helen.margetts@oii.ox.ac.uk

  • Overview
  • People
  • Blog

Overview

The potential of web archives for link analysis research has been well documented, but this potential has yet to be realised and demonstrated in good research. There is therefore a need to start the hands-on work in processing and analysing domain-scale web collections. This project aims to increase visibility, accessibility, and ease-of-use of the JISC UK Web Domain Dataset, a 30 terabyte web archive of the .uk country-code top level domain (ccTLD) collected from 1996 to 2010. The project will extract link graphs from the data, assess the feasibility and impact of using the .uk ccTLD as a boundary for UK web presence, and conduct and disseminate high-quality social science research examples using the collection. It will also trial tools and procedures to make the data more easily accessible including tools for remote access and assessing the feasibility of developing code to allow the easy import of link data from the collection into NodeXL or other network data analysis software packages to allow for easy access, visualisation, and analysis of subsets of the corpus.

Background

The current and transient nature of the Web means that new information replaces older information constantly without any record of the previous state (or versions) of the same information. While new information is being added, existing information also disappears from the Web, leaving a significant gap in our knowledge of the historical web, and potentially in social history and our understanding of change over time.

Serious web archiving effort started in the UK in 2004 through the UK Web Archiving Consortium, a collaborative project among a number of organisations including the National Libraries, the National Archives, JISC, and the Wellcome Library. Comprehensive collection of websites has not been possible in the UK due the lack of a national legislative framework, which in a number of other countries has allowed the national libraries to archive publicly available web publications through periodic crawls of the national domain. The UK Web Archive and the UK Government Web Archive are currently the two main archives in the UK, containing archival copies of only a highly selective fraction of the UK websites.

JISC recently funded the procurement of the UK Web Domain Dataset, a research dataset of UK websites from the Internet Archive spanning the period 1996-2010. This 'national copy' of UK web collection is currently managed by the British Library on behalf of JISC. This project involves an exploration of the social science potential of this dataset. It will develop the tools to conduct social science analyses of key subsets of the data, by using webmetric analyses to look at online structures and the spread of policy issues across websites. The dataset was only procured recently (late 2011): this project is therefore one of the first to explore and understand this dataset.

Link Analysis and Political Science Questions

Underneath all interaction and content on the web are links. Hyperlinks connect web pages and web sites, links connect users of Twitter and other social media sites, and links connect users with corporations, organisations, and government entities. Link analysis can therefore reveal which entities are central in content discussions and which users play important bridging roles connecting otherwise separate groups. Analysis can reveal fractions between groups online and areas of cohesion, the size of entities and the relationship between entities. Internet archives allow all of these dimensions to be analysed over time.

For given domains, link analysis can therefore be used to understand the structures of online institutions, their relationships between each other and their interactions with the outside world, as well as their navigability for users. For the UK government second-level domain (encompassing all sites ending in .gov.uk), link analysis can be used to analyse the changing structure of, and relationships between, government departments and agencies; their place in social and informational networks including the extent to which they are 'nodal', levels and locations of citizen-government interaction; and government's position as watchtower with a privileged view of UK society. In addition, "government on the web" as experienced by citizens online can be documented. Such a full picture of the UK government domain has never been drawn before, although a snapshot of one point in time of central government was created by members of the same research team (Dunleavy, Margetts et al., 2007). Using an archival dataset provides a unique opportunity to assess how the structure of e-government (the only part of government that most people interact with) has changed from its conception to the current day.

Datasets

This project will primarily use and enhance the JISC UK Web Domain Dataset (1996-2010). This dataset is superior to alternatives due to its comprehensiveness in content and extended temporal dimension. Alternative datasets such as the UK Government Web Archive at the National Archives include only selective parts of government presence (i.e. only central government) and because of the focus on only capturing government pages cannot shed light on how sites in other domains (.co.uk, .org.uk, etc.) link to government websites. The JISC UK Web Domain Dataset allows for studies of the entire government presence including local governments, analysis of how other websites link to (and from) government pages, giving a picture of government's place in social and informational networks across the UK, and how these patterns change over an extended period of time.

Additional datasets offer great opportunities for comparison, over time and across countries. The principal investigator (Helen Margetts) currently directs an ESRC-funded research programme which involves crawling the web presence of government in seven countries (the US, Canada, Germany, Japan, Australia, the Netherlands and the UK) but no overtime analysis. Data from an independent crawl by the Oxford Internet Institute in late 2011 of sites in the .gov.uk domain for the ESRC programme offers the opportunity to document the change in government presence from 2010 (when the JISC UK Web Domain Dataset ends) to the present time, as well as introducing a cross-national comparative element.

Methodology

This project will primarily use and enhance the JISC .uk domain web dataset, a 30 terabyte crawl of the .uk country-code top level domain (ccTLD) maintained by the British Library. The data was collected from 1996 to 2010 and holds the raw HTML of all webpages harvested. This project will enhance the sustainability of the collection by:

  • Extracting all hypertext links from the currently unstructured text of the pages and making it easier to directly query this link data.

  • Investigating and recommending access and tools for querying, analysing, and visualising the dataset to enable wider use. This will include assessing the feasibility of making the data accessible in NodeXL or another popular network open-source software package for analysing graph/link data.

  • Assessing the impact of using the .uk ccTLD as boundary for the UK web presence.

  • Developing recommendations for future web crawls that will be used for link analysis.

  • Demonstrating the value of this collection and a 'big data' approach through novel and innovative political science research using the collection.

Support

This project is supported by JISC.

Sponsors

spacer

People

Project Lead

  • Professor Helen Margetts, Oxford Internet Institute (Principal Investigator)

Researchers

  • Dr Jonathan Bright, Oxford Internet Institute

  • Dr Sandra Gonzalez-Bailon, Oxford Internet Institute (Co-Investigator)

  • Scott A. Hale, Oxford Internet Institute (Researcher)

  • Professor Eric T. Meyer, Oxford Internet Institute (Co-Investigator)

  • Tom Nicholls, Oxford Internet Institute

  • Dr Taha Yasseri, Oxford Internet Institute

Blog

  • Leadership without Leaders? Starters and Followers in Collective Action on the Internet

    Government on the Web on 1 Aug 2013 14:21PM

    Type:  Article Experiment?:  Yes Collective Action Date:  [...]

  • Modeling the Rise in Internet-based Petitions

    Government on the Web on 1 Aug 2013 14:02PM

    Type:  Article Experiment?:  No Collective Action [...]

  • Petition Growth and Success Rates on the UK No. 10 Downing Street Website

    Government on the Web on 1 Aug 2013 13:57PM

    Type:  Article Experiment?:  No Big Data [...]

  • Interactive Map of Central Government Online

    Government on the Web on 23 Oct 2012 08:15AM

    Big Data Citizen-Government Interactions Digital Era Governance We have collected and visualized a pilot crawl of UK Central Government websites in [...]

  • Draft Paper: Understanding the Mechanics of Online Collective Action Using 'Big Data'

    Government on the Web on 29 Apr 2012 16:45PM

    Type:  Paper Experiment?:  No Collective Action [...]

  • Join our team: Big Data Research Officer needed

    Government on the Web on 18 Feb 2012 20:21PM

    We are excited to announce an open position for a Big Data Research Officer, who will contribute to three exciting Big Data projects at the OII (Leaders and Followers in Online Activism, Big Data: Demonstrating the Value of the UK Web Domain Dataset for [...]

  • Big Data: Demonstrating the Value of the UK Web Domain Dataset for Social Science Research

    Government on the Web on 7 Feb 2012 14:56PM

    Project Date:  Feb 2012 - Aug 2013 Categories:  Digital Era Governance Citizen-Government [...]

  • Leadership without Leaders? Starters and Followers in Online Collective Action

    Government on the Web on 3 Oct 2011 09:31AM

    Type:  Article Experiment?:  Yes Collective Action Date:  [...]

  • Draft Paper: Applying Social Influence to Collective Action: Heterogeneous Personality Effects

    Government on the Web on 28 Sep 2011 08:46AM

    Type:  Article Experiment?:  Yes Collective Action Date:  [...]

  • Social Information and Political Participation on the Internet: an Experiment

    Government on the Web on 11 Jul 2011 08:43AM

    Type:  Article Experiment?:  Yes Collective Action Date:  [...]

  • New research project: The Internet, Public Policy and Political Science

    Government on the Web on 1 Apr 2011 17:14PM

    Digital Era Governance Collective Action Citizen-Government Interactions We will begin a new three-year research programme on The Internet, Public [...]

  • The net effect: Is the internet really vital to contemporary protest?

    Government on the Web on 10 Mar 2011 17:31PM

    Professor Helen Margetts was recently featured on the The Economic and Social Research Council (ESRC) website writing about the importance of the Internet to contemporary protest. The full article is available at their website. 2011 is bringing dramatic [...]

gipoco.com is neither affiliated with the authors of this page nor responsible for its contents. This is a safe-cache copy of the original web site.