Louise Corti, Director of Collections Development and Producer Relations at the UK Data Service reports back from the rerun of our successful summer school, Encounters with Big Data: An Introduction to using Big Data in the Social Sciences, held at the University of Essex, Colchester from 31 July to 4 August 2017.
The course met one of the objectives of the Smarter Household Energy Data project, a joint International Centre Partnership Grant between the UK Data Service and DataFirst and funded by the Economic and Social Research Council in the UK and the National Research Foundation in South Africa, to create a collaborative research infrastructure for large-scale household energy data.
Louise, UK Principal Investigator on the project said: “We are delighted to be able to run this course for a second time, after it was so well-received in Cape Town earlier this year. Our joint project finally finishes in August, and our two fantastic trainers, Peter and Chris, are leaving the UK Data Service, so it was wonderful to able to squeeze this last capacity building event in”.
The course, aimed at experienced researchers, statisticians, or data analysts, covered aspects of data extraction, exploration, basic analysis and visualisation of big data using a Hadoop system. The platform uses solutions that can deliver data at scale, with speed, and include Hive, Spark, and Zeppelin, which integrate seamlessly with popular data analysis environments like Python and R.
This course was run as part of the University of Essex’s Institute for Analytics and Data Science (IADS) summer school which participants paid to attend. We received applications for the course from 60 people with places available for only 20. Once again, prerequisites were set to ensure that participants could fully participate, requiring experience using quantitative research data in the social sciences, a good understanding of statistical methodology , and competence in writing commands in a statistical computing environment like Stata, R or SPSS.
Participants came from home institutions and a number of countries including Denmark, Spain, Italy, Malta, Canada, Korea, Mexico, Mongolia and the USA. Positions ranged from professors, research assistants, postgraduates, statisticians and data analysts and spanning fields from economics, sociology, criminology and geography to marine ecology and genomics.
The summer school participants, teachers and organisers
The first day offered participants an overview of how big data methods can be applied to social science, using new tools and enhanced modelling techniques. Nathan Cunningham, Director of Research IT and Innovation at the UK Data Service introduced the concept of big data and provided an overview of the work being done in-house to build the UK Data Service’s new Data Services as a Platform (DSaaP) initiative currently being implemented in partnership with Hortonworks, a leader in Hadoop-based data technologies.
Louise Corti then presented on how national statistics offices around the world have been exploring alternative sources of collecting data on citizens (to replace or complement surveys and censuses), and showed an example looking at tourism data from World Heritage Sites on Wikipedia. Participants worked in groups to devise their own ‘national statistic’ based on free data from the internet and quickly came up with some innovative ideas for data and analyses including:
- Crime Rates: What kind of people are more likely to be defrauded online, i.e. those who use the internet more or less? Maybe hard to get data on those known to be defrauded.
- Tourism: Does watching films with particular locations inspire people to travel to those places? Identify locations in the most popular films over the last year and look at travel destinations from flight statistics.
- Pollution: Does pollution affect sports performance or fitness? Link data from air pollution sensors with fitness tracking devices such as Fitbits. Would be hard to control for people’s level of fitness.
- Deprivation: Measure level of deprivation by patterns of mobile phone usage in areas using cell tower records.
Libby Bishop, Manager for Producer Relations and Research Ethics at the UK Data Service, followed with a talk on the role of legal and ethical concerns when using big data. She cited examples of conditions under which research using Twitter data can be shared, and the likelihood of increasing the risk of re-identification given the proliferation of public multiple data sources available about people. Libby outlined the 5 Safes Framework actively used and promoted by the Service as a robust approach for controlling access to disclosive data, so that rich policy-relevant analysis could be safely undertaken. Two guests speakers, James Allen-Roberson and Christian Kemp, from the Department of Sociology at Essex showcased their work on using the dark web in research with discussing the pros and cons of using such hidden sources.
Over the following two days, Peter Smyth and Chris Park delivered presentations, demonstrations and led exercises focused on manipulating data in Hadoop using a freely available Hadoop Sandbox.
Peter Smyth demonstrated how to use the Hive Query Language (HiveQL) to examine the contents of the datasets and to ‘slice’ and ‘dice’ a dataset into smaller datasets which can be used by desktop applications. He also introduced Zeppelin notebooks as a means of carrying out interactive data analysis within a big data environment. Chris Park introduced Apache Spark, a high performance, distributed computation engine designed for handling and analysing big data. He demonstrated how to scale out small-scale analyses by harnessing the power of Spark from R using the SparkR package, and Peter concluded by giving an overview of one of R’s libraries for producing spatial visualisations such as choropleth maps, Leaflet.
Day four covered tools and techniques for getting and converting external data called from APIs, and how to interpret the results in JSON format. Peter concluded with an overview of end-to-end process tools available via Hadoop.
Participants then moved onto formulating their own group projects to consolidate what they had learned over the past four days using open data from the internet. For the remaining day, the groups worked on their projects, accessing structured data from the web, importing, linking them where necessary and conducting exploratory data analysis and visualisations. The group projects turned out to be really interesting with all teams exploring unfamiliar open source data sources and software packages. The five teams came up with some great ideas ranging from use of Twitter data, Google trends data, weather data, accident databases and other sources to look at racism, use of statins, pedestrian and car safety, and weather and mood. A range of R tools were used including ggplot2, Rtweet, gtrendsR and wordcloud(R).
It was hard to choose an outright winner so in the final afternoon we awarded two prizes to two teams: ‘Racists be damned (team of 2) and Auto Choice Model ‘(team of 3)‘.
Team ‘Racists be damned’
- This group of two used online data available from the UK Government and Parliament Petitions A recent example was for a petition for a second Brexit referendum, which 4.2 million had signed. The team extracted JSON data from the site’s API for petitions with the term ‘immigration/immigrants’ in the title (N=100 or so out of a possible 31,732 petitions). The intention was to conduct sentiment analysis on the data and then map this to the UK Index of Multiple Deprivation – but this proved tricky in the time available. They hand-classified the petitions as positive or negative regarding immigration/immigrants They plotted by constituency the proportion of people who signed the anti-immigration petitions (dark shading = higher votes for negative petitions). They converted the json data into a regular dataframe, and used R and leaflet, with a little help from Chris and Peter. One of the team declared his pride in creating a map for the first time.
Some initial results from the ‘Racists be damned’ project team
Team ‘Auto Choice Model’
- This group of three wanted to build an app that enabled insight to be able to purchase a suitable car. They looked for data on fuel economy, emissions and attempted to correlate it with accidents. They found data on car attributes, such as family size etc. They also found a database of car accident locations in the UK by car make, model and engine size, and ranked from 1-3 by severity of accident. The team split up data preparation tasks and used Rmarkdown to keep track of work. They converted UK accident point data to Lower Layer Super Output Areas (LSOA) and used inner joins on the datasets and mapped accidents by model of car to LSOAs using leaflet. The team attempted a 4D plot that turned out to be quite complex showing car engine size by brand by severity of accident by accident location.
We believe that participants really do benefit from group work to exchange ideas and trouble-shoot issues arising. Given the short time frame, most of the groups divided up tasks between themselves.
Prize giving for the winning project teams
Feedback suggested that the course went at roughly the right pace, with only a couple of attendees struggling to keep up as the week progressed. I find that ensuring that all the prerequisites for a course like this are fully met is absolutely essential for any teaching I organise, but for this course run by an external provider, we were unable to vet participants’ applications. Participants for big data courses of this nature benefit from a reasonably high level of competence and experience in data handling and statistical analysis.
I would like to thank Emma McLelland from the IADS summer school and Sarah King-Hele for coordinating the logistics for the week. And huge thanks and fond farewell to our two fantastic trainers, Chris Park and Peter Smyth.
You can access the training materials here on our course GitHub page: https://github.com/ukdataservice/bdas2017
“Very good course indeed. Good structure, material, pace, content + teaching”
“Wonderful prep and professional presentation”
“I really liked it overall. Perhaps one suggestion. I would have liked to be given sample code for analysis actually run in the cloud, not just in our local VM”