Using nightlights data to map rural electrification in South Africa: observations from a quantitative methods graduate

FullSizeRenderIn this blog Camille Corti-Georgiou, summer intern at the UK Data Service at the University of Essex, offers her own reflections on the workshop, ‘Household energy use in the Agincourt area’, held in July 2017 at the Wits Rural Facility in north eastern South Africa. Camille has just completed her first degree in Political Science and International Relations and is about to embark on a Masters of Enterprise at the University of Manchester Alliance Business School.


I attended the ‘Household energy use in the Agincourt area’ workshop, organised by Professors Martin WittenbergDataFirst at Cape Town University) and Dr. Mark Collinson (Senior Researcher: MRC/Wits Rural Public Health and Health Transitions Research Unit), as a student of quantitative methods, with a fairly limited understanding of both longitudinal and ‘big data’, and the methods employed by the team at WITS Rural. Yet, what stood out to me, as a recurring theme throughout the two days, was the potential scope of the work being carried out in Agincourt. While the workshop consisted of multiple presentations and talks, looking at various aspects of the study, in this post, I want to pay specific attention to the presentation given by Professor Martin Wittenberg  on the use of nightlights data in the Agincourt area.

The Agincourt HDSS

The Agincourt Health and Socio-Demographic Surveillance System (HDSS) is, by all accounts, a significant research undertaking, yielding a powerful database for studying many aspects of rural life in a developing region. The principal goal of the HDSS is to offer a more enhanced understanding of the dynamics of health, population and social transitions in northeast South Africa. With twenty years of data having been collected since the initial baseline survey was conducted, the research unit in Bushbuckridge now holds a rich and substantial set of data with policy-influencing significance. Moreover, I was fortunate to see the exemplary framework they have developed for which future research may be undertaken.


Wits Rural Research Facility, Bushbuckridge, South Africa

Throwing Light on Rural Development

For developed regions, the use of satellite data in mapping urbanisation has been widely tested and validated. Measuring the electrification of rural areas, using the same method, however, is a much newer phenomenon. At Wits, the team were interested in whether the satellites would pick up the temporal patterns of rural electrification in the Agincourt area, to allow for analyses that could be corroborated by the data collected on the ground.

Takwanisa Machemedze from Datafirst advanced the actual technique of linking satellite nightlights (as shown in Figure 1) and local data, finding a definitive correlation between the two.


Figure 1. NASA satellite view of Earth at night, compiled from 400 satellite images

In explaining this technique, Martin repeatedly referred to the sum of lights (SOL), the sum of all pixel values for a particular region. While in developed regions a SOL measure will generally suffice, in rural regions it can be problematic. Instead of using the average brightness across the pixel for an area, Takwanise broke up the pixels and matched them to the shape of the boundary, before adding them up. While this approach resulted in some loss of brightness, it gave the most accurate indication of electrification in the area.

Previous analyses conducted on the data by Datafirst concluded stable progress in electrification and revealed a steady, upward trend. What was most interesting, however, were the periods of decline and deviation from this trend, most notably the huge electricity dip across the whole of South Africa in 2008. As a result of the Eskom collapse, stemming from the capacity of electricity failing to meet the demands of the growing economy, a state of electrical emergency was declared. The massive load shedding and electrical rations that followed were indicated by the satellite data. Even in the wake of the collapse, the levels failed to regain, due to the massive tariff increases that followed. At this point it was noted, that one of the drawbacks of using satellite data to measure electrification is that it only detects external light. And, as was duly noted by one of the attendees of the workshop, it could very well have been that electrification was still occurring in the area, but, people were only opting to use internal lights and thus their households would not have been detected.

Data from the ground

Moving on from this, the second half of Martin’s presentation shifted focus onto electrification in the villages and homes of Agincourt. Expanding on the work of Hargreaves, villages in the area were sorted into four distinct categories as shown in Figure 2:

Village type Electricity in 2000 (Y/N)
Central communities Yes
Established communities Yes
Undeveloped villages No
Refugee settlements No

Figure 2. Village typology used by HDSS

Non-domestic lighting, such as light from train stations, police stations and supermarkets, was also picked up by the satellite and thus, it was crucial to conduct analyses on the ground to distinguish between sources of non-domestic and domestic lighting. The data collected for Agincourt was compared against data for Kruger and Nelspruit using a Difference in Difference (DID) approach to look at the take up of lighting. With regards to the rates of electrification in each site at the beginning of the analysis in 1992, levels in Kruger and Agincourt were considerably lower than in Nelspruit. However, data showed Agincourt had a significant tendency to get brighter compared to both. In 1992, Agincourt displayed almost total darkness, lighting up incrementally to 2007 before dulling in the midst of the Eskom collapse of 2008.

The nightlights data additionally showed major bright spots in the Agincourt site. On the surface, it appeared the spots could be attributed to three villages. Yet, when looking closer, one could identify two developed villages, one undeveloped village, a taxi rank, a super market and a train station. The nightlight data was thus spurious, giving the impression the third village had gone through the process of electrification when in reality remained relatively undeveloped. Martin informed us of the caveats of using satellite data and once again reiterated the importance of having researches on site. Moreover, coupling the nightlight data with the village typologies, presented more specific findings, such as 200 new village connections increasing the brightness of the area by 1.7 units.


For me, the workshop delivered a fantastic insight in to the use and application of nightlights data. The analyses conducted on satellite imagery for the last two decades has not only truthfully captured the electrification of the Agincourt area, but displays the differences between the developed and undeveloped areas of the site. Even from a non-technical perspective, the gravity of the research being done is evident and it is clear the potential exists to do so much more.

The full published paper can be read, untitled, Throwing light on rural development: using nightlight data to map rural electrification in South Africa by Takwanisa Machemedze, Taryn Dinkelman, Mark Collinson, Wayne Twine and Martin Wittenberg.

Posted in Uncategorized | Leave a comment

Knowledge exchange symposium: 21st Century Data Infrastructure for Research

In this blog Susan Cadogan and Camille Corti-Georgiou from the UK Data Service report back on the joint symposium held on 11 July 2017 at the University of Cape Town (UCT) organised by the eResearch at UCT, Data First and the UK Data Service.  The meeting aimed to discuss and encourage debate around the requirements necessary to scale data services and data infrastructure for research.

Hosted by Dr Dale Peters, Director of UCT eResearch, the symposium comprised of presentations from Louise Corti and Nathan Cunningham of the UK Data Service, Martin Wittenberg, Director of DataFirst. Anwar Vahed of the Data Intensive Research Initiative of South Africa (DIRISA), Russ Taylor of the Department of Astronomy at UCT (IDIA) and Rob Simmonds of the Department of Computer Science, UCT. Attendees of the symposium hailed from a range of disciplines including data science, bioinformatics and economics.


Corti and Wittenberg set the scene, highlighting the origins of the UK-UCT collaboration stemming from a successful application to a call for International Centre Partnerships funded by the UK’s Economic and Social Research Council (ESRC) and South Africa’s National Research Foundation (NRF). The 18-month Smarter Household Energy Data project’s focus was on scaling up household energy research using ‘big data’ infrastructure. Corti outlined the UK role of the UK Data Service in a collective project with the University College London Energy Institute, looking at household energy data collected from smart meter readings. In South Africa, DataFirst were collaborating closely with both Energy Research Centre at UCT and the University of Witwatersrand and Medical Research Council (MRC)’s Rural Public Health and Health Transitions Research Unit to investigate fuel poverty and the impact of electrification in rural areas using previously untapped sources of data.  Data included came from the Agincourt Health and Socio-Demographic Surveillance System (HDSS), NASA Nightlights data and Cape Town Municipality electricity billing data.  Contrasting access to and use of household energy in the UK and South Africa underlines the vast differences in consumption and access to resources between countries, and has helped highlighted the impact of decision, policy making and intervention in this research sphere.


Louise Corti introducing the UK Data Service – DataFirst partnership project

Both data streams inevitably raised challenges of data quality, sensitivity and anonymity. Corti underlined how the practices utilised in both institutions around the ‘5 Safes’ of data access could be rolled out to accommodate research practices sing these new forms of data. The project has succeeded in meeting its initial aims in providing data expertise across institutions and using the partnership to work through some of the challenges with the size and quality of data.

Infrastructure in South Africa

Dr Anway Vahed, outlined his role in DIRISA helping to establish a national research data infrastructure in South Africa, and coordinating the development of expertise and implementation of research data management strategy and policy.  Together with the Centre for High Performance Computing (CHPC) and the South African Research Network (SANReN), DIRISA forms the data infrastructure component of the National Integrated Cyberinfrastructure System (NICIS) of South Africa. The roadmap for cyber-infrastructure for both data and computing resources runs across six thematic areas: physical science and engineering, energy, health, bio and food, earth and environment, humans and society, materials and manufacturing. DIRISA is aiming to provide federated access to data, and while it does not have a preservation role, it promotes the use of trusted repository services, rather than informal cloud-based solutions such as Google Drive/DropBox for research.


Anwar Vahed speaking about DIRISA

Big data: the case of astronomy

Professor Russ Taylor from the Department of Astronomy at UCT observed that astronomers have long-managed large data flows generated in their field, citing the Square Kilometre Array (SKA) project – aiming to be the world’s largest radio telescope. The pre-construction phase of the project started in 2012 with the next phase of development commencing in 2018 – 2020, and being used to tested proof of concept as scientific observations begin.   With South Africa and Australia having won the SKA bid to co-host, this large-scale project comprises ten cornerstone countries, with over 121 institutions. Some 360 researchers, scientists and engineers are helping develop the supercomputers for processing data.


Russ Taylor from the Department of Astronomy at UCT speaking on the roll out of the SKA

As with research utilising genetic sequencing, data from telescopes has grown rapidly and significantly over the last 5-10 years. Major challenges are: extracting information from such large amounts of data; managing large files (e.g. 10million records); processing data using multiple pathways and detecting change; and merging and linking data (a relatively new area outside of the social sciences). Additional current cultural challenges include research reproducibility and moving to cloud-based processing.

Finally, the aspiration for global collaboration is hindered by lack of funding at this level, although a proposal under Horizon 2020 is being sought.  In terms of cyber infrastructure, South African astronomers are looking to DIRISA for support as a Tier 2 (national) facility to serve researchers their research community together with bioinformatics.

Professor Rob Simmonds provided a technical view on the Inter-University Institute for Data Intensive Astronomy (IDiA) system being built at UCT which will form part of the Tier 2 facility. Investment in the facility over three years will total some R17 million. An OpenStack based Infrastructure as a Service (IaaS) management system has been installed with a security layer to be added, required for when the bioinformatics facility becomes operational.  So far, hardware is in place, storage has been provisioned and other processing services will be provided by the African Research Cloud (ARC) from the University of Cape Town.

The Tier 2 storage will use the Ceph free-software storage platform for cloud storage with a 3x replication back-up system with security, authentication and user front ends to be developed that can support the management of personally-identifiable data and support reproducible science.

Joining up big data infrastructure

As the final speaker, Nathan Cunningham Director of Research IT at the UK Data Service outlined the research landscape in the UK, with a move to join up infrastructure where possible to maximise value and efficiency, enable collaboration and ensure cross-cutting work streams that help to eliminate duplication.


Nathan Cunningham introducing the UK’s big data infrastructure planning

Cunningham drew on the 2013 OECD report on New Data for Understanding the Human Condition: International Perspectives which encourages us to review how we scale up computing power, skills and managements systems to cope with large data resources, including those that cover information about human subjects have. In response to these challenges, Cunningham outlined the UK Data Service’s ongoing implementation of a Data Service as a Platform (DSaaP).  He stressed that while many of the ideas are not original, the importance of using open source tools and creating hybrid services can offer a flexible system that meets the needs of the researcher in addition to ensuring safe and continuous access.

In meeting the challenges of the new infrastructure, staff and researchers will need to upskill and be introduced to new ways of thinking, somewhat removed from the traditional data dissemination and archiving practices we find in social science today.

Looking ahead

A number of common threads and challenges arose from the symposium.  There is great excitement in enabling access to new and novel forms of data, but building systems to manage data across disciplines with varying challenges for access poses some serious challenges, especially where no additional funding to adapt or renew infrastructures is available.

However, the use of open source tools and introduction of hybrid infrastructures and systems could offer positive solutions and has the potential to reduce costs.  Participants agreed to continue sharing ideas and solutions, especially at a time of reduced infrastructure funding and where capitalising on collaboration may help strengthen proposals.

Training and upskilling staff and researchers, poses another challenge.  The UK Data Service-DataFirst project has moved one step closer to this aim via hands-on engagement by developing and running a week -long Summer School entitled, Encounters with Big Data: An Introduction to using Big Data in the Social Sciences, held back in February 2017 at UCT and to be run again in August 2017 in the UK at the  Institute for Analytics and Data Science (IADS) Summer school.


‘Encounters with Big Data workshop’ participants, Cape Town, February 2017

This intensive course has provided participants with the opportunity to extract, explore and analyse big data using some core elements of DSaap.  Using a locally-installed Hadoop Sandbox, participants were taught basic skills in managing and manipulating big data with data science oriented tools and software, such as R and Spark.

The feedback from the course was very encouraging but the challenge now is to ensure that the researchers are able to benefit from their new skills by putting them to use, and that their skills are updated and refreshed over time. We hope that new researchers are further encouraged to participate and that future capacity building can be carried out via a fully-fledged big data environment that can support future waves of researchers.

Louise Corti, UK Principle Investigator on the UK Data Service – Data First project hopes that “we can continue to collaborate with UCT beyond the life of the soon-to-finish project, ensuring that the great networks we have brokered together across disciplines, and the policy-relevant impact we have emerged in the energy research can live on”.


Posted in Uncategorized | Leave a comment

Cape Town Data Quality Workshop: Measurement of Development Indicators


In this blog Andrew Kerr, economist and a Senior Research Officer at DataFirst, University of Cape Town, reports back on a 2- day workshop held at the River Club in Cape Town on 6th and 7th July 2017 organised by Professor Martin Wittenberg, Director of DataFirst, with support from the UK Data Service.

The workshop consisted of several themes for the various sessions: Measurement of individual and household well-being, Access to household energy and household services, New forms of data and innovative approaches to measurement and Labour market data.

Attendees included academics from a variety of Universities (Cape Town, Pretoria, Stellenbosch) and a range of disciplines (Economists, Astronomers, Health and Energy researchers), as well as representatives from Statistics South Africa (StatsSA), the Department of Basic Education and the Office of Astronomy for Development (OAD) of the International Astronomical Union.

Focus on energy and new forms of data

The sessions on energy and new forms of data showcased some of the results from our ESRC-NRF funded international partnership between DataFirst and the UK Data Service, Smarter Household Energy Data.

Using sources of data outside the domain of traditional survey and administrative data is often seen as a challenge for social scientists, due to the unfamiliarity of and trust in the data. A lot of data preparation and manipulation has to go into preparing of data sources derived from real-time measurements. Takwanisa Machemedze, formerly of DataFirst, presented a paper on using nightlights data to measure rural electrification. He showed that the nightlights data tracked the roll-out of electricity connections in a health and demographic surveillance site in the east of the country.

Also using satellite data, Tawanda Chingozha, a PhD student from the University of Stellenbosch, presented a paper using satellite data to estimate changes in land under cultivation due to land reform policies in Zimbabwe. The conclusion was that the land acreage devoted to cultivated land decreased after the fast track land reform programme.

Martin Wittenberg presented a paper with Tom Harris on electricity connections and household formation, as featured in an earlier blogpost using the National Income Dynamics Study (NIDS) suggesting that aggregate electricity statistics, such as access rates, conceal a considerable degree of the complexity and volatility that is inherent in the development of electricity access. Instead he suggests that policy makers involved in electricity roll-out need to consider that household electricity access is a complex outcome of time-variant processes: net connections and household formation and dissolution processes.

Wiebke Toussaint from the UCT Energy Research Centre documented the Domestic Load Research Project, a yearly survey of several hundred Eskom customers, the household data that has been produced by the project and the ERC’s aims to make some of the data available to the public. Grant Smith and Kathryn McDermott from the UCT School of Economics and JPAL presented a paper that showed the effects of changing to prepaid electricity meters as well as giving insights into the difficulties of using administrative data, in this case from the City of Cape Town.

Kathryn McDermott from the UCT School of Economics and JPAL speaking

Chris Park from the UK Data Service submitted a presentation on behalf of Simon Elam from UCL, on “Data Quality: The elephant in the (big data) room”, which looked at the opportunities for using and issues with the quality of data from smart meters.

Measurement of individual and household well-being

Nilmini Herath from JPAL Africa at UCT presented a paper sharing JPAL’s insights gained in running surveys in South Africa. The presentation led to a helpful cross-pollination of ideas with the Stats SA participants who were interested in understanding and improving fieldwork quality in Stats SA data. Emmanuel Bakirdjian of JPAL Africa at UCT presented a related paper giving insights into how different forms of measurement can complement or improve the self-reported data traditionally used in household surveys.

Two of the presenters were students on the first (2016) cohort of the new Post-Graduate Diploma in Survey Data Analysis for Development at UCT, which has been put together by DataFirst, together with SALDRU and the School of Economics. Karabo Sebolai from StatsSA presented a paper comparing imputation and reweighting adjustments for non-response in the Agricultural Survey run by Stats SA, whilst Phumudzo Madzivhandila, also from Stats SA, examined the robustness of multidimensional poverty estimates to different weights for the various components of the poverty estimates.

Professor Steve Koch from the University of Pretoria presented a paper estimating equivalence scales for South Africa, using the 2010 Income and Expenditure Survey. One implication of his work is that standard procedures used for adjusting incomes for household size and composition (normally done by calculating per capita figures) probably over-adjust. This would make big households look poorer than they probably are.

Martin Wittenberg presented work which suggested that there were sampling issues with the Living Conditions Survey. It looked as though the survey was finding fewer rich people at the end of the survey period than at the beginning. This could be related to fieldwork fatigue. Similar questions have been raised about the diary method of collecting consumption data over a whole year.

Measuring labour market changes

In the final session, Andrew Kerr from DataFirst presented two papers using the PALMS data (Post-Apartheid Labour Market Survey series), a compilation of Stats SA labour market data from 1994-2015. The first was on how the standard errors in the Quarterly Labour Force Survey (QLFS) are estimated by Stats SA. It suggested that quarter and quarter changes are probably more noisy than consumers of the data think. The second dealt with the quality of the earnings data in the QLFS and the changes over the period that make comparability over time more difficult.

Andrew Kerr from DataFirst speaking

There was lively discussion on all the papers which continued over lunch, tea and the conference dinner. Fieldwork quality, measurement error and how to share different types of data came up repeatedly. There was also very useful cross-disciplinary engagement, with the Office of Astronomy for Development particularly interested in the use of remote sensing for measurement.

Louise Corti , Principal Investigator on the UK side of the UK-South Africa project notes that “The International Astronomical Union (IAU) is the largest body of professional astronomers in the world who have set up the Office of Astronomy for Development (OAD), in partnership with the South African National Research Foundation (NRF). This is a wonderful opportunity for collaboration across disciplines to harness powerful data resources in the investigation of critical development issues”.

Posted in Uncategorized | Leave a comment

The return of ‘Encounters with Big Data’: our summer school at the University of Essex

Louise Corti, Director of Collections Development and Producer Relations at the UK Data Service  reports back from the rerun of our successful summer school, Encounters with Big Data: An Introduction to using Big Data in the Social Sciencesheld at the University of Essex, Colchester from 31 July to 4 August 2017.

The course met one of the objectives of the Smarter Household Energy Data project, a joint International Centre Partnership Grant between the UK Data Service and DataFirst and funded by the Economic and Social Research Council in the UK and the National Research Foundation in South Africa, to create a collaborative research infrastructure for large-scale household energy data.

Louise, UK Principal Investigator on the project said: “We are delighted to be able to run this course for a second time, after it was so well-received in Cape Town earlier this year. Our joint project finally finishes in August, and our two fantastic trainers, Peter and Chris, are leaving the UK Data Service, so it was wonderful to able to squeeze this last capacity building event in”.

The course, aimed at experienced researchers, statisticians, or data analysts, covered aspects of data extraction, exploration, basic analysis and visualisation of big data using a Hadoop system. The platform uses solutions that can deliver data at scale, with speed, and include Hive, Spark, and Zeppelin, which integrate seamlessly with popular data analysis environments like Python and R.

This course was run as part of the University of Essex’s Institute for Analytics and Data Science (IADS) summer school which participants paid to attend. We received applications for the course from 60 people with places available for only 20. Once again, prerequisites were set to ensure that participants could fully participate, requiring experience using quantitative research data in the social sciences, a good understanding of statistical methodology , and competence in writing commands in a statistical computing environment like Stata, R or SPSS.

Participants came from home institutions and a number of countries including Denmark, Spain, Italy, Malta, Canada, Korea, Mexico, Mongolia and the USA. Positions ranged from professors, research assistants, postgraduates, statisticians and data analysts and spanning fields from economics, sociology, criminology and geography to marine ecology and genomics.


The summer school participants, teachers and organisers

Workshop content

The first day offered participants an overview of how big data methods can be applied to social science, using new tools and enhanced modelling techniques. Nathan Cunningham, Director of Research IT and Innovation at the UK Data Service introduced the concept of big data and provided an overview of the work being done in-house to build t­­he UK Data Service’s new Data Services as a Platform (DSaaP) initiative currently being implemented in partnership with Hortonworks, a leader in Hadoop-based data technologies.

Louise Corti then presented on how national statistics offices around the world have been exploring alternative sources of collecting data on citizens (to replace or complement surveys and censuses), and showed an example looking at tourism data from World Heritage Sites on Wikipedia. Participants worked in groups to devise their own ‘national statistic’ based on free data from the internet and quickly came up with some innovative ideas for data and analyses including:

  • Crime Rates: What kind of people are more likely to be defrauded online, i.e. those who use the internet more or less? Maybe hard to get data on those known to be defrauded.
  • Tourism: Does watching films with particular locations inspire people to travel to those places? Identify locations in the most popular films over the last year and look at travel destinations from flight statistics.
  • Pollution: Does pollution affect sports performance or fitness? Link data from air pollution sensors with fitness tracking devices such as Fitbits. Would be hard to control for people’s level of fitness.
  • Deprivation: Measure level of deprivation by patterns of mobile phone usage in areas using cell tower records.

Libby Bishop, Manager for Producer Relations and Research Ethics at the UK Data Service, followed with a talk on the role of legal and ethical concerns when using big data. She cited examples of conditions under which research using Twitter data can be shared, and the likelihood of increasing the risk of re-identification given the proliferation of public multiple data sources available about people. Libby outlined the 5 Safes Framework actively used and promoted by the Service as a robust approach for controlling access to disclosive data, so that rich policy-relevant analysis could be safely undertaken. Two guests speakers, James Allen-Roberson and Christian Kemp, from the Department of Sociology at Essex showcased their work on using the dark web in research with discussing the pros and cons of using such hidden sources.

Over the following two days, Peter Smyth and Chris Park delivered presentations, demonstrations and led exercises focused on manipulating data in Hadoop using a freely available Hadoop Sandbox.

Peter Smyth demonstrated how to use the Hive Query Language (HiveQL) to examine the contents of the datasets and to ‘slice’ and ‘dice’ a dataset into smaller datasets which can be used by desktop applications. He also introduced Zeppelin notebooks as a means of carrying out interactive data analysis within a big data environment. Chris Park introduced Apache Spark, a high performance, distributed computation engine designed for handling and analysing big data. He demonstrated how to scale out small-scale analyses by harnessing the power of Spark from R using the SparkR package, and Peter concluded by giving an overview of one of R’s libraries for producing spatial visualisations such as choropleth maps, Leaflet.

Day four covered tools and techniques for getting and converting external data called from APIs, and how to interpret the results in JSON format. Peter concluded with an overview of end-to-end process tools available via Hadoop.

Participants then moved onto formulating their own group projects to consolidate what they had learned over the past four days using open data from the internet. For the remaining day, the groups worked on their projects, accessing structured data from the web, importing, linking them where necessary and conducting exploratory data analysis and visualisations. The group projects turned out to be really interesting with all teams exploring unfamiliar open source data sources and software packages. The five teams came up with some great ideas ranging from use of Twitter data, Google trends data, weather data, accident databases and other sources to look at racism, use of statins, pedestrian and car safety, and weather and mood. A range of R tools were used including ggplot2, Rtweet, gtrendsR and wordcloud(R).

It was hard to choose an outright winner so in the final afternoon we awarded two prizes to two teams:  ‘Racists be damned (team of 2) and Auto Choice Model ‘(team of 3)‘.

Team ‘Racists be damned’

  • This group of two used online data available from the UK Government and Parliament Petitions A recent example was for a petition for a second Brexit referendum, which 4.2 million had signed. The team extracted JSON data from the site’s API for petitions with the term ‘immigration/immigrants’ in the title (N=100 or so out of a possible 31,732 petitions). The intention was to conduct sentiment analysis on the data and then map this to the UK Index of Multiple Deprivation – but this proved tricky in the time available. They hand-classified the petitions as positive or negative regarding immigration/immigrants They plotted by constituency the proportion of people who signed the anti-immigration petitions (dark shading = higher votes for negative petitions). They converted the json data into a regular dataframe, and used R and leaflet, with a little help from Chris and Peter. One of the team declared his pride in creating a map for the first time.


Some initial results from the ‘Racists be damned’ project team

Team ‘Auto Choice Model’

  • This group of three wanted to build an app that enabled insight to be able to purchase a suitable car. They looked for data on fuel economy, emissions and attempted to correlate it with accidents. They found data on car attributes, such as family size etc. They also found a database of car accident locations in the UK by car make, model and engine size, and ranked from 1-3 by severity of accident. The team split up data preparation tasks and used Rmarkdown to keep track of work. They converted UK accident point data to Lower Layer Super Output Areas (LSOA) and used inner joins on the datasets and mapped accidents by model of car to LSOAs using leaflet. The team attempted a 4D plot that turned out to be quite complex showing car engine size by brand by severity of accident by accident location.

We believe that participants really do benefit from group work to exchange ideas and trouble-shoot issues arising. Given the short time frame, most of the groups divided up tasks between themselves.


Prize giving for the winning project teams

Feedback suggested that the course went at roughly the right pace, with only a couple of attendees struggling to keep up as the week progressed. I find that ensuring that all the prerequisites for a course like this are fully met is absolutely essential for any teaching I organise, but for this course run by an external provider, we were unable to vet participants’ applications. Participants for big data courses of this nature benefit from a reasonably high level of competence and experience in data handling and statistical analysis.

I would like to thank Emma McLelland from the IADS summer school and Sarah King-Hele for coordinating the logistics for the week. And huge thanks and fond farewell to our two fantastic trainers, Chris Park and Peter Smyth.

You can access the training materials here on our course GitHub page:

Some feedback

“Very good course indeed. Good structure, material, pace, content + teaching”

“Wonderful prep and professional presentation”

“I really liked it overall. Perhaps one suggestion. I would have liked to be given sample code for analysis actually run in the cloud, not just in our local VM”



Posted in Uncategorized | 1 Comment

Measuring dynamics of household electricity connections in a developing context: a longitudinal data approach

In this blog Louise Corti provides a short review of an illuminating paper just published by Tom Harris, Mark Collinson and Martin Wittenberg entitled, ‘Aiming for a Moving Target: The Dynamics of Household Electricity Connections in a Developing Context’ in Science Direct.

This piece of work has stemmed from a collaborative UK-South Africa ESRC-NRF International Centre Partnership award that focused on scaling up the analysis of household energy data, looking at the potential for better understanding the policy context. In the UK policy work surrounding energy usage focuses on mitigating relative fuel poverty, in the first world sense, but for South Africa, research is inequality in public service delivery; despite significant progress over the last 20 years, basic access to electricity is by no means stable or guaranteed, and remains one of the largest development issues faced by post-apartheid South Africa.

While you can read the findings yourself in the open access paper, here I want to draw out the innovative methodological approaches of the research, which has focussed on using data from a long-running health and demographic surveillance system (HDSS) site. From my point of view, the approach taken by the authors offers an immensely promising avenue for opening up access to other complex HDSS data by considering how data can be restructured using traditional panel study methodology and considering carefully designed units of analysis.

In their analysis, the investigators investigated household electricity access in a poor rural setting in South Africa. Their approach critiques the existing literature in this field, which has tended to focus on investigating changes in an individual’s access to electricity, rather than taking into account the importance of the household unit and changes in household access. They acknowledge that that while many existing studies portray the process of electricity roll-out as a simple, monotonic progression, it is often more a far more complex picture than that. Recent literature typically uses binary indicators to measure progress in electricity access, to assess energy poverty, which suggest a strong association between electricity access and poverty. However, the authors note that the complexities of access transitions among the poor are not taken into account and that the aggregate data sources used do not offer rich-enough information on changes over time, for example, looking at access rates at provincial level.

A richer picture can be gained by investigating the short-term dynamics of electricity access using large-scale longitudinal data. The authors find short-lived deviations (in fact, periods of declines in access) from the long-term upward trend in electricity access, which they then go on to explore in more depth. The analyses were facilitated by the creation of new datasets derived from existing longitudinal studies: the National Income Dynamic Study (NIDS) and Agincourt Health and Demographic Surveillance System (HDSS). Longitudinal data analysis techniques were based around the unit of a new category of household type, defined by when a household forms and whether it continues to exist or not (persistence). Of note here is the novel approach used to re-present data from a typical HDSS.

The longitudinal data sources: sample and coverage

The National Income Dynamics Study (NIDS) is a panel study commissioned by the national Presidency in an effort to track long-run poverty and well-being and focuses on income, expenditure, labour market participation, education, health (including anthropometrics) and household well-being (e.g., access to services).  The baseline sample was designed to be nationally representative and consists of around 7,300 households and about 28,000 individuals, who became Continuing Sample Members (CSMs) for the subsequent waves. Babies born to CSM women become CSMs themselves and individuals who were co-resident with CSMs were also interviewed (Temporary Sample Members – TSMs).


The Agincourt Health and Demographic Surveillance System (HDSS) monitors key demographic events and socio-economic variables in the Agincourt sub-district in north-eastern Mpumalanga Province, South Africa. A baseline census was conducted in 1992 with annual census rounds being conducted since and since 1999.  Key variables measured routinely by the HDSS include: births, deaths, in- and out-migrations, household relationships, resident status, refugee status, education, antenatal, and delivery health-seeking practices.  One-off modules are also added such as every second year since 2000, collection of a household asset module, which includes information on household access to services, such as electricity. Temporary migrants are classified by including on the household grid non-resident members who retain significant contact and links with the rural home and ‘share a common pot’.

Novel data structures for measuring household outcomes

The authors enable their longitudinal analysis of household outcomes from the ability to identify the same household in each wave in each study, in other words the surviving or continuing households. This approach was developed by the authors themselves in previous publication, cited in the paper. Applying their new definition to NIDS and HDSS data allowed them to set up panels of individuals as new panels of households, through which longitudinal changes in household electricity access could be explored (using transition matrices). In order to account for bias in their chosen panel definition, for example from varying household formation and dissolution rates, they also created a comparison panel using the traditional “headship-based” definition. Patterns of electricity access statistics estimated on each panel were found to be similar.


From a methodological point of view, the authors conclude that aggregate statistics can conceal a considerable degree of the complexity and volatility inherent in the development of electricity access. Viewing access as a time-variant process rather than assuming linear roll out of services needs to be appreciated and studied. The authors conclude that further studies into service delivery in LMICs should consider using the longitudinal techniques applied in their paper, taking into account household dynamics.

While I am by far no expert in the analysis of service delivery or utilities, my appreciation of the collection, preparation, structuring and curation of long-running social surveys leads me to believe that there is great hope for improving access to complex longitudinal data from multi-million pound development and evaluation studies conducted in low and middle income . If we consider the number of million-pound population surveillance and intervention investments across the world that generate immensely rich data over many years, such as HDSS, Millennium Villages Project rural development sites, and other large-scale intervention projects, like Girls’ Education Challenge, we can envisage the great scope for research opportunities through access beyond the research teams. The sometimes highly complex structure, inherent in HDSS–type data collections, could be reworked to create new datasets that are able to be better understood by social scientists and those sitting outside of the population, migration and epidemiology research domains

The approach the authors have taken in repurposing data to create traditional ‘panel data’, would benefit from being widely showcased and exploited as an inspiration for data-hungry researchers and, of course, for maximising impact opportunities for funders.

Posted in Uncategorized | Leave a comment

Reflections on ‘Encounters with Big Data’, our course in Cape Town

Louise Corti, Director of Collections Development and Producer Relations at the UK Data Service, and Chris Park, Data Scientist, report back from the course, Encounters with Big Data: An Introduction to using Big Data in the Social Sciences, held in Cape Town, South Africa from 30 January to 3 February 2017. Also See our twitter hashtag #ZABigData17

The course met one of the objectives of the Smarter Household Energy Data project, a joint International Centre Partnership Grant between the UK Data Service and DataFirst and funded by the Economic and Social Research Council in the UK and the National Research Foundation in South Africa, to create a collaborative research infrastructure for large-scale household energy data. Louise Corti, one of the principal investigators told us: “This course introduced part of our work on ‘scaling up’ data curation and user access approaches for big data, predominantly where larger or more complex data sources – bigger than the 5GB maximum download bundle of survey data that our users typically access – or where computationally-intensive and iterative modelling is needed.”

The course covered aspects of extraction, exploration, and statistical analysis of big data behind t­­he UK Data Service’s new Data Services as a Platform (DSaaP) initiative being developed in partnership with Hortonworks, a leader in Hadoop-based data technologies. The new platform will be using solutions that deliver data at scale, with speed, and security, including Hive, Spark, and Zeppelin, which integrate seamlessly with popular data analysis environments like R and Python. We received applications  for the course from universities across South Africa, including the University of Cape Town, University of Witwatersrand, Stellenbosch University, and theUniversity of Pretoria, as well as from government agencies including the South African National Space Agency and the Human Sciences Research Council of South Africa. The participants were highly experienced in using cross-sectional and longitudinal survey data, public health and medical data, transport data, financial data, and satellite data.


Nathan Cunningham, Director of the Big Data Network Support directorate at the UK Data Service, summarised the training: “Our aim was to support researchers in understanding and analysing large and complex datasets, focusing on leveraging the power of popular statistical software like R within a big data environment.

Workshop content

The first day offered participants an overview of how big data methods can be applied to social science, using new tools and enhanced modelling techniques. The Data Services as a Platform initiative was introduced by Darren Bell, lead Repository Architect at the UK Data Service, and Peter Smyth, Big Data Training Expert at the UK Data Service. Martin Wittenberg, Professor of Economics and Director of DataFirst at the University of Cape Town presented his work on restructuring a 20-year health and demographic surveillance study based in South Africa showing an example of big data analysis using night-time satellite data from NASA as a proxy to explore the impact of electrification on South Africa’s economic development over time, highlighting the exciting opportunities that lie ahead for using big data in the social sciences.

Louise Corti presented on how national statistics offices around the world have been exploring alternative sources of collecting data on citizens (instead of surveys and censuses), and showed an example looking at tourism data from World Heritage Sites on Wikipedia. Participants worked in groups to devise their own national statistics focused ideas for analysis including:

  • Public Health: Understanding how Google searches on medical terms reflect disease prevalence and how bias in the data might be related to rural/urban access to technology, and the use of specialist terminology;
  • Crime Rates: Using crime statistics to identify policy intervention by mapping crime rates over time using aggregated provincial crime stats and police station reports of crime and location;
  • Transport: Studying transport policies to assess different needs for private and public transport. The team used Google tracking location for traffic showing how different categories of commuters are likely to experience different commuting times and local weather station pollution data as a proxy for congestion and emissions.
  • Monetary policy: Following the recent currency changes in Zimbabwe, the team assessed preparedness for new currency using google location and sentiment analysis of social media data from Twitter and Facebook to see what people are talking about.


Libby Bishop, Manager for Producer Relations and Research Ethics at the UK Data Service, followed with a talk on the role of legal and ethical concerns when using big data. She cited examples of conditions under which research using Twitter data can be shared, and the likelihood of increasing the risk of re-identification given the proliferation of public multiple data sources available about people. Libby showed the UK Data Service is keen to promote the idea that disclosure risk is best viewed as a continuum rather than a dichotomous concept. She also suggested that the 5 Safes Framework, which is being actively used and promoted by the Service, is an excellent approach for controlling access to disclosure data, and a means by which rich policy-relevant analysis could be safely undertaken.

Over the next two days, Peter Smyth and Chris Park delivered presentations and led group exercises focused on manipulating data using tools supporting the DSaaP. Peter Smyth demonstrated how to create and query tables in Hive, and introduced Zeppelin notebooks as a means of carrying out interactive data analysis within a big data environment. Chris Park described ways of scaling out small-scale analyses implemented in R with SparkR, and introduced students to aspects of distributed computing and how and when it could benefit research. Practical exercises, such as fitting linear models to large-scale open data were used to put the theory and concepts into practice. Aspects of provisioning and monitoring Hadoop clusters using Ambari and data visualization using the leaflet library were also discussed.

Before participants moved onto formulating group projects, a concluding talk summarised the UK Data Service data processing pipeline. Over the course of the remaining two days, the groups worked on projects focused on accessing structured data from the web, then importing, querying and linking them for exploratory data analysis, modelling, and mapping. The group projects were very interesting and explored unfamiliar open source data sources and software packages. We ended the course by awarding a prize to ‘Team Synergy’, a group of demographic and health data managers and economists, who undertook a comprehensive spatial analysis of mortality in South Africa using Zeppelin, Hive, and SparkR.


We received some great feedback:

”Great introduction to whole field, gentle ease in to more technical concepts”

“Knowing what the other participants are working on is very helpful – a round robin type introduction saves a great deal of time during networking breaks as you can gravitate quickly towards those you want to follow up with. Really enjoyed the exercises – learning happens in action!”

“The practicals on how to navigate Hive were interesting and mind opening”

“I could easily follow the hdfs command lines and the effectiveness of Hadoop and think this software is going to be useful to my studies”

“Really enjoyed the R and Spark sessions. Felt like this is what I’ll be most able to work on in my future research (economics). The exercises were very useful –because the content was accessible and also because of the way that the code was templated for us to follow along with Introductory lectures appreciated – was easy enough to follow. Chris’s anecdotes very welcome (thanks!)”

We would like to thank our South African hosts, especially Alison Siljeur at DataFirst, for helping organise and coordinate the workshop, making it a wonderful experience for us as tutors and participants.



Posted in Uncategorized | 1 Comment

Electricity Access Dynamics in South Africa

In this blog post, Tom Harris and Professor Martin Wittenberg from Cape Town University (UCT), report on the survey analyses they undertook on electricity transitions. Tom is a researcher in SALDRU (Southern Africa Labour and Development Research Unit) and DataFirst, with a Master’s degree in Economics. Martin is in the School of Economics and Director of DataFirst, and a PI on the UK-ZA energy project.


Lack of access to efficient, reliable and modern energy is a prevalent issue across the developing world, with one in five people lacking access to electricity. The use of biomass fuels and other ‘dirty’ and inefficient fuels has severe negative implications, not only for public health,  but for the environment  and economic development too; transitions up the energy ladder  (to electricity particularly) are accordingly associated with improvements in well-being and economic progress [1-7]. Since the use of such fuels is most prevalent among those already in poverty, this exacerbates the plight of the poor, and widens the gap between the rich and the poor [2, 5, 6]. Electrification and improvements in electricity access are therefore of considerable interest to policy makers within countries where wide-spread poverty and inequality remain the dominant issues.

In this regard, South Africa is a success story. Access rates improved from well below 40% in the early 1990s to nearly 80% by 2002. Our own analysis of household electricity access, in Figure 1, confirms that this progress has continued in recent years, and shows national household access rates to have risen by a further 9% between 2002 and 2012.


Certainly, these improvements are encouraging, and there is much that can be learned from the successes of the South African electricity roll-out process. However, our results also reveal a striking point not yet given much attention in the literature: this long-run improvement in electricity access is not the result of a consistent, monotonic increase in access rates. Instead, there are short-term deviations from the long-term upward trend. The most salient of these deviations are periods of declines in access, which are evident in both national and small-area data. However, the cross-sectional statistics presented in Figure 1 cannot explain whether these declines in access were the result of households losing access at a location, or whether they were due to groups of people migrating from connected household units and setting up new households in locations that lack access.

Therefore, in an effort to understand what contributed to the observed declines in access, we introduce two novel approaches. Firstly, in order to explore the relationship between household formation and electricity access, we categorise households according to when they form and whether they continue to exist or not, and investigate changes in access among these different household categories. Secondly, to examine transitions in access among households that continue to exist, we apply longitudinal techniques to a unique form of NIDS (the National Income Dynamics Study, a nationally representative household panel) that allows us to track household units over time [8, 9]. These results are shown in Figure 2 and Table 1 below.



Our findings (in Figure 2) suggest that household formation contributed to the decline in aggregate access over the 2008-2010 period in two respects. Firstly, rapid growth of the household population meant that the electricity roll-out could not keep pace with net household formation: more than 400,000 household units were added to the population, and yet the total number of connections increased by less than 100,000. Secondly, newly formed households were less likely to have electricity access than those that dissolved (77.7% vs. 84.2%) – suggesting that households with electricity access dissolved and were replaced by new households without access.

Moreover, it was not only household formation and dissolution that led to the observed decline in electricity access rates between 2008 and 2010. We also see a decline in access rates among those households that continued to exist (in Table 1). More specifically, we find that even though many additional household electricity connections were added between 2008 and 2010, these positive transitions were outweighed by numerous connections losses – with more than 800,000 continuing households (8.7% of the connected household population in 2008 that continued to exist) having lost an electricity connection by 2010.

However, the processes described above are not particular to periods of declines in access – as is evident in Figure 3 and Table 2, presented below.



Pervasive connection losses are observed even over the 2010-2012 period: a period in which aggregate access actually improved. In addition, while we argued that household formation and dissolution dynamics contributed to a decline in aggregate access rates between 2008 and 2010, in investigating the 2010-2012 period we find that these processes are also able to contribute to improvements in aggregate electricity access – with households that formed over this period being more likely to have had access to electricity than both those that dissolved and those that continued to exist.

The policy and theoretical implications of our findings are applicable to those working in developing contexts well beyond the borders of South Africa. We have shown that aggregate electricity statistics, such as access rates, conceal a considerable degree of the complexity and volatility that is inherent in the development of electricity access. In this light, we suggest that policy makers involved in the area of electricity roll-out are likely to be aiming for a moving target in the following three respects:

  1. the number of households may be growing faster than the rate of growth in connections, as a result of rapid household formation;
  2. people may be moving out of connected households and setting up new households in locations that lack access; and
  3. certain connected households that survive from one period to the next may actually lose their electricity connections.

We therefore suggest that in developing countries like South Africa, electricity access rates are unlikely to show consistent improvements, even in periods of rapid electricity roll out. Access rates do not simply improve as new connections are added. Instead, we argue that household electricity access is a complex outcome of two key time-variant processes: (1) net connections (new connections less disconnections) and (2) household formation and dissolution processes.


  1. Vermaak, C., M. Kohler, and B. Rhodes, Developing an energy-based poverty line for South Africa. Journal of Economic and Financial Sciences, 2014. 7(1): p. 127-144.
  2. Kimemia, D., et al., Burns, scalds and poisonings from household energy use in South Africa: Are the energy poor at greater risk? Energy for Sustainable Development, 2014. 18: p. 1-8.
  3. van der Kroon, B., R. Brouwer, and P.J. van Beukering, The energy ladder: Theoretical myth or empirical truth? Results from a meta-analysis. Renewable and Sustainable Energy Reviews, 2013. 20: p. 504-513.
  4. Smith, K.R., National burden of disease in India from indoor air pollution. Proceedings of the National Academy of Sciences, 2000. 97(24): p. 13286-13293.
  5. Heltberg, R., Factors determining household fuel choice in Guatemala. Environment and development economics, 2005. 10(03): p. 337-361.
  6. Rao, M.N. and B.S. Reddy, Variations in energy use by Indian households: an analysis of micro level data. Energy, 2007. 32(2): p. 143-153.
  7. Vermaak, C., M. Kohler, and B. Rhodes, Developing energy-based poverty indicators for South Africa. Journal of Interdisciplinary Economics, 2009. 21(2): p. 163-195.
  8. Harris, T., Household Electricity Access and Household Dynamics: Insights into the links between electricity access and household dynamics in South Africa between 2008 and 2012, in School of Economics. 2016, University of Cape Town: Cape Town.
  9. Wittenberg, M. and M. Collinson, Household formation and household size in post-apartheid South Africa: Evidence from the Agincourt sub-district 1992-2003. 2014: A DataFirst Technical Paper 27. DataFirst, University of Cape Town, Cape Town.
Posted in Uncategorized | 1 Comment

UK visit to South Africa research sites

In August 2017, 2 members of the project team visited colleagues at the DataFirst at the University of Cape Town (UCT) and run a programme of short workshops.  It was also a good opportunity to nail down plans together with our host, Lynn Woolfrey and Martin Wittenberg, for our week-long, Big Data Research Summer School taking place in February 2017 at UCT.

Louise Corti gave the first workshop on preparing qualitative data for sharing and re-use in DataFirst’s excellently-provisioned Training Centre, on campus. The training was attended by some 20 staff from UCT.  Over the 3 hours Louise covered existing best practices and tools, looking at the value of sharing data, assessment and preparation of data, and showing how data can be best documented to capture as much context as possible. She highlighted the role of the design of consent forms, methods of anonymisation, and controlling access to data; three pivotal approaches that seek to balance ethics with sharing research information for the public good.

There is little sharing of qualitative data in South Africa, so this was a great opportunity for researchers and research librarians to hear from a pioneering expert in the field. Louise noted that the UK were very fortunate in that their funders, the ESRC, have funded this area of work for almost 25 years, enabling the UK Data Service to amass approaching 1000 datasets in their collection of available qualitative research data.  Participants were happy to receive a complimentary copy of the UK Data Service’s Sage handbook, Managing and Sharing Research Data: A Guide to Good Practice by Corti et al.

The workshop was followed by a networking lunch, where participants had an opportunity to come together to talk about their own data and about data management in their own institution, and indeed, the future role of DataFirst in helping move the sharing of qualitative data forward. As we were speaking, DataFirst had just received their first self-standing qualitative dataset (not a component of a larger quantitative study), a small semi structured dataset.

Day 2 kicked off with Nathan Cunningham’s half-day seminar on How to understand Big Data: Strategies for understanding your big data and getting new knowledge from it, with UCT staff in attendance from across the research and IT areas. This event was very much pitched at an introductory level to big data thinking, analysis, and tools. Nathan addressed the 4 Vs, noting that while Volume, Velocity and Variety were all indicative about the shape of big data you might be dealing with, the most important V of big data is its Value. We need new strategies for achieving this value for scientific research, and Nathan finished with showcasing the Hadoop Open Data Platform currently being set up to handle Big Data at the UK Data Service.

Focus on Health and Demographic Surveillance System data

The final full day workshop took place at the research site of the WITS Rural Facility (WRF) of the University of Witwatersrand in Bushbuckridge, in the rural north east of South Africa.  12 staff from the local Agincourt Health and Demographic Surveillance site (HDSS) (420 sq km covering 87,000+ people living in 14,000+ households and 26 villages in South Africa’s semi-arid rural north-east) and from the Africa Centre for Population Health (also running a similar HDSS, in rural KwaZulu-Natal) attended. Both are part of the broader INDEPTH Network, set up to provide a more complete picture of the health status of communities collected from whole communities over extended time periods, which more accurately reflect health and population problems in low- and middle-income countries (LMICs). Its 43 member centres observe, through 49 HDSS field sites, the life events of over three million people in 20 LMICs in Africa, Asia, and Oceania. These sites are funded through multiple stakeholders, including the Wellcome Trust, NIH, and the EC.

How data are managed and shared

The morning session heard from the UKDS and DataFirst speaking about data curation opportunities, especially utilising big data environments, for good data management of more complex data and large-scale analysis. There were questions on the process of getting the Data Seal of Approval and also some clarification provided on the relationship between DataFirst and Stats SA. Louise pointed to the UKDS-ESPA brochure on the Guide to sharing social data in a multidisciplinary, stakeholder research. Other questions arose on: ownership of data; best practices in version control of data and assigning DOIs; which metadata – DDI-Codebook or Lifecycle?; file formats; methods of gaining consent; and models of data licencing and access, such as Secure labs and Safe rooms. Louise drew attention to the UKDS’ new 5 Safes animation which explains, succinctly, how the secure access model works, in providing safe access to ‘unsafe’ data.

Following lunch, staff from the HDSS spoke of various challenges and solutions in providing access to the complexity of their data assets from their sites.  Mark Collinson, PI on the Agincourt, generously hosted the day. He noted that the event was also a good opportunity for HDSS data groups from different sites to meet and exchange notes on current and future working practices, and think about how to move forward in harmonising some of their data. An example cited by Mark was coding of labour market activity, which is not consistent across the sites (use of ILO classification or not).

There are variations in what data and field methods information is published across each HDSS.  INDEPTH was set up to encourage best practice in data management by developing a reference data model and incorporating it into a ‘resource kit’ for HDSS. While INDEPTH’s iSHARE (INDEPTH Sharing and Access Repository) portal holds useful core data from the HDSS sites, this is limited to summary information on demographics, births, deaths, and in- and out-migration), and, there is as yet, very little data available for social and economic researchers to exploit. INDEPTH have also published an INDEPTH Stats App, a data visualization app which allows users to explore on their mobile devices these basic demographic indicators, through interactive, animated charts and tables, generated by the 52 HDSS.

As with many biomedical longitudinal studies, data are available via a bespoke individual request. For Agincourt, theme leaders evaluate all incoming data requests, which are optimised and bespoke subsets are prepared at cost. There is also a purpose-built “1 in 10” Agincourt database used for training purposes which gets updated every year. The Africa centre use a helpdesk to manage queries and all their ETL scripts are archived as XML on a source control system (SVN) and are documented, published and extracted.

The final session of the day discussed various open source tools –  including the Open DataKit (ODK), Redcap, Smart Paper and Formhub – to manage fieldwork and administration.

Expanding opportunities for using HDSS data

Martin Wittenberg’s complete restructuring of the Agincourt data in linked panel data format (long form) at 3 levels (individual, household and community) represents a huge step change and may be useful as an exemplary showcase on how to open up data. While the 3-level data is likely will be restricted to Secure Lab access, there could be a less disclosive public use file that would be useful).

More fieldwork documentation would be useful too; and Mark noted that he would be publishing the Agincourt fieldwork manual.

Finally, there are plans to expand the current iSHARE programme (iSHARE2), through Wellcome Trust funding to enable all INDEPTH member centres to contribute fully documented, high-quality, micro-level datasets on the Scientific Programme in a timely manner. We very much look forward to more news on this.

There are some really positive outcomes from this meeting which all feel represented a first step in better linking up work on data preparation and access being undertaken by HDSS, and the national social science data archives. further, we look forward to working with these HDSS sites to see how big data solutions may facilitate elements of their work.

Posted in Uncategorized | Leave a comment

UK Energy Research exemplar projects, September 2016

In this post Louise Corti and Chris Park from the UK Data Service (UKDS) provide an overview of some of the exemplar projects using energy data that provide valuable use cases for the UKDS’s new Hadoop-based data infrastructure, designed for the storage of both existing and new and novel forms of data drawn from current and future UKDS holdings. They also report back from the small researcher meeting held at the UCL Energy Institute in June.


On June 24 2016, a workshop was held as part of our ESRC-NRF International Centre Partnership grant between UK Data Archive and DataFirst, with a view to creating a collaborative research infrastructure for household energy data. A productive workshop was held in Cape Town in January 2016 to discuss active research projects  that make use of South African energy data. The present workshop was a small roundtable meeting made up three of our colleagues from DataFirst and UK researchers who have expressed an interest in being early adopters of the system.

Nathan Cunningham, our Functional Director of the at the UKDS, gave an introductory talk on what constitutes big data and emphasized the need to embrace new technologies that can enable us to innovate and meet new and evolving business challenges. Our new Data Service Platform is being developed in partnership with Hortonworks, the only Hadoop vendor offering a truly open-source distribution of Hadoop. The infrastructure is based on the Hortonworks Data Platform (HDP) as a reference architecture for data management and analytics, and Hortonworks Data Flow for dataflow management and orchestration.

Reference Architecture for Hadoop:




These are hardened, enterprise-ready systems that can deliver stability, security and a truly non-proprietary roadmap. The aim will be to deliver impact for both the existing data assets of UKDS – open, safeguarded, and controlled – and to offer new and novel services that can deliver both exploratory and hypothesis-driven analysis along with integrated data services that interface with popular data analysis tools such as SPSS, Python, and R.


Aidan Condron, our Senior Officer for Collections Development and Producer Relations, gave a short live demo using data from the Energy Demand Research Project (EDRP), hosted at the UKDS. Apache Hive, a key component for data warehousing solutions employing a dialect of SQL can be used to rapidly aggregate and dynamically produce household energy consumption profiles. He also showed how Apache Zeppelin, an in-browser notebook tool supporting multiple analytical frameworks such as R, Python, and Scala, can be used to graph households by geographic region and types of energy installed (electricity only or electricity and gas).


Research use cases

The meeting then heard from researchers about their data-related challenges. First, Jonathan Chambers, a PhD candidate at the UCL Energy Institute who is using household energy data to generate data-driven models for assessing energy demand in UK dwellings, discussed the difficulties involved in transforming low quality data into a form that can be analysed. Many of his data sets suffered from general issues such as gaps, duplicates, and inconsistent formatting, as well as more specific problems such as timestamp parsing and adjusting for daylight savings. He was also forced to restrict his analysis to a small number of dwellings due to processing limits on his laptop while running his iterative algorithms on large data sets, some of which exceeded 20GB in size. Due to the complexity of his data pipeline and the size of the data sets used, it was deemed that his project lent itself well for testing the UKDS platform. The project was chosen as the leading use case from the UCL Energy Institute, and Chris Park, one of our Data Scientists, will be working closely with Jonathan in the next few months to provide scale and efficiency to his data transformation and modelling pipeline with our new data platform, which could make a qualitative difference to Jonathan’s research.

Next, Anastasia Ushakova from the UCL Department of Geography and School of Public Policy, spoke on her work on developing methods for detecting vulnerable consumers using smart meter data. She uses machine learning methods to detect patterns in energy consumption, which are used to group consumers into clusters to infer household characteristics, but her project is still in the exploratory phase and she hasn’t yet finalized her data processing pipeline. Louise Corti, Functional Director of Collections Development and Producer Relations at the UKDS (and UK Princpal Investigator on this UK-South Africa project), pointed out that in the meantime, baselining household attributes and consumption patterns against known data sources such as Understanding Society might be useful. Hugh Shanahan, Senior Lecturer in Computer Science at Royal Holloway, University of London, suggested that Anastasia’s research could benefit from exploring alternatives methods for detecting patterns such as conformal prediction. It was decided that the UKDS remain in close contact with Anastasia as her research project matures and keep track of her progress.

Jacqueline Beckhelling from the School of Civil and Building Engineering at Loughborough University then discussed her work on the DEFACTO project, which investigates how the use of digital control and feedback technologies can encourage efficient energy use. The project uses gas readings collected every 30 minutes and electricity readings collected every 2 minutes, yielding some 100GB of data, which is currently emailed to the researchers – to the tune of some half a million rows of incoming data each day. Jacqueline pointed out that data storage, transfer, and processing were big issues that couldn’t be overcome using existing tools, and it was agreed that her project merits further investigation to identify which components of her data pipeline can be fruitfully ported over to the UKDS system.

Finally, Moira Nicholson, another PhD candidate from the UCL Energy Institute, spoke of her work on using randomized control trials to test whether behavioural interventions could be used to boost the number of British consumers switching to off-peak electricity tariffs. She uses three different versions of web pages to examine tariff switching rates, and uses Google Analytics to analyse trends and compute useful metrics. While interesting, it was agreed that her work is unlikely to benefit from the UKDS platform since the data used in her work was neither big nor complex in its current form.

An interesting discussion then ensued, with researchers exchanging ideas about measuring statistical uncertainty at scale. Nathan Cunningham pointed out that the Hadoop ecosystem was ideally placed to provide heuristic tools for handling systematic errors. David Shipworth, Reader in the Built Environment at the UCL Energy Institute, then raised questions about how one could detect and flag data to indicate error and bias in the presence of non-uniform errors, and emphasized the need to better quantify uncertainties for new and novel forms of data to answer key policy questions.

Next steps

To address some of the issues discussed, the UCL Energy Institute, with input from the UKDS, will be hosting a workshop on smart meter data in October 2016, aimed at identifying a taxonomy of errors to construct a set of syntactic and semantic checks for handling smart meter data. The workshop will bring together experts from the wider energy research community, including the Centre for Sustainable Energy, Durham Energy Institute, Imperial College London, and the UCL Energy Institute, as well as data curation and modelling experts from the UK Data Service and the Institute for Social and Economic Research.

Overall, the meeting was highly productive and helped consolidate some high value use cases for energy research, and provided clarity and direction for the UKDS moving forward. The UKDS has initiated a Big Data Webinar Series, in which Peter Smyth, our Senior Officer for Training and User Support, introduces components of the Apache Hadoop ecosystem, including Hive and Spark, and invites guests to speak on issues such as: exploring diet and obesity in children using geo-demographic classifications, linking data from social media, traffic sensors, and other data sources to explore social aspects of the urban environment in the Glasgow area, and the impact of online shopping on the UK high street.

The UKDS will also contribute to cutting edge energy research projects to bring innovative big data technologies to the forefront of academic research: Bashar Al-Hnaity, one of our Data Scientists, will work closely with David Shipworth and Simon Elam from the UCL Energy Institute to develop innovative, data-driven models using our new data platform to encourage efficient policy making in the energy sector.

Posted in Knowledge exchange, Research use cases | Leave a comment

Household Energy Data Workshop, January 2016, University of Cape Town

In this post Louise Corti from the UK Data Service reports back from a researcher meeting held in Cape Town


This workshop was part of an ESRC-NRF International Centre Partnerships grant between UK Data Archive and DataFirst which is looking at a collaborative research infrastructure and data compilation project in the area of household energy data.  A big focus of the project is anticipating and finding solutions for the challenges of ‘big’ and complex data in three main areas. The first is confidentiality, legal and ethical issues, particularly important in relation to administrative data and local level census and Demographic Surveillance Site data. The second relates to size and complexity, for example municipal datasets are of a size that social scientists in South Africa are not used to analysing, and the added challenges that comes with combining data from different data sources raises analysis issues of a different type. Third is surrounding data quality – missing data, data errors and uncertain or unknown provenance need to be identified and dealt with.

Professor Martin Wittenberg, Economist and Acting Director of DataFirst at University of Cape Town (UCT) introduced the focus of the workshop explaining that it was very much to kick start some knowledge engagement on these matters for social scientists in South Africa; and to brainstorm that might help lead to action and tangible research projects. A similar workshop had already been held with researchers in the UK in December 2015.  One key aim was to identify potential data sources and research projects that could feature as demonstrators for the project.  Participants came from various sectors spanning a good mix of energy researchers, economists and health researchers and national statisticians. These included: the UCT Energy Research Centre, the African Climate and Development Initiative, Statistics South Africa, the School of Economics, the Jameel Poverty Action Lab Africa ( JPAL), the MRC/Wits University Rural Public Health and Health Transitions Research Unit (Agincourt) and Dartmouth University.  Martin alluded to some recent high profile visitors to Cape Town, including the UK cricket team, Mumford and Sons, and now the UK Data Archive, surely the ‘rock stars of data’!

The morning session comprised three talks on active energy research projects.

Municipal data

 Grant Smith spoke about his and his colleague, Martine Visser’s, work on the use of municipal electricity data derived from consumption data collected by municipalities based on billing information.  Using prepaid vending data from some 600,000 customers (75 % of Cape Town customers from 2014 and some data dating back to 1993) with some detail on tariffs over time, billing records from 2005 for credit electricity, they are undertaking research on residential prepaid customers from 2004, and had already begun to analyse patterns of consumption among poor and rich households.  The data came directly from commercial SAP business warehouse records, largely as text documents, plus records of transactions and debt.  Across different departments there are various versions of data and in different structures.  At a large (for social science) scale of 370 GB and some 300.000 households with daily data, a lot of data extraction and cleaning work needed to be done to extract the data for processing.

Added to this data were property values (in quantiles), property subdivision history, especially over the last 10 years, and GIS locations (as the primary key for linkage). Evidence arising so far was that the poorest households are having far smaller transactions per month. Further research papers would be coming on the provision of free water and electricity, and household welfare.

Issues arising from this area of work centred around four main areas. First, data extraction challenges at scale; the statistical processing is quite RAM constraining and the team felt they needed to learn new languages like Python to help manipulate data. Second is messiness and structure of the data sources; some standardised data collection and management by cities would be beneficial and could also be useful for direct interventions, such as the influence of nudges on behaviour across different groups. Fortunately, the ZA POPI (like UK DPA) does have caveats built in for research like this.

Third is the confidentiality of the data.  The current data sharing agreement with the City is limited at present to a single research project, but a broader licence would be hugely beneficial, such as ones operated under the controlled environment of DataFirst’s Research Data Centre. Finally, they are concerned with how to link this type of administrative data to other data sources (such as to survey data, via geocoded data/GPS coordinates) to gain a finer grained picture of local electricity access and use.  This matching capability is needed to be able to evaluate national programmes and to look at trying to measure the developmental impacts of improving electricity access. This early work on Cape Town municipal data would likely be extensible to other municipalities in South Africa.

Rural electrification

The second session looked at projects on rural electrification. The electrification of South Africa’s rural areas has significantly increased access since the end of Apartheid, and research is addressing what, if anything has been the impact of this?  Mark Collinson and Taryn Dinkelman (via Skype) spoke about the Wits/MRC Rural Public Health and Health Transitions Research Unit (Agincourt) project which has been monitoring demographic and health changes in the Agincourt area since 1992. Since 2000 the team have also monitored changes in infrastructure and household assets as well as labour market outcomes. They had amassed a valuable set of panel data dating back to the 1990’s, which would be turned into a shareable resource.

Questions for future work include whether we can see the impact of the electricity roll-outs on household outcomes in this area and what additional data, such as on timing of the roll-outs, consumption information, could we bring to bear.

Big data platforms

Following a networking lunch, in the afternoon the UK team outlined the work being done in the UK to help ‘scale up’ data activities for social scientists, in particular building a prototype service infrastructure for managing and making big data accessible. Louise Corti, Director of Collections Development at the UK Data Archive at the University of Essex spoke about the UK’s role in the UK-ZA Centre partnership grant. The focus on providing use cases for supporting managing and researching questions on household energy data would help the UK develop and deliver an infrastructure to include technical, governance and security models for ingest and access to data. A blueprint of a working system could then be delivered to DataFirst.  She was very pleased to be working with the Centre for Environmental Epidemiology (CEE) at University College London (UCL) to provide the domain research expertise. So far the UK Data Service has attracted a few active researchers who were keen to pursue projects making use of the big data environment being developed; in essence to see how it compared with more traditional methods, and what analytic and visualisation tools might be available.  The work also paid attention to methods of ingesting and managing data from real time data sources, such as regular data capture from household energy smart meters, and linking of data sources that is likely to point to disclosure risk: how is this managed in a big data environment? A second aim was to undertake in both countries an audit of energy data sources. Finally, the project would be creating instructional materials to be used for a week-long summer school for upskilling social scientists for managing and analysing new and novel forms of data, to be held in UCT in February 2017.

Simon Elam of the CEE spoke about the role of the CEE in brokering access to and use of energy smart meter data or the UK, based on the UK government’s plan to install a meter in every household in the UK by 2019.  The work of his centre aimed to: improve the collection of empirical energy demand and related data;  help combine and curate relevant data sets and help researchers, industry and policy makers use energy data and the results of data analysis via securing open access to data.

Simon also spoke of the ‘Vulnerable Customers and Energy Efficiency (VCEE)’project currently in operation which sought to engage with ‘fuel poor’ so they could benefit from energy efficiency and demand side response. The project included an energy supply company supplier for all VCEE participants who would install electricity and gas smart meters and temperature loggers in participant homes.

Dr. Nathan Cunningham spoke about the cutting edge work being done by the UK Data Service current Big Data Network Support award to develop a demonstrator for a modern, fit-for-purpose research data infrastructure utilising current industry-standard Open Data Platform technology in an affordable and accessible way.  This modern platform that takes its standards from the open communities and aims to enable ready discovery, linkage and visualisation of datasets of interest to the social scientist.  A key issue here is the moving of processing to the data and not the other way around. The UK Data Service is designing a blueprint for a fully functional infrastructure that could, if realised, be transferred and taken up by UCT.

Combining local and national data

The final session of the day saw Martin Wittenberg and his PhD student, Tom Harris talk about the opportunities that combining local and national data could bring.  While the projects discussed in the morning involved micro-studies in a local context looking at two very different contexts, much of the national policy discussion actually occurred on the basis of information supplied by national level datasets (the census, information from the General Household Surveys, Eskom data).  The session questioned whether national level datasets were informative enough about local conditions, or could be used to improve our understanding of local conditions, and whether local information could be used to improve the quality of national data.

Tom spoke of his and Martin’s detailed work on the post-Apartheid labour market series (PALMS) data set. This included a large-scale rescue of older smaller surveys from the national statistical institute, administrative and market research data sources that involved harmonisation and cleaning of some 28 surveys. Tricky issues were constructing reliable weights over time, and some work had been done by Martin on smoothing weights and flagging outliers. Critical here was the release by DataFirst of the code for cleaning the data, demonstrating excellent research transparency.

The wrap up session invited participants to think about connecting up data, research questions and forward-looking  infrastructure. The cocktail reception invited further networking and discussion about the pragmatics of working together to test out the open data platform and explore new ways of working that utilised opportunities offered by the big data platform.

Our next meeting for researchers using household energy data will be held on 24 June 2016 at UCL in London.

Posted in Knowledge exchange, Research use cases | Leave a comment