Using nightlights data to map rural electrification in South Africa: observations from a quantitative methods graduate

FullSizeRenderIn this blog Camille Corti-Georgiou, summer intern at the UK Data Service at the University of Essex, offers her own reflections on the workshop, ‘Household energy use in the Agincourt area’, held in July 2017 at the Wits Rural Facility in north eastern South Africa. Camille has just completed her first degree in Political Science and International Relations and is about to embark on a Masters of Enterprise at the University of Manchester Alliance Business School.


I attended the ‘Household energy use in the Agincourt area’ workshop, organised by Professors Martin WittenbergDataFirst at Cape Town University) and Dr. Mark Collinson (Senior Researcher: MRC/Wits Rural Public Health and Health Transitions Research Unit), as a student of quantitative methods, with a fairly limited understanding of both longitudinal and ‘big data’, and the methods employed by the team at WITS Rural. Yet, what stood out to me, as a recurring theme throughout the two days, was the potential scope of the work being carried out in Agincourt. While the workshop consisted of multiple presentations and talks, looking at various aspects of the study, in this post, I want to pay specific attention to the presentation given by Professor Martin Wittenberg  on the use of nightlights data in the Agincourt area.

The Agincourt HDSS

The Agincourt Health and Socio-Demographic Surveillance System (HDSS) is, by all accounts, a significant research undertaking, yielding a powerful database for studying many aspects of rural life in a developing region. The principal goal of the HDSS is to offer a more enhanced understanding of the dynamics of health, population and social transitions in northeast South Africa. With twenty years of data having been collected since the initial baseline survey was conducted, the research unit in Bushbuckridge now holds a rich and substantial set of data with policy-influencing significance. Moreover, I was fortunate to see the exemplary framework they have developed for which future research may be undertaken.


Wits Rural Research Facility, Bushbuckridge, South Africa

Throwing Light on Rural Development

For developed regions, the use of satellite data in mapping urbanisation has been widely tested and validated. Measuring the electrification of rural areas, using the same method, however, is a much newer phenomenon. At Wits, the team were interested in whether the satellites would pick up the temporal patterns of rural electrification in the Agincourt area, to allow for analyses that could be corroborated by the data collected on the ground.

Takwanisa Machemedze from Datafirst advanced the actual technique of linking satellite nightlights (as shown in Figure 1) and local data, finding a definitive correlation between the two.


Figure 1. NASA satellite view of Earth at night, compiled from 400 satellite images

In explaining this technique, Martin repeatedly referred to the sum of lights (SOL), the sum of all pixel values for a particular region. While in developed regions a SOL measure will generally suffice, in rural regions it can be problematic. Instead of using the average brightness across the pixel for an area, Takwanise broke up the pixels and matched them to the shape of the boundary, before adding them up. While this approach resulted in some loss of brightness, it gave the most accurate indication of electrification in the area.

Previous analyses conducted on the data by Datafirst concluded stable progress in electrification and revealed a steady, upward trend. What was most interesting, however, were the periods of decline and deviation from this trend, most notably the huge electricity dip across the whole of South Africa in 2008. As a result of the Eskom collapse, stemming from the capacity of electricity failing to meet the demands of the growing economy, a state of electrical emergency was declared. The massive load shedding and electrical rations that followed were indicated by the satellite data. Even in the wake of the collapse, the levels failed to regain, due to the massive tariff increases that followed. At this point it was noted, that one of the drawbacks of using satellite data to measure electrification is that it only detects external light. And, as was duly noted by one of the attendees of the workshop, it could very well have been that electrification was still occurring in the area, but, people were only opting to use internal lights and thus their households would not have been detected.

Data from the ground

Moving on from this, the second half of Martin’s presentation shifted focus onto electrification in the villages and homes of Agincourt. Expanding on the work of Hargreaves, villages in the area were sorted into four distinct categories as shown in Figure 2:

Village type Electricity in 2000 (Y/N)
Central communities Yes
Established communities Yes
Undeveloped villages No
Refugee settlements No

Figure 2. Village typology used by HDSS

Non-domestic lighting, such as light from train stations, police stations and supermarkets, was also picked up by the satellite and thus, it was crucial to conduct analyses on the ground to distinguish between sources of non-domestic and domestic lighting. The data collected for Agincourt was compared against data for Kruger and Nelspruit using a Difference in Difference (DID) approach to look at the take up of lighting. With regards to the rates of electrification in each site at the beginning of the analysis in 1992, levels in Kruger and Agincourt were considerably lower than in Nelspruit. However, data showed Agincourt had a significant tendency to get brighter compared to both. In 1992, Agincourt displayed almost total darkness, lighting up incrementally to 2007 before dulling in the midst of the Eskom collapse of 2008.

The nightlights data additionally showed major bright spots in the Agincourt site. On the surface, it appeared the spots could be attributed to three villages. Yet, when looking closer, one could identify two developed villages, one undeveloped village, a taxi rank, a super market and a train station. The nightlight data was thus spurious, giving the impression the third village had gone through the process of electrification when in reality remained relatively undeveloped. Martin informed us of the caveats of using satellite data and once again reiterated the importance of having researches on site. Moreover, coupling the nightlight data with the village typologies, presented more specific findings, such as 200 new village connections increasing the brightness of the area by 1.7 units.


For me, the workshop delivered a fantastic insight in to the use and application of nightlights data. The analyses conducted on satellite imagery for the last two decades has not only truthfully captured the electrification of the Agincourt area, but displays the differences between the developed and undeveloped areas of the site. Even from a non-technical perspective, the gravity of the research being done is evident and it is clear the potential exists to do so much more.

The full published paper can be read, untitled, Throwing light on rural development: using nightlight data to map rural electrification in South Africa by Takwanisa Machemedze, Taryn Dinkelman, Mark Collinson, Wayne Twine and Martin Wittenberg.

Posted in Uncategorized | Leave a comment

Knowledge exchange symposium: 21st Century Data Infrastructure for Research

In this blog Susan Cadogan and Camille Corti-Georgiou from the UK Data Service report back on the joint symposium held on 11 July 2017 at the University of Cape Town (UCT) organised by the eResearch at UCT, Data First and the UK Data Service.  The meeting aimed to discuss and encourage debate around the requirements necessary to scale data services and data infrastructure for research.

Hosted by Dr Dale Peters, Director of UCT eResearch, the symposium comprised of presentations from Louise Corti and Nathan Cunningham of the UK Data Service, Martin Wittenberg, Director of DataFirst. Anwar Vahed of the Data Intensive Research Initiative of South Africa (DIRISA), Russ Taylor of the Department of Astronomy at UCT (IDIA) and Rob Simmonds of the Department of Computer Science, UCT. Attendees of the symposium hailed from a range of disciplines including data science, bioinformatics and economics.


Corti and Wittenberg set the scene, highlighting the origins of the UK-UCT collaboration stemming from a successful application to a call for International Centre Partnerships funded by the UK’s Economic and Social Research Council (ESRC) and South Africa’s National Research Foundation (NRF). The 18-month Smarter Household Energy Data project’s focus was on scaling up household energy research using ‘big data’ infrastructure. Corti outlined the UK role of the UK Data Service in a collective project with the University College London Energy Institute, looking at household energy data collected from smart meter readings. In South Africa, DataFirst were collaborating closely with both Energy Research Centre at UCT and the University of Witwatersrand and Medical Research Council (MRC)’s Rural Public Health and Health Transitions Research Unit to investigate fuel poverty and the impact of electrification in rural areas using previously untapped sources of data.  Data included came from the Agincourt Health and Socio-Demographic Surveillance System (HDSS), NASA Nightlights data and Cape Town Municipality electricity billing data.  Contrasting access to and use of household energy in the UK and South Africa underlines the vast differences in consumption and access to resources between countries, and has helped highlighted the impact of decision, policy making and intervention in this research sphere.


Louise Corti introducing the UK Data Service – DataFirst partnership project

Both data streams inevitably raised challenges of data quality, sensitivity and anonymity. Corti underlined how the practices utilised in both institutions around the ‘5 Safes’ of data access could be rolled out to accommodate research practices sing these new forms of data. The project has succeeded in meeting its initial aims in providing data expertise across institutions and using the partnership to work through some of the challenges with the size and quality of data.

Infrastructure in South Africa

Dr Anway Vahed, outlined his role in DIRISA helping to establish a national research data infrastructure in South Africa, and coordinating the development of expertise and implementation of research data management strategy and policy.  Together with the Centre for High Performance Computing (CHPC) and the South African Research Network (SANReN), DIRISA forms the data infrastructure component of the National Integrated Cyberinfrastructure System (NICIS) of South Africa. The roadmap for cyber-infrastructure for both data and computing resources runs across six thematic areas: physical science and engineering, energy, health, bio and food, earth and environment, humans and society, materials and manufacturing. DIRISA is aiming to provide federated access to data, and while it does not have a preservation role, it promotes the use of trusted repository services, rather than informal cloud-based solutions such as Google Drive/DropBox for research.


Anwar Vahed speaking about DIRISA

Big data: the case of astronomy

Professor Russ Taylor from the Department of Astronomy at UCT observed that astronomers have long-managed large data flows generated in their field, citing the Square Kilometre Array (SKA) project – aiming to be the world’s largest radio telescope. The pre-construction phase of the project started in 2012 with the next phase of development commencing in 2018 – 2020, and being used to tested proof of concept as scientific observations begin.   With South Africa and Australia having won the SKA bid to co-host, this large-scale project comprises ten cornerstone countries, with over 121 institutions. Some 360 researchers, scientists and engineers are helping develop the supercomputers for processing data.


Russ Taylor from the Department of Astronomy at UCT speaking on the roll out of the SKA

As with research utilising genetic sequencing, data from telescopes has grown rapidly and significantly over the last 5-10 years. Major challenges are: extracting information from such large amounts of data; managing large files (e.g. 10million records); processing data using multiple pathways and detecting change; and merging and linking data (a relatively new area outside of the social sciences). Additional current cultural challenges include research reproducibility and moving to cloud-based processing.

Finally, the aspiration for global collaboration is hindered by lack of funding at this level, although a proposal under Horizon 2020 is being sought.  In terms of cyber infrastructure, South African astronomers are looking to DIRISA for support as a Tier 2 (national) facility to serve researchers their research community together with bioinformatics.

Professor Rob Simmonds provided a technical view on the Inter-University Institute for Data Intensive Astronomy (IDiA) system being built at UCT which will form part of the Tier 2 facility. Investment in the facility over three years will total some R17 million. An OpenStack based Infrastructure as a Service (IaaS) management system has been installed with a security layer to be added, required for when the bioinformatics facility becomes operational.  So far, hardware is in place, storage has been provisioned and other processing services will be provided by the African Research Cloud (ARC) from the University of Cape Town.

The Tier 2 storage will use the Ceph free-software storage platform for cloud storage with a 3x replication back-up system with security, authentication and user front ends to be developed that can support the management of personally-identifiable data and support reproducible science.

Joining up big data infrastructure

As the final speaker, Nathan Cunningham Director of Research IT at the UK Data Service outlined the research landscape in the UK, with a move to join up infrastructure where possible to maximise value and efficiency, enable collaboration and ensure cross-cutting work streams that help to eliminate duplication.


Nathan Cunningham introducing the UK’s big data infrastructure planning

Cunningham drew on the 2013 OECD report on New Data for Understanding the Human Condition: International Perspectives which encourages us to review how we scale up computing power, skills and managements systems to cope with large data resources, including those that cover information about human subjects have. In response to these challenges, Cunningham outlined the UK Data Service’s ongoing implementation of a Data Service as a Platform (DSaaP).  He stressed that while many of the ideas are not original, the importance of using open source tools and creating hybrid services can offer a flexible system that meets the needs of the researcher in addition to ensuring safe and continuous access.

In meeting the challenges of the new infrastructure, staff and researchers will need to upskill and be introduced to new ways of thinking, somewhat removed from the traditional data dissemination and archiving practices we find in social science today.

Looking ahead

A number of common threads and challenges arose from the symposium.  There is great excitement in enabling access to new and novel forms of data, but building systems to manage data across disciplines with varying challenges for access poses some serious challenges, especially where no additional funding to adapt or renew infrastructures is available.

However, the use of open source tools and introduction of hybrid infrastructures and systems could offer positive solutions and has the potential to reduce costs.  Participants agreed to continue sharing ideas and solutions, especially at a time of reduced infrastructure funding and where capitalising on collaboration may help strengthen proposals.

Training and upskilling staff and researchers, poses another challenge.  The UK Data Service-DataFirst project has moved one step closer to this aim via hands-on engagement by developing and running a week -long Summer School entitled, Encounters with Big Data: An Introduction to using Big Data in the Social Sciences, held back in February 2017 at UCT and to be run again in August 2017 in the UK at the  Institute for Analytics and Data Science (IADS) Summer school.


‘Encounters with Big Data workshop’ participants, Cape Town, February 2017

This intensive course has provided participants with the opportunity to extract, explore and analyse big data using some core elements of DSaap.  Using a locally-installed Hadoop Sandbox, participants were taught basic skills in managing and manipulating big data with data science oriented tools and software, such as R and Spark.

The feedback from the course was very encouraging but the challenge now is to ensure that the researchers are able to benefit from their new skills by putting them to use, and that their skills are updated and refreshed over time. We hope that new researchers are further encouraged to participate and that future capacity building can be carried out via a fully-fledged big data environment that can support future waves of researchers.

Louise Corti, UK Principle Investigator on the UK Data Service – Data First project hopes that “we can continue to collaborate with UCT beyond the life of the soon-to-finish project, ensuring that the great networks we have brokered together across disciplines, and the policy-relevant impact we have emerged in the energy research can live on”.


Posted in Uncategorized | Leave a comment

Cape Town Data Quality Workshop: Measurement of Development Indicators


In this blog Andrew Kerr, economist and a Senior Research Officer at DataFirst, University of Cape Town, reports back on a 2- day workshop held at the River Club in Cape Town on 6th and 7th July 2017 organised by Professor Martin Wittenberg, Director of DataFirst, with support from the UK Data Service.

The workshop consisted of several themes for the various sessions: Measurement of individual and household well-being, Access to household energy and household services, New forms of data and innovative approaches to measurement and Labour market data.

Attendees included academics from a variety of Universities (Cape Town, Pretoria, Stellenbosch) and a range of disciplines (Economists, Astronomers, Health and Energy researchers), as well as representatives from Statistics South Africa (StatsSA), the Department of Basic Education and the Office of Astronomy for Development (OAD) of the International Astronomical Union.

Focus on energy and new forms of data

The sessions on energy and new forms of data showcased some of the results from our ESRC-NRF funded international partnership between DataFirst and the UK Data Service, Smarter Household Energy Data.

Using sources of data outside the domain of traditional survey and administrative data is often seen as a challenge for social scientists, due to the unfamiliarity of and trust in the data. A lot of data preparation and manipulation has to go into preparing of data sources derived from real-time measurements. Takwanisa Machemedze, formerly of DataFirst, presented a paper on using nightlights data to measure rural electrification. He showed that the nightlights data tracked the roll-out of electricity connections in a health and demographic surveillance site in the east of the country.

Also using satellite data, Tawanda Chingozha, a PhD student from the University of Stellenbosch, presented a paper using satellite data to estimate changes in land under cultivation due to land reform policies in Zimbabwe. The conclusion was that the land acreage devoted to cultivated land decreased after the fast track land reform programme.

Martin Wittenberg presented a paper with Tom Harris on electricity connections and household formation, as featured in an earlier blogpost using the National Income Dynamics Study (NIDS) suggesting that aggregate electricity statistics, such as access rates, conceal a considerable degree of the complexity and volatility that is inherent in the development of electricity access. Instead he suggests that policy makers involved in electricity roll-out need to consider that household electricity access is a complex outcome of time-variant processes: net connections and household formation and dissolution processes.

Wiebke Toussaint from the UCT Energy Research Centre documented the Domestic Load Research Project, a yearly survey of several hundred Eskom customers, the household data that has been produced by the project and the ERC’s aims to make some of the data available to the public. Grant Smith and Kathryn McDermott from the UCT School of Economics and JPAL presented a paper that showed the effects of changing to prepaid electricity meters as well as giving insights into the difficulties of using administrative data, in this case from the City of Cape Town.

Kathryn McDermott from the UCT School of Economics and JPAL speaking

Chris Park from the UK Data Service submitted a presentation on behalf of Simon Elam from UCL, on “Data Quality: The elephant in the (big data) room”, which looked at the opportunities for using and issues with the quality of data from smart meters.

Measurement of individual and household well-being

Nilmini Herath from JPAL Africa at UCT presented a paper sharing JPAL’s insights gained in running surveys in South Africa. The presentation led to a helpful cross-pollination of ideas with the Stats SA participants who were interested in understanding and improving fieldwork quality in Stats SA data. Emmanuel Bakirdjian of JPAL Africa at UCT presented a related paper giving insights into how different forms of measurement can complement or improve the self-reported data traditionally used in household surveys.

Two of the presenters were students on the first (2016) cohort of the new Post-Graduate Diploma in Survey Data Analysis for Development at UCT, which has been put together by DataFirst, together with SALDRU and the School of Economics. Karabo Sebolai from StatsSA presented a paper comparing imputation and reweighting adjustments for non-response in the Agricultural Survey run by Stats SA, whilst Phumudzo Madzivhandila, also from Stats SA, examined the robustness of multidimensional poverty estimates to different weights for the various components of the poverty estimates.

Professor Steve Koch from the University of Pretoria presented a paper estimating equivalence scales for South Africa, using the 2010 Income and Expenditure Survey. One implication of his work is that standard procedures used for adjusting incomes for household size and composition (normally done by calculating per capita figures) probably over-adjust. This would make big households look poorer than they probably are.

Martin Wittenberg presented work which suggested that there were sampling issues with the Living Conditions Survey. It looked as though the survey was finding fewer rich people at the end of the survey period than at the beginning. This could be related to fieldwork fatigue. Similar questions have been raised about the diary method of collecting consumption data over a whole year.

Measuring labour market changes

In the final session, Andrew Kerr from DataFirst presented two papers using the PALMS data (Post-Apartheid Labour Market Survey series), a compilation of Stats SA labour market data from 1994-2015. The first was on how the standard errors in the Quarterly Labour Force Survey (QLFS) are estimated by Stats SA. It suggested that quarter and quarter changes are probably more noisy than consumers of the data think. The second dealt with the quality of the earnings data in the QLFS and the changes over the period that make comparability over time more difficult.

Andrew Kerr from DataFirst speaking

There was lively discussion on all the papers which continued over lunch, tea and the conference dinner. Fieldwork quality, measurement error and how to share different types of data came up repeatedly. There was also very useful cross-disciplinary engagement, with the Office of Astronomy for Development particularly interested in the use of remote sensing for measurement.

Louise Corti , Principal Investigator on the UK side of the UK-South Africa project notes that “The International Astronomical Union (IAU) is the largest body of professional astronomers in the world who have set up the Office of Astronomy for Development (OAD), in partnership with the South African National Research Foundation (NRF). This is a wonderful opportunity for collaboration across disciplines to harness powerful data resources in the investigation of critical development issues”.

Posted in Uncategorized | Leave a comment

The return of ‘Encounters with Big Data’: our summer school at the University of Essex

Louise Corti, Director of Collections Development and Producer Relations at the UK Data Service  reports back from the rerun of our successful summer school, Encounters with Big Data: An Introduction to using Big Data in the Social Sciencesheld at the University of Essex, Colchester from 31 July to 4 August 2017.

The course met one of the objectives of the Smarter Household Energy Data project, a joint International Centre Partnership Grant between the UK Data Service and DataFirst and funded by the Economic and Social Research Council in the UK and the National Research Foundation in South Africa, to create a collaborative research infrastructure for large-scale household energy data.

Louise, UK Principal Investigator on the project said: “We are delighted to be able to run this course for a second time, after it was so well-received in Cape Town earlier this year. Our joint project finally finishes in August, and our two fantastic trainers, Peter and Chris, are leaving the UK Data Service, so it was wonderful to able to squeeze this last capacity building event in”.

The course, aimed at experienced researchers, statisticians, or data analysts, covered aspects of data extraction, exploration, basic analysis and visualisation of big data using a Hadoop system. The platform uses solutions that can deliver data at scale, with speed, and include Hive, Spark, and Zeppelin, which integrate seamlessly with popular data analysis environments like Python and R.

This course was run as part of the University of Essex’s Institute for Analytics and Data Science (IADS) summer school which participants paid to attend. We received applications for the course from 60 people with places available for only 20. Once again, prerequisites were set to ensure that participants could fully participate, requiring experience using quantitative research data in the social sciences, a good understanding of statistical methodology , and competence in writing commands in a statistical computing environment like Stata, R or SPSS.

Participants came from home institutions and a number of countries including Denmark, Spain, Italy, Malta, Canada, Korea, Mexico, Mongolia and the USA. Positions ranged from professors, research assistants, postgraduates, statisticians and data analysts and spanning fields from economics, sociology, criminology and geography to marine ecology and genomics.


The summer school participants, teachers and organisers

Workshop content

The first day offered participants an overview of how big data methods can be applied to social science, using new tools and enhanced modelling techniques. Nathan Cunningham, Director of Research IT and Innovation at the UK Data Service introduced the concept of big data and provided an overview of the work being done in-house to build t­­he UK Data Service’s new Data Services as a Platform (DSaaP) initiative currently being implemented in partnership with Hortonworks, a leader in Hadoop-based data technologies.

Louise Corti then presented on how national statistics offices around the world have been exploring alternative sources of collecting data on citizens (to replace or complement surveys and censuses), and showed an example looking at tourism data from World Heritage Sites on Wikipedia. Participants worked in groups to devise their own ‘national statistic’ based on free data from the internet and quickly came up with some innovative ideas for data and analyses including:

  • Crime Rates: What kind of people are more likely to be defrauded online, i.e. those who use the internet more or less? Maybe hard to get data on those known to be defrauded.
  • Tourism: Does watching films with particular locations inspire people to travel to those places? Identify locations in the most popular films over the last year and look at travel destinations from flight statistics.
  • Pollution: Does pollution affect sports performance or fitness? Link data from air pollution sensors with fitness tracking devices such as Fitbits. Would be hard to control for people’s level of fitness.
  • Deprivation: Measure level of deprivation by patterns of mobile phone usage in areas using cell tower records.

Libby Bishop, Manager for Producer Relations and Research Ethics at the UK Data Service, followed with a talk on the role of legal and ethical concerns when using big data. She cited examples of conditions under which research using Twitter data can be shared, and the likelihood of increasing the risk of re-identification given the proliferation of public multiple data sources available about people. Libby outlined the 5 Safes Framework actively used and promoted by the Service as a robust approach for controlling access to disclosive data, so that rich policy-relevant analysis could be safely undertaken. Two guests speakers, James Allen-Roberson and Christian Kemp, from the Department of Sociology at Essex showcased their work on using the dark web in research with discussing the pros and cons of using such hidden sources.

Over the following two days, Peter Smyth and Chris Park delivered presentations, demonstrations and led exercises focused on manipulating data in Hadoop using a freely available Hadoop Sandbox.

Peter Smyth demonstrated how to use the Hive Query Language (HiveQL) to examine the contents of the datasets and to ‘slice’ and ‘dice’ a dataset into smaller datasets which can be used by desktop applications. He also introduced Zeppelin notebooks as a means of carrying out interactive data analysis within a big data environment. Chris Park introduced Apache Spark, a high performance, distributed computation engine designed for handling and analysing big data. He demonstrated how to scale out small-scale analyses by harnessing the power of Spark from R using the SparkR package, and Peter concluded by giving an overview of one of R’s libraries for producing spatial visualisations such as choropleth maps, Leaflet.

Day four covered tools and techniques for getting and converting external data called from APIs, and how to interpret the results in JSON format. Peter concluded with an overview of end-to-end process tools available via Hadoop.

Participants then moved onto formulating their own group projects to consolidate what they had learned over the past four days using open data from the internet. For the remaining day, the groups worked on their projects, accessing structured data from the web, importing, linking them where necessary and conducting exploratory data analysis and visualisations. The group projects turned out to be really interesting with all teams exploring unfamiliar open source data sources and software packages. The five teams came up with some great ideas ranging from use of Twitter data, Google trends data, weather data, accident databases and other sources to look at racism, use of statins, pedestrian and car safety, and weather and mood. A range of R tools were used including ggplot2, Rtweet, gtrendsR and wordcloud(R).

It was hard to choose an outright winner so in the final afternoon we awarded two prizes to two teams:  ‘Racists be damned (team of 2) and Auto Choice Model ‘(team of 3)‘.

Team ‘Racists be damned’

  • This group of two used online data available from the UK Government and Parliament Petitions A recent example was for a petition for a second Brexit referendum, which 4.2 million had signed. The team extracted JSON data from the site’s API for petitions with the term ‘immigration/immigrants’ in the title (N=100 or so out of a possible 31,732 petitions). The intention was to conduct sentiment analysis on the data and then map this to the UK Index of Multiple Deprivation – but this proved tricky in the time available. They hand-classified the petitions as positive or negative regarding immigration/immigrants They plotted by constituency the proportion of people who signed the anti-immigration petitions (dark shading = higher votes for negative petitions). They converted the json data into a regular dataframe, and used R and leaflet, with a little help from Chris and Peter. One of the team declared his pride in creating a map for the first time.


Some initial results from the ‘Racists be damned’ project team

Team ‘Auto Choice Model’

  • This group of three wanted to build an app that enabled insight to be able to purchase a suitable car. They looked for data on fuel economy, emissions and attempted to correlate it with accidents. They found data on car attributes, such as family size etc. They also found a database of car accident locations in the UK by car make, model and engine size, and ranked from 1-3 by severity of accident. The team split up data preparation tasks and used Rmarkdown to keep track of work. They converted UK accident point data to Lower Layer Super Output Areas (LSOA) and used inner joins on the datasets and mapped accidents by model of car to LSOAs using leaflet. The team attempted a 4D plot that turned out to be quite complex showing car engine size by brand by severity of accident by accident location.

We believe that participants really do benefit from group work to exchange ideas and trouble-shoot issues arising. Given the short time frame, most of the groups divided up tasks between themselves.


Prize giving for the winning project teams

Feedback suggested that the course went at roughly the right pace, with only a couple of attendees struggling to keep up as the week progressed. I find that ensuring that all the prerequisites for a course like this are fully met is absolutely essential for any teaching I organise, but for this course run by an external provider, we were unable to vet participants’ applications. Participants for big data courses of this nature benefit from a reasonably high level of competence and experience in data handling and statistical analysis.

I would like to thank Emma McLelland from the IADS summer school and Sarah King-Hele for coordinating the logistics for the week. And huge thanks and fond farewell to our two fantastic trainers, Chris Park and Peter Smyth.

You can access the training materials here on our course GitHub page:

Some feedback

“Very good course indeed. Good structure, material, pace, content + teaching”

“Wonderful prep and professional presentation”

“I really liked it overall. Perhaps one suggestion. I would have liked to be given sample code for analysis actually run in the cloud, not just in our local VM”



Posted in Uncategorized | 1 Comment

Measuring dynamics of household electricity connections in a developing context: a longitudinal data approach

In this blog Louise Corti provides a short review of an illuminating paper just published by Tom Harris, Mark Collinson and Martin Wittenberg entitled, ‘Aiming for a Moving Target: The Dynamics of Household Electricity Connections in a Developing Context’ in Science Direct.

This piece of work has stemmed from a collaborative UK-South Africa ESRC-NRF International Centre Partnership award that focused on scaling up the analysis of household energy data, looking at the potential for better understanding the policy context. In the UK policy work surrounding energy usage focuses on mitigating relative fuel poverty, in the first world sense, but for South Africa, research is inequality in public service delivery; despite significant progress over the last 20 years, basic access to electricity is by no means stable or guaranteed, and remains one of the largest development issues faced by post-apartheid South Africa.

While you can read the findings yourself in the open access paper, here I want to draw out the innovative methodological approaches of the research, which has focussed on using data from a long-running health and demographic surveillance system (HDSS) site. From my point of view, the approach taken by the authors offers an immensely promising avenue for opening up access to other complex HDSS data by considering how data can be restructured using traditional panel study methodology and considering carefully designed units of analysis.

In their analysis, the investigators investigated household electricity access in a poor rural setting in South Africa. Their approach critiques the existing literature in this field, which has tended to focus on investigating changes in an individual’s access to electricity, rather than taking into account the importance of the household unit and changes in household access. They acknowledge that that while many existing studies portray the process of electricity roll-out as a simple, monotonic progression, it is often more a far more complex picture than that. Recent literature typically uses binary indicators to measure progress in electricity access, to assess energy poverty, which suggest a strong association between electricity access and poverty. However, the authors note that the complexities of access transitions among the poor are not taken into account and that the aggregate data sources used do not offer rich-enough information on changes over time, for example, looking at access rates at provincial level.

A richer picture can be gained by investigating the short-term dynamics of electricity access using large-scale longitudinal data. The authors find short-lived deviations (in fact, periods of declines in access) from the long-term upward trend in electricity access, which they then go on to explore in more depth. The analyses were facilitated by the creation of new datasets derived from existing longitudinal studies: the National Income Dynamic Study (NIDS) and Agincourt Health and Demographic Surveillance System (HDSS). Longitudinal data analysis techniques were based around the unit of a new category of household type, defined by when a household forms and whether it continues to exist or not (persistence). Of note here is the novel approach used to re-present data from a typical HDSS.

The longitudinal data sources: sample and coverage

The National Income Dynamics Study (NIDS) is a panel study commissioned by the national Presidency in an effort to track long-run poverty and well-being and focuses on income, expenditure, labour market participation, education, health (including anthropometrics) and household well-being (e.g., access to services).  The baseline sample was designed to be nationally representative and consists of around 7,300 households and about 28,000 individuals, who became Continuing Sample Members (CSMs) for the subsequent waves. Babies born to CSM women become CSMs themselves and individuals who were co-resident with CSMs were also interviewed (Temporary Sample Members – TSMs).


The Agincourt Health and Demographic Surveillance System (HDSS) monitors key demographic events and socio-economic variables in the Agincourt sub-district in north-eastern Mpumalanga Province, South Africa. A baseline census was conducted in 1992 with annual census rounds being conducted since and since 1999.  Key variables measured routinely by the HDSS include: births, deaths, in- and out-migrations, household relationships, resident status, refugee status, education, antenatal, and delivery health-seeking practices.  One-off modules are also added such as every second year since 2000, collection of a household asset module, which includes information on household access to services, such as electricity. Temporary migrants are classified by including on the household grid non-resident members who retain significant contact and links with the rural home and ‘share a common pot’.

Novel data structures for measuring household outcomes

The authors enable their longitudinal analysis of household outcomes from the ability to identify the same household in each wave in each study, in other words the surviving or continuing households. This approach was developed by the authors themselves in previous publication, cited in the paper. Applying their new definition to NIDS and HDSS data allowed them to set up panels of individuals as new panels of households, through which longitudinal changes in household electricity access could be explored (using transition matrices). In order to account for bias in their chosen panel definition, for example from varying household formation and dissolution rates, they also created a comparison panel using the traditional “headship-based” definition. Patterns of electricity access statistics estimated on each panel were found to be similar.


From a methodological point of view, the authors conclude that aggregate statistics can conceal a considerable degree of the complexity and volatility inherent in the development of electricity access. Viewing access as a time-variant process rather than assuming linear roll out of services needs to be appreciated and studied. The authors conclude that further studies into service delivery in LMICs should consider using the longitudinal techniques applied in their paper, taking into account household dynamics.

While I am by far no expert in the analysis of service delivery or utilities, my appreciation of the collection, preparation, structuring and curation of long-running social surveys leads me to believe that there is great hope for improving access to complex longitudinal data from multi-million pound development and evaluation studies conducted in low and middle income . If we consider the number of million-pound population surveillance and intervention investments across the world that generate immensely rich data over many years, such as HDSS, Millennium Villages Project rural development sites, and other large-scale intervention projects, like Girls’ Education Challenge, we can envisage the great scope for research opportunities through access beyond the research teams. The sometimes highly complex structure, inherent in HDSS–type data collections, could be reworked to create new datasets that are able to be better understood by social scientists and those sitting outside of the population, migration and epidemiology research domains

The approach the authors have taken in repurposing data to create traditional ‘panel data’, would benefit from being widely showcased and exploited as an inspiration for data-hungry researchers and, of course, for maximising impact opportunities for funders.

Posted in Uncategorized | Leave a comment

Reflections on ‘Encounters with Big Data’, our course in Cape Town

Louise Corti, Director of Collections Development and Producer Relations at the UK Data Service, and Chris Park, Data Scientist, report back from the course, Encounters with Big Data: An Introduction to using Big Data in the Social Sciences, held in Cape Town, South Africa from 30 January to 3 February 2017. Also See our twitter hashtag #ZABigData17

The course met one of the objectives of the Smarter Household Energy Data project, a joint International Centre Partnership Grant between the UK Data Service and DataFirst and funded by the Economic and Social Research Council in the UK and the National Research Foundation in South Africa, to create a collaborative research infrastructure for large-scale household energy data. Louise Corti, one of the principal investigators told us: “This course introduced part of our work on ‘scaling up’ data curation and user access approaches for big data, predominantly where larger or more complex data sources – bigger than the 5GB maximum download bundle of survey data that our users typically access – or where computationally-intensive and iterative modelling is needed.”

The course covered aspects of extraction, exploration, and statistical analysis of big data behind t­­he UK Data Service’s new Data Services as a Platform (DSaaP) initiative being developed in partnership with Hortonworks, a leader in Hadoop-based data technologies. The new platform will be using solutions that deliver data at scale, with speed, and security, including Hive, Spark, and Zeppelin, which integrate seamlessly with popular data analysis environments like R and Python. We received applications  for the course from universities across South Africa, including the University of Cape Town, University of Witwatersrand, Stellenbosch University, and theUniversity of Pretoria, as well as from government agencies including the South African National Space Agency and the Human Sciences Research Council of South Africa. The participants were highly experienced in using cross-sectional and longitudinal survey data, public health and medical data, transport data, financial data, and satellite data.


Nathan Cunningham, Director of the Big Data Network Support directorate at the UK Data Service, summarised the training: “Our aim was to support researchers in understanding and analysing large and complex datasets, focusing on leveraging the power of popular statistical software like R within a big data environment.

Workshop content

The first day offered participants an overview of how big data methods can be applied to social science, using new tools and enhanced modelling techniques. The Data Services as a Platform initiative was introduced by Darren Bell, lead Repository Architect at the UK Data Service, and Peter Smyth, Big Data Training Expert at the UK Data Service. Martin Wittenberg, Professor of Economics and Director of DataFirst at the University of Cape Town presented his work on restructuring a 20-year health and demographic surveillance study based in South Africa showing an example of big data analysis using night-time satellite data from NASA as a proxy to explore the impact of electrification on South Africa’s economic development over time, highlighting the exciting opportunities that lie ahead for using big data in the social sciences.

Louise Corti presented on how national statistics offices around the world have been exploring alternative sources of collecting data on citizens (instead of surveys and censuses), and showed an example looking at tourism data from World Heritage Sites on Wikipedia. Participants worked in groups to devise their own national statistics focused ideas for analysis including:

  • Public Health: Understanding how Google searches on medical terms reflect disease prevalence and how bias in the data might be related to rural/urban access to technology, and the use of specialist terminology;
  • Crime Rates: Using crime statistics to identify policy intervention by mapping crime rates over time using aggregated provincial crime stats and police station reports of crime and location;
  • Transport: Studying transport policies to assess different needs for private and public transport. The team used Google tracking location for traffic showing how different categories of commuters are likely to experience different commuting times and local weather station pollution data as a proxy for congestion and emissions.
  • Monetary policy: Following the recent currency changes in Zimbabwe, the team assessed preparedness for new currency using google location and sentiment analysis of social media data from Twitter and Facebook to see what people are talking about.


Libby Bishop, Manager for Producer Relations and Research Ethics at the UK Data Service, followed with a talk on the role of legal and ethical concerns when using big data. She cited examples of conditions under which research using Twitter data can be shared, and the likelihood of increasing the risk of re-identification given the proliferation of public multiple data sources available about people. Libby showed the UK Data Service is keen to promote the idea that disclosure risk is best viewed as a continuum rather than a dichotomous concept. She also suggested that the 5 Safes Framework, which is being actively used and promoted by the Service, is an excellent approach for controlling access to disclosure data, and a means by which rich policy-relevant analysis could be safely undertaken.

Over the next two days, Peter Smyth and Chris Park delivered presentations and led group exercises focused on manipulating data using tools supporting the DSaaP. Peter Smyth demonstrated how to create and query tables in Hive, and introduced Zeppelin notebooks as a means of carrying out interactive data analysis within a big data environment. Chris Park described ways of scaling out small-scale analyses implemented in R with SparkR, and introduced students to aspects of distributed computing and how and when it could benefit research. Practical exercises, such as fitting linear models to large-scale open data were used to put the theory and concepts into practice. Aspects of provisioning and monitoring Hadoop clusters using Ambari and data visualization using the leaflet library were also discussed.

Before participants moved onto formulating group projects, a concluding talk summarised the UK Data Service data processing pipeline. Over the course of the remaining two days, the groups worked on projects focused on accessing structured data from the web, then importing, querying and linking them for exploratory data analysis, modelling, and mapping. The group projects were very interesting and explored unfamiliar open source data sources and software packages. We ended the course by awarding a prize to ‘Team Synergy’, a group of demographic and health data managers and economists, who undertook a comprehensive spatial analysis of mortality in South Africa using Zeppelin, Hive, and SparkR.


We received some great feedback:

”Great introduction to whole field, gentle ease in to more technical concepts”

“Knowing what the other participants are working on is very helpful – a round robin type introduction saves a great deal of time during networking breaks as you can gravitate quickly towards those you want to follow up with. Really enjoyed the exercises – learning happens in action!”

“The practicals on how to navigate Hive were interesting and mind opening”

“I could easily follow the hdfs command lines and the effectiveness of Hadoop and think this software is going to be useful to my studies”

“Really enjoyed the R and Spark sessions. Felt like this is what I’ll be most able to work on in my future research (economics). The exercises were very useful –because the content was accessible and also because of the way that the code was templated for us to follow along with Introductory lectures appreciated – was easy enough to follow. Chris’s anecdotes very welcome (thanks!)”

We would like to thank our South African hosts, especially Alison Siljeur at DataFirst, for helping organise and coordinate the workshop, making it a wonderful experience for us as tutors and participants.



Posted in Uncategorized | 1 Comment

Electricity Access Dynamics in South Africa

In this blog post, Tom Harris and Professor Martin Wittenberg from Cape Town University (UCT), report on the survey analyses they undertook on electricity transitions. Tom is a researcher in SALDRU (Southern Africa Labour and Development Research Unit) and DataFirst, with a Master’s degree in Economics. Martin is in the School of Economics and Director of DataFirst, and a PI on the UK-ZA energy project.


Lack of access to efficient, reliable and modern energy is a prevalent issue across the developing world, with one in five people lacking access to electricity. The use of biomass fuels and other ‘dirty’ and inefficient fuels has severe negative implications, not only for public health,  but for the environment  and economic development too; transitions up the energy ladder  (to electricity particularly) are accordingly associated with improvements in well-being and economic progress [1-7]. Since the use of such fuels is most prevalent among those already in poverty, this exacerbates the plight of the poor, and widens the gap between the rich and the poor [2, 5, 6]. Electrification and improvements in electricity access are therefore of considerable interest to policy makers within countries where wide-spread poverty and inequality remain the dominant issues.

In this regard, South Africa is a success story. Access rates improved from well below 40% in the early 1990s to nearly 80% by 2002. Our own analysis of household electricity access, in Figure 1, confirms that this progress has continued in recent years, and shows national household access rates to have risen by a further 9% between 2002 and 2012.


Certainly, these improvements are encouraging, and there is much that can be learned from the successes of the South African electricity roll-out process. However, our results also reveal a striking point not yet given much attention in the literature: this long-run improvement in electricity access is not the result of a consistent, monotonic increase in access rates. Instead, there are short-term deviations from the long-term upward trend. The most salient of these deviations are periods of declines in access, which are evident in both national and small-area data. However, the cross-sectional statistics presented in Figure 1 cannot explain whether these declines in access were the result of households losing access at a location, or whether they were due to groups of people migrating from connected household units and setting up new households in locations that lack access.

Therefore, in an effort to understand what contributed to the observed declines in access, we introduce two novel approaches. Firstly, in order to explore the relationship between household formation and electricity access, we categorise households according to when they form and whether they continue to exist or not, and investigate changes in access among these different household categories. Secondly, to examine transitions in access among households that continue to exist, we apply longitudinal techniques to a unique form of NIDS (the National Income Dynamics Study, a nationally representative household panel) that allows us to track household units over time [8, 9]. These results are shown in Figure 2 and Table 1 below.



Our findings (in Figure 2) suggest that household formation contributed to the decline in aggregate access over the 2008-2010 period in two respects. Firstly, rapid growth of the household population meant that the electricity roll-out could not keep pace with net household formation: more than 400,000 household units were added to the population, and yet the total number of connections increased by less than 100,000. Secondly, newly formed households were less likely to have electricity access than those that dissolved (77.7% vs. 84.2%) – suggesting that households with electricity access dissolved and were replaced by new households without access.

Moreover, it was not only household formation and dissolution that led to the observed decline in electricity access rates between 2008 and 2010. We also see a decline in access rates among those households that continued to exist (in Table 1). More specifically, we find that even though many additional household electricity connections were added between 2008 and 2010, these positive transitions were outweighed by numerous connections losses – with more than 800,000 continuing households (8.7% of the connected household population in 2008 that continued to exist) having lost an electricity connection by 2010.

However, the processes described above are not particular to periods of declines in access – as is evident in Figure 3 and Table 2, presented below.



Pervasive connection losses are observed even over the 2010-2012 period: a period in which aggregate access actually improved. In addition, while we argued that household formation and dissolution dynamics contributed to a decline in aggregate access rates between 2008 and 2010, in investigating the 2010-2012 period we find that these processes are also able to contribute to improvements in aggregate electricity access – with households that formed over this period being more likely to have had access to electricity than both those that dissolved and those that continued to exist.

The policy and theoretical implications of our findings are applicable to those working in developing contexts well beyond the borders of South Africa. We have shown that aggregate electricity statistics, such as access rates, conceal a considerable degree of the complexity and volatility that is inherent in the development of electricity access. In this light, we suggest that policy makers involved in the area of electricity roll-out are likely to be aiming for a moving target in the following three respects:

  1. the number of households may be growing faster than the rate of growth in connections, as a result of rapid household formation;
  2. people may be moving out of connected households and setting up new households in locations that lack access; and
  3. certain connected households that survive from one period to the next may actually lose their electricity connections.

We therefore suggest that in developing countries like South Africa, electricity access rates are unlikely to show consistent improvements, even in periods of rapid electricity roll out. Access rates do not simply improve as new connections are added. Instead, we argue that household electricity access is a complex outcome of two key time-variant processes: (1) net connections (new connections less disconnections) and (2) household formation and dissolution processes.


  1. Vermaak, C., M. Kohler, and B. Rhodes, Developing an energy-based poverty line for South Africa. Journal of Economic and Financial Sciences, 2014. 7(1): p. 127-144.
  2. Kimemia, D., et al., Burns, scalds and poisonings from household energy use in South Africa: Are the energy poor at greater risk? Energy for Sustainable Development, 2014. 18: p. 1-8.
  3. van der Kroon, B., R. Brouwer, and P.J. van Beukering, The energy ladder: Theoretical myth or empirical truth? Results from a meta-analysis. Renewable and Sustainable Energy Reviews, 2013. 20: p. 504-513.
  4. Smith, K.R., National burden of disease in India from indoor air pollution. Proceedings of the National Academy of Sciences, 2000. 97(24): p. 13286-13293.
  5. Heltberg, R., Factors determining household fuel choice in Guatemala. Environment and development economics, 2005. 10(03): p. 337-361.
  6. Rao, M.N. and B.S. Reddy, Variations in energy use by Indian households: an analysis of micro level data. Energy, 2007. 32(2): p. 143-153.
  7. Vermaak, C., M. Kohler, and B. Rhodes, Developing energy-based poverty indicators for South Africa. Journal of Interdisciplinary Economics, 2009. 21(2): p. 163-195.
  8. Harris, T., Household Electricity Access and Household Dynamics: Insights into the links between electricity access and household dynamics in South Africa between 2008 and 2012, in School of Economics. 2016, University of Cape Town: Cape Town.
  9. Wittenberg, M. and M. Collinson, Household formation and household size in post-apartheid South Africa: Evidence from the Agincourt sub-district 1992-2003. 2014: A DataFirst Technical Paper 27. DataFirst, University of Cape Town, Cape Town.
Posted in Uncategorized | 1 Comment