In this blog Susan Cadogan and Camille Corti-Georgiou from the UK Data Service report back on the joint symposium held on 11 July 2017 at the University of Cape Town (UCT) organised by the eResearch at UCT, Data First and the UK Data Service. The meeting aimed to discuss and encourage debate around the requirements necessary to scale data services and data infrastructure for research.
Hosted by Dr Dale Peters, Director of UCT eResearch, the symposium comprised of presentations from Louise Corti and Nathan Cunningham of the UK Data Service, Martin Wittenberg, Director of DataFirst. Anwar Vahed of the Data Intensive Research Initiative of South Africa (DIRISA), Russ Taylor of the Department of Astronomy at UCT (IDIA) and Rob Simmonds of the Department of Computer Science, UCT. Attendees of the symposium hailed from a range of disciplines including data science, bioinformatics and economics.
Corti and Wittenberg set the scene, highlighting the origins of the UK-UCT collaboration stemming from a successful application to a call for International Centre Partnerships funded by the UK’s Economic and Social Research Council (ESRC) and South Africa’s National Research Foundation (NRF). The 18-month Smarter Household Energy Data project’s focus was on scaling up household energy research using ‘big data’ infrastructure. Corti outlined the UK role of the UK Data Service in a collective project with the University College London Energy Institute, looking at household energy data collected from smart meter readings. In South Africa, DataFirst were collaborating closely with both Energy Research Centre at UCT and the University of Witwatersrand and Medical Research Council (MRC)’s Rural Public Health and Health Transitions Research Unit to investigate fuel poverty and the impact of electrification in rural areas using previously untapped sources of data. Data included came from the Agincourt Health and Socio-Demographic Surveillance System (HDSS), NASA Nightlights data and Cape Town Municipality electricity billing data. Contrasting access to and use of household energy in the UK and South Africa underlines the vast differences in consumption and access to resources between countries, and has helped highlighted the impact of decision, policy making and intervention in this research sphere.
Both data streams inevitably raised challenges of data quality, sensitivity and anonymity. Corti underlined how the practices utilised in both institutions around the ‘5 Safes’ of data access could be rolled out to accommodate research practices sing these new forms of data. The project has succeeded in meeting its initial aims in providing data expertise across institutions and using the partnership to work through some of the challenges with the size and quality of data.
Infrastructure in South Africa
Dr Anway Vahed, outlined his role in DIRISA helping to establish a national research data infrastructure in South Africa, and coordinating the development of expertise and implementation of research data management strategy and policy. Together with the Centre for High Performance Computing (CHPC) and the South African Research Network (SANReN), DIRISA forms the data infrastructure component of the National Integrated Cyberinfrastructure System (NICIS) of South Africa. The roadmap for cyber-infrastructure for both data and computing resources runs across six thematic areas: physical science and engineering, energy, health, bio and food, earth and environment, humans and society, materials and manufacturing. DIRISA is aiming to provide federated access to data, and while it does not have a preservation role, it promotes the use of trusted repository services, rather than informal cloud-based solutions such as Google Drive/DropBox for research.
Big data: the case of astronomy
Professor Russ Taylor from the Department of Astronomy at UCT observed that astronomers have long-managed large data flows generated in their field, citing the Square Kilometre Array (SKA) project – aiming to be the world’s largest radio telescope. The pre-construction phase of the project started in 2012 with the next phase of development commencing in 2018 – 2020, and being used to tested proof of concept as scientific observations begin. With South Africa and Australia having won the SKA bid to co-host, this large-scale project comprises ten cornerstone countries, with over 121 institutions. Some 360 researchers, scientists and engineers are helping develop the supercomputers for processing data.
As with research utilising genetic sequencing, data from telescopes has grown rapidly and significantly over the last 5-10 years. Major challenges are: extracting information from such large amounts of data; managing large files (e.g. 10million records); processing data using multiple pathways and detecting change; and merging and linking data (a relatively new area outside of the social sciences). Additional current cultural challenges include research reproducibility and moving to cloud-based processing.
Finally, the aspiration for global collaboration is hindered by lack of funding at this level, although a proposal under Horizon 2020 is being sought. In terms of cyber infrastructure, South African astronomers are looking to DIRISA for support as a Tier 2 (national) facility to serve researchers their research community together with bioinformatics.
Professor Rob Simmonds provided a technical view on the Inter-University Institute for Data Intensive Astronomy (IDiA) system being built at UCT which will form part of the Tier 2 facility. Investment in the facility over three years will total some R17 million. An OpenStack based Infrastructure as a Service (IaaS) management system has been installed with a security layer to be added, required for when the bioinformatics facility becomes operational. So far, hardware is in place, storage has been provisioned and other processing services will be provided by the African Research Cloud (ARC) from the University of Cape Town.
The Tier 2 storage will use the Ceph free-software storage platform for cloud storage with a 3x replication back-up system with security, authentication and user front ends to be developed that can support the management of personally-identifiable data and support reproducible science.
Joining up big data infrastructure
As the final speaker, Nathan Cunningham Director of Research IT at the UK Data Service outlined the research landscape in the UK, with a move to join up infrastructure where possible to maximise value and efficiency, enable collaboration and ensure cross-cutting work streams that help to eliminate duplication.
Cunningham drew on the 2013 OECD report on New Data for Understanding the Human Condition: International Perspectives which encourages us to review how we scale up computing power, skills and managements systems to cope with large data resources, including those that cover information about human subjects have. In response to these challenges, Cunningham outlined the UK Data Service’s ongoing implementation of a Data Service as a Platform (DSaaP). He stressed that while many of the ideas are not original, the importance of using open source tools and creating hybrid services can offer a flexible system that meets the needs of the researcher in addition to ensuring safe and continuous access.
In meeting the challenges of the new infrastructure, staff and researchers will need to upskill and be introduced to new ways of thinking, somewhat removed from the traditional data dissemination and archiving practices we find in social science today.
A number of common threads and challenges arose from the symposium. There is great excitement in enabling access to new and novel forms of data, but building systems to manage data across disciplines with varying challenges for access poses some serious challenges, especially where no additional funding to adapt or renew infrastructures is available.
However, the use of open source tools and introduction of hybrid infrastructures and systems could offer positive solutions and has the potential to reduce costs. Participants agreed to continue sharing ideas and solutions, especially at a time of reduced infrastructure funding and where capitalising on collaboration may help strengthen proposals.
Training and upskilling staff and researchers, poses another challenge. The UK Data Service-DataFirst project has moved one step closer to this aim via hands-on engagement by developing and running a week -long Summer School entitled, Encounters with Big Data: An Introduction to using Big Data in the Social Sciences, held back in February 2017 at UCT and to be run again in August 2017 in the UK at the Institute for Analytics and Data Science (IADS) Summer school.
This intensive course has provided participants with the opportunity to extract, explore and analyse big data using some core elements of DSaap. Using a locally-installed Hadoop Sandbox, participants were taught basic skills in managing and manipulating big data with data science oriented tools and software, such as R and Spark.
The feedback from the course was very encouraging but the challenge now is to ensure that the researchers are able to benefit from their new skills by putting them to use, and that their skills are updated and refreshed over time. We hope that new researchers are further encouraged to participate and that future capacity building can be carried out via a fully-fledged big data environment that can support future waves of researchers.
Louise Corti, UK Principle Investigator on the UK Data Service – Data First project hopes that “we can continue to collaborate with UCT beyond the life of the soon-to-finish project, ensuring that the great networks we have brokered together across disciplines, and the policy-relevant impact we have emerged in the energy research can live on”.