Jupyter meets the Earth

Lindsey Heagy
Jupyter Blog

--

By Lindsey Heagy and Fernando Pérez

We are thrilled to announce that the NSF is funding our EarthCube proposal “Jupyter meets the Earth: Enabling discovery in geoscience through interactive computing at scale” (pdf). The team working on this project consists of Fernando Pérez [1, 2, 3], Joe Hamman [4], Laurel Larsen [5], Kevin Paul [6], Lindsey Heagy [1], Chris Holdgraf [1, 2] and Yuvi Panda [7]. Our project team includes members from the Jupyter and Pangeo communities, with representation across the geosciences including climate modeling, water resource applications, and geophysics. Three active research projects, one in each domain, will motivate developments in the Jupyter and Pangeo ecosystems. Each of these research applications demonstrates aspects of a research workflow which requires scalable, interactive computational tools.

In this project we intend to follow the patterns that have made Jupyter an effective and successful platform: we will drive the development of computational machinery by concrete use cases from our own experience and research needs, and then find the appropriate points for extension, abstraction, and generalization. We are motivated to advancing research of contemporary importance in geoscience, and are equally committed to producing work that leads to broad impact, general use infrastructure that benefits scientists, educators, industry, and the general community.

The adoption of open languages such as Python and the coalescence of communities of practice around open-source tools, is visible in nearly every domain of science. This is a fundamental shift in how science is conducted and shared. In recent years, there have been several high-profile examples in which open tools from the Python and Jupyter ecosystems played an integral role in the research, from data analysis to the dissemination of results. These include the first image of a black hole from the Event Horizon Telescope team and the detection of gravitational waves by the LIGO collaboration. The utility of open-source software in projects like these and the success of open communities such as Pangeo, provide evidence of the force-multiplying impact of investing in an ecosystem of open, community-driven tools. Through this project, we will advance this open paradigm in geoscience research, while strengthening and improving the infrastructure that supports it. We made this argument when discussing the intended impacts of our proposal, and we are pleased that the NSF is investing in this vision.

Geoscience use cases

Given our project’s aims and approach, participating actively in domain research is crucial to our success. The following descriptions are meant to offer a flavour of the research questions we are tackling, each led by a geoscientist in the team.

CMIP6 climate data analysis (Hamman). The World Climate Research Program’s Coupled Model Intercomparison Project is now in its sixth phase and is expected to provide the most comprehensive and robust projections of future climate predictions. When complete, the archive is expected to exceed 18 PB in size. In the coming years, this collection of climate model experiments will form the basis for fundamental research, climate adaptation studies, and policy initiatives. While the CMIP6 dataset is likely to hold new answers to many pressing climate questions, the sheer volume of data is likely to present significant challenges to researchers. Indeed, new tools for scalable data analysis, machine learning, and inference are required to make the most out of these data.

Large-Scale Hydrologic Modeling (Larsen). Streamflow forecasts are a valuable tool for flood mitigation and water management. Creating these forecasts requires that a variety of data types be brought together including model-generated streamflow estimates, sensor-based observations of water discharge, and hydrometeorological forcing factors, such as precipitation, temperature, relative humidity, and snow-water equivalent. The integration of simulated and observed data over disparate spatial and temporal scales creates new avenues for exploring data science techniques for effective water management.

Geophysical inversions (Heagy). Geophysical inversions construct models of the subsurface by combining simulations of the governing physics with optimization techniques. These models are critical tools for locating and managing natural resources, such as groundwater, or for assessing the risk from natural hazards, such as volcanoes. Today, we need models applicable to increasingly complex scenarios, such as the socially delicate task of developing groundwater management policies in water-limited regions. This will require the development of new techniques for combining multiple geophysical data sets in a joint inversion, as well as the use of statistical and data science methods for including geologic and hydrologic data in the construction of 3D models.

These scientific problems exhibit, each with its own flavour, similar technical challenges with respect to handling large volumes of data, performing expensive computations, and doing both of these as a part of the interactive, exploratory workflow that is necessary for scientific discovery.

Jupyter & Pangeo: empowering scientists

Jupyter and Pangeo are both open communities that share the goal of developing tools and practices for interactive scientific workflows; these tools aim to deliver practical value to scientists who, in the course of everyday research, face a combination of big data and large-scale computing needs.

Project Jupyter creates open-source tools and standards for interactive computing. These span the spectrum from low-level protocols for running code interactively up to the web-based JupyterLab interface that a researcher uses. Jupyter is agnostic of programming language: over 130 different Jupyter kernels exist, and they provide support for most programming languages in widespread use today. Jupyter can be run on a laptop, in an HPC center, or in the cloud. Shared-infrastructure deployments (e.g. HPC / cloud) are enabled by JupyterHub, a component in the Jupyter toolbox that supports the deployment and management of Jupyter sessions for multiple users. The development process of tools in the Jupyter ecosystem is community-oriented and includes a diverse set of stakeholders across research, education, and industry. The project has a strong tradition of building tools that are first designed to solve specific problems, and then generalized to other users and applications.

A Pangeo Platform is a modular composition of open, community-driven projects, tailored to the scientific needs of a specific scientific domain. In its simplest form, it is based on the following generic components: a browser-based user interface (Jupyter), a data model and analytics toolkit (Xarray), a parallel job distribution system (Dask), a resource management system (either Kubernetes or a job queuing system such as PBS), and a storage system (either cloud object store or traditional HPC file system). These are complemented by problem- and domain-specific libraries.

This modular design allows for individual components to be readily exchanged and the system to be applied in new use cases. Pangeo was created by, and for, geoscientists faced with large-scale data and computation challenges, but such problems are now common in science. Researchers in a variety of disciplines including neuroscience and astrophysics are working to adapt the Pangeo design pattern for their communities. Beyond the initial Pangeo deployments supported by the NSF EarthCube grant for Pangeo, the platform has been adopted internationally, including by the UK Met office. It has also supported new research and education efforts from other federal agencies, such as NASA’s HackWeek focused on the analysis of ICESat-2 data.

User-Centered Development

Pushing the boundaries of any toolset unveils areas for improvement and opportunities for developments that streamline and upgrade the user experience. We aim to take a holistic view of the scientific discovery process, from initial data acquisition through computational analysis to the dissemination of findings. Our development efforts will make technological improvements within the Jupyter and Pangeo ecosystems in order to reduce pain-points along the discovery lifecycle and to advance the infrastructure that serves scientists. Following established patterns in Jupyter’s development, we take a user-first, needs-driven approach and then generalize these ideas to work across related fields. Broadly, there are 4 areas along the research lifecycle where we will invest development efforts, which we discuss next.

Data discovery. An early step in the research process is locating and acquiring data of interest. Data catalogs provide a way to expose datasets to the community in a way that is structured. Within the geosciences, there are a number of emerging community standards for data catalogs (e.g. THREDDS, STAC). To streamline access to such data sets, we plan to develop JupyterLab extensions which provide a user-interface that exposes these catalogs to researchers. This work will build upon the JupyterLab Data Registry, which will provide a consistent set of standards for how data can be consumed and displayed by extensions in the Jupyter ecosystem, as well as Intake, a lightweight library for finding, loading, and sharing data which is already serving the Pangeo community. Our communities are already discussing potential avenues for integration between these tools.

Scientific discovery through interactive computing. The Jupyter Notebook has been adopted by many scientists because it supports an iterative, exploratory workflow combining code and narrative. Beyond code, text, and images, Jupyter supports the creation of Graphical User Interfaces (GUIs) with minimal programming effort on the part of the scientist. The Jupyter widgets framework lets scientists create a “Research GUI” that combines scientific code with interactive elements such as sliders, buttons and menus in just a single line of code, while still allowing for extensive customization and more complex interfaces when required. In this project, we will develop custom widgets tailored at the specific scientific needs of each of our driving use cases.

Beyond their utility in the exploratory phase of research, interactive interfaces, or “dashboards” provide a mechanism for delivering custom scientific displays to collaborators, stakeholders, and students for whom the details of the code may not be pertinent. Voilà is a project, led by the QuantStack team, that enables dashboards to be generated from Jupyter notebooks. We plan to develop interactive dashboards using Voilà for our geoscience use-cases, contribute generic improvements to the Voilà codebase, and provide a demonstration of how researchers can deploy dashboards to share their research.

Research GUIs to explore Maxwell’s equations in research and education. Photo credit: SEOGI KANG

Established tools and data visualization. Many widely-used tools, particularly for visualization (e.g. Ncview, Paraview), are desktop-based applications and therefore cannot easily be used in cloud or HPC workflows. In some cases, modern, open-source alternatives are available. But often for specialized tasks, modern tools may not yet have functionality equivalent to the desktop version. JupyterHub can readily serve non-Jupyter web-native software applications such as RStudio, Shiny applications, and Stencila to users; under this project we aim to extend JupyterHub to also be able to serve desktop-native applications.

Using and managing shared computational infrastructure. JupyterHub makes it possible to manage computing resources, user accounts, and provide access to computational environments online. Currently, JupyterHubs in the Pangeo project are deployed and maintained using the Zero to JupyterHub guide alongside the HubPloy library. Together, these libraries have simplified the initial setup and automated upgrades to the Hubs. There are still many improvements that can make JupyterHub more suitable for larger, more complex deployments, both in terms of managing users and efficiently allocating resources. Under this project, we plan to build tools which collect metrics such as CPU and memory usage and expose these to both users and administrators so they can make more efficient use of the Hub. For shared deployments, we will improve user management so that user-groups can be used to manage permissions and allocations, and so that usage can be tracked and appropriately billed to the relevant grants. Within the HubPloy library, we also plan to make improvements to streamline continuous deployments so that installation and upgrade processes are repeatable and reliable.

An opportunity for meaningful impact — join us!

The impacts of climate change and the need for data-driven management of resources are some of the most critical and complex challenges facing society today. In recent years, we have experienced severe droughts in California and are in the midst of water management crises in the Central Valley, while devastating wildfires have destroyed entire communities. Through this partnership between researchers and Jupyter developers, we hope to contribute to the advancement of science-based solutions to these challenges, both by contributing directly to the research and by improving the open ecosystem of tools available to researchers and the stakeholders impacted by these issues.

If developing open-source tech to advance research in geoscience and beyond excites you, then please get in touch! At UC Berkeley, we will be hiring in 2 positions: a dev-ops position focussed on JupyterHub and shared infrastructure deployments, and a JupyterLab-oriented role focussed on extensions, dashboards, and interactivity. There will also be a position opening up at NCAR for a software engineer targeting improvements to improving the user experience of Xarray and Dask workflows.

Even if you aren’t looking for a new job, there are other ways to get involved with both the Jupyter and Pangeo communities. We welcome new participants to the weekly Pangeo meetings (on Wednesdays alternating between 4p GMT and 8p GMT) and there are monthly Jupyter community calls, which are open and meant to be accessible to a wide audience. Outside of calls, general Jupyter conversations happen on the Jupyter discourse and Pangeo conversations are typically on the Pangeo GitHub.

In closing: a step toward sustainable open science

This project provides our team with $2 Million in funding over 3 years as a part of the NSF EarthCube program. It also represents the first time federal funding is being allocated for the development of core Jupyter infrastructure.

The open source ecosystem that Jupyter and Pangeo belong to has become part of the backbone that supports much of today’s computation in science, from astronomy and cosmology to microbiology and subatomic physics. Such broad usage represents a victory for this open and collaborative model of building scientific tools. Much of this success has come through the efforts of scientists and engineers who are committed to an open model of science, but who have had to work with little direct funding, minimal institutional support, and few viable career paths within science.

There is real strategic risk to continuing with the implicit assumption that scientific open-source tools can be developed and maintained “for free.” If open, community-driven tools are to sustainably grow into the computational backbone of science, we need to recognize this role and support those who create them as regular members of the scientific community (we recently talked about this in more detail in a talk at NSF headquarters). Projects like ours, where funding and resources are explicitly allocated toward this goal, are a step in the right direction. We hope that our experiences will contribute to ongoing conversations in the scientific community around these complex issues.

In the past, we have tried to maintain a close relationship between domain problems and software development in Jupyter. However, this has typically been done in an ad-hoc manner, either by “hiding” the software development under the cover of science or by having funding to Jupyter alone. This is the first project where we explicitly partner with a team of domain scientists to simultaneously drive forward domain research and the development of Jupyter infrastructure.

We are excited about this opportunity and hope to be able to demonstrate that investing in open tools can be a force-multiplier of resources. As always, our work will be done openly, transparently, and with constant community engagement. We look forward to your critiques, ideas, and contributions to make this effort as successful as possible.

Acknowledgments

Thanks to Joe Hamman, Chris Holdgraf, and Doug Oldenburg for constructive feedback and edits on this blog post.

Many thanks to Ryan Abernathey (Columbia), Paul Bedrosian (USGS), Sylvain Corlay (QuantStack), Rich Signell (USGS), and Rollin Thomas (NERSC), who provided us with letters of support for this project; we look forward to working with you all! We are also grateful to Shree Mishra, our NSF Program Director on this project, and to Dave Stuart for their support as we move forward with the project. This project is a part of the to the EarthCube program and we look forward to engaging with its working group.

Finally, the sustained growth of Jupyter to the large-scale project that it has become would not have happened without the generous support of the Alfred P. Sloan, the Gordon and Betty Moore, and the Helmsley Foundations, as well as the leadership of Josh Greenberg and Chris Mentzel respectively at Sloan and Moore.

This work is supported by the NSF EarthCube program under awards 1928406, 1928374

Affiliations

[1] UC Berkeley, Statistics Department

[2] UC Berkeley, Berkeley Institute for Data Science

[3] Lawrence Berkeley National Lab, Computational Research Division

[4] National Center for Atmospheric Research, Climate and Global Dynamics Laboratory

[5] UC Berkeley, Department of Geography

[6] National Center for Atmospheric Research, Computational Information Systems Laboratory

[7] UC Berkeley, Division of Data Sciences

--

--