How Data Commons Can Support Open ScienceRobert GrossmanBlockedUnblockFollowFollowingApr 24By Robert L.
GrossmanIn February 2019, Sage Bionetworks hosted a workshop called Critical Assessment of Open Science (CAOS) in New Orleans.
A number of those attending the workshop have published blog posts that explore some of the themes that were discussed at the workshop.
You can find the posts at the Sage Bionetworks website.
Here is my contribution to this series:What are data commons and why might a research community develop one? Below, I give a quick introduction to data commons and their role in open science.
Data commons are used by projects and communities to create open resources to accelerate the rate of discovery and increase the impact of the data they host.
It is important to note that a data commons are not designed for individual researchers working on an isolated projects to ignore FAIR principles and to dump their data to satisfy data management and data sharing requirements.
More formally, data commons are software platforms that co-locate: 1) data, 2) cloud-based computing infrastructure, and 3) commonly used software applications, tools and services to create a resource for managing, analyzing and sharing data with a community.
The key ways that data commons support open science include:Data commons make data available so that they are open and can be easily accessed and analyzed.
Data commons support FAIR principles for data.
Unlike a data lake, data commons curate data using one or more common data models and harmonize them by processing them with a common set of pipelines so that different datasets can be more easily integrated and analyzed together.
In this sense, data commons reduce the cost and effort required for the meaningful analysis of research data.
Data commons save time for researchers by integrating and supporting commonly used software tools, applications and services.
Data commons use different strategies for this.
The commons themselves can include workspaces that support data analysis, commons can interoperate with co-located cloud computing platforms and applications, or data analysis can be done via third party applications, such as Jupyter notebooks, that access data through APIs exposed by the data commons.
These days it is very common to integrate machine learning, deep learning and AI tools with data commons so that machine learning and deep learning models can be built easily over the data in data commons.
Data commons also save money and resources for a research community since each research group in the community doesn’t have to create their computing environment and host the same data.
Since operating data commons can be expensive, a model that is becoming popular is not charging for accessing data in a commons, but either providing cloud-based credits or allotments to those interested in analyzing data in the commons or passing the charges for data analysis to the users.
A good example of how data commons can support open science is the Genomic Data Commons (GDC) that was launched in 2016 by the National Cancer Institute (NCI).
The GDC has over 2.
7 PB of harmonized genomic and associated clinical data and is used by over 100,000 researchers each year.
In an average month, 1–2 PB or more of data are downloaded or accessed from it.
The GDC also interoperates with three cloud computing platforms: Broad’s FireCloud, the Seven Bridges Genomics Cancer Genomics Cloud, and ISB’s Cancer Genomics Cloud.
The GDC supports an open data ecosystem that includes Jupyter notebooks, RStudio notebooks, and more specialized applications that access GDC data via the GDC API.
The GDC saves the research community time and effort since research groups have access to harmonized data that have been curated with respect to a common data model and run with a set of common bioinformatics pipelines.
By using a centralized cloud-based infrastructure, the GDC also reduces the total cost for the cancer researchers to work with large genomics data since each research group does not need to set up and operate their own large-scale computing infrastructure.
Based upon this success, a number of other communities are building their own data commons or considering it.
For more information about data commons and data ecosystems that can be built around them, see:Robert L.
Grossman, Data Lakes, Clouds and Commons: A Review of Platforms for Analyzing and Sharing Genomic Data, Trends in Genetics 35 (2019) pp.
Also see: arXiv:1809.
Grossman, Progress Towards Cancer Data Ecosystems, The Cancer Journal: The Journal of Principles and Practice of Oncology, May/June 2018, Volume 24 Number 3, pages 122–126.
0000000000000318About: Robert L.
Grossman is the Frederick H.
Rawson Distinguished Service Professor in Medicine and Computer Science and the Jim and Karen Frank Director of the Center for Translational Data Science (CTDS) at the University of Chicago.
He is also the Director of the not-for-profit Open Commons Consortium (OCC), which manages and operates cloud computing and data commons infrastructure to support scientific, medical, healthcare and environmental research.
Originally published at http://sagebionetworks.