Whole Tale

A platform for publishing transparent and reproducible computational research

Announcement

Whole Tale 1.2 released! See what’s new.

Whole Tale is an open-source platform designed to simplify the process of creating, publishing, and verifying computational research artifacts. Use Whole Tale to:

  • Run your code on an external system to ensure reproducibility.

  • Configure a computational environment with the exact versions of operating system and software used to obtain results.

  • Capture a recorded run of your workflow to ensure transparency of results.

  • Publish computational research artifacts to popular research archives.

For more information, see:

Why Whole Tale?

An overview of the Whole Tale platform.

What’s new?

Description of key features and known limitations of the platform.

User’s Guide

Platform user’s guide

Tutorials

Step-by-step tutorials

The project is funded by the National Science Foundation. A public version of the platform is available for academic use at https://dashboard.wholetale.org/.

http://readthedocs.org/projects/wholetale/badge/?version=latest

Terms of Use

Whole Tale is an open-source platform. The operational version hosted on Jetstream2 is for academic use only and limited to fair and reasonable use.

MATLAB licensing is on a “right to use” basis and not guaranteed in perpetuity. Absolutely no proprietary, closed, or for-profit use of MATLAB is permitted.

Violating any terms of service with respect to ACCESS, Jetstream2, or Mathworks licensing may result in loss of access to these services.

Tutorials

Quickstart

Simple example demonstrates how to run an existing tale and create a new tale that uses externally registered data.

STATA Tutorial

Demonstrates how to create and run a tale based on the STATA Desktop environment

JupyterLab

Demonstrates how to create and run a tale based on the JupyterLab environment

User’s Guide

This guide provides detailed information about the Whole Tale system for new users.

Why Whole Tale?

Whole Tale is developing open-source tools and infrastructure intended to simply the process of creating, publishing, and verifying computational research artifacts in conformance with community standards (Chard et al. [1]).

Research communities across the sciences are increasingly concerned with the replication of computationally-obtained results and reuse of research software to both increase trust in and the scientific impact of published research. Transparent, reproducible, and reusable research artifacts are seen as essential to sustaining and further accelerating scientific discovery. However, ensuring the transparency and reproducibility of results or reusability of software presents many challenges, both technological and social. Many communities are turning to peer-review processes as a means to encourage and enforce new practices and standards (Willis and Stodden [2]).

For more information, see the report of the National Academies of Sciences, Engineering and Medicine on Reproducibility and Replicability in Science.

What is a “Tale”?

Whole Tale defines a model for computational reproducibility that captures the input, output, data, code, execution environment, provenance and other metadata about the results of computational research. We refer to this model as a tale – a composite research object that includes the environment, configuration, metadata, code, and data objects required to fully reproduce computational results (Chard et al. [3]). Technically speaking, a tale’s computational environment is defined by a Docker image specification that can be published to an external archive – along with your code and data – and re-run either in Whole Tale or even on your laptop.

What can Whole Tale do?

Having created a tale, a researcher can share it with others, publish it to a research repository (such as Zenodo or Dataverse), associate a persistent identifier, and link it to publications. Other researchers or reviewers can instantiate the published version of the Tale and execute it in the same state as at the time it was when published. Tales may also contain intellectual property metadata with licensing information enabling re-use, reproducibility, giving credit, as well as for broad access.

Who uses Whole Tale?

Whole Tale is intended for researchers, editors, curators, and reviewers of published computational research.

What’s new?

Version 1.2 introduces the following features and enhancements:

For a complete list of features and bugfixes, see the v1.2 release notes.

Planned Features

For a complete list of current and planned features, see the release schedule.

  • Archival storage of container images

  • User interface to configure the environment

  • Support for user-contributed templated environments

  • Improved composability of environments

  • Improved accessibility including use of VS Code

  • Increased resources (CPU, memory) and GPU support

  • Metadata enhancements including citations, licenses

  • Computational provenance recorder using eBPF

Limitations

  • The Whole Tale dashboard works best in Chrome. There are known issues in Firefox.

Signing In

Users can access Whole Tale to browse public tales without signing in. However, many of the core operations require users to have an account. Whole Tale allows users to sign in using existing credentials from from hundreds of research institutions and organizations as well as ORCID or Google accounts.

Note

The Whole Tale system uses Globus Auth to allow users to login using their existing credentials. For more information about Globus, see their documentation.

  1. Go to https://dashboard.wholetale.org and select the “Sign In” link to start the login process.

_images/public_signin.png
  1. Search for and select your institution or organization from the search box. If your organization does not appear in the list, you can use your Google account, ORCID account, or register for a Globus account. After selecting your organization, select the “Continue” button.

_images/organization_selection.png
  1. You will be redirected to your organization’s login page. Enter your credentials.

_images/orcid_login.png

Note

Whole Tale user accounts are based on the email address obtained from the institutional login. Two different accounts (e.g., institution and ORCID) that have the same email address will have the same Whole Tale user.

Whole Tale uses authentication services provided by Globus. The first time you login, you may be prompted to grant Whole Tale access to information via Globus.

_images/consent.png

After logging in you will be redirected to the Whole Tale dashboard where you can explore existing and create new tales.

Important

You can revoke this consent at any time via https://app.globus.org/account/consents.

Signing in with Google

You will be asked to “Sign in to continue to globus.org” - this is expected, as WholeTale uses Globus toolkits. If you are already signed into a Google account in your browser, you will be offered one or more accounts to choose from, if not, enter the preferred Google account to use.

_images/google_login.png

Signing in with ORCID

You can also use your ORCID account. In this case, you will be redirected to the ORCID authentication screen. You should authenticate as you usually would to ORCID.

_images/orcid_login.png

About Globus

Globus (https://www.globus.org/what-we-do) is a non-profit service of the University of Chicago for secure and reliable research data management. Globus software is commonly used to transfer data in research computing infrastructure. The use of Globus authentication services in Whole Tale enables us to access data on behalf of users via the Globus network.

Exploring Existing Tales

The Tale Dashboard page is the main landing page for Whole Tale. It allows you to create new tales or run existing tales.

From this page you can:

_images/browse_overview.png

Whole Tale’s main landing page

The Tale Dashboard has four sections:

  • My Tales: Tales you have created

  • Shared with Me: Tales shared with you by other users

  • Public Tales: Tales shared publicly by users of the system

  • Currenty running: Displays if you have any interactive sessions running

My Tales

The My Tales tab displays all tales that you have created or copied. You have edit permission on these tales and can delete them or share them with other users.

Shared with Me

The Shared with Me tab displays all tales that have been shared with you by other users. These may be read-only or editable. If a tale is shared with you read-only, when you attempt to run it a copy will be made and appear under My Tales with the “COPY” indicator.

Public Tales

The Public Tales tab displays all tales that have been shared publicly by users of the system. These are all read-only. If you attempt to run a public tale, a copy will be made and appear under My Tales with the “COPY” indicator.

Currently Running Tales

If you have clicked the Run Tale button for any tales, the Currently Running panel will display. You may have 2 interactive environments running at the same time.

Tale Operations

View Tale

Hover over the tale card and select View to access a tale. You can view or edit metadata, files, and run the interactive environment created by the author.

Run Tale

Clicking the Run Tale button on a tale that you own will start the associated interactive environment. On tales shared publicly or with read-only permissions, a copy will first be created.

Stop Tale

Clicking the Stop Tale button will stop the interactive environment.

Delete Tale

To delete a tale, click the “X” button on the tale card. You will be prompted to confirm before the tale is deleted.

Important

Stop will end your interactive session, shutting down the associated container image. Delete will completely remove your tale from the system.

Creating New Tales

Tales contain the inputs, outputs, data, code, execution environment, provenance and other related metadata about the results of computational research. A tale is typically associated with a single publication and contains all information necessary to reproduce reported results.

Tales can be created as follows:

  • New Tale: Create a new empty tale

  • Github repository: Create a tale based on an existing Github repository

  • From data repository: Create a tale based on an existing dataset stored in a research repository (such as Dataverse, Zenodo or DataONE)

Note

You can create as many tales on the system as you’d like, but you can only have 2 interactive environments running concurrently.

_images/create_overview.png

Creating a new Tale

Environments

When creating a new tale, you must select the default interactive environment that will be used. Supported environments include JupyterLab, RStudio, MATLAB and STATA. For more information including how to customize installed packages, see the Environments section.

Creating an Empty Tale

To create an empty Tale, click the Create New Tale button and select the Create New Tale option. The Create New Tale dialog will appear allowing you to enter a title and select the interactive environment. Select the Create New Tale button and you will be taken to the your new tale where you can upload files, register external data, edit metadata, share with other users, or start an interactive environment. For more information see the Accessing and Modifying Tales section.

_images/create_menu.png

Create New Tale menu

_images/create_new_tale.png

Dialog for creating a new Tale

Creating Tales from Git Repositories

Tales can also be created from existing public Git repositories. To create a new tale that contains a Git repository, select the Create New Tale dropdown menu then Create Tale from Git Repository.

_images/create_new_git_tale.png

Dialog for creating a new Tale from a Git repository

Enter the URL of a public repository, title for your tale, and select the desired interactive environment. Select the Create New Tale button to create the tale and import contents from the specified Git repository. For more information about using Git in Whole Tale, see Working with Git.

Creating and Importing Tales from External Repositories

Tales can also be created from third-party research data repositories. Currently supported repositories include Zenodo, Dataverse, OpenICPSR, and DataONE.

_images/create_new_tale_doi.png

Dialog for creating a new Tale from a DOI

Choosing Between Read-Only and Read/Write

When a tale is created from an exeternally registered dataset (e.g., via DOI), you have the choice to mount the dataset read-only via external data or for the contents of the dataset to be copied to the workspace, enabling you to write. Citations are automatically generated for read-only external datasets.

Accessing and Modifying Tales

The Run view allows you to interact with and modify your running tale. From this page you can:

Launching the Tale

After you have finalized your tale and click Run Tale, you’ll be brought to the Interact page where it will start up, seen in the image below. From here you can access the tale, along with an assortment of other actions that are documented below.

_images/tale_launching.png

A tale that is being created and configured.

Interacting With Tales

RStudio

When starting a tale that is using an RStudio Environment, you’ll be presented with RStudio, shown below.

_images/rstudio.png

Each of the folders shown are analogous to the tabs under the Files tab. You can access all of your home files under the home/ folder; data that was brought in from a third party service can be found under data/; files that were added to your workspace are found under workspace/.

Jupyter Notebook

When starting a tale that has a Jupyter Notebook Environment, you’ll be presented with a typical Notebook interface.

_images/jupyter_browse.png

As with RStudio, data that came from external repositories can be found under data/, home directory files in home/, and workspace files in workspace/.

Adding Data

See File Management for details about how to manage files and data in your tale.

Modifying Tale Metadata

The Run page can also be used to access the tale metadata editor, shown below.

_images/metadata_editor.png

The editor can be used to change the environment, add authors to the tale, change the license, make the tale public, and provide in in-depth description of the tale.

Advanced Settings

The advanced settings section allows you to override default settings including the default command, environment variables, and memory limits. Note that memory limits are contrained by the underlying virtual machine. Any additional files required for building the container image can be specified using the extra_build_files setting.

{
    "environment": [
         "MY_ENV=value"
     ],
     "memLimit":"12gb",
     extra_build_files: [
         "some_file.txt",
         "some_folder",
     ],
}

Tale Actions

Use the tale’s action menu, highlighted below, to access tale-specific operations.

_images/action_menu.png

The tale’s action menu

Tale actions

Action

Description

View Logs

Enabled when your tale instance is running, this option allows you
to view the running container instance logs (i.e., docker logs).

Rebuild Tale

Rebuilds the container image. Requires restart (below).

Restart Tale

Restartsthe container instance

Save Tale Version

Creates a new version of your tale. See Versioning Tales.

Recorded Run

Starts a recorded run. See Recorded Runs.

Duplicate Tale

Creates a copy of your tale.

Publish Tale

Publishes your tale to a supported repository. See Publishing Tales.

Export Tale

Exports your tale. See Exporting and Running Locally.

Connect to Git Repository…

Connects an existing workspace to a remote Git repository.

Computational Environments

Your tale’s computational environment is a Docker image that is automatically built based on the combination of the selected interactive environment and any specified software dependencies. The interactive environment is selected during tale creation and can be changed on the Metadata page. Whole Tale currenty supports a number of popular interactive environments including JupyterLab, RStudio, MATLAB, and STATA.

You can customize these environments by adding your own packages based on repo2docker compatible configuration files. Read more about customizing the environment.

What is Docker?

Docker is a popular virtualization (“container”) platform that has been widely adopted for the packaging, distribution, and deployment of software – including research software. Whole Tale creates Docker images on your behalf that capture the operating system and software versions used to execute your computational workflow. These images are stored in our container image regsitry.

Extending repo2docker

Whole Tale uses a custom extension to the Binder project’s repo2docker component to build images. This means that any Binder-compatible repository can also be run in Whole Tale.

Note

repo2docker is a software package that can be used to build custom Docker images based on some simple text file conventions.

Differences from Binder

The Whole Tale repo2docker extends Binder adding the following capabilities:

  • Support for MATLAB and STATA environments

  • Support for Rocker Project RStudio environments

Custom Buildpacks
  • The MATLAB buildpack introduces toolboxes.txt

Jupyter

When starting a Tale that has a Jupyter Notebook Environment, you’ll be presented with a typical Notebook interface.

_images/jupyter_browse.png

As with RStudio, data that came from external repositories can be found under data/, home directory files in home/, and workspace files in workspace/.

RStudio

When starting a tale that is using an RStudio Environment, you’ll be presented with RStudio, shown below.

_images/rstudio.png

Each of the folders shown are analogous to the tabs under the Files tab. You can access all of your home files under the home/ folder; data that was brought in from a third party service can be found under data/; files that were added to your workspace are found under workspace/.

MATLAB

Whole Tale supports the creation, publication, and execution of Tales that rely on MATLAB software. We provide three different interfaces to MATLAB including the Web Desktop (available with MATLAB R2020b and greater), JuptyerLab with MATLAB kernel (all supported versions), and Linux Desktop via XPRA (all supported versions).

Use of MATLAB is subject to our Terms of Use.

Web Desktop

Available with MATLAB R2020b and greater, the web desktop provides access to the new MATLAB Online-style IDE that can be used from any standard web browser. The web desktop experience will be familiar to all MATLAB users.

_images/matlab-web-desktop.png
Jupyter with MATLAB Kernel

Available for any supported version of MATLAB, the JupyterLab IDE with MATLAB kernel can be used to create and run MATLAB code or Jupyter notebooks using the MATLAB kernel. The JupyterLab terminal provides access to a Linux shell environment to run MATLAB code.

_images/matlab-jupyter-kernel.png
Linux Desktop via XPRA

Available for any supported version of MATLAB, Xpra HTML5 client provides remote access to the native MATLAB Linux desktop.

_images/matlab-xpra.png
Customizing Your MATLAB Environment

In Whole Tale, users must declare their dependencies in a set of simple text files that are used to build a Docker image. For more information, see Customizing the Environment.

Technical Details
MATLAB Installation Media

Whole Tale requires access to the installation media for each supported release of MATLAB. Downloadable ISO images from Mathworks are converted to Docker images used for installation of the base MATLAB software and selected toolboxes. These images are private to the Whole Tale system, but anyone with an appropriate license should be able to access them from Mathworks. Instructions for creating the installation image are available in the matlab-install repository.

MATLAB BuildPack

To support the creation of custom MATLAB environments, Whole Tale has created a Binder- compatible buildpack. MATLAB environments begin with only the base MATLAB software installed and users can customize selected toolboxes by listing them in a toolboxes.txt file. The purpose of this is minimize image sizes at the time of publication by enabling the selection of only those packages required for reproduction. For example, see https://github.com/craig-willis/matlab-example.

License Information

Access to MATLAB on the Whole Tale platform is provided by institutional licenses from the University of Texas at Austin and Indiana University through the NSF Jetstream Cloud service. Tales that are published/exported and run outside of the Whole Tale system will require you to provide your own license information.

As noted in our Terms of Use, MATLAB on Whole Tale is for academic use only.

STATA

Whole Tale supports the creation of Tales that rely on STATA software.

Interfaces

Whole Tale supports two different interfaces for creating STATA based Tales:

Jupyter with STATA Kernel

Available for any supported version of STATA, the JupyterLab IDE with STATA kernel can be used to create and run STATA code or Jupyter notebooks using the STATA kernel. The JupyterLab terminal provides access to a Linux shell environment to run STATA code.

Linux Desktop via XPRA

Available for any supported version of STATA, Xpra HTML5 client provides remote access to the native STATA Linux desktop.

License

Access to STATA on the Whole Tale platform is provided through support from Stata Corp. Tales that are exported and run outside of the Whole Tale system will require you to provide your own license information.

As noted in our Terms of Service, STATA on Whole Tale is for academic use only.

Customizing the Environment

You can customize your tale environment by declaring any software dependencies in a set of simple text files. These are used to build a Docker image that is used to run your code.

The text file formats are based on formats supported by repo2docker and, where possible, follow package installation conventions of each language (e.g., requirements.txt for Python, DESCRIPTION for R). In other cases, simple formats are defined (e.g., apt.txt, toolboxes.txt). For more information, see the the repo2docker configuration files documentation.

Base environment

The base environment is always an Ubuntu “Long Term Support” (LTS) version.

Install Linux packages (via apt)

The apt.txt file contains a list of packages that can be installed via apt-get. Entries may have an optional version (e.g., for use with apt-get install <package name>=<version>)

For example:

libblas-dev=3.7.1-4ubuntu1
liblapack-dev=3.7.1-4ubuntu1

For more information, see Install packages with apt-get (repo2docker)

MATLAB

Note

MATLAB support is part of the Whole Tale repo2docker extension.

MATLAB toolboxes must be declared in a toolboxes.txt file. Each line contains a valid MATLAB product for the selected version.

For a complete list of available packages for each supported version, see https://github.com/whole-tale/matlab-install/blob/main/products/.

For example the following toolboxes.txt wold install the Financial and Statistics and Machine Learning toolboxes.

product.Financial_Toolbox
product.Statistics_and_Machine_Learning_Toolbox

See our MATLAB example

STATA

Note

STATA support is part of the Whole Tale repo2docker extension.

STATA packages must be declared in a install.do file. Each line contains a valid installation command.

For example the following install.do uses ssc to install packages:

ssc install estout
ssc install boottest
ssc install hnblogit

R/RStudio

R packages may be specified in a DESCRIPTION file or install.R.

For install.R, each line is an install.packages() statement for a given package:

install.packages("ggplot2")
install.packages("reshape2")
install.packages("lmtest")

To configure a specific version, we recommend configuring an MRAN date using the runtime.txt file:

r-2020-10-20

This file contains the MRAN date containing the versions of packages specified in install.R.

Alternatively, you ca use the install_version function in place of install.packages in your install.R file.

require(devtools)
install_version("ggplot2", version = "0.9.1")

For more information see:

Python

Python packages can be specified using requirements.txt, Pipfile/Pipfile.lock, or Conda environment.yml.

Example requirements.txt:

bokeh==1.4.0
pandas==1.2.4
xlrda==2.0.1

See also:

Environment Variables

In addition to using the start file below, you can specify custom environment variables using the advanced settings.

Other

Non-standard packages can be installed (or arbitrary commands run) using a postBuild script. The start script can be used to run arbitrary code before th user session starts.

Important

The start file is currently not supported in RStudio environments.

See also:

File Management

The Files tab allows you to manage:

Tale Workspace

The Tale Workspace folder is the primary storage area (or folder) for all files associated with a tale. This folder is available in your running tale environment as workspace.

_images/tale_workspace.png

Tale workspace and menu

Workspace Operations

Common operations include:

  • Upload files or folders from your local machine

  • Create, rename, move, remove, copy, download files and folders

  • Copy or move files between tales

Selecting a folder or file will present a menu with the following options:

  • Move To: move a file or folder

  • Rename: rename a file or folder

  • Share: share a file or folder with a user or group

  • Copy: copy a file or folder

  • Download: download a file or folder

  • Remove: remove a file or folder

Home Folder

The Home folder is your personal workspace in the Whole Tale system. You can perform most common operations including uploading, creating, moving, renaming, and deleting files and folders.

Your Home folder is mounted into every running tale instance with full read-write permissions. This means that you can access and manage files from both the Whole Tale dashboard and within tale environments. This is in contrast to the Data folder described below, which is limited to read-only access.

External Data

The External Data folder contains references to data you’ve registered in the system for use with the tale. This data is meant to be read only and can only be added from external sources. With this folder you can:

  • Register data from external sources (e.g., via DOI or URL)

  • Select and add registered data to a Tale

Supported Data Repositories

The current supported repositories that you can register data from are

  1. Zenodo: A general purpose research data repository.

  2. Dataverse: An open source research data repository platform with 35 installations worldwide including the flagship Harvard Dataverse.

  3. Globus: A service geared towards researchers and computing managers that allows custom APIs, data publication, and file transfer.

  4. DataONE: A federation of data repositories with a focus on scientific data. A list of participating member nodes can be found on the member node information page.

5. DERIVA: “DERIVA <http://isrd.isi.edu/deriva/>`_ is an asset management platform with a number of deployments. Support for DERIVA in WholeTale is being tested using data from The Pancreatic β-cell Consortium.

Adding Data

Files and folders cannot be uploaded to the External Data folder directly. To encourage reproducibility, only data registered from external resources and associated with a tale will be available in the External Data folder.

To register data from an external resource, use the data registration dialog, shown below.

_images/data_registration_modal.png

The data registration dialog allows you to search by DOI and ingest data into Whole Tale.

To access this dialog, navigate to the External Data folder by clicking the link icon below the home directory folder.

_images/data_upload.png

A user’s External Data folder, populated with data that was registered from external sources.

The blue plus icon will open the registration dialog where you can find and register your data. You’ll need to have either the DOI or data package URL to find the data.

Adding Data From DataONE

Data packages from DataONE can be integrated into Whole Tale by searching for the DOI of the package or by pasting the URL into the search box in the registration modal.

By DOI

Consider the following package. Visiting the package landing page we can see that the DOI is “doi:10.18739/A29G5GD0V”. To register this data package using the DOI, open the registration dialog and paste the DOI into the search box. Click “search” and check that the correct package was found. Click “Register” to begin data registration.

_images/dataone_doi.png

A dataset that was found by searching for the DOI.

By URL

The URL of the data package can also be used to locate the package instead of the DOI. In the previous example, pasting “https://search.dataone.org/#view/doi:10.18739/A29G5GD0V” into the search box will give the same data package which can subsequently be registered.

_images/dataone_url.png

A dataset that was found by searching with the package’s DataONE url.

Adding Data From Dataverse

Whole Tale allows to register data from all 35 public Dataverse installations. Support for additional installations can be added per user request. Similarly to DataONE, data can be registered both by providing DOI or direct URL into the search box of the registration modal.

By DOI

DOIs may be specified for either datasets or individual files. For example:

By URL

URLs may be specified for either datasets or individual files using the web or Data Access API formats. For example:

Adding Data From Globus

Data can also be retrieved from Globus by specifying the DOI of the package, as done in the DataONE case.

Note

Only the Materials Data Facility is currently supported.

By DOI

The DOI of the dataset can be found on the dataset landing page. For example, the Twin-mediated Crystal Growth an Enigma Resolved package has DOI 10.18126/M2301J. This DOI should be used in the data registration dialog when searching for the dataset.

Adding Data From DERIVA

Data from a DERIVA deployment can be added by browsing to a dataset in the DERIVA user interface and clicking on the Export button in the upper right corner of the screen:

_images/deriva-export-button.png

Clicking the export button triggers a drop-down menu, where an option to export to WholeTale can be selected:

_images/deriva-export-menu.png

Once the export is initiated, the DERIVA backend will package the dataset and redirect to WholeTale, where the dataset can be imported.

Adding Data From The Filesystem

Files and folders cannot be uploaded to the Data folder directly. To encourage reproducibility, only data registered from external resources or associated with a tale will be available in the Data folder. The data can however, be uploaded to the Home directory.

Sharing Tales

Tales can be shared with other Whole Tale users enabling active collaboration between researchers and peer review. The Shared with Me tab on the main dashboard page lists all of the Tales that other users are actively sharing with you.

To share a Tale with other users, first open the Tale and navigate to the Share tab, shown below.

_images/share_page.png

The Tale sharing area.

To add a user as a collaborator, use the Collaborators section to search for their username. Once selected, the appropriate permissions can be set.

_images/collaborators.png

Sharing a Tale with a user.

Tales can also be unshared with users. To remove a user, navigate to the Collaborators area. Find the user that the Tale is being shared with and click the ‘x’ to remove them.

_images/collaborator_complete.png

Read Permissions

The default access level for a shared Tale is Can Read. This means that the user you’ve shared the Tale with won’t be able to make modifications to the Tale’s metadata or to its files. The user will instead be able to view the metadata fields and view the included files.

When a user runs a Tale that was shared with them, a copy is created that the user can write to. Although unsharing the Tale will remove it from the user’s dashboard, it will not remove their personal copy if one was made.

Edit Permissions

When a Tale is shared for the purpose of active collaboration, the permissions should be set to Can Edit. This gives the user the ability to make changes to the Tale’s metadata and files. When a shared Tale has edit permissions and is run, a copy isn’t created like in the case of Tales with Read permissions. Instead, the Tale is started and the user is free to make modifications. These changes are reflected onto the original Tale. Users should be cautious not to work on a shared Tale simultaneously to avoid conflicts with opened files. To see recent changes, files need to be re-opened; if the file is saved before opening, any changes will be overwritten.

Versioning Tales

Whole Tale allows users to create versions of their tales. A version includes the contents of the tale workspace, externally registered data, and metadata. Versions can be renamed, deleted, exported and published. Recorded runs are created based on versions.

Versioning Model

Whole Tale uses a simply filesystem-based versioning model. Any versions you create are accessible via the Whole Tale dashboard as wells as in the ../versions folder in your running interactive environment.

Creating Versions

You can create versions of your tale using the Save Tale Version option on the tale action menu or the Tale History panel. Select the Tale History icon (tale_history) to open or close the panel:

To create a new version, select Save Tale Version:

_images/tale_history.png

Version history

Select Files > Saved Versions to manage your versions. From this menu you can rename, remove, download, restore, or export your tale version.

_images/version_menu.png

Version menu

Note

The Download folder option simply downloads any folder as a zip file. Use the Export Version option to download a complete version of your tale including metadata, external data, recorded runs, etc.

Created versions are accessible from inside your running interactive environment in the ../versions directory:

_images/version_container.png

Versions in the running container

Version Actions

Use the version action menu to operate on versions operations.

Version actions

Action

Description

Rename

Rename the selected version

Remove

Remove the selected version.

Download Folder

Download a zip file of any folder

View info

View tale metadata for the selected version

Restore Version

Restore the tale and workspace to the selected version

Export Version

Export the selected version as a Bg

As New Tale

Create a new tale based on the selected version

Deleting Versions

Deleted versions are removed permanently and cannot be recovered. A version cannot be deleted if it has an associated recorded run.

Renaming Versions

When versions are renamed, they are also renamed on the filesystem.

Restoring Versions

By selecting Restore you will copy the contents of the selected version to your active workspace. This includes tale metadata and registered datasets.

Exporting and Publishing Versions

Each time you export or publish a tale, if no version exists one is created for you. Selecting Export Tale or Publish Tale from the tale action menu will export or publish the most recent version. To export a specific version, including any associated recorded runs, select the desired version from the Publish Tale dialog.

_images/publish_version.png

Publishing versions

Recorded Runs

A recorded run is a way to ensure the transparency and reproducibility of your computational workflow by capturing the exact copy of artifacts used to generate results. A recorded run:

  • Creates a version of your tale

  • Builds or uses an existing container image

  • Executes a specified workflow via an entrypoint or master script in an isolated container

  • Captures an immutable copy of all outputs

  • Captures system runtime information including memory and cpu usage

  • Can be published

Note

Recorded runs can be used to capture multiple independent workflows for a single version of a tale. A version can have more than one recorded run.

You can execute a recorded run via the Tale Action Menu or Tale History panel. Select the Tale History icon (tale_history) to open or close the panel:

To start a crecorded run, select Perform Run:

_images/recorded_run.png

Perform run button

This will prompt you to specify an entrypoint or master script for your workflow:

_images/recorded_run_dialog.png

Recorded run dialog

Created runs are accessible from the Whole Tale dashboard or from inside your running interactive environment in the ../runs directory:

_images/recorded_run_dashboard.png

Recorded run in the dashboard

_images/recorded_run_container.png

Recorded run in the interactive environment

Deleting Runs

Runs can only be removed if the associated version is first removed. Deleted runs are removed permanently and cannot be recovered.

Renaming Runs

Runs can be renamed via the dashboard and are renamed on the filesystem.

Exporting and Publishing Runs

Recorded runs are exported and published based on the associated version. This means that an exported or published tale will have a single version, but may have more than one recorded run.

Exporting and Running Locally

Exporting

Tales can be exported as a BagIt archive. This is the same format used for publishing.

To export a tale, navigate to the Run page, select the Tale Action () menu and then select Export Tale:

_images/export_tale.png

Exporting a Tale

The tale will be exported as a BagIt archive. BagIt is a standard format defined for archival storage of digital objects.

BagIt Format

Tales exported under BagIt have additional metadata and an additional fetch.txt file that lists where external data resides. Tales that are exported in this format also have the ability to be run locally.

BagIt files

File or Folder

Description

README.md

Whole Tale readme

bag-info.txt

Bag metadata

bagit.txt

Bag declaration

data

Bag “payload directory”

data/LICENSE

Tale license

data/workspace

Tale workspace for exported version

data/runs

Recorded runs for exported version

fetch.txt

BagIt fetch file for external data

manifest-<algorithm>.txt

Payload manifest files for integrity checking.

metadata/metadata.json

Tale metadata for exported/published tale version

metadata/environment.json

Environment metadata for exported/published tale version

run-local.sh

Script to run tale locally

tagmanifest-<algorithm>.txt

Payload manifest files for integrity checking.

To validate an exported bag using the bdbag package:

pip install bdbag
bdbag --resolve-fetch all .
bdbag --validate full .

Running Tales Locally

Exported Tales under the BagIt format have a run-local.sh file that can be run to re-create the tale. Before running run-local.sh, ensure that you have Docker running in the background.

When you’re ready to run the Tale, open up the terminal and navigate to the top level of the bag. Run sh run-local.sh and wait for the setup to complete. If this is your first running a tale locally, it may take some time to download the container image.

Publishing Tales

When your Tale is ready to be preserved and shared, you can publish it to external repositories. When you publish your Tale, it receives a DOI that can be cited by others.

Publishers

Zenodo

Zenodo is a general purpose data repository that’s run by CERN. Zenodo allows for objects up to 50gb and can mint DOIs. Zenodo also offers the ability to version data submissions, which is supported in Whole Tale’s system.

Note

Before publishing to production servers, it’s recommended to first publish to the Zenodo sandbox repository.

DataONE

DataONE is a network of data centers and organizations that share their information across a network of nodes, where data is replicated and described with rich metadata. Publishing your tale into the DataONE network will allow you to archive your work, collect usage statistics, and make it easy to share with other scientists.

Publishing Your Tale

To publish your Tale, select “Publish Tale” from the dropdown menu on the Run page.

_images/start_publish.gif

Publishing is accessed through the Tale Run page

Once the publishing dialog is opened, select which data repository you want to store your Tale in.

_images/run_publish.gif

Publishing to the Development node

If you haven’t connected any third-party accounts the list will be empty. For instructions how to connect your account, visit the settings page.

Viewing Publishing Information

After you’ve published your tale, you can always view the published location under the metadata tab on the Run page.

Updating Published Tales

In the case that the published tale needs to be published again, with the intent of overwriting the previous publication, it can be updated by re-publishing to the same repository. This feature is only available to tales under the user’s ownership.

If a tale was used for related, subsequent work and shouldn’t update the previous tale, a new tale should be created by first copying the tale. When the copied tale is published, it will be a completely new record.

Whole Tale Generated Files

manifest.json

This is a metadata document that describes the Tale, which was inspired from the Research Object Lite specification. The important information contained in this file includes any external data that was used, locations of data files, and author attributes.

environment.json

The environment file contains information about the Tale’s compute environment which includes memory limits and docker information.

LICENSE

The LICENSE file describes the Tale’s license. To change the license, navigate to the metadata editor in the Run page.

README.md

The README file gives instructions for running your Tale locally without Whole Tale.

Account Settings

The Account Settings page allows you to manage your third party integrations. From here you can check which services Whole Tale has access to and manage your API keys.

Connecting to DataONE

Connecting to DataONE requires that you have an ORCID or university account. When connecting your Whole Tale account with DataONE, you’ll be asked to log into either one. Whole Tale will automatically request your API token from DataONE once connected.

Whole Tale allows you to connect to the following DataONE instances.

  1. DataONE

  2. Dataone Development

Connecting to Dataverse

You can connect your Whole Tale account with Dataverse by providing your Dataverse API key. A guide on obtaining your API key can be found in the Dataverse API Guide.

Whole Tale supports the following instances of Dataverse.

  1. Harvard Dataverse

  2. Development 2

  3. Dataverse Demo

_images/connect_dataverse.png

Connecting your account to Dataverse.

Connecting to Zenodo

Zenodo can be integrated with your Whole Tale account by using your Zenodo API key. You can retrieve your token at the Zenodo token page.

Whole Tale supports the following Zenodo servers.

  1. Zenodo

  2. Zenodo Sandbox

_images/connect_zenodo.png

Connecting your account to Zenodo.

Working with Git

Whole Tale supports three different ways of working with Git repositories.

  • Using Git command-line tools from the interactive environment

  • Creating a new tale from an existing Git repository

  • Connecting an existing tale to a Git repository

In all cases, you will likely need to work with either Git command line tools or plugins in your selected interactive environment.

Note

You cannot create a tale from a private (password-protected) Git repository.

Commandline

For experienced Git users, the simplest way to connect your tale to a Github repository is via the command line or client tools in your selected interactive environment. After selecting Run Tale and accessing your interactive session, open a console or terminal:

git init
git remote add origin https://github.com/<your_org>/<your_repo>.git
git pull origin master  # or main

This will initialize your tale workspace as a Git repository and associate it with the specified remote. From here, you can synchronize any code changes with your remote repository using your preferred tools.

Create or Connect to Git Repository

For users who are unfamiliar with Git commandline tools, you can achieve the above by simply creating a new tale or connecting an existing tale to a public Git repository. You will still need to use Git features in your selected interactive environment.

Image Building and Caching

Whole Tale uses an extension to Project Jupyter’s repo2docker to build container images based on a set of configuration files found in the tale workspace. Beginning with version 1.1, container images are built only using the repo2docker configuration files and any extra files specified using the extra_build_files advanced settings option.

The image tag is created as a combination of checksums from:

  1. repo2docker configuration files

  2. tale image object configuration

  3. Dockerfile created by r2d using files from above

With this, any tales that have a common environment configuration and images built using the same version of repo2docker will share a container image. This improves build performance and caching efficiency on the platform.

Container Image Registry

Note

This is a non-archival registry. We are currently working on a way to support depositing images for archival storage.

Containerization is central to the Whole Tale strategy for improving the long-term runnability and reproduciblity of published tales. Prior to v1.2, the container registry was used only as an internal cache. As of v1.2, we are making images associated with tales available publicly via images.wholetale.org.

To access the image associated with your tale, inspect the run-local.sh after export.

Tale Structure

This section describes the structure of a tale. Tales consist of the following elements:

  • Environment

  • Workspace

  • External data

  • Versions and Runs

  • Metadata

Environment

A tale environment consists of:

  • The selected interactive environment (e.g., RStudio, JupyerLab, MATLAB, STATA). These relate to a repo2docker buildpack and default command used to start the interactive environment.

  • repo2docker configuration files specifying software dependencies

  • The container image built from the above stored in the container registry.

Workspace

The tale workspace is the primary folder containing your code, data, documentation – anything required to reproduce your computational workflow.

External data

A read-only folder containing externally referenced data that has been registered with Whole Tale.

Versions and Runs

The Saved Versions (versions) and Recorded Runs (runs) folders provide read-only access to versions and recorded runs.

Metadata

Tale metadata contains additional information about your tale including title, authors, description, keywords, license, and citations/references.

Integrating with Whole Tale

The Whole Tale platform provides several integration options for remote repositories including:

  • Data registration: allows users to import and reference data from your repository

  • “Analyze in Whole Tale”: allows users to launch interactive analysis environments using data or Tales published in your repository

  • Tale publishing: allows users to publish Tale packages to your repository

Data registration

Whole Tale enables users to import and work with datasets from a variety of remote resources. After registering data, users can launch interactive analysis environment and package new Tales that reference the data stored on remote systems.

As of release v0.8, supported data providers include:

  • HTTP: any data accessible via HTTP protocol

  • DataONE: any dataset available through the DataONE network

  • Dataverse: any dataset available through the Dataverse network

  • Globus: data available through the Materials Data Facility (MDF)

  • DERIVA: currently in testing/development with data from the The Pancreatic β-cell Consortium

New data providers can be added by extending the Girder Whole Tale plugin.

Analyze in Whole Tale

The “Analyze in Whole Tale” feature enables one-stop data registration and Tale creation from remote repositories. Remote systems simply construct a URL pointing to the Whole Tale integration/ endpoint providing the URI for a dataset, optional Tale name and environment. Note that this requires that the provided URI is supported by one of the above data registration providers.

For example, the following URL will open the Browse page with the Tale name, data, and environment pre-populated: https://girder.dev.wholetale.org/api/v1/integration/dataone?uri=doi:10.7910/DVN/29911&name=MyTale&environment=rstudio

_images/compose.png

Pre-populated New Tale Modal

After selecting Launch New Tale, the user will be taken to an RStudio environment with the selected dataset mounted under /data.

Bookmarklet for Analyze in Whole Tale

You can enable Analyze in Whole Tale for virtually any web resources by using a bookmarklet.

How it works?
  1. Install the AinWT bookmarklet in your browser’s bookmark toolbar.

  2. When you come across a dataset from a provider that Whole Tale supports, click the AinWT bookmarklet in your bookmark toolbar.

  3. You will be redirected to the Whole Tale’s dashboard where a modal will prompt you to create a Tale which will include the selected dataset.

How to Install
Firefox

Right-click on the following link: AinWT for Firefox, then select the “Bookmark This Link” option.

Chrome

Drag this link to the bookmarks toolbar: AinWT for Chrome.

Safari

Drag this link to the bookmarks toolbar: AinWT for Safari.

iPhone and iPad

In iPad, iPhone or iPod Touch, copy this line of text:

javascript:void(window.location='https://dashboard.wholetale.org/mine?name=My%20Tale&asTale=true&uri=%27+encodeURIComponent(location.href))

Bookmark this page or any page, then tap the Bookmarks button to edit the new bookmark, paste the text you just copied, then tap “Bookmarks” and then “Done”.

Dataverse External Tools

Whole Tale provides specific integration with Dataverse via the External Tools feature.

The following External Tools manifest can be used to enable Whole Tale integration on your Dataverse installation:

{
  "displayName": "Whole Tale",
  "description": "Analyze in Whole Tale",
  "scope": "dataset",
  "type": "explore",
  "toolUrl": "https://data.wholetale.org/api/v1/integration/dataverse",
  "toolParameters": {
    "queryParameters": [
      {
        "datasetPid": "{datasetPid}"
      },
      {
        "siteUrl": "{siteUrl}"
      },
      {
        "key": "{apiToken}"
      }
    ]
  }
}

Download the manifest

To install, simply POST the manifest your instance. For example:

curl -X POST -H 'Content-type: application/json' --upload-file wholetale.json  \
   http://localhost:8080/api/admin/externalTools

Administrator’s Guide

Documentation about the installation, configuration, and ongoing operation of WT systems

Installation

The Whole Tale deployment and installation process is documented in the terraform_deployment repository.

Monitoring

Whole Tale system monitoring is implemented using the Open Monitoring Distribution (OMD). Monitoring of development and production systems is supported by NCSA’s ISDA group. The OMD instance can be accessed using NCSA credentials:

https://gonzo-nagios.ncsa.illinois.edu/WT

Adding Users

The WT OMD tenant is configured to use NCSA LDAP for authentication. Users in the grp_wtops LDAP group have access to OMD. Users outside of NCSA can be added via the “Users” menu.

Hosts and Host Groups

Hosts are organized into Host Groups based on deployment. We currently have host groups for the production, development, and staging deployments.

Notifications

Notifications are currently configured to be sent in bulk based on deployment. All notifications within a 5 minute period will be sent in a single bulk email.

Checkmk Agent

WT uses a custom Checkmk agent. Docker image and definition are available from:

The agent implements four custom checks: - check-celery - confirms celery_worker is running on all nodes - check-nodes - confirms nodes are in ready state - check-services - confirms that expected docker services are running - check-tale - confirms a tale can be launched and stopped on the system

Installation

The monitoring stack is installed as part of the Terraform deployment process. The Checkmk agent is deployed as a global docker service.

Slack Ingration

Checkmk notifications are also sent to the Whole Tale #alerts channel. Configuring Slack integration required the following steps:

On Slack: - Create new app https://api.slack.com/apps with name “Checkmk” - Activate incoming webhooks - Copy the service ID from the webhook URL which should have the form TXXXXXXXX/BXXXXXXXX/XXXXXXXXXXXXXXXXXXXXXXXX

On OMD: - Go to Notifications page - Select “New Rule” - Under “Notification Method” select “CMK-Slack Websocket integration” - Enter the above service ID as the first parameter and the channel (without #) as the second - Save and activate

To test (from https://checkmk.com/cms_notifications.html#Testing%20and%20Debugging) - Go to “All Services” page - Select “HTTP Dashboard” service - Select “Commands” (hammer) button - Next to “Custom notification” select “Send” - Confirm

This should send a test via both email and Slack.

Development Plan

Release Notes

v1.2

New Features
Bugfixes

v1.1

New Features
Bugfixes

v0.9

Features:

  • Support for storing and using third party API keys from Zenodo, Dataverse, and DataONE

  • Support for registering data from Zenodo

  • Added support for publishing and importing Tales to and from Zenodo

v0.8

Features:

  • A re-designed main page for the dashboard

  • A new, unified, notification system

  • Support for Dataverse hierarchy

  • Added ability to change compute environments

v0.6

Features:

  • Restructured Dashboard “Run” view

  • Tale workspace support

  • Ability to add/remove data to a running Tale (note: removed Data panel from Run and Compose views)

  • Change to registered data model (note: now limits operations on external datasets)

  • Analyze in WT support for DataONE

Bugfixes:

  • Handle failures of Dataverse installation list

  • Fixed issue when registering data from Globus (MDF)

  • Detection/correction of internal-state desync (“blue screen”)

  • Fix for Running git clone in home

v0.5

This release includes the following features. Note that with this release we’re adopting detailed release notes:

Refactor of data registration framework:

Dataverse integration:

Minor changes/bug fixes:

Deployment:

v0.4

This release includes the following features:

v0.3

This release includes the following features:

  • Automated deployment for development instances of WT

  • HTTPS for frontends/Wildcard certificate support

  • Migration process from GridFS to WebDav

v0.2

This release includes the following features:

  • Home directories (WebDav)

  • Backup of database and home directories

  • Container repository of frontends

  • Interface for creating new frontends

v0.1

This initial release includes the following features:

  • User dashboard

  • Ability to create and run tales

  • Globus and ORCID authentication

  • Globus, HTTP and DataONE ingestion

  • Jupyter and RStudio frontends

  • POSIX filesystem for remote data

  • Scalable infrastructure as code

Release schedule

Future releases

  • Recorded run/validation framework)

  • CLI for WholeTale

  • Kubernetes support

  • Support for verification/review workflows

  • Preservation of images

v1.0 (5/2021)

  • Create, export, and publish tale versions

  • Share tales with other users

  • Secured routes to instances

  • UI refactor

  • Support for MATLAB and Stata environments

  • Import and connect to Git repositories

v0.9 (4/2020)

  • Addressing user feedback from previous releases

  • Brown Dog integration (1.6.2)

  • Native support for WT in Jupyter and RStudio (1.2.1)

  • Tracking and storing Jupyter provenance to DataONE (1.4.6)

  • Indexing, remixing of the frontends (1.2.3)

  • OAI-ORE filesystem (1.3.4)

  • Tale validation framework

v0.8 (8/2019)

  • Addressing user feedback from MVP

  • Bug fixes

v0.7 (5/19)

  • Document, store, and publish Tales in DataONE

  • Export Tales to ZIP and BagIt

  • Ability to customize the Tale license

  • Tales now keep record of citations for external datasets that were used

  • Ability to add multiple authors to a Tale

  • Ability to run Tales locally

  • Misc UI improvements

  • Environment customization

v0.6 (3/19)

  • Refactored UI based on usability testing

  • Full Globus Integration (1.1.2)

  • Tale Workspace support

  • User namespacing system (1.4.3) (overview)

  • Ability to dynamically add/remove data from running Tales

  • Analyze in WT support for DataONE

v0.5 (12/18)

  • One-click data-to-Tale creation

  • Dataverse data registration

  • Dataverse external tool support

v0.4 (7/18, MVP)

  • Redesigned UI

  • User Documentation

  • Overall stability improvements

  • Automated testing

v0.3 (6/18)

  • Automated deployment for development instances of WT

v0.2 (5/18)

  • Home directories (WebDav)

  • Backup of database and home directories

  • Container repository of frontends

Architecture

Overview

The Whole Tale provides a scalable, web-based, multi-user platform for the creation, publication, and execution of tales – executable research objects that capture data, code, and the complete software environment required for reproducibility. It is designed to enable researchers to publish their code and data along with required software dependencies to long-term research archives, simplifying the process of creating and verifying computational artifacts.

The Whole Tale platform includes the following primary components:

  • Identity and access management

  • Dashboard

  • Whole Tale API

  • Whole Tale Filesystem

  • Image registry

  • Provider API

  • User environments

The following diagram illustrates the logical relationship between key system components:

_images/logical_overview.png

The Whole Tale platform leverages and extends a variety of standard components and services including the OpenStack cloud platform (via Jetstream2), Docker Swarm container orchestration platform, Celery/Redis for distributed task management, MongoDB for data management, Traefik reverse proxy, Open Monitoring Distribution for monitoring, as well as interactive analysis environments such as RStudio and Jupyter. Whole Tale leverages and extends the Girder REST API framework.

_images/architecture_overview.png

Identity and Access Management

Identity and access management are implemented via OAuth 2.0/OpenID Connect. Via the Girder OAuth plugin, the platform can be configured to use common OAuth providers including Google, Github, Bitbucket, and Globus. The production WT service leverages Globus Auth for federated login because if provides support for:

  • InCommon IdPs via CILogon

  • XSEDE/Argonne, ORCID and other research-centric systems

  • Tokens that can be used to initiate Globus transfers

The publishing framework uses ORCID for authentication into the DataONE network.

Dashboard

The dashboard is the primary interface into the Whole Tale system for users to interactively launch, create, and share Tales. It is the reference interface for the Whole Tale API, built using the Angular open-source web framework.

_images/dashboard.png

Whole Tale API

The Whole Tale API extends the Girder framework adding Whole Tale capabilities including:

  • Images, Tales, Instances, Versions, and Runs

  • Distributed home and Tale workspace folders

  • Importing data from remote repositories

  • Publishing Tales to remote repositories

  • Remote data access and caching

Via Celery/Redis, the Whole Tale API provides a scalable framework for:

  • Building and managing Tale images

  • Launching Tale instances (e.g., RStudio, Jupyter environments)

  • Ingesting data from external sources

  • Executing recorded runs

The following diagram provides an overview of key compoments of the Whole Tale API:

_images/tale_instance_model.png

Each user has a home folder that is accessible via the Whole Tale filesystem to every running Tale instance (and also exposed via WebDav to be optionally mounted on their local system). Every Tale is defined by its environment (e.g., RStudio/Jupyter); a workspace folder containing code, data, and narrative; and an optional set of externally-referenced datasets. When a user runs/launches a Tale, they get a Tale instance – a running Docker container based on the defined environment with the Tale workspace, external data, and home directory mounted and accessible.

Girder

Girder is an open source web-based data management platform intended for developing new web services with a focus on data organization, user management, authentication and authorization. It has been adopted by several related projects including yt.Hub, the NSF-funded Renaissance Simulations Laboratory, Crops in silico, and the Einstein Toolkit DataVault.

Whole Tale leverages Girder for the following features:

  • OAuth flow for user authentication

  • User and group management including advanced access control models

  • Metadata management including file, folder, and collection abstractions

  • Job management framework including notifications

  • API key and token management

  • Lighweight and high-performance interface to MongoDB

Environment Customization

As of release v0.6, environment customization is implemented via the Recipe model

_images/tale_image_model.png

A Tale image is defined by a “recipe”, which refers to a Github repository and commit ID that conforms to the Whole Tale image definition requirements. Future releases will include integration with Project Jupyter’s repo2docker framework.

Scalable task distribution (gwvolman)

The Whole Tale API implements a generic and scalable task distribution framework via the popular Celery system. The gwvolman implements tasks including:

  • Building and pushing images

  • Managing services (Swarm) including start/stop/update

  • Managing container volumes (mount/unmount)

  • Ingesting data from external providers

  • Publishing Tales to external providers (v0.7)

Whole Tale Filesystem

The Whole Tale filesystem provides distributed access to system data via a POSIX interface. This includes enabling access to home and Tale workspace data and managing access to and caching of externally registered data.

_images/filesystem_overview.png

Distributed folder access (wt_home_dir)

The Whole Tale platform includes an integrated WebDAV server (via WsgiDav) to enable distributed access to home and Tale workspace folders. The WebDAV server is integrated with Girder for authentication and to synchronize fileystem metadata. This means that changes made via WebDAV or Girder (e.g., the WT Dashboard) are always reflected in the exposed filesystem.

_images/webdav_overview.png

Data Management Service (girder_wt_data_manager)

The Whole Tale Data Management system is responsible for managing the data used in Tales. The main components include:

  • Transfer subsystem that managed movement of data from external data providers to local storage in Whole Tale. This is achieved through provider-specific transfer adapters.

  • Storage management system that acts as a local data cache that selectively caches or clears local copies of externally hosted data based on frequency of use.

  • Filesystem interface that allows tales to access cached data through a standard POSIX interface.

_images/data_manager_overview.png

Python client (girderfs)

Whole Tale provides girerfs, a Python client/library to mount the Whole Tale filesystem volumes. This is an intermediate layer representing data in Whole Tale as a POSIX filesystem that interfaces with the Data Management system. This is based on fusepy, a thin python wrapper for FUSE development.

This component supports the following mount types: * remote: mount Girder folders via REST API * direct: mount local Girder assetstore * wt_dms: mount via Whole Tale DMS * wt_work: mount Tale workspace via davfs * wt_home: mount user home directory via davfs

Provider Framework

The Whole Tale provider framework is designed to enable easy extension to support new providers for data registration, “Analyze in WT” capabilities, and publishing.

The framework consists of the following interfaces:

  • ImportProvider: Search, register, and access data from external repositories

  • Integration: Translate requests for Analyze in Whole Tale

  • PublishProvider: Publish Tales to external repositories

  • TransferHandler: Protocol handlers for transferring data (e.g., HTTP, Globus)

Remote data registration and access

Combined with the Whole Tale filesystem and data management system, the provider model provides an abstraction over heterogenous data sources (APIs), exposing a consistent interface to both the Whole Tale dashboard and running tale instances. Datasets from DataONE, Dataverse, and Globus are exposed to running Jupyter and RStudio containers as elements of a POSIX filesystem. The registration process captures only the metadata of the remote dataset and the data management service retrieves the actual bits only when used. This means that only those portions of the remote dataset that are actually used are transferred and cached in Whole Tale.

_images/registration_overview.png

User Environments

A fundamental design of the Whole Tale system is that users must be able to conduct and publish their analysis using their software environment of choice. Common environments such as RStudio and Jupyter should be provided by the system. Users must be able to customize these environments by selecting specific software versions. They must also be able to define and share new environments that may not be part of the base system.

In v0.6, the base environments are defined by the Recipe and Image models. Recipes refer to specific Github repositories and commit hashes. Images are the build Docker images stored in the Whole Tale image registry.

As of v0.7, we have adopted the Binder repo2docker model where users can easily customize software in the environment.

http://readthedocs.org/projects/wholetale/badge/?version=latest

About

What is Whole Tale?

Whole Tale is an NSF-funded Data Infrastructure Building Block (DIBBS) initiative to build a scalable, open source, web-based, multi-user platform for reproducible research enabling the creation, publication, and execution of tales – executable research objects that capture data, code, and the complete software environment used to produce research findings.

A beta version of the system is available at https://dashboard.wholetale.org.

The Whole Tale platform has been designed based on community input primarily through working groups and collaborations with researchers.

The Whole Tale project is involved in several initiatives to train researchers for reproducibility as well as use of Whole Tale in the classroom.

A Platform for Reproducible Research

The goal of Whole Tale is to enable researchers to define and create computational environments to (easily) manage the complete conduct of computational experiments and expose them for analysis and reproducibility. The platform addresses two trends;

  • improved transparency so people can run much more ambitious computational experiments

  • better computational experiment infrastructure to allow researchers to be more transparent

Why Whole Tale?

The Whole Tale platform is being developed to simplify the adoption of practices that improve the understandability and reproducibility of computational research.

Technological Sources of Impact

Virtually all published discoveries today have data and computational components. There is a mismatch between traditional scientific dissemination practices and modern computational research practice leads to reproducibility concerns.

Computational Reproducibility

The Whole Tale platform supports computational reproducibility by enabling researchers to create and package code, data and information about the workflow and computational environment necessary to support review and reproduce results of computational analysis that are reported in published research. Whole Tale implements this definition by supporting explicit citation of externally referenced data, capturing the artifacts and provenance information needed to facilitate understanding, transparency, and execution of the computational processes and workflows used for review and reproducibility at the time of publication.

Who is Whole Tale for?

Researchers

Researchers are increasingly adopting practices to improve the transparency and reproducibility of their own computational research. Some are self-motivated to improve their own rigor and transparency while others are responding to the demands and requirements of academic communities and journals. Some are advanced tool users with sophisticated methods of packaging and distributing scientific software, often with automated testing and verification. Others are more concerned with the research product than learning new tools and infrastructure for sharing and transparency.

Societies/Communities

Academic societies, associations and communities are responding to challenges in the reproducibility of published research by adopting recommendations, guidelines, and policies that impact publishers, editors, and researchers. Communities are beginning to adopt practices encouraging or requiring sharing of code and data. Some are even implementing verification and evaluation processes to confirm the reprodubility of published work.

Editors

In response to the demand of academic communities to address problems of reproducibility and reuse, journal editors are increasingly adopting guidelines and enforcing policies for the sharing of data, code and information about the software environment used in published research based on computational analysis.

Curators and Reviewers

The scholarly publication process has built-in mechanisms for anonymous peer review. Some communities are adopting replication practices to ensure that published research can be replicated at various levels. Anonymous reviewers and curators of research artifacts play an important role in the quality of research artifacts.

Repository Developers and Operators

Developers and operators of research data repositories are faced with the challenge of addressing the needs of their communities through support for new types of scholarly objects, methods of access, and processes for review and verification.

What is a Tale?

A tale is an executable research object that combines data (references), code (computational methods), computational environment, and narrative (traditional science story). Tales are captured in a standards-based format complete with metadata.

_images/tale_diagram.png

Whole Tale is an ongoing NSF-funded Data Infrastructure Building Blocks (DIBBS) project initiated in 2016 with expected completion February, 2023.

Our Approach

Open Source Platform

The Whole Tale project is developing an open source platform and welcomes both re-use and contribution from the research infrastructure community. As a building block, the platform is intended to be deployable in a variety of environments, including the primary service at https://dashboard.wholetale.org.

Open Source Curriculum

The Whole Tale project is dedicated to the improvement of education and training for reproducible research practices through the creation of open lessons both for classroom instruction and through contributions to programs such as the Carpentries.

Open Infrastructure

The Whole Tale project aspires to contribute to and increase the impact of existing open infrastructure projects. The success of Whole Tale depends on our ability to connect users to the tools and information they need to create reproducible research packages. The Whole Tale platform leverages components from projects including Girder, The Rocker Project, and Project Jupyter.

Lowering Barriers

The Whole Tale project draws on the experience and expertise of researchers and communities that are working in the forefront of reproducible research practices and applies these lessons to the development of a platform that is intended to simplify and broaden the adoption of such practices.

Community Engagement

Through our working groups and workshops, we engage with a broad community of researchers, educators, and infrastructure developers to inform Whole Tale project direction and platform design.

Usability

As we develop a web-based platform, we recognize the importance of usability and user experience and will continue to conduct regular usability tests to improve the system.

Domain Case Studies

This section documents detailed case studies of domains that inform the Whole Tale platform design. When looking at a particular discipline or sub-discipline, we explore the following questions:

  • What is the state of computational reproducibility?

  • What do associations/societies and top journals say about data/code sharing?

  • Are related initiatives driven by motivated researchers or editors?

  • Have they implemented or are they considering badging or verification?

  • Why is the field different from others we’ve studied?

  • Are there examples of existing “tales” (e.g., research compendia)?

  • Are there relationships with other open science infrastructure projects?

Archaeology

Introduction

According to Marwick (2017a), the field of archaeology has had a long-term commitment to empirical tests of reproducibility by returning to excavation and survey sites, but has only recently started to make progress in testing reproducibility of statistical and computational results. As in many other fields, data sharing has had increased attention over the past decade and sharing code and analytical materials only over the last few years. As noted by Marwick (2018), the Journal of Archaeological Science adopted a “data disclosure” policy in 2013 and author guidelines were updated only in 2018 to encourage sharing of “software, code, models, algorithms, protocols, methods and other useful materials related to the project.”

Open access to archeological data is sometimes problematic due to cultural sensitivities or issues of ownership (copyright or international stakeholders) and impact of exposure (e.g., risks of looting). Data publishing is also limited due to costs (key discipline repositories are fee-based) and researcher motivations. Community norms do not encourage/reward data and code publishing and no journals require archaeologists to make code and data available by default. Discipline-specific repositories include the Archaeological Data Service, the Archaeological Record (tDAR), and Open Context.

Marwick (2017a) outlines a set of basic principles to improve computational reproducibility in archaeological research. These are similar to guidelines provided in other fields:

  • Make data and code openly available

  • Publish only the data used in the analysis

  • Use a programming language to write scripts for analysis and visualization

  • Use version control

  • Document and share the computational environment

  • Archive in online repositories that issue persisted identifiers

  • Use open licenses

Marwick and the archaeology community have adopted the concept of “research compendium” refer to data/code packages. This concept originated with Gentleman and Temple Lang (2004): “We introduce the concept of a compendium as both a container for the different elements that make up the document and its computations (i.e. text, code, data,…), and as a means for distributing, managing and updating the collection.”

Marwick (2017a) describes a specific case study to illustrate his principles for reproducibility and demonstrate the research compendium concept:

Marwick (2017a) suggests that R is widely used by archaeologists “who share code with their publications” in part because it is free, widely used in academic research including statistics and includes support for experimental packages. He selected Git because commits can be used to indicate exact version of code used during submissions (note, started with a private repository that was opened after publication). He selected Docker because of convenience, building his image based on the existing rOpenSci image.

He found that the primary issue is the time required to learn the various tools and recommends incentivizing training in and practice of reproducible research. He also recommends changing editorial standards of journals by requiring submission of “research compendia”.

Education/training/outreach

The Archaeology community has organized “reproducible research” Carpentry workshops and contributed to the open source curriculum. For example:

Example research compendia

Climate change stimulated agricultural innovation and exchange across Asia

This research compendium has been published by d’Alpoim Guedes and Bocinsky via a combination of Github and Zenodo. The paper, compendium, and data are each published as separate citable artifacts. The data package includes all raw (downloaded) and derived data generated by the analysis (~3GB). The code is packaged as an R package. The environment is provided via a Dockerfile that adds packages on top of the rocker/geospatial:3.5.1 image. The image has been pushed to Dockerhub and is therefore immediately re-runnable.

The authors provide multiple methods of re-running the analysis: by cloning and running the Github repository locally, via the published Docker image, or by building and running the Docker image locally. The primary entrypoint is a single R-Markdown script.

Data is downloaded from multiple sources during execution.

  • The R FedData package is used to dynamically download data published from the NOAA Global Historical Climatology Network based on spatial constriants

  • Instrument data published vi NOAA FTP server (URL)

  • An Excel spreadsheet published as supplemental data vi Science (URL)

  • They also use data from The Digital Archaeological Record (tDAR) (requires authentication)

  • Elevation data via the Google Elevation API

This compendium suggests the following use cases:

  • Support for rocker-project images

  • Ability for researchers to dynamically and programmatically register immutable published datasets

  • Support for authenticated data sources and

  • Ability to register data from FTP services

  • Ability to store arbitrary credential information (e.g., in Home)

  • Support for projects where Github is the active working environment

  • Support for re-using the Github README for Tale description

  • Association and display of citation information for associated materials

  • Automatic citation of source data, where possible

  • Separate licenses for code and data

Additional Examples

The following are examples of “research compendia” from Archaeology:

Marwick (2018) reports on three pilot studies exploring data sharing in archaeology. He discusses the ethics of data sharing due to work with local and indigenous communities and other stakeholders and describes archaeology as a “restricted data-sharing and data-poor field.”

References

Archaeology Data Service/Digital Antiquity 2011 Guides to Good Practice. Electronic document, http://guides.archaeologydataservice.ac.uk/

Journal of Archaeological Science 2018 Guide for Authors. Journal of Archaeological Science. Electronic document; https://www.elsevier.com/journals/journal-of-archaeological-science/0305-4403/guide-for-authors (via wayback)

Kansa, Eric C., and Kansa, Sarah W. 2013 Open Archaeology: We All Know That a 14 Is a Sheep: Data Publication and Professionalism in Archaeological Communication. Journal of Eastern Mediterranean Archaeology and Heritage Studies 1 (1):88–97

Marwick, B. J. (2017a) Computational Reproducibility in Archaeological Research: Basic Principles and a Case Study of Their Implementation. Archaeol Method Theory (2017) 24: 424. https://doi.org/10.1007/s10816-015-9272-9

Marwick, B. et al. (2017b) Open science in archaeology. SAA Archaeological Record, 17(4), pp. 8-14.

Marwick (2017c) Using R and Related Tools for Reproducible Research in Archaeology. In Kitzes, J., Turek, D., & Deniz, F. (Eds.) The Practice of Reproducible Research: Case Studies and Lessons from the Data-Intensive Sciences. Oakland, CA: University of California Press. https://www.practicereproducibleresearch.org/case-studies/benmarwick.html

Marwick, B., & Birch, S. 2018 A Standard for the Scholarly Citation of Archaeological Data as an Incentive to Data Sharing. Advances in Archaeological Practice 1-19. https://doi.org/10.1017/aap.2018.3 https://doi.org/10.17605/OSF.IO/KSRUZ (code/data)

Marwick, B., Boettiger, C., & Mullen, L. (2017d). Packaging data analytical work reproducibly using R (and friends). The American Statistician https://doi.org/10.1080/00031305.2017.1375986

Nüst, Daniel, Carl Boettiger, and Ben Marwick. 2018. “How to read a research compendium.” arXiv:1806.09525

Open Digital Archaeology Textbook . https://o-date.github.io/draft/book/

Economics

Introduction

According to Levenstein (2017, 2018), the field of economics has a long history of interest in reproducibility and replicability starting in the 1980s. Early studies (e.g., Dewald, 1986) found low replication rates in published research. The field also has a long history of data sharing, with policies starting as early as 2003. By 2015, 27 journals required data sharing. Ten journals encourage replication studies. (SOURCE)

In 2018, the American Economics Association (AEA) appointed a data editor in part to improve access to and reproducibility of published researcher. Economics faces additional challenges due to the use of commercial data, requiring waivers because of both IP and confidentiality concerns. While macroeconomic research tends to use public data disseminated by government agencies and central banks, microeconomics research often relies on private/confidential administrative data. Data packages used to be published as supplemental information on via the journal web platform. In 2019, all historical packages were migrated to the openICPSR platform, and future packages were preserved there as well (Vilhuber, 2020). AEA currently has over 4000 replication packages, all of which contain software/code, many of which also contain released data.

Example: American Economics Association

Kingi et al (2018) report the results of an effort to reproduce a subset of studies published by the AEA using only the information provided by the authors during submission. The AEA is interested in performing after-the-fact verification of published code and data and is exploring the adoption of workflows similar to those used by the American Journal of Political Science (AJPS). They have over 300 examples of validated data/code packages.

A major challenge for the AEA is the widespread use of commercial statistical software. Over 70% of submitted packages require Stata or Matlab. SAS is widely used by the Census Bureau.

User Stories

Based on the above cases, we see the following user stories.

  • Ex-post validation of AEA deposits: The AEA data editor (or graduate student) should be able to perform after-the-fact validation of published data/code packages by importing them into Whole Tale.

  • Stata support: An AEA researcher should be able to publish a Tale based on the Stata environment. A reviewer or user should be able to re-run the Tale in Stata.

  • Matlab support: An AEA researcher should be able to publish a Tale based on the Matlab environment. A reviewer or user should be able to re-run the Tale in Matlab.

  • SAS support: An AEA researcher should be able to publish a Tale based on the SAS environment. A reviewer or user should be able to re-run the Tale in SAS.

  • ICPSR integration: Whole Tale should support registering data from and publishing to ICPSR.

  • Private WT instance: WT platform can be deployed locally with more restrictive access.

  • ICPSR/Dataverse: Dataverse holds “replication datasets” created from ICPSR data that don’t link to the original data at ICPSR. The articles may not even cite the data at ICPSR, so the original authors of the data don’t get any credit. The authors of the article should get credit for their code, but not for the data.

  • Multiple applications: AEA data packages often contain a mixture of code – R and Stata or R and Matlab, etc. that require the ability to run not just R or Stata, but both in the same image.

  • Ability to choose base software version: Some Tales will require newer/older versions of R

  • Metadata/classification Published packages should support domain and journal metadata formats (i.e., JEL https://www.aeaweb.org/econlit/jelCodes.php)

References

AEA. (2019). Usage Data. https://github.com/AEADataEditor/econ-program-usage-data

William G. Dewald, Jerry G. Thursby, Richard G. Anderson. Replication in Empirical Economics: The Journal of Money, Credit and Banking Project The American Economic Review, Vol. 76, No. 4 (Sep., 1986), pp. 587-603. http://www.jstor.org/stable/1806061

Kingi, Hautahi; Vilhuber, Lars; Herbert, Sylverie; Stanchi, Flavio. 2018. The Reproducibility of Economics Research: A Case Study. https://ecommons.cornell.edu/handle/1813/60838 Preprint - https://hautahi.com/static/docs/Replication_aejae.pdf

Levenstein, Margaret (2017). Presentation to the NAS Committee on Replicability and Reproducibility in Science. http://sites.nationalacademies.org/DBASSE/BBCSS/DBASSE_185106

Levenstein, Margaret (2018). Reproducibility and Replicability in Economic Science. https://deepblue.lib.umich.edu/bitstream/handle/2027.42/143813/Reproducibility+and+Replicability+in+Economic+Science+Levenstein+NAS+presentation+February+22,+2018.pdf?sequence=

Vilhuber, Lars (2020). Migrating historical AEA supplements. https://aeadataeditor.github.io/aea-supplement-migration/programs/aea201910-migration.html (accessed 2023-01-16)

Materials Science

Introduction

Political Science

Introduction

Over the past 20 years, the political science community has increasingly pursued transparency through encouraging or requiring authors to publish “replication files” intended to make each step the research process as explicit as possible. Beginning with the recommendations of King (1995) the research community and publishers have adopted a series of guidelines (for example DA-RT, 2015) culminating in the implementation of in-house and third-party certification workflows by top journals (Christian et al, 2018).

The American Political Science Association’s (ASPA) A Guide to Professional Ethics in Political Science states that:

Top journals, including the American Journal of Political Science (AJPS) and the American Political Science Review (APSR) require authors reporting empirical and quantitative results to deposit data, software and code, and other information needed to reproduce findings (ASPR, 2019). As discussed in detail below, AJPS is the only journal to implement a third-party certification of replication packages.

As highlighted by Dafoe (2014), the replication standard in political science is in part motivated by a number of high-visibility controversies in the social sciences. He cites the example of an influential paper in economics that was discovered three years later to have errors, arguing that the availability of the replication file for the study would have at least accelerated the identification of potential errors.

In January 2016, 27 political science journal editors signed the “Joint Statement on Data Access and Research Transparency” (DA-RT, 2015) that includes a number of requirements related to the APSA ethics guidelines for authors centered around data citation, transparency of analytic methods (e.g., code), and improving access to data and other research materials.

Example: American Journal of Political Science

Christian et al (2018) describe an operationalization of the replication standard implemented by the American Journal of Political Science (AJPS) in collaboration with the Odum Institute for Research in the Social Sciences and the Qualitative Data Repository (QDR). AJPS is one of the top-ranked political science journals with an ISI ranking of 1/169 in 2017. In 2012, AJPS adopted guidelines for authors to deposit replication packages in Harvard’s Dataverse. According to Jacoby (2017), due to concerns about the quality of deposited materials, AJPS implemented the third-party certification process starting in 2015.

Christian et al describe the basic workflow is as follows:

  • The author submits a manuscript to AJPS for peer-review. If accepted, the author is required to submit the replications materials to AJPS Dataverse.

  • Once the replication materials are available, the editor contacts Odum/QDR to begin the curation and verification process.

  • Data is reviewed per a data quality review framework. Statistical experts perform verification by executing the analysis code and comparing the output to tables and figures reported in the manuscript.

  • A “Verification Form” is returned to the editor including the results of the review process and any errors. The editor notifies authors to correct problems.

  • Once the data review and verification process is complete, the editor issues the acceptance notification and the materials are published in Dataverse (including DOI).

  • The paper and replication package are linked via DOI.

The authors further note that only 10% of submissions pass review without the need for revision and that, as of 2019, the process requires roughly 6 hours of effort for a single manuscript.

In response to his presentation to the NAS Committee on Replication and Reproducibility in the sciences, Jacoby (2017) notes that:

  • Odum archive staff handle both data curation and verification (statistical)

  • Errors are generally not serious (e.g., lack of documentation or tables that don’t reproduce exactly).

  • Mean number of resubmissions is 1.82

  • The verification process is paid for by the Midwest Political Science Association

  • AJPS requires only the data used in analysis (i.e., not all of the data collected)

  • Anecdotally, he has had feedback that the resource is invaluable for methodology courses (See also Janz 2016)

In 2018, the Odum Institute was awarded a $500,000 grant from the Sloan Foundation to improve and automate the verification process.

Jacoby (2017) notes that other political science journals have in-house verification processes, typically relying on graduate students. In these cases, it’s likely that the focus is on re-runnability of the code without necessarily comparing the reported results. In response, an example was raised from the field computer science where reproducibility reports are written by community reviewers, notably Information Systems journal (Chirigati, 2016).

The AJPS provides a “Quantitative Data Verification” checklist for the preparation of replication files that includes:

  • README file containing the names of all files with a brief description and any other important information regarding how to replicate the findings (i.e., the order files need to be run, etc.)

  • Includes a Codebook (.pdf format) with variable definitions for all variables in the analysis dataset(s) and value labels for categorical variables

  • Includes clear information regarding the software version used to conduct analysis

  • Includes complete references for source datasets Includes the analysis dataset(s) in a file format readily accessible to the social science research community (i.e., text files, delimited files, Stata files, R files, SAS files, SPSS files, etc.)

  • Includes a unique case identifier variable linking each observation in the analysis dataset to the original data source

  • Includes software command file(s) for reconstructing the analysis dataset from the original data source and/or extracting and merging multiple original source datasets, including information on source dataset(s) version and access date(s)

  • Includes commands needed to reproduce all tables, figures, and other analytical results presented in the article and supplementary materials

  • Includes commands/instructions for installing macros or packages

  • Includes comment statements used to explain the analysis steps and distinguish commands for tables, figures, and other outputs Includes seed values for any commands that generate random numbers (e.g., Monte Carlo simulations, bootstrap resampling, jittering points in graphical displays, etc.)

  • Includes any additional software tools needed for replication (e.g., Stata .ado files and R packages)

Examples

Harvard’s Dataverse includes hundreds of Political Science replication packages, including those verified through the Odum/QDR workflow.

References

AJPS replication policy https://ajps.org/ajps-replication-policy/

AJPS Quantitative Data Verification Checklist. 2016. https://ajpsblogging.files.wordpress.com/2019/01/ajps-quant-data-checklist-ver-1-2.pdf

AJPS Guidelines for Preparing Replication Files, https://ajpsblogging.files.wordpress.com/2018/05/ajps_replication-guidelines-2-1.pdf

APSA Guide to Professional Ethics, Rights and Freedoms https://www.apsanet.org/portals/54/Files/Publications/APSAEthicsGuide2012.pdf

ASPR. (2019). Submission Guidelines. https://www.apsanet.org/APSR-Submission-Guidelines. Accessed February 8, 2019.

Barba, Lorena A. (2018). Terminologies for Reproducibly Science. https://arxiv.org/pdf/1802.03311.pdf

Christian et al. Operationalizing the Replication Standard: A Case Study of the Data Curation and Verification Workflow for Scholarly Journals https://osf.io/preprints/socarxiv/cfdba/

Core2 award https://odum.unc.edu/2018/07/alfred-p-sloan-foundation-grant/

Dafoe, 2014. Science Deserves Better.

DA-RT. (2015). Data Access and Research Transparency (DA-RT): A Joint Statement by Political Science Journal Editors. https://doi.org/10.1177/0010414015594717

Jacoby, William. 2017. Presentation to National Academy of Sciences Committee on Replication and Reproducibility in the sciences. https://vimeo.com/252434555

Janz, 2016. Bringing the Gold Standard into the Classroom: Replication in University Teaching. https://doi.org/10.1111/insp.12104

Fernando Chirigati, Rebecca Capone, Rémi Rampin, Juliana Freire, Dennis Shasha. (2016). A collaborative approach to computational reproducibility. Information Systems, Volume 59, 2016, https://doi.org/10.1016/j.is.2016.03.002.

TOP guidelines (https://cos.io/our-services/top-guidelines/)

Use Cases

The following high-level use cases are supported by Whole Tale (v0.6):

  • A user can register immutable public data from supported external resources including DataONE, Globus, Dataverse and some HTTP sources.

  • A user can create a Tale based on popular environments including RStudio and Jupyter.

  • A user can upload/create source code files in the Tale workspace that are used for analysis. Analysis code can optionally reference externally registered data.

  • A user can share their Tale (via Public setting) and run Tales shared by others.

  • A Dataverse or DataONE user can create a Tale based on a public dataset via the repository native user interface (Analyze in Whole Tale)

  • A user can discover public Tales in the system (via Browse) and run them

  • A user provide metadata about their Tale including title, authors, description and a graphic representation

The following use cases are planned for future releases:

  • A user can customize existing software environments using common package managers.

  • A user can publish a Tale to an external research repository including DataONE and Dataverse network members.

  • A curator or reviewer can use Whole Tale to verify or certify published artifacts.

  • A user can add a new base environment to Whole Tale

  • A user can share a Tale with another user for collaboration

  • A user can share a Tale with another user for anonymous review

  • A user can copy an existing Tale and change the code, environment, or externally registered data (remix).

  • A user can run licensed software including Stata and Matlab

  • A user can run a Tale on a remote resource based on available data (data locality) or specialized compute requirements.

  • A user can create a Tale based on embargoed or private/authenticated data.

  • A user can track Tale executions along with detailed provenance information.

  • A user can export a Tale and run locally

Workshops

August 14-16, 2017
National Center for Supercomputing Applications (NCSA)
University of Illinois at Urbana-Champaign
August 31-Septemer 2, 2017
Tamaya Resort, New Mexico
March 7th, 2019
National Center for Supercomputing Applications,
University of Illinois at Urbana-Champaign

Publications

Shawn Bowers, Timothy McPhillips, and Bertram Ludäscher. (2018). Validation and Inference of Schema-Level Workflow Data-Dependency Annotations, 7th Intl. Provenance and Annotation Workshop (IPAW) https://arxiv.org/abs/1807.09899

Adam Brinckman, Kyle Chard, Niall Gaffney, Mihael Hategan, Matthew B. Jones, Kacper Kowalik, Sivakumar Kulasekaran, Bertram Ludäscher, Bryce D. Mecum, Jarek Nabrzyski, Victoria Stodden, Ian J. Taylor, Matthew J. Turk, Kandace Turner. (2017). Computing environments for reproducibility: Capturing the ‘‘Whole Tale’’, Future Generation Computer Systems https://doi.org/10.1016/j.future.2017.12.029

Adam Brinckman, Kyle Chard, Niall Gaffney, Mihael Hategan, Matthew B. Jones, Kacper Kowalik, Sivakumar Kulasekaran, Bertram Ludäscher, Bryce Mecum, Jaroslaw Nabrzyski, Victoria Stodden, Ian Taylor, Matthew Turk, and Kandace Turner. (2017). The Whole Tale: Merging Science and Cyberinfrastructure Pathways, Globus World/NDS Workshop

Kyle Chard. (2017). The Whole Tale: Merging Science and Cyberinfrastructure Pathways, National Data Service Workshop

Kyle Chard. (2017). The Whole Tale: Merging Science and Cyberinfrastructure Pathways, Nordic e-Infrastructure conference (NEIC)

Kyle Chard, Niall Gaffney, Matthew B. Jones, Kacper Kowalik, Bertram Ludäscher, Jarek Nabrzyski, Victoria Stodden, Ian Taylor, Thomas Thelen, Matthew J. Turk, Craig Willis (2019). Application of BagIt-Serialized Research Object Bundles for Packaging and Re-execution of Computational Analyses. In Proceedings of the IEEE 15th International Conference on e-Science (e-Science) https://doi.org/10.1109/eScience.2019.00068

Jarek Nabrzyski. (2017). The Whole Tale: Merging Science and Cyberinfrastructure Pathways, PressQT Notre Dame

Matthew Jones. (2018). Provenance tracking and display in DataONE, Earth Data Provenance Workshop https://esipfed.github.io/Earth-Data-Provenance-Workshop/

Kacper Kowalik. (2018). Sneaking Data into Containers with the Whole Tale, SciPy’18 https://scipy2018.scipy.org/ehome/index.php?eventid=299527&tabid=712461&cid=2233540&sessionid=21615725&sessionchoice=1&

Bertram Ludäscher. (2017). Workflows, Provenance, & Reproducibility: Telling the Whole Tale behind a Paleoclimate Reconstruction, EarthCube All Hands Meeting https://www.earthcube.org/2017AHMResources

Bertram Ludäscher. (2017). From Provenance Standards and Tools to Queries and Actionable Provenance, American Geophysical Union, Fall Meeting 2017, abstract #IN42C-02 http://adsabs.harvard.edu/abs/2017AGUFMIN42C..02L

Bertram Ludäscher. (2018). Whole Tale: The Experience of Research through reproducible, computational narratives, Building a Community to Advance and Sustain Digitized Biocollections (BCoN) meeting https://www.slideshare.net/ludaesch/wholetale-the-experience-of-research-88002697

Niall Gaffney. (2018). Improving Research Outcomes, Leveraging Digital Libraries, Advanced Computing and Data, JCDL 2018

Bertram Ludäscher. (2018). From Workflows to Provenance and Reproducibility: Looking Back and Forth, 7th Intl. Provenance and Annotation Workshop (IPAW) http://provenanceweek2018.org/ipaw/

Bertram Ludäscher and Santiago Núñez-Corrales. (2018). Dissecting Reproducibility: A case study with ecological niche models in the Whole Tale environment, Hierarchy of Hypotheses Workshop (HoH3) https://www.slideshare.net/ludaesch/dissecting-reproducibility-a-case-study-with-ecological-niche-models-in-the-whole-tale-environment

Timothy M. McPhillips, Craig Willis, Michael R. Gryk, Santiago Núñez-Corrales, Bertram Ludäscher (2019): Reproducibility by Other Means: Transparent Research Objects. In Proceedings of the IEEE 15th International Conference on e-Science (e-Science) https://doi.org/10.1109/eScience.2019.00068

B Mecum, S Wyngaard, C Willis, M Turk, T Thelen, I Taylor, V Stodden, D Perez, J Nabrzyski, B Ludaescher, S Kulasekaran, K Kowalik, MB Jones, M Hategan, N Gaffney, K Chard, A Brinckman. (2018). Science, containerized: Integrating provenance and compute environments with the Whole Tale, American Geophysical Union, Fall Meeting 2018, Washington DC http://adsabs.harvard.edu/abs/2018AGUFMIN53A..02M

Bryce Mecum, Matthew Jones, Dave Vieglais and Craig Willis. (2018). Preserving Reproducibility: Provenance and Executable Containers in DataONE Data Packages, 2018 IEEE 14th International Conference on e-Science (e-Science) https://doi.org/10.1109/eScience.2018.00019

Victoria Stodden. (2017). The Role of Cyberinfrastructure in Reproducible Science, Chameleon User Meeting

Victoria Stodden. (2017). Toward a Reproducible Scholarly Record, Dataverse Community Meeting

Victoria Stodden. (2017). Implementing Reproducible Computationally-Enabled Science, Institute for Advanced Computational Science

Victoria Stodden. (2017). Reproducibility in Computationally-Enabled Research: Integrating Tools and Skills, METRICS Seminar

Victoria Stodden. (2017). Data-Sharing and Reproducibility, National Academy of Sciences

Victoria Stodden. (2017). Research Data Management Implementations:Towards the Reproducibility of Science, RDMI Workshop

Victoria Stodden. (2017). Reproducibility in Computationally-Enabled Research, The Judith Resnik Year of Women in ECE Seminar

Victoria Stodden. (2018). Infrastructure for Enabling Reproducibility in Computational and Data-enabled Science, Biostatistics Seminar Northwestern Preventive Medicine https://www.preventivemedicine.northwestern.edu/divisions/biostatistics/seminars.html

Victoria Stodden. (2018). Open Data, Code, and Computational Reproducibility, CMU Open Science Symposium https://events.mcs.cmu.edu/oss2018/

Victoria Stodden. (2018). Enabling Reproducibility in Computational and Data-enabled Science, Ecole Polytechnique Federale de Lausanne https://memento.epfl.ch/event/enabling-reproducibility-in-computational-and-da-3/

Victoria Stodden. (2018). Enablilng Reproducibiolity in Computational and Data-enabled Science, Workshop II HPC and Data Science for Scientific Discovery. Part of the Long Program: Science at Extreme Scales: Where Big Data Meetrs Large-Scale Computing https://www.ipam.ucla.edu/programs/workshops/workshop-ii-hpc-and-data-science-for-scientific-discovery/

Matt Turk, Kacper Kowalik. (2018). Sneaking Data into Containers with the Whole Tale, SciPy2018

Craig Willis. (2018). The Whole Tale: Merging Science and Cyberinfrastructure Pathways, NDS/MBDH Data Science Tools & Methods Workshop http://www.nationaldataservice.org/get_involved/events/DataScienceTools/

Craig Willis, Kacper Kowalik. (2018). Container applications in research computing and research data access, PEARC’18 https://pearc18.conference-program.com/?page_id=10&id=pan109&sess=sess155

Craig Willis. (2018). The Whole Tale: Merging Science and Cyberinfrastructure Pathways, PresQT 2018

Development Documents

Developer Guide

Whole Tale is an open-source software project. External contributions are encouraged. Please feel free to ask questions or suggest changes to the this Developer Guide.

Issue management

The core team uses Github for issue management. General issues or where the specific component is unknown are filed in <https://github.com/whole-tale/whole-tale/issues.

During weekly development calls, issues are prioritized, clarified, and assigned to release milestones.

Defining “done”

What does it mean for an issue or task to be “done”?

  • Code complete

  • Unit tests complete and passing

  • Manual tests defined and passing

  • Documentation updated

  • PR reviewed and merged

Code management

Best practices:

  • Never commit code to master. Always use a fork or feature branch and create a Pull Request for your work.

  • Name your branch for the purpose of the change. For example feat-add-foo.

  • Always include clear commit messages

  • Organize each commit to represent one logical set of changes. For example, separate out code formatting as one commit and functional changes as another.

  • Reference individual issues in commits

  • Prefer rebasing over merging from master

  • Learn to use rebase to squash commits – organize commits for ease of review.

  • Never merge your own PR if not approved by at least one person. If reviews aren’t happening in a timely matter, escalate them to the team.

  • Merging a PR means that the work has been tested, reviewed, and documented.

Testing

Every PR must include either a unit test or manual test scenario. PRs will not be be merged unless tests run successfully.

Manual test cases will be added to the test plan template.

For the Whole Tale API, we leverage Girder’s automated testing framework.

Tests are run via CircleCI. Tests will fail with < 82% coverage.

Repositories and components

The project has the following repositories:

Core services:

Setting up for local development

The entire WT platform stack can be deployed locally or on a VM using the development deployment process.

The WT platform stack can be deployed on an Open-Stack cluster using the Terraform deployment process.

Integrating with the ‘Analyze in Whole Tale’ feature

To utilize Whole Tale’s ability to create a Tale based on data on your repository, follow the steps outline below. The general idea behind this feature is that the backend endpoint for this feature will never change, but the user interface may. To get around this, third parties should send their users to the /integration endpoint, which then re-directs them to the appropriate frontend URL.

  1. Clone the girder_wholetale repository

  2. Create a folder in server/lib with the name of your service as the folder name

  3. Add an integration.py file in the folder

  4. Copy and paste the contents of the DataONE or Dataverse integration.py into yours

  5. Change the content in autoDescribeRoute to match your service, including any query parameters

  6. Change the name of the __DataImport method to match the name of your service

  7. Modify any of the query parameters in the method if you’ve changed them

  8. Navigate to server/rest/integration.py

  9. Import your method in your integration.py (see how it’s done for current integrators

  10. Add self.route(‘GET’, (‘YOUR_SERVICE_NAME’,), YOUR_METHOD_NAME) to the __init__

GitHub Organization

Meta repositories:
  • whole-tale - User facing repository used mostly as a general bug tracker

  • wt-design-docs - This repository

  • girder_deploy - Collection of scripts used for deploying Girder (obsoleted by terraform_deployment??)

  • terraform_deployment - Terraform deployment setup for WT production system.

  • deploy-dev - Scripts for developers that want to deploy Whole Tale

  • tale_serialization_formats - Contains documentation related to the format of the manifest.json file and zip&bag export formats

Core services:
  • dashboard - Frontend UI to Whole Tale.

  • girder_wholetale - Girder plugin providing basic Whole Tale functionality.

  • girder_wt_data_manager - Girder plugin for external data management.

  • wt_sils - Girder plugin providing Smart Image Lookup Service.

  • gwvolman - Girder Worker plugin responsible for spawning Instances and mounting GirderFS on compute nodes.

  • globus_handler - Girder plugin that supervises transfers between multip GridFTP servers

Images:

Communicating

The Whole Tale development team strives for open channels of communication. The development team communicates through the following:

Design Notes

Comparison between ownCloud and WsgiDAV

This is a quick and dirty performance comparison between the ownCloud WebDav server and WsgiDAV.

A few tests are run:

  • a recursive copy of a directory that mostly includes source files

  • large file copies (256M, 1G)

  • deletion of the directory in the first step

  • recursive listing

  • evaluation of a jupyter notebook

Both ownCloud and WsgiDAV are run on plain HTTP and locally. Data is stored on an SSD. ownCloud is configured with Redis caching, with Redis running locally and the connection being through a UNIX socket. Redis is also used for ‘memcache.locking’. These are recommended performance settings for ownCloud (see here). WsgiDAV is run with its own HTTP server.

The “many files” used in the tests consist of a structure with 50 sub-directories. There are a total of 680 files adding up to 5.1MB.

Results
Recursive copy
cp -r ...
_images/Recursive_Copy.png
Recursive Listing
ls -lR
_images/Recursive_Listing.png
Recursive delete
rm -rf ...
_images/Recursive_Delete.png
File copy (256M)

This is from local to webdav mount

_images/256M_Copy.png
File copy (1G)
_images/1G_Copy.png
Notebook Eval
jupyter nbconvert --to notebook --execute ...

The notebook has some code that loads a bunch of data files that are stored in the same directory as the notebook (so webdav). Jupyter’s configuration directory is set to the webdav mount.

_images/Notebook_Eval.png

This document is to keep track of what metadata is automatically generated by the Whole Tale system, and what the default values are. This is relevant to publishing tales on DataONE.

System Metadata Generation

Any file that doesn’t already exist on DataONE needs to have a metadata document describing it’s properties. This is accomplished by using dataone_package.generate_system_metadata which ends up calling the d1_python library. The object’s MIME type md5, and size are all put into the metadata document.

Additionally, system metadata is generated for both the EML document and the resource map

Rights Holder

The rights holder typically corresponds to a user’s ORCID. Right now this is hard-coded and will be addressed with Globus-DataONE integration.

Access Policy

The metadata document hold information about who can access the data. By default, this is set to public and to read permissions.

Minimum EML Generation

When uploading the package, a minimum EML record is required. This will/can be edited by the user at another step later in the process. The minimum record has documents the following

  1. The Title. This is set as the tale title

  2. The surName of the individualName in the creator field. This is currently set as the lastName of the user.

  3. The surName of the individualName in the creator field. This is currently set as the lastName of the user.

  4. A otherEntity block for each object. This includes a physical section outlining the size and name of the object.

  5. A section for the tale.yaml file.

Creating Frontends

This document describes basic requirements for creating a usable Frontend (also known as Whole Tale Image).

Description

Software environment used in research plays a vital role in Whole Tale and needs to be preserved in the same fashion as other research artifacts. Currently Whole Tale utilizes Docker containers for that purpose. Our system is designed to allow users to create and publish Docker images, which can be subsequently used as building blocks for Tales.

Due to constraints related to reproducibility we do not permit uploading raw Docker images. Instead we require users to provide a GitHub repository, that contains everything that is necessary to build a Docker image, which at a bare minimum is a Dockerfile. A reference to a GitHub url and a commit id is kept in our database as a Recipe object.

Image composition guidelines

Docker images in WT are built from Recipe objects, which contain necessary information to create a local copy of an environment, where docker build command can be executed. There is a minimum set of requirements that the resulting Image has to fulfil:

  • define a single port that is going to be used to access a running container via http(s) (e.g. EXPOSE 8888)

  • define a command that is executed upon container initialization (COMMAND ["/bin/app.exe"])

  • define a user that will own container session (USER jovyan)

  • define a single volume that will be used as a mount point for Whole Tale FS (VOLUME ["/home/jovyan/work"])

Optionally:

  • FROM target should be referenced via a digest rather than a tag (currently not enforced).

Aforementioned properties can be defined by the Dockerfile, but may be overwritten through WT’s Image object, e.g.:

"config": {
  "command": "/init",
  "port": 8787,
  "targetMount": "/home/rstudio/work",
  "urlPath": "",
  "user": "rstudio"
}
Notes
  • config.command and config.urlPath properties supports templating. Currently config.port and a randomly generated can be passed via {port}} and {{token}} respectively. (see Example below)

  • config options passed to Docker during a Tale initialization are currently evaluated in a following order (first on the list takes precedence):

    • Tale object’s config property

    • Image object’s config property

    • Docker image defaults

    • gwvolman defaults

Example Frontends

A basic Jupyter image that WT uses is defined here. Relevant part of the Image object:

"config": {
  "command": "jupyter notebook --no-browser --port {port} --ip=0.0.0.0 --NotebookApp.token={token} --NotebookApp.base_url=/{base_path} --NotebookApp.port_retries=0",
  "memLimit": "2048m",
  "port": 8888,
  "targetMount": "/home/jovyan/work",
  "urlPath": "?token={token}",
  "user": "jovyan"
}

For a full list of properties supported by config parameter please refer to plugin code.

Backup design notes

Background: * The MongoDB is currently backed up manually via mongodump and copied to an external resource (e.g., ytHub) * The terraform_deployment process was modified to restore from backup if the URL was provided in variables.tf. The idea behind this was to quickly deploy a new instance of the infrastructure with the latest user data – for example, moving from Nebula to Jetstream. * If WT is intended to be an installable package, then the backup scenario should be flexible but also have a low bar. For example, we don’t want to require everyone to use Crash Plan or come up with their own custom solution. Ideally, we would provide a basic functional backup that is flexible enough for most scenarios. * Initial discussions revolved around use of Box, since it is already resilient infrastructure, instead of trying to setup our own remote backup server. It would certainly be possible to setup a VM at IU to handle backups to disk. DataOne is using [Bacula](http://blog.bacula.org/); NCSA ITS recommends Crash Plan to projects. * Mongo is running in Docker Swarm and attached to the wt_mongo network. Any backup implementation will need to be able to handle mongodump.

Requirements: * Periodic (eventually nightly) backup of user and system data * Ability to restore from backup in the event of disaster recovery or when migrating/upgrading infrastructure * Need to support backup and recovery for from multiple WT instances (i.e., dev/testing/production) * Integration with monitoring/alert system (e.g., notifications if backup fails) * Configurable retention policy (default: 1 week)

User/system data: * User data is currently stored in MongoDB and the home directory mount on the fileserver node * System data is primary stored in MongoDB. There are a few configuration files that might be useful to capture – e.g., acme.json. * Open question Do we need to backup any of the cached/registered user data?

Preliminary implementation: * Uses [rclone](https://rclone.org/) a cloud-oriented rsync. The user would supply an rclone configuration file during WT system installation. Unfortunately this requires interactive setup to get an oauth token, but the token is valid for 60 days. * The backup process is containerized and includes both rclone and mongodump support see https://github.com/whole-tale/backup. * Data and rclone config is mounted into the container via -v. The backup.sh script runs mongo dump, tar’s the mounted directory, and copied to box under “WT/name/YYYYMMDD/” * Nightly backups are handled via systemd timer. A single timer will run backup nightly. * Data is mounted into the container via -v flag along with rclone conf. * The backup process runs on the WT fileserver for direct access to user directories. * Initially home directories are tar’d, gzip’d and sync’d to box. Mongodump output is compressed.

Meeting Notes

Meeting notes are available in Github https://github.com/whole-tale/wt-design-docs/tree/master/development/meetings.

Testing Process

Usability Test Results

As part of the effort to gain insight into the usability of the Whole Tale dashboard, members of the team developed a series of common tasks that users may perform. In addition, questions were developed for each task. After, members of the development team performed said actions and answered the questions. This document is the synthesis of the members’ responses.

The usability tests can be found at

https://hackmd.io/d-m0LsVXSh2EpzhcRJMCUA

Task 1: Launching the LIGO Tale

Broad Goal: Find a tale (a combination of environment and data), launch it, use it (briefly) to its desired end, and then terminate it.

Questions:

  1. What is a tale?

    There was common agreement that a tale is a container/package that has at least someone’s code and data inside.

  2. When you ran the LIGO tale, what did it do?

    There was a bug that prevented some users from launching the tale. Users that did not experience the error was able to successfully launch the tale. The notebook file had to be manually opened, and then it was possible to generate the output data plots

  3. What was clear/unclear about this process?

    There was a lack of direction of what to do/where to go/how to run the experiment when the user entered jupyter notebook. In particular, the file structure was confusing.

Common Issues/Questions:
  1. When launching a tale, it was confusing to see both “View” and “Launch” on the tale widget.

  2. After clicking “Launch”, it was expected that the tale launched and opened, but instead, if the user missed the window to click the “Go to Tale” button, they had to visit the Status page where they had to interact with the UI again to view it.

  3. There was a lack of understanding what to do when the tale is launched (ie why am I not seeing the tale after clicking Launch; where do I go to see the tale?)

Suggestions:
  1. Automatically open the notebook file when the user enters the tale.

  2. When the user clicks ‘Launch”, open the tale in a new tab when it is ready.

Misc Notes:
  1. The notebooks was in a data directory-why?

  2. The ‘view’ link was expected to bring the user to a summary of the tale, not the experiment.

  3. ‘Data’ and ‘Home’ directories are not native to Jupyter environments, and it may be misleading/confusing for a user that is familiar with the Jupyter environment

Task 2: Registering data from Globus

Broad Goal: Register a dataset that is accessible by a URL, and then put that data into its own folder.

Questions:
  1. What is the difference between the home, data, and workspace folders?

There was agreement between two respondents that raw data would appear in the data folder.

Two respondents thought that personal data would be kept in home

The workspace folder had the greatest amount of uncertainty: each respondent failed to clearly identify its purpose.

  1. If you want to share a registered dataset, which folder would you put it in?

Two respondents felt that the data would go into the data folder, but the workspace folder was also a considered option.

Common Issues/Questions:
  1. There was common theme about how the workspace folder should be utilized, possibly stemming from a lack of understanding of what its use was.

Task 3: Creating a Tale

Broad Goal: Given a registered dataset, create a new “Tale” that utilizes it and launch that Tale.

Questions:
  1. Were you easily able to find the registered data set?

One user experienced an error when attempting to access their data; two of the respondents were able to locate the data.

  1. Inside the tale, were you able to identify your files of interest?

Two of the respondents ran into a bug when launching the tale and were unable to complete the task. Another user was able to find the data in the Data folder.

Misc Notes:
  1. Two respondents mentioned that the login screen may be confusing to new users.

  2. One respondent was confused about the kinematic folder in the tale.

  3. The data folder was read only, even though the tale was un-published which prevented the user from organizing the tale.

Task 4: Importing Data from DataONE

Broad Goal: Register a dataset that is accessible from DataONE, and use it in a Tale.

Questions:
  1. Where did you find the data you registered in Whole Tale?

    The data was found in the Data folder (in the dashboard).

  2. What is the difference between Catalog and My Data?

One user thought that it was the difference between cached data and what had been registered. Two other respondents agreed that the Catalog has data that others had registered, while My Data has data specific to the user.

  1. Did you find the data in your running tale in RStudio?

Two of the users ran into a bug when launching the tale, the other ran into a bug when attempting to access /work/data.

  1. What was clear/unclear about the process?

It may be confusing for a user dealing/using dois, reference URIs, data ids, etc.

There was a lack of indication of where the data was being put in the dashboard.

Task 5: Importing Data from Globus

Broad Goal: Register a dataset that is accessible from Globus, and use it in a Tale.

Questions:
  1. Which folder were you expecting the data to be registered in?

There was a split between users thinking that the data would go into the workspace and data folders. During registration, a notification came up that stated the data was being copied to the workspace.

  1. Did the file names and extensions in the tale match the ones in Globus?

The filenames did not match those in Globus, and were extensionless.

  1. If there were any hurdles for plotting the data, what were they?

The filenames not matching what was in Globus was an issue. There was also only one file that made it over from Globus.

During registration, only a single file was registered in Whole Tale, despite there being more in the Globus package.

Task 6: Import Recipe and Build Image

Broad Goal: Register a Git Repo containing a recipe and build an WholeTale Image

Questions:
  1. Was the process self explanatory. How could the UI design or hints/documentation be improved to help a user walk through without seeking help?

In general, each user had issues determining what was asked at each step, and what each step represented (for example, what is a recipe?).

There was a consensus that error reporting can be improved when recipe/image creation fails.

  1. Are the current steps an efficient method for representing the breadth of functionality you might want to achieve from WholeTale frontends?

There was agreement that the current system may be too complicated for normal users. “The notion of “recipes” and “frontends” as an abstraction over Docker images makes it even harder to understand what I’m doing.”

  1. Can you think of use cases for using the second step (Create Image) without the first i.e. to provide multiple images for the same recipe?

None of the users could think of any use cases.

  1. Any other feedback about existing process? Please provide input into streamlining the process, if relevant

The was consensus that it wasn’t clear which fields were required, and that needing to specify the commit can be automated. It was also suggested that some fields, like port and volume, may be able taken from the dockerfile. It may be a good idea to separate the more advanced fields from the bare minimum ones.

TODO: Described testing process.

  • Manual testing

  • Unit testing

  • Integration testing

  • Continuous integration

Release Process

Release process includes the following:

  • Create release milestones in Github for the following repositories

  • Team identifies target features for release, creates issues, and assigns milestones to associated issues

  • Features are implemented, tested, and documentation updated either on master or a designated feature branch

  • Changes are merged to stable branch

  • Stable branch is updated with any version changes, if applicable

  • Release candidate is created by tagging the stable branch (v1.0-rc1)

  • Release candidate is tested

  • Release notes are created and added to documentation

  • Final release is created by tagging the stable branch (v1.0)

  • Install release to production instance

  • Announce release to community

Detailed release process

  • For all repos, merge or cherry-pick commits from master to stable, bump versions, and create release tag.

  • Wait for autobuilds of Docker images, then deploy to staging.

  • Publish releases via github

  • Deploy tagged version to staging

  • Testing/smoke test

  • Deploy to production

  • globus_handler

    • Bump version in plugin.json (master/stable)

  • girder_wt_data_manager, wt_home_dirs, wt_versioning, virtual_resources

  • Bump version in plugin.json (master/stable)

  • girder_wholetale

    • Bump version in plugin.yml (master/stable)

    • Pin version of gwvolman in requirements.txt

  • gwvolman

    • Bump version in setup.py (master/stable)

    • Pin version of girderfs in requirements.txt

  • girder

    • Clone repo with submodules

    • In stable branch, checkout version tag for each plugin, commit

  • wt-design-docs:

    • Add release notes

    • Update ISSUE_TEMPLATE/test_plan.md

Release steps used for v0.6

# Clone the repos
# Bump version on master
# Merge to stable
# Bump version on master
# Release

# repodocker_wholetale
# girderfs
# gwvolman
# virtual_resources
# wt_versioning
# girder_wt_data_manager
# globus_handler
# wt_home_dirs
# girder_wholetale
# ngx-dashboard
# wt-design-docs
# terraform_deployment


version_stable="1.0.0"
tag_stable="v1.0"
version_master_python="1.1.0dev0"
version_master_plugin="1.1.0"

# virtual_resources
git clone https://github.com/whole-tale/virtual_resources
cd virtual_resources
sed -i.bak "s/^version: .*/version: ${version_stable}/g" plugin.yml
git add plugin.yml
git commit -m "Bump version"
git checkout stable
git merge master
git tag ${tag_stable}
git push origin stable
git push origin --tags
git checkout master
sed -i.bak "s/^version: .*/version: ${version_stable}/g" plugin.yml
git add plugin.yml
git add plugin.json
git commit -m "Bump version"
git push origin master

# wt_versioning
git clone https://github.com/whole-tale/wt_versioning
cd wt_versioning
sed -i.bak "s/^version: .*/version: ${version_stable}/g" plugin.yml
git add plugin.yml
git commit -m "Bump version"
git checkout stable
git merge master
git tag ${tag_stable}
git push origin stable
git push origin --tags
git checkout master
sed -i.bak "s/^version: .*/version: ${version_stable}/g" plugin.yml
git add plugin.yml
git add plugin.json
git commit -m "Bump version"
git push origin master

# girder_wt_data_manager
git clone https://github.com/whole-tale/girder_wt_data_manager
cd girder_wt_data_manager
sed -i.bak "s/^version: .*/version: ${version_stable}/g" plugin.yml
git add plugin.yml
git commit -m "Bump version"
git checkout stable
git merge master
git tag ${tag_stable}
git push origin stable
git push origin --tags
git checkout master
sed -i.bak "s/^version: .*/version: ${version_stable}/g" plugin.yml
git add plugin.yml
git add plugin.json
git commit -m "Bump version"
git push origin master

git clone https://github.com/whole-tale/wt_home_dirs
cd wt_home_dirs
# Same steps as wt_sils

git clone https://github.com/whole-tale/globus_handler
cd globus_handler
# Same steps as girder_wt_data_manager

git clone https://github.com/whole-tale/girderfs
cd  girderfs
sed -i.bak "s/__version__ = '[^']*'/__version__ = '${version_stable}'/g" ./girderfs/__init__.py
git add ./girderfs/__init__.py
git commit -m "Bump version"
git checkout stable
git merge master
git push origin stable
git tag ${tag_stable}
git push origin --tags
git checkout master
sed -i.bak "s/__version__ = '[^']*'/__version__ = '${version_master_python}'/g" ./girderfs/__init__.py
git add girderfs/__init__.py
git commit -m "Bump version"
git push origin master

git clone https://github.com/whole-tale/gwvolman
cd gwvolman
# Pin girderfs version in requirements.txt to ${tag_stable}
sed -i.bak "s/version='[^']*',/version='${version_stable}',/g" setup.py
git add setup.py requirements.txt
git commit -m "Bump version"
git checkout stable
git merge master
git push origin stable
git tag ${tag_stable}
git push origin --tags
git checkout master
# Pin girderfs version in requirements.txt to master
sed -i.bak "s/version='[^']*',/version='${version_master_python}',/g" setup.py
git add setup.py requirements.txt
git commit -m "Bump version"
git push origin master

git clone https://github.com/whole-tale/girder_wholetale
cd girder_wholetale
# Pin gwvolman version in requirements.txt to ${tag_stable}
sed -i.bak "s/^version: .*/version: ${version_stable}/g" plugin.yml
git add plugin.yml requirements.txt
git commit -m "Bump version"
git checkout stable
git merge master
git push origin stable
git tag ${tag_stable}
git push origin --tags
git checkout master
# Pin gwvolman version in requirements.txt to master
sed -i.bak "s/^version: .*/version: ${version_master_plugin}/g" plugin.yml
git add plugin.yml requirements.txt
git commit -m "Bump version"
git push origin master

git clone --recurse-submodules https://github.com/whole-tale/girder
cd girder
git checkout stable
git merge master
cd plugins/globus_handler
git checkout ${tag_stable}
cd ../plugins/wholetale
git checkout ${tag_stable}
cd ../wt_data_manager/
git checkout ${tag_stable}
cd ../wt_home_dir/
git checkout ${tag_stable}
cd ../wt_sils/
git checkout ${tag_stable}
cd ..
git commit -a -m 'Bump version'
git push origin stable
git tag ${tag_stable}
git push origin --tags

git clone https://github.com/whole-tale/dashboard
cd dashboard
sed -i.bak "s/\"version\": \"[^\"]*\"/\"version\": \"${version_stable}\"/g" package.json
git add package.json
git commit -m "Bump version"
git checkout stable
git merge master
# Manually merge conflicts
git commit
git push origin stable
git tag ${tag_stable}
git push origin --tags

# Update release notes
git clone https://github.com/whole-tale/wt-design-docs
cd wt-design-docs
sed -i.bak "s/version = '[^']*']/version = '${version_stable}'/g" conf.py
git add conf.py
git commit -m "Bump version"
git push origin master
git tag ${tag_stable}
git push origin --tags

# Publish releases via github

# Deploy tagged version to staging

# Testing/smoke test

# Deploy to production