Whole Tale¶
A platform for publishing transparent and reproducible computational research
Announcement
Whole Tale 1.2 released! See what’s new.
Whole Tale is an open-source platform designed to simplify the process of creating, publishing, and verifying computational research artifacts. Use Whole Tale to:
Run your code on an external system to ensure reproducibility.
Configure a computational environment with the exact versions of operating system and software used to obtain results.
Capture a recorded run of your workflow to ensure transparency of results.
Publish computational research artifacts to popular research archives.
For more information, see:
- Why Whole Tale?
An overview of the Whole Tale platform.
- What’s new?
Description of key features and known limitations of the platform.
- User’s Guide
Platform user’s guide
- Tutorials
Step-by-step tutorials
The project is funded by the National Science Foundation. A public version of the platform is available for academic use at https://dashboard.wholetale.org/.
Terms of Use¶
Whole Tale is an open-source platform. The operational version hosted on Jetstream2 is for academic use only and limited to fair and reasonable use.
MATLAB licensing is on a “right to use” basis and not guaranteed in perpetuity. Absolutely no proprietary, closed, or for-profit use of MATLAB is permitted.
Violating any terms of service with respect to ACCESS, Jetstream2, or Mathworks licensing may result in loss of access to these services.
Tutorials¶
- Quickstart
Simple example demonstrates how to run an existing tale and create a new tale that uses externally registered data.
- STATA Tutorial
Demonstrates how to create and run a tale based on the STATA Desktop environment
- JupyterLab
Demonstrates how to create and run a tale based on the JupyterLab environment
User’s Guide¶
This guide provides detailed information about the Whole Tale system for new users.
Why Whole Tale?¶
Whole Tale is developing open-source tools and infrastructure intended to simply the process of creating, publishing, and verifying computational research artifacts in conformance with community standards (Chard et al. [1]).
Research communities across the sciences are increasingly concerned with the replication of computationally-obtained results and reuse of research software to both increase trust in and the scientific impact of published research. Transparent, reproducible, and reusable research artifacts are seen as essential to sustaining and further accelerating scientific discovery. However, ensuring the transparency and reproducibility of results or reusability of software presents many challenges, both technological and social. Many communities are turning to peer-review processes as a means to encourage and enforce new practices and standards (Willis and Stodden [2]).
For more information, see the report of the National Academies of Sciences, Engineering and Medicine on Reproducibility and Replicability in Science.
What is a “Tale”?¶
Whole Tale defines a model for computational reproducibility that captures the input, output, data, code, execution environment, provenance and other metadata about the results of computational research. We refer to this model as a tale – a composite research object that includes the environment, configuration, metadata, code, and data objects required to fully reproduce computational results (Chard et al. [3]). Technically speaking, a tale’s computational environment is defined by a Docker image specification that can be published to an external archive – along with your code and data – and re-run either in Whole Tale or even on your laptop.
What can Whole Tale do?¶
Having created a tale, a researcher can share it with others, publish it to a research repository (such as Zenodo or Dataverse), associate a persistent identifier, and link it to publications. Other researchers or reviewers can instantiate the published version of the Tale and execute it in the same state as at the time it was when published. Tales may also contain intellectual property metadata with licensing information enabling re-use, reproducibility, giving credit, as well as for broad access.
Who uses Whole Tale?¶
Whole Tale is intended for researchers, editors, curators, and reviewers of published computational research.
What’s new?¶
Version 1.2 introduces the following features and enhancements:
The ability to access and view Whole Tale without signing in
Public container image registry
Automated recorded execution via recorded runs
The ability to override default configurations, upload folders, and view instance logs,
Performance improvements for image building and caching
Updates repo2docker to version 2022.10, adding support for Julia
Integration with new third-party platforms including the Confirmable Reproducible Research Environment, OpenICPSR, and DERIVA
For a complete list of features and bugfixes, see the v1.2 release notes.
Planned Features¶
For a complete list of current and planned features, see the release schedule.
Archival storage of container images
User interface to configure the environment
Support for user-contributed templated environments
Improved composability of environments
Improved accessibility including use of VS Code
Increased resources (CPU, memory) and GPU support
Metadata enhancements including citations, licenses
Computational provenance recorder using eBPF
Limitations¶
The Whole Tale dashboard works best in Chrome. There are known issues in Firefox.
Signing In¶
Users can access Whole Tale to browse public tales without signing in. However, many of the core operations require users to have an account. Whole Tale allows users to sign in using existing credentials from from hundreds of research institutions and organizations as well as ORCID or Google accounts.
Note
The Whole Tale system uses Globus Auth to allow users to login using their existing credentials. For more information about Globus, see their documentation.
Go to https://dashboard.wholetale.org and select the “Sign In” link to start the login process.

Search for and select your institution or organization from the search box. If your organization does not appear in the list, you can use your Google account, ORCID account, or register for a Globus account. After selecting your organization, select the “Continue” button.

You will be redirected to your organization’s login page. Enter your credentials.

Note
Whole Tale user accounts are based on the email address obtained from the institutional login. Two different accounts (e.g., institution and ORCID) that have the same email address will have the same Whole Tale user.
Whole Tale uses authentication services provided by Globus. The first time you login, you may be prompted to grant Whole Tale access to information via Globus.

After logging in you will be redirected to the Whole Tale dashboard where you can explore existing and create new tales.
Important
You can revoke this consent at any time via https://app.globus.org/account/consents.
Signing in with Google¶
You will be asked to “Sign in to continue to globus.org” - this is expected, as WholeTale uses Globus toolkits. If you are already signed into a Google account in your browser, you will be offered one or more accounts to choose from, if not, enter the preferred Google account to use.

Signing in with ORCID¶
You can also use your ORCID account. In this case, you will be redirected to the ORCID authentication screen. You should authenticate as you usually would to ORCID.

About Globus¶
Globus (https://www.globus.org/what-we-do) is a non-profit service of the University of Chicago for secure and reliable research data management. Globus software is commonly used to transfer data in research computing infrastructure. The use of Globus authentication services in Whole Tale enables us to access data on behalf of users via the Globus network.
Exploring Existing Tales¶
The Tale Dashboard page is the main landing page for Whole Tale. It allows you to create new tales or run existing tales.
From this page you can:
Browse and search for tales that you have created or that have been shared with you or publicly.
Create a new tale
Delete tales that you own or have edit acces to

Whole Tale’s main landing page¶
The Tale Dashboard has four sections:
My Tales: Tales you have created
Shared with Me: Tales shared with you by other users
Public Tales: Tales shared publicly by users of the system
Currenty running: Displays if you have any interactive sessions running
My Tales¶
The My Tales tab displays all tales that you have created or copied. You have edit permission on these tales and can delete them or share them with other users.
Public Tales¶
The Public Tales tab displays all tales that have been shared publicly by users of the system. These are all read-only. If you attempt to run a public tale, a copy will be made and appear under My Tales with the “COPY” indicator.
Currently Running Tales¶
If you have clicked the Run Tale button for any tales, the Currently Running panel will display. You may have 2 interactive environments running at the same time.
Tale Operations¶
View Tale¶
Hover over the tale card and select View to access a tale. You can view or edit metadata, files, and run the interactive environment created by the author.
Run Tale¶
Clicking the Run Tale button on a tale that you own will start the associated interactive environment. On tales shared publicly or with read-only permissions, a copy will first be created.
Stop Tale¶
Clicking the Stop Tale button will stop the interactive environment.
Delete Tale¶
To delete a tale, click the “X” button on the tale card. You will be prompted to confirm before the tale is deleted.
Important
Stop will end your interactive session, shutting down the associated container image. Delete will completely remove your tale from the system.
Creating New Tales¶
Tales contain the inputs, outputs, data, code, execution environment, provenance and other related metadata about the results of computational research. A tale is typically associated with a single publication and contains all information necessary to reproduce reported results.
Tales can be created as follows:
New Tale: Create a new empty tale
Github repository: Create a tale based on an existing Github repository
From data repository: Create a tale based on an existing dataset stored in a research repository (such as Dataverse, Zenodo or DataONE)
Note
You can create as many tales on the system as you’d like, but you can only have 2 interactive environments running concurrently.

Creating a new Tale¶
Environments¶
When creating a new tale, you must select the default interactive environment that will be used. Supported environments include JupyterLab, RStudio, MATLAB and STATA. For more information including how to customize installed packages, see the Environments section.
Creating an Empty Tale¶
To create an empty Tale, click the Create New Tale button and select the Create New Tale option. The Create New Tale dialog will appear allowing you to enter a title and select the interactive environment. Select the Create New Tale button and you will be taken to the your new tale where you can upload files, register external data, edit metadata, share with other users, or start an interactive environment. For more information see the Accessing and Modifying Tales section.

Create New Tale menu¶

Dialog for creating a new Tale¶
Creating Tales from Git Repositories¶
Tales can also be created from existing public Git repositories. To create a new tale that contains a Git repository, select the Create New Tale dropdown menu then Create Tale from Git Repository.

Dialog for creating a new Tale from a Git repository¶
Enter the URL of a public repository, title for your tale, and select the desired interactive environment. Select the Create New Tale button to create the tale and import contents from the specified Git repository. For more information about using Git in Whole Tale, see Working with Git.
Creating and Importing Tales from External Repositories¶
Tales can also be created from third-party research data repositories. Currently supported repositories include Zenodo, Dataverse, OpenICPSR, and DataONE.

Dialog for creating a new Tale from a DOI¶
Choosing Between Read-Only and Read/Write¶
When a tale is created from an exeternally registered dataset (e.g., via DOI), you have the choice to mount the dataset read-only via external data or for the contents of the dataset to be copied to the workspace, enabling you to write. Citations are automatically generated for read-only external datasets.
Accessing and Modifying Tales¶
The Run view allows you to interact with and modify your running tale. From this page you can:
Start and stop the interactive environment
Edit tale metadata including advanced settings
Share the tale with other users
Initiate other tale actions including: create versions and recorded runs; export or publish your tale.
Launching the Tale¶
After you have finalized your tale and click Run Tale, you’ll be brought to the Interact page where it will start up, seen in the image below. From here you can access the tale, along with an assortment of other actions that are documented below.

A tale that is being created and configured.¶
Interacting With Tales¶
RStudio¶
When starting a tale that is using an RStudio Environment, you’ll be presented with RStudio, shown below.

Each of the folders shown are analogous to the tabs under the Files tab. You can access all of your home files under the home/ folder; data that was brought in from a third party service can be found under data/; files that were added to your workspace are found under workspace/.
Jupyter Notebook¶
When starting a tale that has a Jupyter Notebook Environment, you’ll be presented with a typical Notebook interface.

As with RStudio, data that came from external repositories can be found under data/, home directory files in home/, and workspace files in workspace/.
Adding Data¶
See File Management for details about how to manage files and data in your tale.
Modifying Tale Metadata¶
The Run page can also be used to access the tale metadata editor, shown below.

The editor can be used to change the environment, add authors to the tale, change the license, make the tale public, and provide in in-depth description of the tale.
Advanced Settings¶
The advanced settings section allows you to override default settings including
the default command, environment variables, and memory limits. Note that memory
limits are contrained by the underlying virtual machine. Any additional files
required for building the container image can be specified using the
extra_build_files
setting.
{
"environment": [
"MY_ENV=value"
],
"memLimit":"12gb",
extra_build_files: [
"some_file.txt",
"some_folder",
],
}
Tale Actions¶
Use the tale’s action menu, highlighted below, to access tale-specific operations.

The tale’s action menu¶
Action |
Description |
---|---|
View Logs |
Enabled when your tale instance is running, this option allows you
to view the running container instance logs (i.e.,
docker logs ). |
Rebuild Tale |
Rebuilds the container image. Requires restart (below). |
Restart Tale |
Restartsthe container instance |
Save Tale Version |
Creates a new version of your tale. See Versioning Tales. |
Recorded Run |
Starts a recorded run. See Recorded Runs. |
Duplicate Tale |
Creates a copy of your tale. |
Publish Tale |
Publishes your tale to a supported repository. See Publishing Tales. |
Export Tale |
Exports your tale. See Exporting and Running Locally. |
Connect to Git Repository… |
Connects an existing workspace to a remote Git repository.
See Working with Git.
|
Computational Environments¶
Your tale’s computational environment is a Docker image that is automatically built based on the combination of the selected interactive environment and any specified software dependencies. The interactive environment is selected during tale creation and can be changed on the Metadata page. Whole Tale currenty supports a number of popular interactive environments including JupyterLab, RStudio, MATLAB, and STATA.
You can customize these environments by adding your own packages based on repo2docker compatible configuration files. Read more about customizing the environment.
What is Docker?¶
Docker is a popular virtualization (“container”) platform that has been widely adopted for the packaging, distribution, and deployment of software – including research software. Whole Tale creates Docker images on your behalf that capture the operating system and software versions used to execute your computational workflow. These images are stored in our container image regsitry.
Extending repo2docker¶
Whole Tale uses a custom extension to the Binder project’s repo2docker component to build images. This means that any Binder-compatible repository can also be run in Whole Tale.
Note
repo2docker is a software package that can be used to build custom Docker images based on some simple text file conventions.
Differences from Binder¶
The Whole Tale repo2docker extends Binder adding the following capabilities:
Support for MATLAB and STATA environments
Support for Rocker Project RStudio environments
Custom Buildpacks¶
The MATLAB buildpack introduces toolboxes.txt
Jupyter¶
When starting a Tale that has a Jupyter Notebook Environment, you’ll be presented with a typical Notebook interface.

As with RStudio, data that came from external repositories can be found under data/, home directory files in home/, and workspace files in workspace/.
RStudio¶
When starting a tale that is using an RStudio Environment, you’ll be presented with RStudio, shown below.

Each of the folders shown are analogous to the tabs under the Files tab. You can access all of your home files under the home/ folder; data that was brought in from a third party service can be found under data/; files that were added to your workspace are found under workspace/.
MATLAB¶
Whole Tale supports the creation, publication, and execution of Tales that rely on MATLAB software. We provide three different interfaces to MATLAB including the Web Desktop (available with MATLAB R2020b and greater), JuptyerLab with MATLAB kernel (all supported versions), and Linux Desktop via XPRA (all supported versions).
Use of MATLAB is subject to our Terms of Use.
Web Desktop¶
Available with MATLAB R2020b and greater, the web desktop provides access to the new MATLAB Online-style IDE that can be used from any standard web browser. The web desktop experience will be familiar to all MATLAB users.

Jupyter with MATLAB Kernel¶
Available for any supported version of MATLAB, the JupyterLab IDE with MATLAB kernel can be used to create and run MATLAB code or Jupyter notebooks using the MATLAB kernel. The JupyterLab terminal provides access to a Linux shell environment to run MATLAB code.

Linux Desktop via XPRA¶
Available for any supported version of MATLAB, Xpra HTML5 client provides remote access to the native MATLAB Linux desktop.

Customizing Your MATLAB Environment¶
In Whole Tale, users must declare their dependencies in a set of simple text files that are used to build a Docker image. For more information, see Customizing the Environment.
Technical Details¶
Whole Tale requires access to the installation media for each supported release of MATLAB. Downloadable ISO images from Mathworks are converted to Docker images used for installation of the base MATLAB software and selected toolboxes. These images are private to the Whole Tale system, but anyone with an appropriate license should be able to access them from Mathworks. Instructions for creating the installation image are available in the matlab-install repository.
To support the creation of custom MATLAB environments, Whole Tale has created a Binder- compatible buildpack. MATLAB environments begin with only the base MATLAB software installed and users can customize selected toolboxes by listing them in a toolboxes.txt file. The purpose of this is minimize image sizes at the time of publication by enabling the selection of only those packages required for reproduction. For example, see https://github.com/craig-willis/matlab-example.
Access to MATLAB on the Whole Tale platform is provided by institutional licenses from the University of Texas at Austin and Indiana University through the NSF Jetstream Cloud service. Tales that are published/exported and run outside of the Whole Tale system will require you to provide your own license information.
As noted in our Terms of Use, MATLAB on Whole Tale is for academic use only.
STATA¶
Whole Tale supports the creation of Tales that rely on STATA software.
Interfaces¶
Whole Tale supports two different interfaces for creating STATA based Tales:
Available for any supported version of STATA, the JupyterLab IDE with STATA kernel can be used to create and run STATA code or Jupyter notebooks using the STATA kernel. The JupyterLab terminal provides access to a Linux shell environment to run STATA code.
Available for any supported version of STATA, Xpra HTML5 client provides remote access to the native STATA Linux desktop.
License¶
Access to STATA on the Whole Tale platform is provided through support from Stata Corp. Tales that are exported and run outside of the Whole Tale system will require you to provide your own license information.
As noted in our Terms of Service, STATA on Whole Tale is for academic use only.
Customizing the Environment¶
You can customize your tale environment by declaring any software dependencies in a set of simple text files. These are used to build a Docker image that is used to run your code.
The text file formats are based on formats supported by repo2docker and, where possible, follow package installation conventions of each language (e.g., requirements.txt for Python, DESCRIPTION for R). In other cases, simple formats are defined (e.g., apt.txt, toolboxes.txt). For more information, see the the repo2docker configuration files documentation.
Base environment¶
The base environment is always an Ubuntu “Long Term Support” (LTS) version.
Install Linux packages (via apt)¶
The apt.txt
file contains a list of packages that can be installed via apt-get
.
Entries may have an optional version (e.g., for use with apt-get install <package name>=<version>
)
For example:
libblas-dev=3.7.1-4ubuntu1
liblapack-dev=3.7.1-4ubuntu1
For more information, see Install packages with apt-get (repo2docker)
MATLAB¶
Note
MATLAB support is part of the Whole Tale repo2docker extension.
MATLAB toolboxes must be declared in a toolboxes.txt
file. Each line contains a valid
MATLAB product for the selected version.
For a complete list of available packages for each supported version, see https://github.com/whole-tale/matlab-install/blob/main/products/.
For example the following toolboxes.txt
wold install the Financial and
Statistics and Machine Learning toolboxes.
product.Financial_Toolbox
product.Statistics_and_Machine_Learning_Toolbox
See our MATLAB example
STATA¶
Note
STATA support is part of the Whole Tale repo2docker extension.
STATA packages must be declared in a install.do
file. Each line contains a valid
installation command.
For example the following install.do
uses ssc
to install packages:
ssc install estout
ssc install boottest
ssc install hnblogit
R/RStudio¶
R packages may be specified in a DESCRIPTION
file or install.R.
For install.R, each line is an install.packages() statement for a given package:
install.packages("ggplot2")
install.packages("reshape2")
install.packages("lmtest")
To configure a specific version, we recommend configuring an MRAN date using the runtime.txt
file:
r-2020-10-20
This file contains the MRAN date containing the versions of packages specified in install.R
.
Alternatively, you ca use the install_version
function in place of install.packages in
your install.R
file.
require(devtools)
install_version("ggplot2", version = "0.9.1")
For more information see:
Install an R package (repo2docker)
Specifying runtimes (repo2docker)
Python¶
Python packages can be specified using requirements.txt
, Pipfile
/Pipfile.lock
, or
Conda environment.yml
.
Example requirements.txt
:
bokeh==1.4.0
pandas==1.2.4
xlrda==2.0.1
See also:
requirements.txt: Install a Python environment (repo2docker)
Pipfile and/or Pipfile.lock: Install a Python environment (repo2docker)
environment.yml: Install a Conda enviroment (repo2docker)
Mapping Estimated Water Usage (Example tale)
Environment Variables¶
In addition to using the start
file below, you can specify custom environment
variables using the advanced settings.
Other¶
Non-standard packages can be installed (or arbitrary commands run) using a postBuild
script.
The start
script can be used to run arbitrary code before th user session starts.
Important
The start
file is currently not supported in RStudio environments.
See also:
postBuild: Run code after installing the environment (repo2docker)
start: Run code before the user session starts (repo2docker)
File Management¶
The Files tab allows you to manage:
files in your tale workspace or home folder
Tale Workspace¶
The Tale Workspace folder is the primary storage area (or folder) for
all files associated with a tale. This folder is available in your running
tale environment as workspace
.

Tale workspace and menu¶
Workspace Operations¶
Common operations include:
Upload files or folders from your local machine
Create, rename, move, remove, copy, download files and folders
Copy or move files between tales
Selecting a folder or file will present a menu with the following options:
Move To: move a file or folder
Rename: rename a file or folder
Share: share a file or folder with a user or group
Copy: copy a file or folder
Download: download a file or folder
Remove: remove a file or folder
Home Folder¶
The Home folder is your personal workspace in the Whole Tale system. You can perform most common operations including uploading, creating, moving, renaming, and deleting files and folders.
Your Home folder is mounted into every running tale instance with full read-write permissions. This means that you can access and manage files from both the Whole Tale dashboard and within tale environments. This is in contrast to the Data folder described below, which is limited to read-only access.
External Data¶
The External Data folder contains references to data you’ve registered in the system for use with the tale. This data is meant to be read only and can only be added from external sources. With this folder you can:
Register data from external sources (e.g., via DOI or URL)
Select and add registered data to a Tale
Supported Data Repositories¶
The current supported repositories that you can register data from are
Zenodo: A general purpose research data repository.
Dataverse: An open source research data repository platform with 35 installations worldwide including the flagship Harvard Dataverse.
Globus: A service geared towards researchers and computing managers that allows custom APIs, data publication, and file transfer.
DataONE: A federation of data repositories with a focus on scientific data. A list of participating member nodes can be found on the member node information page.
5. DERIVA: “DERIVA <http://isrd.isi.edu/deriva/>`_ is an asset management platform with a number of deployments. Support for DERIVA in WholeTale is being tested using data from The Pancreatic β-cell Consortium.
Adding Data¶
Files and folders cannot be uploaded to the External Data folder directly. To encourage reproducibility, only data registered from external resources and associated with a tale will be available in the External Data folder.
To register data from an external resource, use the data registration dialog, shown below.

The data registration dialog allows you to search by DOI and ingest data into Whole Tale.¶
To access this dialog, navigate to the External Data folder by clicking the link icon below the home directory folder.

A user’s External Data folder, populated with data that was registered from external sources.¶
The blue plus icon will open the registration dialog where you can find and register your data. You’ll need to have either the DOI or data package URL to find the data.
Adding Data From DataONE¶
Data packages from DataONE can be integrated into Whole Tale by searching for the DOI of the package or by pasting the URL into the search box in the registration modal.
By DOI
Consider the following package. Visiting the package landing page we can see that the DOI is “doi:10.18739/A29G5GD0V”. To register this data package using the DOI, open the registration dialog and paste the DOI into the search box. Click “search” and check that the correct package was found. Click “Register” to begin data registration.

A dataset that was found by searching for the DOI.¶
By URL
The URL of the data package can also be used to locate the package instead of the DOI. In the previous example, pasting “https://search.dataone.org/#view/doi:10.18739/A29G5GD0V” into the search box will give the same data package which can subsequently be registered.

A dataset that was found by searching with the package’s DataONE url.¶
Adding Data From Dataverse¶
Whole Tale allows to register data from all 35 public Dataverse installations. Support for additional installations can be added per user request. Similarly to DataONE, data can be registered both by providing DOI or direct URL into the search box of the registration modal.
By DOI
DOIs may be specified for either datasets or individual files. For example:
Dataset: doi:10.7910/DVN/TJCLKP
By URL
URLs may be specified for either datasets or individual files using the web or Data Access API formats. For example:
Adding Data From Globus¶
Data can also be retrieved from Globus by specifying the DOI of the package, as done in the DataONE case.
Note
Only the Materials Data Facility is currently supported.
By DOI
The DOI of the dataset can be found on the dataset landing page. For example, the Twin-mediated Crystal Growth an Enigma Resolved package has DOI 10.18126/M2301J. This DOI should be used in the data registration dialog when searching for the dataset.
Adding Data From DERIVA¶
Data from a DERIVA deployment can be added by browsing to a dataset in the DERIVA user interface and clicking on the Export button in the upper right corner of the screen:

Clicking the export button triggers a drop-down menu, where an option to export to WholeTale can be selected:

Once the export is initiated, the DERIVA backend will package the dataset and redirect to WholeTale, where the dataset can be imported.
Adding Data From The Filesystem¶
Files and folders cannot be uploaded to the Data folder directly. To encourage reproducibility, only data registered from external resources or associated with a tale will be available in the Data folder. The data can however, be uploaded to the Home directory.
Sharing Tales¶
Tales can be shared with other Whole Tale users enabling active collaboration between researchers and peer review. The Shared with Me tab on the main dashboard page lists all of the Tales that other users are actively sharing with you.
To share a Tale with other users, first open the Tale and navigate to the Share tab, shown below.

The Tale sharing area.¶
To add a user as a collaborator, use the Collaborators section to search for their username. Once selected, the appropriate permissions can be set.

Sharing a Tale with a user.¶
Tales can also be unshared with users. To remove a user, navigate to the Collaborators area. Find the user that the Tale is being shared with and click the ‘x’ to remove them.

Read Permissions¶
The default access level for a shared Tale is Can Read. This means that the user you’ve shared the Tale with won’t be able to make modifications to the Tale’s metadata or to its files. The user will instead be able to view the metadata fields and view the included files.
When a user runs a Tale that was shared with them, a copy is created that the user can write to. Although unsharing the Tale will remove it from the user’s dashboard, it will not remove their personal copy if one was made.
Edit Permissions¶
When a Tale is shared for the purpose of active collaboration, the permissions should be set to Can Edit. This gives the user the ability to make changes to the Tale’s metadata and files. When a shared Tale has edit permissions and is run, a copy isn’t created like in the case of Tales with Read permissions. Instead, the Tale is started and the user is free to make modifications. These changes are reflected onto the original Tale. Users should be cautious not to work on a shared Tale simultaneously to avoid conflicts with opened files. To see recent changes, files need to be re-opened; if the file is saved before opening, any changes will be overwritten.
Versioning Tales¶
Whole Tale allows users to create versions of their tales. A version includes the contents of the tale workspace, externally registered data, and metadata. Versions can be renamed, deleted, exported and published. Recorded runs are created based on versions.
Versioning Model¶
Whole Tale uses a simply filesystem-based versioning model. Any versions you
create are accessible via the Whole Tale dashboard as wells as in the
../versions
folder in your running interactive environment.
Creating Versions¶
You can create versions of your tale using the Save Tale Version option on
the tale action menu or the Tale History panel.
Select the Tale History icon () to open or close the panel:
To create a new version, select Save Tale Version:

Version history¶
Select Files > Saved Versions to manage your versions. From this menu you can rename, remove, download, restore, or export your tale version.

Version menu¶
Note
The Download folder option simply downloads any folder as a zip file. Use the Export Version option to download a complete version of your tale including metadata, external data, recorded runs, etc.
Created versions are accessible from inside your running interactive environment in the ../versions directory:

Versions in the running container¶
Version Actions¶
Use the version action menu to operate on versions operations.
Action |
Description |
---|---|
Rename |
Rename the selected version |
Remove |
Remove the selected version. |
Download Folder |
Download a zip file of any folder |
View info |
View tale metadata for the selected version |
Restore Version |
Restore the tale and workspace to the selected version |
Export Version |
Export the selected version as a Bg |
As New Tale |
Create a new tale based on the selected version |
Deleting Versions¶
Deleted versions are removed permanently and cannot be recovered. A version cannot be deleted if it has an associated recorded run.
Renaming Versions¶
When versions are renamed, they are also renamed on the filesystem.
Restoring Versions¶
By selecting Restore you will copy the contents of the selected version to your active workspace. This includes tale metadata and registered datasets.
Exporting and Publishing Versions¶
Each time you export or publish a tale, if no version exists one is created for you. Selecting Export Tale or Publish Tale from the tale action menu will export or publish the most recent version. To export a specific version, including any associated recorded runs, select the desired version from the Publish Tale dialog.

Publishing versions¶
Recorded Runs¶
A recorded run is a way to ensure the transparency and reproducibility of your computational workflow by capturing the exact copy of artifacts used to generate results. A recorded run:
Creates a version of your tale
Builds or uses an existing container image
Executes a specified workflow via an entrypoint or master script in an isolated container
Captures an immutable copy of all outputs
Captures system runtime information including memory and cpu usage
Can be published
Note
Recorded runs can be used to capture multiple independent workflows for a single version of a tale. A version can have more than one recorded run.
You can execute a recorded run via the Tale Action Menu or Tale History
panel. Select the Tale History icon () to open or close the panel:
To start a crecorded run, select Perform Run:

Perform run button¶
This will prompt you to specify an entrypoint or master script for your workflow:

Recorded run dialog¶
Created runs are accessible from the Whole Tale dashboard or from inside your running interactive environment in the ../runs directory:

Recorded run in the dashboard¶

Recorded run in the interactive environment¶
Deleting Runs¶
Runs can only be removed if the associated version is first removed. Deleted runs are removed permanently and cannot be recovered.
Renaming Runs¶
Runs can be renamed via the dashboard and are renamed on the filesystem.
Exporting and Publishing Runs¶
Recorded runs are exported and published based on the associated version. This means that an exported or published tale will have a single version, but may have more than one recorded run.
Exporting and Running Locally¶
Exporting¶
Tales can be exported as a BagIt archive. This is the same format used for publishing.
To export a tale, navigate to the Run page, select the Tale Action (…) menu and then select Export Tale:

Exporting a Tale¶
The tale will be exported as a BagIt archive. BagIt is a standard format defined for archival storage of digital objects.
BagIt Format¶
Tales exported under BagIt have additional metadata and an additional fetch.txt
file that lists where external data resides. Tales that are exported in this format
also have the ability to be run locally.
File or Folder |
Description |
---|---|
README.md |
Whole Tale readme |
bag-info.txt |
Bag metadata |
bagit.txt |
Bag declaration |
data |
Bag “payload directory” |
data/LICENSE |
Tale license |
data/workspace |
Tale workspace for exported version |
data/runs |
Recorded runs for exported version |
fetch.txt |
BagIt fetch file for external data |
manifest-<algorithm>.txt |
Payload manifest files for integrity checking. |
metadata/metadata.json |
Tale metadata for exported/published tale version |
metadata/environment.json |
Environment metadata for exported/published tale version |
run-local.sh |
Script to run tale locally |
tagmanifest-<algorithm>.txt |
Payload manifest files for integrity checking. |
To validate an exported bag using the bdbag package:
pip install bdbag
bdbag --resolve-fetch all .
bdbag --validate full .
Running Tales Locally¶
Exported Tales under the BagIt format have a run-local.sh
file that can be run to
re-create the tale. Before running run-local.sh
, ensure that you have Docker
running in the background.
When you’re ready to run the Tale, open up the terminal and navigate to the top level
of the bag. Run sh run-local.sh
and wait for the setup to complete. If this is your
first running a tale locally, it may take some time to download the container image.
Publishing Tales¶
When your Tale is ready to be preserved and shared, you can publish it to external repositories. When you publish your Tale, it receives a DOI that can be cited by others.
Publishers¶
Zenodo¶
Zenodo is a general purpose data repository that’s run by CERN. Zenodo allows for objects up to 50gb and can mint DOIs. Zenodo also offers the ability to version data submissions, which is supported in Whole Tale’s system.
Note
Before publishing to production servers, it’s recommended to first publish to the Zenodo sandbox repository.
DataONE¶
DataONE is a network of data centers and organizations that share their information across a network of nodes, where data is replicated and described with rich metadata. Publishing your tale into the DataONE network will allow you to archive your work, collect usage statistics, and make it easy to share with other scientists.
Publishing Your Tale¶
To publish your Tale, select “Publish Tale” from the dropdown menu on the Run page.
![]()
Publishing is accessed through the Tale Run page¶
Once the publishing dialog is opened, select which data repository you want to store your Tale in.
![]()
Publishing to the Development node¶
If you haven’t connected any third-party accounts the list will be empty. For instructions how to connect your account, visit the settings page.
Viewing Publishing Information¶
After you’ve published your tale, you can always view the published location under the metadata tab on the Run page.
Updating Published Tales¶
In the case that the published tale needs to be published again, with the intent of overwriting the previous publication, it can be updated by re-publishing to the same repository. This feature is only available to tales under the user’s ownership.
If a tale was used for related, subsequent work and shouldn’t update the previous tale, a new tale should be created by first copying the tale. When the copied tale is published, it will be a completely new record.
Whole Tale Generated Files¶
manifest.json¶
This is a metadata document that describes the Tale, which was inspired from the Research Object Lite specification. The important information contained in this file includes any external data that was used, locations of data files, and author attributes.
environment.json¶
The environment file contains information about the Tale’s compute environment which includes memory limits and docker information.
LICENSE¶
The LICENSE file describes the Tale’s license. To change the license, navigate to the metadata editor in the Run page.
README.md¶
The README file gives instructions for running your Tale locally without Whole Tale.
Account Settings¶
The Account Settings page allows you to manage your third party integrations. From here you can check which services Whole Tale has access to and manage your API keys.
Connecting to DataONE¶
Connecting to DataONE requires that you have an ORCID or university account. When connecting your Whole Tale account with DataONE, you’ll be asked to log into either one. Whole Tale will automatically request your API token from DataONE once connected.
Whole Tale allows you to connect to the following DataONE instances.
Connecting to Dataverse¶
You can connect your Whole Tale account with Dataverse by providing your Dataverse API key. A guide on obtaining your API key can be found in the Dataverse API Guide.
Whole Tale supports the following instances of Dataverse.

Connecting your account to Dataverse.¶
Connecting to Zenodo¶
Zenodo can be integrated with your Whole Tale account by using your Zenodo API key. You can retrieve your token at the Zenodo token page.
Whole Tale supports the following Zenodo servers.

Connecting your account to Zenodo.¶
Working with Git¶
Whole Tale supports three different ways of working with Git repositories.
Using Git command-line tools from the interactive environment
Creating a new tale from an existing Git repository
Connecting an existing tale to a Git repository
In all cases, you will likely need to work with either Git command line tools or plugins in your selected interactive environment.
Note
You cannot create a tale from a private (password-protected) Git repository.
Commandline¶
For experienced Git users, the simplest way to connect your tale to a Github repository is via the command line or client tools in your selected interactive environment. After selecting Run Tale and accessing your interactive session, open a console or terminal:
git init
git remote add origin https://github.com/<your_org>/<your_repo>.git
git pull origin master # or main
This will initialize your tale workspace as a Git repository and associate it with the specified remote. From here, you can synchronize any code changes with your remote repository using your preferred tools.
Create or Connect to Git Repository¶
For users who are unfamiliar with Git commandline tools, you can achieve the above by simply creating a new tale or connecting an existing tale to a public Git repository. You will still need to use Git features in your selected interactive environment.
Image Building and Caching¶
Whole Tale uses an extension to Project Jupyter’s repo2docker to build container images based on a set of configuration files
found in the tale workspace. Beginning with version 1.1, container images are
built only using the repo2docker configuration files and any extra files
specified using the extra_build_files
advanced settings option.
The image tag is created as a combination of checksums from:
repo2docker configuration files
tale image object configuration
Dockerfile created by r2d using files from above
With this, any tales that have a common environment configuration and images built using the same version of repo2docker will share a container image. This improves build performance and caching efficiency on the platform.
Container Image Registry¶
Note
This is a non-archival registry. We are currently working on a way to support depositing images for archival storage.
Containerization is central to the Whole Tale strategy for improving the
long-term runnability and reproduciblity of published tales. Prior to v1.2, the
container registry was used only as an internal cache. As of v1.2, we are
making images associated with tales available publicly via
images.wholetale.org
.
To access the image associated with your tale, inspect the run-local.sh
after export.
Tale Structure¶
This section describes the structure of a tale. Tales consist of the following elements:
Environment
Workspace
External data
Versions and Runs
Metadata
Environment¶
A tale environment consists of:
The selected interactive environment (e.g., RStudio, JupyerLab, MATLAB, STATA). These relate to a repo2docker buildpack and default command used to start the interactive environment.
repo2docker configuration files specifying software dependencies
The container image built from the above stored in the container registry.
Workspace¶
The tale workspace is the primary folder containing your code, data, documentation – anything required to reproduce your computational workflow.
External data¶
A read-only folder containing externally referenced data that has been registered with Whole Tale.
Versions and Runs¶
The Saved Versions (versions) and Recorded Runs (runs) folders provide read-only access to versions and recorded runs.
Metadata¶
Tale metadata contains additional information about your tale including title, authors, description, keywords, license, and citations/references.
Integrating with Whole Tale¶
The Whole Tale platform provides several integration options for remote repositories including:
Data registration: allows users to import and reference data from your repository
“Analyze in Whole Tale”: allows users to launch interactive analysis environments using data or Tales published in your repository
Tale publishing: allows users to publish Tale packages to your repository
Data registration¶
Whole Tale enables users to import and work with datasets from a variety of remote resources. After registering data, users can launch interactive analysis environment and package new Tales that reference the data stored on remote systems.
As of release v0.8, supported data providers include:
HTTP: any data accessible via HTTP protocol
DataONE: any dataset available through the DataONE network
Dataverse: any dataset available through the Dataverse network
Globus: data available through the Materials Data Facility (MDF)
DERIVA: currently in testing/development with data from the The Pancreatic β-cell Consortium
New data providers can be added by extending the Girder Whole Tale plugin.
Analyze in Whole Tale¶
The “Analyze in Whole Tale” feature enables one-stop data registration and Tale
creation from remote repositories. Remote systems simply construct a URL
pointing to the Whole Tale integration/
endpoint providing the URI for a dataset,
optional Tale name and environment. Note that this requires that the provided
URI is supported by one of the above data registration providers.
For example, the following URL will open the Browse page with the Tale name, data, and environment pre-populated: https://girder.dev.wholetale.org/api/v1/integration/dataone?uri=doi:10.7910/DVN/29911&name=MyTale&environment=rstudio

Pre-populated New Tale Modal¶
After selecting Launch New Tale, the user will be taken to an RStudio environment
with the selected dataset mounted under /data
.
Bookmarklet for Analyze in Whole Tale¶
You can enable Analyze in Whole Tale for virtually any web resources by using a bookmarklet.
How it works?¶
Install the AinWT bookmarklet in your browser’s bookmark toolbar.
When you come across a dataset from a provider that Whole Tale supports, click the AinWT bookmarklet in your bookmark toolbar.
You will be redirected to the Whole Tale’s dashboard where a modal will prompt you to create a Tale which will include the selected dataset.
How to Install¶
Firefox¶
Right-click on the following link: AinWT for Firefox, then select the “Bookmark This Link” option.
Chrome¶
Drag this link to the bookmarks toolbar: AinWT for Chrome.
Safari¶
Drag this link to the bookmarks toolbar: AinWT for Safari.
iPhone and iPad¶
In iPad, iPhone or iPod Touch, copy this line of text:
javascript:void(window.location='https://dashboard.wholetale.org/mine?name=My%20Tale&asTale=true&uri=%27+encodeURIComponent(location.href))
Bookmark this page or any page, then tap the Bookmarks button to edit the new bookmark, paste the text you just copied, then tap “Bookmarks” and then “Done”.
Dataverse External Tools¶
Whole Tale provides specific integration with Dataverse via the External Tools feature.
The following External Tools manifest can be used to enable Whole Tale integration on your Dataverse installation:
{
"displayName": "Whole Tale",
"description": "Analyze in Whole Tale",
"scope": "dataset",
"type": "explore",
"toolUrl": "https://data.wholetale.org/api/v1/integration/dataverse",
"toolParameters": {
"queryParameters": [
{
"datasetPid": "{datasetPid}"
},
{
"siteUrl": "{siteUrl}"
},
{
"key": "{apiToken}"
}
]
}
}
Download the manifest
To install, simply POST
the manifest your instance. For example:
curl -X POST -H 'Content-type: application/json' --upload-file wholetale.json \
http://localhost:8080/api/admin/externalTools
Administrator’s Guide¶
Documentation about the installation, configuration, and ongoing operation of WT systems
Installation¶
The Whole Tale deployment and installation process is documented in the terraform_deployment repository.
Monitoring¶
Whole Tale system monitoring is implemented using the Open Monitoring Distribution (OMD). Monitoring of development and production systems is supported by NCSA’s ISDA group. The OMD instance can be accessed using NCSA credentials:
https://gonzo-nagios.ncsa.illinois.edu/WT
Adding Users¶
The WT OMD tenant is configured to use NCSA LDAP for authentication. Users in the grp_wtops LDAP group have access to OMD. Users outside of NCSA can be added via the “Users” menu.
Hosts and Host Groups¶
Hosts are organized into Host Groups based on deployment. We currently have host groups for the production, development, and staging deployments.
Notifications¶
Notifications are currently configured to be sent in bulk based on deployment. All notifications within a 5 minute period will be sent in a single bulk email.
Checkmk Agent¶
WT uses a custom Checkmk agent. Docker image and definition are available from:
The agent implements four custom checks: - check-celery - confirms celery_worker is running on all nodes - check-nodes - confirms nodes are in ready state - check-services - confirms that expected docker services are running - check-tale - confirms a tale can be launched and stopped on the system
Installation¶
The monitoring stack is installed as part of the Terraform deployment process. The Checkmk agent is deployed as a global docker service.
Slack Ingration¶
Checkmk notifications are also sent to the Whole Tale #alerts channel. Configuring Slack integration required the following steps:
On Slack: - Create new app https://api.slack.com/apps with name “Checkmk” - Activate incoming webhooks - Copy the service ID from the webhook URL which should have the form TXXXXXXXX/BXXXXXXXX/XXXXXXXXXXXXXXXXXXXXXXXX
On OMD: - Go to Notifications page - Select “New Rule” - Under “Notification Method” select “CMK-Slack Websocket integration” - Enter the above service ID as the first parameter and the channel (without #) as the second - Save and activate
To test (from https://checkmk.com/cms_notifications.html#Testing%20and%20Debugging) - Go to “All Services” page - Select “HTTP Dashboard” service - Select “Commands” (hammer) button - Next to “Custom notification” select “Send” - Confirm
This should send a test via both email and Slack.
Development Plan¶
Release Notes¶
v1.2¶
New Features¶
Create a new tale populated with data by passing DOI using Dashboard’s “add new tale” modal – ngx-dashboard#281.
Allow to upload entire folders to Home / Workspace via Dashboard – ngx-dashboard#268.
Allow anonymous access to public Tales – ngx-dashboard#277, wt_versioning#45.
Allow to view logs from running Instances – ngx-dashboard#311, girder_wholetale#557, instance_logger.
Publicly available Docker Registry with Tale images – girder_wholetale#551.
Support for OpenICPSR as a data provider – girder_wholetale#543.
Create a new Tale from a specific version – ngx-dashboard#280, wt_versioning#43, girder_wholetale#536.
Add access to advanced Tale configuration via Metadata tab – ngx-dashboard#274, ngx-dashboard#275, girder_wholetale#538.
Improved Home/Workspace/Run performance by using NFS mounts – gwvolman#175, gwvolman#172.
Tale catalog search is now case insensitive and includes category field – ngx-dashboard#278.
Increase the amount of metadata stored with registered external resources – girder_wholetale#542.
Allow to cancel long running jobs (image build, recorded run) – girder_wholetale#565, ngx-dashboard#310, gwvolman#177.
Add metrics/actions tracking – girder_wholetale#564, gwvolman#186.
Update repo2docker to 2022.10 and add support for Julia based images in WholeTale – repo2docker#6, repo2docker_wholetale#47.
Bugfixes¶
Improved notifications handling – ngx-dashboard#291.
Fixed file manager showing content not matching the selected tab – ngx-dashboard#265.
FUSE DMS now properly handles empty files – girderfs#35.
Autocreate parent directories during item/file upload – virtual_resources#20.
Pull docker image before executing a recorded run – gwvolman#173.
Improve error handling in Analyze in WT – ngx-dashboard#279.
Respect repo2docker version on Tale import – girder_wholetale#549.
Improve Dataverse DOI resolution – girder_wholetale#546.
Fix for Dataverse DOI missing ORCID info – girder_wholetale#541.
Improve import by using direct file access rather than WebDAV – girder_wholetale#545.
Properly mount environment definition in the exported
run-local.sh
script – girder_wholetale#540.Fix login via OAuth2 providers other than Globus – girder_wholetale#537.
Allow unauthenticated read access to docker registry from tasks – gwvolman#183.
Improve error logging in Recorded Runs – gwvolman#184.
Remove the use of deprecated method from
packaging
– gwvolman#187.Minor UI fixes – ngx-dashboard#300, ngx-dashboard#302.
Fix the expiration period for Recorded Run related tokens – girder_wholetale#561, wt_versioning#49.
Fix server error while trying to access non-existing virtual folder – virtual_resources#21.
Improve detection of Instance readiness – girder_wholetale#563.
Prevent re-registration of external data files – girder_wholetale#556.
Fix issue with building image for Recorded Runs – gwvolman#188
Prevent a situation where manifest doesn’t have image tag – girder_wholetale#558.
Don’t try to update a notification that doesn’t exist – girder_wholetale#560.
Prevent an unauthenticated user from accidentally accessing the Interact tab – ngx-dashboard#301.
v1.1¶
New Features¶
Recorded runs (lite) using ReproZip and Docker stats – girderfs#29, girderfs#27, gwvolman#151, gwvolman#155, gwvolman#156, gwvolman#161, gwvolman#164, gwvolman#171, wt_versioning#27, wt_versioning#29, wt_versioning#30, wt_versioning#31, wt_versioning#34, wt_versioning#35, wt_versioning#38, wt_versioning#39, wt_versioning#41, ngx-dashboard#243.
Support for installing custom STATA packages via install.do – gwvolman#154, repo2docker_wholetale#26.
Improved image build performance/caching – gwvolman#158, gwvolman#160, repo2docker_wholetale#21.
Updated XPRA-based envs for Stata and Matlab – repo2docker_wholetale#22, repo2docker_wholetale#24, repo2docker_wholetale#26, repo2docker_wholetale#29, repo2docker_wholetale#33, repo2docker_wholetale#37, repo2docker_wholetale#40.
Remote iframe support for RStudio/Jupyter – repo2docker_wholetale#38.
DERIVA integration – girder_wholetale#510, girder_wholetale#519, girder_wt_data_manager#51, wt_home_dirs#33.
Add ability to register raw data from zip/bdbag – girder_wholetale#497, girder_wholetale#517.
- API changes for CORE2
Add ability to relinquish ownership – girder_wholetale#504, girder_wholetale#506, girder_wholetale#508.
Remote iframe support for RStudio/Jupyter (configuration change).
Better handling for auth originating from an external domain – girder_wholetale#511, girder_wholetale#512.
Add ability to import non-tale datasets from Zenodo – girder_wholetale#501, girder_wholetale#516.
DataONE publishing improvements – gwvolman#167, gwvolman#168, gwvolman#169.
Better support for storing SSH credentials in Home – girderfs#30.
Support for accessing private external data with user credentials – girder_wt_data_manager#47, girder_wholetale#465, girder_wholetale#531, girder_wholetale#528.
Automatic checksum validation of external data – girder_wt_data_manager#54, girder_wt_data_manager#53, girder_wholetale#524.
Ability to preview Tales for specific versions – wt_versioning#24, wt_versioning#37, ngx-dashboard#218.
Allow to specify a subset of dataset during import via path – girder_wholetale#520.
New version of WT vocabulary has been published – girder_wholetale#533.
Bugfixes¶
- UI fixes:
Properly space files in file browser for Chrome >= 91.x – ngx-dashboard#206
Interact tab autoupdates when container starts – ngx-dashboard#217
Display instances created from shared Tales in the running Tales panel – ngx-dashboard#228
Fix encoding in AinWT parameters – ngx-dashboard#252, ngx-dashboard#263
Minor improvements – ngx-dashboard#242, ngx-dashboard#257, ngx-dashboard#262, ngx-dashboard#264
Properly preserve computation environment during import/export – girder_wholetale#515
Better error reporting for WT FUSE – girderfs#31
Refactor of WT FUSE – girderfs#26
DMSFS thread safety improvements – girderfs#33
Fix “exact name” search for virtual resource – virtual_resources#17
Raise exception during rename if folder/item with the same name exists – virtual_resource#19
Avoid hardcoding docker volumes mount point – gwvolman#163
Prevent publishing the same Tale twice – gwvolman#170
WT DMS now uses requests – girder_wt_data_manager#49
Handle gzipped transfers in DMS – girder_wt_data_manager#52
Correctly handle external data in exported bags – girder_wholetale#518, girder_wholetale#525
Fix cleaning Tale data upon removal – wt_versioning#28, wt_versioning#33, wt_versioning#36, wt_home_dirs#34, girder_wholetale#499
- Provider specific fixes:
- Dataverse
Port to requests and minor fixes – girder_wholetale#500
Utilize more metedata for creating Tales during import – girder_wholetale#464
- DataONE
Use proper headers for access data – girder_wholetale#522
Fix integration for AinWT – girder_wholetale#532
- Globus
Don’t assume type of unique id dataset uses – girder_wholetale#526
Fix build issues in R/Rocker images – repo2docker_wholetale#27, repo2docker_wholetale#32, repo2docker_wholetale#39
v0.9¶
Features:
Support for storing and using third party API keys from Zenodo, Dataverse, and DataONE
Support for registering data from Zenodo
Added support for publishing and importing Tales to and from Zenodo
v0.8¶
Features:
A re-designed main page for the dashboard
A new, unified, notification system
Support for Dataverse hierarchy
Added ability to change compute environments
v0.6¶
Features:
Restructured Dashboard “Run” view
Tale workspace support
Ability to add/remove data to a running Tale (note: removed Data panel from Run and Compose views)
Change to registered data model (note: now limits operations on external datasets)
Analyze in WT support for DataONE
Bugfixes:
Handle failures of Dataverse installation list
Fixed issue when registering data from Globus (MDF)
Detection/correction of internal-state desync (“blue screen”)
Fix for Running git clone in home
v0.5¶
This release includes the following features. Note that with this release we’re adopting detailed release notes:
Refactor of data registration framework:
Globus registration (whole-tale/girder_wholetale/165)
Refactor DataONE lookup (whole-tale/girder_wholetale/177)
Change to use DMS (whole-tale/girder_wholetale/168, whole-tale/gwvolman/30)
Refactor task handling (whole-tale/girder_wholetale/170)
Added Tale import support (whole-tale/girder_wholetale/173, whole-tale/gwvolman/32, whole-tale/dashboard/287)
Dataverse integration:
Support ingest from Dataverse (whole-tale/girder_wholetale/175)
External tools integration (whole-tale/girder_wholetale/180)
Minor changes/bug fixes:
Optional DataMap parameters (whole-tale/girder_wholetale/178)
Removed obsolete plugin config options (whole-tale/girder_wholetale/186)
Lookup error handling (whole-tale/girder_wholetale/190)
Chained redirects in DOI ( whole-tale/girder_wholetale/188)
Add OPTIONS to methods allowed by DAV read privilege (whole-tale/wt_home_dirs/17)
Propagate file size changes (whole-tale/wt_home_dirs/16)
Login route handling (whole-tale/dashboard/300)
Run Tale from view page (whole-tale/dashboard/pull/273)
Local storage problem (whole-tale/dashboard/326)
Allow manual configuration of Dataverse instances (whole-tale/girder_wholetale/182)
Updated registration modal (whole-tale/dashboard/324)
Re-enabled http check (whole-tale/girder_wholetale/181)
Upgraded to Girder 2.5.0, no longer running as root
Deployment:
Added DMS volume (whole-tale/terraform_deployment/38)
v0.4¶
This release includes the following features:
Redesigned user interface based on user experience testing, including ability to access running tales directly (via iframes)
Environmental variables can be passed to a running Tale, using
containerConfig.environment
(whole-tale/girder_wholetale#102, whole-tale/gwvolman@b4c068a0)Tales accept multiple sources as input data (whole-tale/girder_wholetale#98)
WT Homes/Workspaces support moving data to other assetstores (whole-tale/wt_home_dirs#9)
Improved monitoring and backup
v0.3¶
This release includes the following features:
Automated deployment for development instances of WT
HTTPS for frontends/Wildcard certificate support
Migration process from GridFS to WebDav
v0.2¶
This release includes the following features:
Home directories (WebDav)
Backup of database and home directories
Container repository of frontends
Interface for creating new frontends
v0.1¶
This initial release includes the following features:
User dashboard
Ability to create and run tales
Globus and ORCID authentication
Globus, HTTP and DataONE ingestion
Jupyter and RStudio frontends
POSIX filesystem for remote data
Scalable infrastructure as code
Release schedule¶
Future releases¶
Recorded run/validation framework)
CLI for WholeTale
Kubernetes support
Support for verification/review workflows
Preservation of images
v1.0 (5/2021)¶
Create, export, and publish tale versions
Share tales with other users
Secured routes to instances
UI refactor
Support for MATLAB and Stata environments
Import and connect to Git repositories
v0.9 (4/2020)¶
Addressing user feedback from previous releases
Brown Dog integration (1.6.2)
Native support for WT in Jupyter and RStudio (1.2.1)
Tracking and storing Jupyter provenance to DataONE (1.4.6)
Indexing, remixing of the frontends (1.2.3)
OAI-ORE filesystem (1.3.4)
Tale validation framework
v0.8 (8/2019)¶
Addressing user feedback from MVP
Bug fixes
v0.7 (5/19)¶
Document, store, and publish Tales in DataONE
Export Tales to ZIP and BagIt
Ability to customize the Tale license
Tales now keep record of citations for external datasets that were used
Ability to add multiple authors to a Tale
Ability to run Tales locally
Misc UI improvements
Environment customization
v0.6 (3/19)¶
Refactored UI based on usability testing
Full Globus Integration (1.1.2)
Tale Workspace support
User namespacing system (1.4.3) (overview)
Ability to dynamically add/remove data from running Tales
Analyze in WT support for DataONE
v0.5 (12/18)¶
One-click data-to-Tale creation
Dataverse data registration
Dataverse external tool support
v0.4 (7/18, MVP)¶
Redesigned UI
User Documentation
Overall stability improvements
Automated testing
v0.3 (6/18)¶
Automated deployment for development instances of WT
v0.2 (5/18)¶
Home directories (WebDav)
Backup of database and home directories
Container repository of frontends
Architecture¶
Overview¶
The Whole Tale provides a scalable, web-based, multi-user platform for the creation, publication, and execution of tales – executable research objects that capture data, code, and the complete software environment required for reproducibility. It is designed to enable researchers to publish their code and data along with required software dependencies to long-term research archives, simplifying the process of creating and verifying computational artifacts.
The Whole Tale platform includes the following primary components:
Identity and access management
Dashboard
Whole Tale API
Whole Tale Filesystem
Image registry
Provider API
User environments
The following diagram illustrates the logical relationship between key system components:

The Whole Tale platform leverages and extends a variety of standard components and services including the OpenStack cloud platform (via Jetstream2), Docker Swarm container orchestration platform, Celery/Redis for distributed task management, MongoDB for data management, Traefik reverse proxy, Open Monitoring Distribution for monitoring, as well as interactive analysis environments such as RStudio and Jupyter. Whole Tale leverages and extends the Girder REST API framework.

Identity and Access Management¶
Identity and access management are implemented via OAuth 2.0/OpenID Connect. Via the Girder OAuth plugin, the platform can be configured to use common OAuth providers including Google, Github, Bitbucket, and Globus. The production WT service leverages Globus Auth for federated login because if provides support for:
InCommon IdPs via CILogon
XSEDE/Argonne, ORCID and other research-centric systems
Tokens that can be used to initiate Globus transfers
The publishing framework uses ORCID for authentication into the DataONE network.
Dashboard¶
The dashboard is the primary interface into the Whole Tale system for users to interactively launch, create, and share Tales. It is the reference interface for the Whole Tale API, built using the Angular open-source web framework.

Whole Tale API¶
The Whole Tale API extends the Girder framework adding Whole Tale capabilities including:
Images, Tales, Instances, Versions, and Runs
Distributed home and Tale workspace folders
Importing data from remote repositories
Publishing Tales to remote repositories
Remote data access and caching
Via Celery/Redis, the Whole Tale API provides a scalable framework for:
Building and managing Tale images
Launching Tale instances (e.g., RStudio, Jupyter environments)
Ingesting data from external sources
Executing recorded runs
The following diagram provides an overview of key compoments of the Whole Tale API:

Each user has a home folder that is accessible via the Whole Tale filesystem to every running Tale instance (and also exposed via WebDav to be optionally mounted on their local system). Every Tale is defined by its environment (e.g., RStudio/Jupyter); a workspace folder containing code, data, and narrative; and an optional set of externally-referenced datasets. When a user runs/launches a Tale, they get a Tale instance – a running Docker container based on the defined environment with the Tale workspace, external data, and home directory mounted and accessible.
Girder¶
Girder is an open source web-based data management platform intended for developing new web services with a focus on data organization, user management, authentication and authorization. It has been adopted by several related projects including yt.Hub, the NSF-funded Renaissance Simulations Laboratory, Crops in silico, and the Einstein Toolkit DataVault.
Whole Tale leverages Girder for the following features:
OAuth flow for user authentication
User and group management including advanced access control models
Metadata management including file, folder, and collection abstractions
Job management framework including notifications
API key and token management
Lighweight and high-performance interface to MongoDB
Environment Customization¶
As of release v0.6, environment customization is implemented via the Recipe model

A Tale image is defined by a “recipe”, which refers to a Github repository and commit ID that conforms to the Whole Tale image definition requirements. Future releases will include integration with Project Jupyter’s repo2docker framework.
Scalable task distribution (gwvolman)¶
The Whole Tale API implements a generic and scalable task distribution framework via the popular Celery system. The gwvolman implements tasks including:
Building and pushing images
Managing services (Swarm) including start/stop/update
Managing container volumes (mount/unmount)
Ingesting data from external providers
Publishing Tales to external providers (v0.7)
Whole Tale Filesystem¶
The Whole Tale filesystem provides distributed access to system data via a POSIX interface. This includes enabling access to home and Tale workspace data and managing access to and caching of externally registered data.

Distributed folder access (wt_home_dir)¶
The Whole Tale platform includes an integrated WebDAV server (via WsgiDav) to enable distributed access to home and Tale workspace folders. The WebDAV server is integrated with Girder for authentication and to synchronize fileystem metadata. This means that changes made via WebDAV or Girder (e.g., the WT Dashboard) are always reflected in the exposed filesystem.

Data Management Service (girder_wt_data_manager)¶
The Whole Tale Data Management system is responsible for managing the data used in Tales. The main components include:
Transfer subsystem that managed movement of data from external data providers to local storage in Whole Tale. This is achieved through provider-specific transfer adapters.
Storage management system that acts as a local data cache that selectively caches or clears local copies of externally hosted data based on frequency of use.
Filesystem interface that allows tales to access cached data through a standard POSIX interface.

Python client (girderfs)¶
Whole Tale provides girerfs, a Python client/library to mount the Whole Tale filesystem volumes. This is an intermediate layer representing data in Whole Tale as a POSIX filesystem that interfaces with the Data Management system. This is based on fusepy, a thin python wrapper for FUSE development.
This component supports the following mount types: * remote: mount Girder folders via REST API * direct: mount local Girder assetstore * wt_dms: mount via Whole Tale DMS * wt_work: mount Tale workspace via davfs * wt_home: mount user home directory via davfs
Provider Framework¶
The Whole Tale provider framework is designed to enable easy extension to support new providers for data registration, “Analyze in WT” capabilities, and publishing.
The framework consists of the following interfaces:
ImportProvider: Search, register, and access data from external repositories
Integration: Translate requests for Analyze in Whole Tale
PublishProvider: Publish Tales to external repositories
TransferHandler: Protocol handlers for transferring data (e.g., HTTP, Globus)
Remote data registration and access¶
Combined with the Whole Tale filesystem and data management system, the provider model provides an abstraction over heterogenous data sources (APIs), exposing a consistent interface to both the Whole Tale dashboard and running tale instances. Datasets from DataONE, Dataverse, and Globus are exposed to running Jupyter and RStudio containers as elements of a POSIX filesystem. The registration process captures only the metadata of the remote dataset and the data management service retrieves the actual bits only when used. This means that only those portions of the remote dataset that are actually used are transferred and cached in Whole Tale.

User Environments¶
A fundamental design of the Whole Tale system is that users must be able to conduct and publish their analysis using their software environment of choice. Common environments such as RStudio and Jupyter should be provided by the system. Users must be able to customize these environments by selecting specific software versions. They must also be able to define and share new environments that may not be part of the base system.
In v0.6, the base environments are defined by the Recipe and Image models. Recipes refer to specific Github repositories and commit hashes. Images are the build Docker images stored in the Whole Tale image registry.
As of v0.7, we have adopted the Binder repo2docker model where users can easily customize software in the environment.
About¶
What is Whole Tale?¶
Whole Tale is an NSF-funded Data Infrastructure Building Block (DIBBS) initiative to build a scalable, open source, web-based, multi-user platform for reproducible research enabling the creation, publication, and execution of tales – executable research objects that capture data, code, and the complete software environment used to produce research findings.
A beta version of the system is available at https://dashboard.wholetale.org.
The Whole Tale platform has been designed based on community input primarily through working groups and collaborations with researchers.
The Whole Tale project is involved in several initiatives to train researchers for reproducibility as well as use of Whole Tale in the classroom.
A Platform for Reproducible Research¶
The goal of Whole Tale is to enable researchers to define and create computational environments to (easily) manage the complete conduct of computational experiments and expose them for analysis and reproducibility. The platform addresses two trends;
improved transparency so people can run much more ambitious computational experiments
better computational experiment infrastructure to allow researchers to be more transparent
Why Whole Tale?¶
The Whole Tale platform is being developed to simplify the adoption of practices that improve the understandability and reproducibility of computational research.
Technological Sources of Impact¶
Virtually all published discoveries today have data and computational components. There is a mismatch between traditional scientific dissemination practices and modern computational research practice leads to reproducibility concerns.
Computational Reproducibility¶
The Whole Tale platform supports computational reproducibility by enabling researchers to create and package code, data and information about the workflow and computational environment necessary to support review and reproduce results of computational analysis that are reported in published research. Whole Tale implements this definition by supporting explicit citation of externally referenced data, capturing the artifacts and provenance information needed to facilitate understanding, transparency, and execution of the computational processes and workflows used for review and reproducibility at the time of publication.
Who is Whole Tale for?¶
Researchers¶
Researchers are increasingly adopting practices to improve the transparency and reproducibility of their own computational research. Some are self-motivated to improve their own rigor and transparency while others are responding to the demands and requirements of academic communities and journals. Some are advanced tool users with sophisticated methods of packaging and distributing scientific software, often with automated testing and verification. Others are more concerned with the research product than learning new tools and infrastructure for sharing and transparency.
Societies/Communities¶
Academic societies, associations and communities are responding to challenges in the reproducibility of published research by adopting recommendations, guidelines, and policies that impact publishers, editors, and researchers. Communities are beginning to adopt practices encouraging or requiring sharing of code and data. Some are even implementing verification and evaluation processes to confirm the reprodubility of published work.
Editors¶
In response to the demand of academic communities to address problems of reproducibility and reuse, journal editors are increasingly adopting guidelines and enforcing policies for the sharing of data, code and information about the software environment used in published research based on computational analysis.
Curators and Reviewers¶
The scholarly publication process has built-in mechanisms for anonymous peer review. Some communities are adopting replication practices to ensure that published research can be replicated at various levels. Anonymous reviewers and curators of research artifacts play an important role in the quality of research artifacts.
Repository Developers and Operators¶
Developers and operators of research data repositories are faced with the challenge of addressing the needs of their communities through support for new types of scholarly objects, methods of access, and processes for review and verification.
What is a Tale?¶
A tale is an executable research object that combines data (references), code (computational methods), computational environment, and narrative (traditional science story). Tales are captured in a standards-based format complete with metadata.

Whole Tale is an ongoing NSF-funded Data Infrastructure Building Blocks (DIBBS) project initiated in 2016 with expected completion February, 2023.
Our Approach¶
Open Source Platform¶
The Whole Tale project is developing an open source platform and welcomes both re-use and contribution from the research infrastructure community. As a building block, the platform is intended to be deployable in a variety of environments, including the primary service at https://dashboard.wholetale.org.
Open Source Curriculum¶
The Whole Tale project is dedicated to the improvement of education and training for reproducible research practices through the creation of open lessons both for classroom instruction and through contributions to programs such as the Carpentries.
Open Infrastructure¶
The Whole Tale project aspires to contribute to and increase the impact of existing open infrastructure projects. The success of Whole Tale depends on our ability to connect users to the tools and information they need to create reproducible research packages. The Whole Tale platform leverages components from projects including Girder, The Rocker Project, and Project Jupyter.
Lowering Barriers¶
The Whole Tale project draws on the experience and expertise of researchers and communities that are working in the forefront of reproducible research practices and applies these lessons to the development of a platform that is intended to simplify and broaden the adoption of such practices.
Community Engagement¶
Through our working groups and workshops, we engage with a broad community of researchers, educators, and infrastructure developers to inform Whole Tale project direction and platform design.
Usability¶
As we develop a web-based platform, we recognize the importance of usability and user experience and will continue to conduct regular usability tests to improve the system.
Domain Case Studies¶
This section documents detailed case studies of domains that inform the Whole Tale platform design. When looking at a particular discipline or sub-discipline, we explore the following questions:
What is the state of computational reproducibility?
What do associations/societies and top journals say about data/code sharing?
Are related initiatives driven by motivated researchers or editors?
Have they implemented or are they considering badging or verification?
Why is the field different from others we’ve studied?
Are there examples of existing “tales” (e.g., research compendia)?
Are there relationships with other open science infrastructure projects?
Archaeology¶
Introduction¶
According to Marwick (2017a), the field of archaeology has had a long-term commitment to empirical tests of reproducibility by returning to excavation and survey sites, but has only recently started to make progress in testing reproducibility of statistical and computational results. As in many other fields, data sharing has had increased attention over the past decade and sharing code and analytical materials only over the last few years. As noted by Marwick (2018), the Journal of Archaeological Science adopted a “data disclosure” policy in 2013 and author guidelines were updated only in 2018 to encourage sharing of “software, code, models, algorithms, protocols, methods and other useful materials related to the project.”
Open access to archeological data is sometimes problematic due to cultural sensitivities or issues of ownership (copyright or international stakeholders) and impact of exposure (e.g., risks of looting). Data publishing is also limited due to costs (key discipline repositories are fee-based) and researcher motivations. Community norms do not encourage/reward data and code publishing and no journals require archaeologists to make code and data available by default. Discipline-specific repositories include the Archaeological Data Service, the Archaeological Record (tDAR), and Open Context.
Marwick (2017a) outlines a set of basic principles to improve computational reproducibility in archaeological research. These are similar to guidelines provided in other fields:
Make data and code openly available
Publish only the data used in the analysis
Use a programming language to write scripts for analysis and visualization
Use version control
Document and share the computational environment
Archive in online repositories that issue persisted identifiers
Use open licenses
Marwick and the archaeology community have adopted the concept of “research compendium” refer to data/code packages. This concept originated with Gentleman and Temple Lang (2004): “We introduce the concept of a compendium as both a container for the different elements that make up the document and its computations (i.e. text, code, data,…), and as a means for distributing, managing and updating the collection.”
Marwick (2017a) describes a specific case study to illustrate his principles for reproducibility and demonstrate the research compendium concept:
https://github.com/benmarwick/1989-excavation-report-Madjedbebe
Using a combination of R + knitr/Rmarkdown + Github + Docker + Figshare
Licenses: CC0 for data, MIT for code
Figshare was selected because of either the fee to publish in discipline repositories or technical limitations integrating with Github.
Marwick (2017a) suggests that R is widely used by archaeologists “who share code with their publications” in part because it is free, widely used in academic research including statistics and includes support for experimental packages. He selected Git because commits can be used to indicate exact version of code used during submissions (note, started with a private repository that was opened after publication). He selected Docker because of convenience, building his image based on the existing rOpenSci image.
He found that the primary issue is the time required to learn the various tools and recommends incentivizing training in and practice of reproducible research. He also recommends changing editorial standards of journals by requiring submission of “research compendia”.
Education/training/outreach¶
The Archaeology community has organized “reproducible research” Carpentry workshops and contributed to the open source curriculum. For example:
Example research compendia¶
Climate change stimulated agricultural innovation and exchange across Asia
This research compendium has been published by d’Alpoim Guedes and Bocinsky via a combination of Github and Zenodo. The paper, compendium, and data are each published as separate citable artifacts. The data package includes all raw (downloaded) and derived data generated by the analysis (~3GB). The code is packaged as an R package. The environment is provided via a Dockerfile that adds packages on top of the rocker/geospatial:3.5.1 image. The image has been pushed to Dockerhub and is therefore immediately re-runnable.
The authors provide multiple methods of re-running the analysis: by cloning and running the Github repository locally, via the published Docker image, or by building and running the Docker image locally. The primary entrypoint is a single R-Markdown script.
Data is downloaded from multiple sources during execution.
The R FedData package is used to dynamically download data published from the NOAA Global Historical Climatology Network based on spatial constriants
Instrument data published vi NOAA FTP server (URL)
An Excel spreadsheet published as supplemental data vi Science (URL)
They also use data from The Digital Archaeological Record (tDAR) (requires authentication)
Elevation data via the Google Elevation API
This compendium suggests the following use cases:
Support for rocker-project images
Ability for researchers to dynamically and programmatically register immutable published datasets
Support for authenticated data sources and
Ability to register data from FTP services
Ability to store arbitrary credential information (e.g., in Home)
Support for projects where Github is the active working environment
Support for re-using the Github README for Tale description
Association and display of citation information for associated materials
Automatic citation of source data, where possible
Separate licenses for code and data
Additional Examples¶
The following are examples of “research compendia” from Archaeology:
Marwick (2018) reports on three pilot studies exploring data sharing in archaeology. He discusses the ethics of data sharing due to work with local and indigenous communities and other stakeholders and describes archaeology as a “restricted data-sharing and data-poor field.”
References¶
Archaeology Data Service/Digital Antiquity 2011 Guides to Good Practice. Electronic document, http://guides.archaeologydataservice.ac.uk/
Journal of Archaeological Science 2018 Guide for Authors. Journal of Archaeological Science. Electronic document; https://www.elsevier.com/journals/journal-of-archaeological-science/0305-4403/guide-for-authors (via wayback)
Kansa, Eric C., and Kansa, Sarah W. 2013 Open Archaeology: We All Know That a 14 Is a Sheep: Data Publication and Professionalism in Archaeological Communication. Journal of Eastern Mediterranean Archaeology and Heritage Studies 1 (1):88–97
Marwick, B. J. (2017a) Computational Reproducibility in Archaeological Research: Basic Principles and a Case Study of Their Implementation. Archaeol Method Theory (2017) 24: 424. https://doi.org/10.1007/s10816-015-9272-9
Marwick, B. et al. (2017b) Open science in archaeology. SAA Archaeological Record, 17(4), pp. 8-14.
Marwick (2017c) Using R and Related Tools for Reproducible Research in Archaeology. In Kitzes, J., Turek, D., & Deniz, F. (Eds.) The Practice of Reproducible Research: Case Studies and Lessons from the Data-Intensive Sciences. Oakland, CA: University of California Press. https://www.practicereproducibleresearch.org/case-studies/benmarwick.html
Marwick, B., & Birch, S. 2018 A Standard for the Scholarly Citation of Archaeological Data as an Incentive to Data Sharing. Advances in Archaeological Practice 1-19. https://doi.org/10.1017/aap.2018.3 https://doi.org/10.17605/OSF.IO/KSRUZ (code/data)
Marwick, B., Boettiger, C., & Mullen, L. (2017d). Packaging data analytical work reproducibly using R (and friends). The American Statistician https://doi.org/10.1080/00031305.2017.1375986
Nüst, Daniel, Carl Boettiger, and Ben Marwick. 2018. “How to read a research compendium.” arXiv:1806.09525
Open Digital Archaeology Textbook . https://o-date.github.io/draft/book/
Economics¶
Introduction¶
According to Levenstein (2017, 2018), the field of economics has a long history of interest in reproducibility and replicability starting in the 1980s. Early studies (e.g., Dewald, 1986) found low replication rates in published research. The field also has a long history of data sharing, with policies starting as early as 2003. By 2015, 27 journals required data sharing. Ten journals encourage replication studies. (SOURCE)
In 2018, the American Economics Association (AEA) appointed a data editor in part to improve access to and reproducibility of published researcher. Economics faces additional challenges due to the use of commercial data, requiring waivers because of both IP and confidentiality concerns. While macroeconomic research tends to use public data disseminated by government agencies and central banks, microeconomics research often relies on private/confidential administrative data. Data packages used to be published as supplemental information on via the journal web platform. In 2019, all historical packages were migrated to the openICPSR platform, and future packages were preserved there as well (Vilhuber, 2020). AEA currently has over 4000 replication packages, all of which contain software/code, many of which also contain released data.
Example: American Economics Association¶
Kingi et al (2018) report the results of an effort to reproduce a subset of studies published by the AEA using only the information provided by the authors during submission. The AEA is interested in performing after-the-fact verification of published code and data and is exploring the adoption of workflows similar to those used by the American Journal of Political Science (AJPS). They have over 300 examples of validated data/code packages.
A major challenge for the AEA is the widespread use of commercial statistical software. Over 70% of submitted packages require Stata or Matlab. SAS is widely used by the Census Bureau.
User Stories¶
Based on the above cases, we see the following user stories.
Ex-post validation of AEA deposits: The AEA data editor (or graduate student) should be able to perform after-the-fact validation of published data/code packages by importing them into Whole Tale.
Stata support: An AEA researcher should be able to publish a Tale based on the Stata environment. A reviewer or user should be able to re-run the Tale in Stata.
Matlab support: An AEA researcher should be able to publish a Tale based on the Matlab environment. A reviewer or user should be able to re-run the Tale in Matlab.
SAS support: An AEA researcher should be able to publish a Tale based on the SAS environment. A reviewer or user should be able to re-run the Tale in SAS.
ICPSR integration: Whole Tale should support registering data from and publishing to ICPSR.
Private WT instance: WT platform can be deployed locally with more restrictive access.
ICPSR/Dataverse: Dataverse holds “replication datasets” created from ICPSR data that don’t link to the original data at ICPSR. The articles may not even cite the data at ICPSR, so the original authors of the data don’t get any credit. The authors of the article should get credit for their code, but not for the data.
Multiple applications: AEA data packages often contain a mixture of code – R and Stata or R and Matlab, etc. that require the ability to run not just R or Stata, but both in the same image.
Ability to choose base software version: Some Tales will require newer/older versions of R
Metadata/classification Published packages should support domain and journal metadata formats (i.e., JEL https://www.aeaweb.org/econlit/jelCodes.php)
References¶
AEA. (2019). Usage Data. https://github.com/AEADataEditor/econ-program-usage-data
William G. Dewald, Jerry G. Thursby, Richard G. Anderson. Replication in Empirical Economics: The Journal of Money, Credit and Banking Project The American Economic Review, Vol. 76, No. 4 (Sep., 1986), pp. 587-603. http://www.jstor.org/stable/1806061
Kingi, Hautahi; Vilhuber, Lars; Herbert, Sylverie; Stanchi, Flavio. 2018. The Reproducibility of Economics Research: A Case Study. https://ecommons.cornell.edu/handle/1813/60838 Preprint - https://hautahi.com/static/docs/Replication_aejae.pdf
Levenstein, Margaret (2017). Presentation to the NAS Committee on Replicability and Reproducibility in Science. http://sites.nationalacademies.org/DBASSE/BBCSS/DBASSE_185106
Levenstein, Margaret (2018). Reproducibility and Replicability in Economic Science. https://deepblue.lib.umich.edu/bitstream/handle/2027.42/143813/Reproducibility+and+Replicability+in+Economic+Science+Levenstein+NAS+presentation+February+22,+2018.pdf?sequence=
Vilhuber, Lars (2020). Migrating historical AEA supplements. https://aeadataeditor.github.io/aea-supplement-migration/programs/aea201910-migration.html (accessed 2023-01-16)
Political Science¶
Introduction¶
Over the past 20 years, the political science community has increasingly pursued transparency through encouraging or requiring authors to publish “replication files” intended to make each step the research process as explicit as possible. Beginning with the recommendations of King (1995) the research community and publishers have adopted a series of guidelines (for example DA-RT, 2015) culminating in the implementation of in-house and third-party certification workflows by top journals (Christian et al, 2018).
The American Political Science Association’s (ASPA) A Guide to Professional Ethics in Political Science states that:
Top journals, including the American Journal of Political Science (AJPS) and the American Political Science Review (APSR) require authors reporting empirical and quantitative results to deposit data, software and code, and other information needed to reproduce findings (ASPR, 2019). As discussed in detail below, AJPS is the only journal to implement a third-party certification of replication packages.
As highlighted by Dafoe (2014), the replication standard in political science is in part motivated by a number of high-visibility controversies in the social sciences. He cites the example of an influential paper in economics that was discovered three years later to have errors, arguing that the availability of the replication file for the study would have at least accelerated the identification of potential errors.
In January 2016, 27 political science journal editors signed the “Joint Statement on Data Access and Research Transparency” (DA-RT, 2015) that includes a number of requirements related to the APSA ethics guidelines for authors centered around data citation, transparency of analytic methods (e.g., code), and improving access to data and other research materials.
Example: American Journal of Political Science¶
Christian et al (2018) describe an operationalization of the replication standard implemented by the American Journal of Political Science (AJPS) in collaboration with the Odum Institute for Research in the Social Sciences and the Qualitative Data Repository (QDR). AJPS is one of the top-ranked political science journals with an ISI ranking of 1/169 in 2017. In 2012, AJPS adopted guidelines for authors to deposit replication packages in Harvard’s Dataverse. According to Jacoby (2017), due to concerns about the quality of deposited materials, AJPS implemented the third-party certification process starting in 2015.
Christian et al describe the basic workflow is as follows:
The author submits a manuscript to AJPS for peer-review. If accepted, the author is required to submit the replications materials to AJPS Dataverse.
Once the replication materials are available, the editor contacts Odum/QDR to begin the curation and verification process.
Data is reviewed per a data quality review framework. Statistical experts perform verification by executing the analysis code and comparing the output to tables and figures reported in the manuscript.
A “Verification Form” is returned to the editor including the results of the review process and any errors. The editor notifies authors to correct problems.
Once the data review and verification process is complete, the editor issues the acceptance notification and the materials are published in Dataverse (including DOI).
The paper and replication package are linked via DOI.
The authors further note that only 10% of submissions pass review without the need for revision and that, as of 2019, the process requires roughly 6 hours of effort for a single manuscript.
In response to his presentation to the NAS Committee on Replication and Reproducibility in the sciences, Jacoby (2017) notes that:
Odum archive staff handle both data curation and verification (statistical)
Errors are generally not serious (e.g., lack of documentation or tables that don’t reproduce exactly).
Mean number of resubmissions is 1.82
The verification process is paid for by the Midwest Political Science Association
AJPS requires only the data used in analysis (i.e., not all of the data collected)
Anecdotally, he has had feedback that the resource is invaluable for methodology courses (See also Janz 2016)
In 2018, the Odum Institute was awarded a $500,000 grant from the Sloan Foundation to improve and automate the verification process.
Jacoby (2017) notes that other political science journals have in-house verification processes, typically relying on graduate students. In these cases, it’s likely that the focus is on re-runnability of the code without necessarily comparing the reported results. In response, an example was raised from the field computer science where reproducibility reports are written by community reviewers, notably Information Systems journal (Chirigati, 2016).
The AJPS provides a “Quantitative Data Verification” checklist for the preparation of replication files that includes:
README file containing the names of all files with a brief description and any other important information regarding how to replicate the findings (i.e., the order files need to be run, etc.)
Includes a Codebook (.pdf format) with variable definitions for all variables in the analysis dataset(s) and value labels for categorical variables
Includes clear information regarding the software version used to conduct analysis
Includes complete references for source datasets Includes the analysis dataset(s) in a file format readily accessible to the social science research community (i.e., text files, delimited files, Stata files, R files, SAS files, SPSS files, etc.)
Includes a unique case identifier variable linking each observation in the analysis dataset to the original data source
Includes software command file(s) for reconstructing the analysis dataset from the original data source and/or extracting and merging multiple original source datasets, including information on source dataset(s) version and access date(s)
Includes commands needed to reproduce all tables, figures, and other analytical results presented in the article and supplementary materials
Includes commands/instructions for installing macros or packages
Includes comment statements used to explain the analysis steps and distinguish commands for tables, figures, and other outputs Includes seed values for any commands that generate random numbers (e.g., Monte Carlo simulations, bootstrap resampling, jittering points in graphical displays, etc.)
Includes any additional software tools needed for replication (e.g., Stata .ado files and R packages)
Examples¶
Harvard’s Dataverse includes hundreds of Political Science replication packages, including those verified through the Odum/QDR workflow.
References¶
AJPS replication policy https://ajps.org/ajps-replication-policy/
AJPS Quantitative Data Verification Checklist. 2016. https://ajpsblogging.files.wordpress.com/2019/01/ajps-quant-data-checklist-ver-1-2.pdf
AJPS Guidelines for Preparing Replication Files, https://ajpsblogging.files.wordpress.com/2018/05/ajps_replication-guidelines-2-1.pdf
APSA Guide to Professional Ethics, Rights and Freedoms https://www.apsanet.org/portals/54/Files/Publications/APSAEthicsGuide2012.pdf
ASPR. (2019). Submission Guidelines. https://www.apsanet.org/APSR-Submission-Guidelines. Accessed February 8, 2019.
Barba, Lorena A. (2018). Terminologies for Reproducibly Science. https://arxiv.org/pdf/1802.03311.pdf
Christian et al. Operationalizing the Replication Standard: A Case Study of the Data Curation and Verification Workflow for Scholarly Journals https://osf.io/preprints/socarxiv/cfdba/
Core2 award https://odum.unc.edu/2018/07/alfred-p-sloan-foundation-grant/
Dafoe, 2014. Science Deserves Better.
DA-RT. (2015). Data Access and Research Transparency (DA-RT): A Joint Statement by Political Science Journal Editors. https://doi.org/10.1177/0010414015594717
Jacoby, William. 2017. Presentation to National Academy of Sciences Committee on Replication and Reproducibility in the sciences. https://vimeo.com/252434555
Janz, 2016. Bringing the Gold Standard into the Classroom: Replication in University Teaching. https://doi.org/10.1111/insp.12104
Fernando Chirigati, Rebecca Capone, Rémi Rampin, Juliana Freire, Dennis Shasha. (2016). A collaborative approach to computational reproducibility. Information Systems, Volume 59, 2016, https://doi.org/10.1016/j.is.2016.03.002.
TOP guidelines (https://cos.io/our-services/top-guidelines/)
Use Cases¶
The following high-level use cases are supported by Whole Tale (v0.6):
A user can register immutable public data from supported external resources including DataONE, Globus, Dataverse and some HTTP sources.
A user can create a Tale based on popular environments including RStudio and Jupyter.
A user can upload/create source code files in the Tale workspace that are used for analysis. Analysis code can optionally reference externally registered data.
A user can share their Tale (via Public setting) and run Tales shared by others.
A Dataverse or DataONE user can create a Tale based on a public dataset via the repository native user interface (Analyze in Whole Tale)
A user can discover public Tales in the system (via Browse) and run them
A user provide metadata about their Tale including title, authors, description and a graphic representation
The following use cases are planned for future releases:
A user can customize existing software environments using common package managers.
A user can publish a Tale to an external research repository including DataONE and Dataverse network members.
A curator or reviewer can use Whole Tale to verify or certify published artifacts.
A user can add a new base environment to Whole Tale
A user can share a Tale with another user for collaboration
A user can share a Tale with another user for anonymous review
A user can copy an existing Tale and change the code, environment, or externally registered data (remix).
A user can run licensed software including Stata and Matlab
A user can run a Tale on a remote resource based on available data (data locality) or specialized compute requirements.
A user can create a Tale based on embargoed or private/authenticated data.
A user can track Tale executions along with detailed provenance information.
A user can export a Tale and run locally
Workshops¶
Publications¶
Shawn Bowers, Timothy McPhillips, and Bertram Ludäscher. (2018). Validation and Inference of Schema-Level Workflow Data-Dependency Annotations, 7th Intl. Provenance and Annotation Workshop (IPAW) https://arxiv.org/abs/1807.09899
Adam Brinckman, Kyle Chard, Niall Gaffney, Mihael Hategan, Matthew B. Jones, Kacper Kowalik, Sivakumar Kulasekaran, Bertram Ludäscher, Bryce D. Mecum, Jarek Nabrzyski, Victoria Stodden, Ian J. Taylor, Matthew J. Turk, Kandace Turner. (2017). Computing environments for reproducibility: Capturing the ‘‘Whole Tale’’, Future Generation Computer Systems https://doi.org/10.1016/j.future.2017.12.029
Adam Brinckman, Kyle Chard, Niall Gaffney, Mihael Hategan, Matthew B. Jones, Kacper Kowalik, Sivakumar Kulasekaran, Bertram Ludäscher, Bryce Mecum, Jaroslaw Nabrzyski, Victoria Stodden, Ian Taylor, Matthew Turk, and Kandace Turner. (2017). The Whole Tale: Merging Science and Cyberinfrastructure Pathways, Globus World/NDS Workshop
Kyle Chard. (2017). The Whole Tale: Merging Science and Cyberinfrastructure Pathways, National Data Service Workshop
Kyle Chard. (2017). The Whole Tale: Merging Science and Cyberinfrastructure Pathways, Nordic e-Infrastructure conference (NEIC)
Kyle Chard, Niall Gaffney, Matthew B. Jones, Kacper Kowalik, Bertram Ludäscher, Jarek Nabrzyski, Victoria Stodden, Ian Taylor, Thomas Thelen, Matthew J. Turk, Craig Willis (2019). Application of BagIt-Serialized Research Object Bundles for Packaging and Re-execution of Computational Analyses. In Proceedings of the IEEE 15th International Conference on e-Science (e-Science) https://doi.org/10.1109/eScience.2019.00068
Jarek Nabrzyski. (2017). The Whole Tale: Merging Science and Cyberinfrastructure Pathways, PressQT Notre Dame
Matthew Jones. (2018). Provenance tracking and display in DataONE, Earth Data Provenance Workshop https://esipfed.github.io/Earth-Data-Provenance-Workshop/
Kacper Kowalik. (2018). Sneaking Data into Containers with the Whole Tale, SciPy’18 https://scipy2018.scipy.org/ehome/index.php?eventid=299527&tabid=712461&cid=2233540&sessionid=21615725&sessionchoice=1&
Bertram Ludäscher. (2017). Workflows, Provenance, & Reproducibility: Telling the Whole Tale behind a Paleoclimate Reconstruction, EarthCube All Hands Meeting https://www.earthcube.org/2017AHMResources
Bertram Ludäscher. (2017). From Provenance Standards and Tools to Queries and Actionable Provenance, American Geophysical Union, Fall Meeting 2017, abstract #IN42C-02 http://adsabs.harvard.edu/abs/2017AGUFMIN42C..02L
Bertram Ludäscher. (2018). Whole Tale: The Experience of Research through reproducible, computational narratives, Building a Community to Advance and Sustain Digitized Biocollections (BCoN) meeting https://www.slideshare.net/ludaesch/wholetale-the-experience-of-research-88002697
Niall Gaffney. (2018). Improving Research Outcomes, Leveraging Digital Libraries, Advanced Computing and Data, JCDL 2018
Bertram Ludäscher. (2018). From Workflows to Provenance and Reproducibility: Looking Back and Forth, 7th Intl. Provenance and Annotation Workshop (IPAW) http://provenanceweek2018.org/ipaw/
Bertram Ludäscher and Santiago Núñez-Corrales. (2018). Dissecting Reproducibility: A case study with ecological niche models in the Whole Tale environment, Hierarchy of Hypotheses Workshop (HoH3) https://www.slideshare.net/ludaesch/dissecting-reproducibility-a-case-study-with-ecological-niche-models-in-the-whole-tale-environment
Timothy M. McPhillips, Craig Willis, Michael R. Gryk, Santiago Núñez-Corrales, Bertram Ludäscher (2019): Reproducibility by Other Means: Transparent Research Objects. In Proceedings of the IEEE 15th International Conference on e-Science (e-Science) https://doi.org/10.1109/eScience.2019.00068
B Mecum, S Wyngaard, C Willis, M Turk, T Thelen, I Taylor, V Stodden, D Perez, J Nabrzyski, B Ludaescher, S Kulasekaran, K Kowalik, MB Jones, M Hategan, N Gaffney, K Chard, A Brinckman. (2018). Science, containerized: Integrating provenance and compute environments with the Whole Tale, American Geophysical Union, Fall Meeting 2018, Washington DC http://adsabs.harvard.edu/abs/2018AGUFMIN53A..02M
Bryce Mecum, Matthew Jones, Dave Vieglais and Craig Willis. (2018). Preserving Reproducibility: Provenance and Executable Containers in DataONE Data Packages, 2018 IEEE 14th International Conference on e-Science (e-Science) https://doi.org/10.1109/eScience.2018.00019
Victoria Stodden. (2017). The Role of Cyberinfrastructure in Reproducible Science, Chameleon User Meeting
Victoria Stodden. (2017). Toward a Reproducible Scholarly Record, Dataverse Community Meeting
Victoria Stodden. (2017). Implementing Reproducible Computationally-Enabled Science, Institute for Advanced Computational Science
Victoria Stodden. (2017). Reproducibility in Computationally-Enabled Research: Integrating Tools and Skills, METRICS Seminar
Victoria Stodden. (2017). Data-Sharing and Reproducibility, National Academy of Sciences
Victoria Stodden. (2017). Research Data Management Implementations:Towards the Reproducibility of Science, RDMI Workshop
Victoria Stodden. (2017). Reproducibility in Computationally-Enabled Research, The Judith Resnik Year of Women in ECE Seminar
Victoria Stodden. (2018). Infrastructure for Enabling Reproducibility in Computational and Data-enabled Science, Biostatistics Seminar Northwestern Preventive Medicine https://www.preventivemedicine.northwestern.edu/divisions/biostatistics/seminars.html
Victoria Stodden. (2018). Open Data, Code, and Computational Reproducibility, CMU Open Science Symposium https://events.mcs.cmu.edu/oss2018/
Victoria Stodden. (2018). Enabling Reproducibility in Computational and Data-enabled Science, Ecole Polytechnique Federale de Lausanne https://memento.epfl.ch/event/enabling-reproducibility-in-computational-and-da-3/
Victoria Stodden. (2018). Enablilng Reproducibiolity in Computational and Data-enabled Science, Workshop II HPC and Data Science for Scientific Discovery. Part of the Long Program: Science at Extreme Scales: Where Big Data Meetrs Large-Scale Computing https://www.ipam.ucla.edu/programs/workshops/workshop-ii-hpc-and-data-science-for-scientific-discovery/
Matt Turk, Kacper Kowalik. (2018). Sneaking Data into Containers with the Whole Tale, SciPy2018
Craig Willis. (2018). The Whole Tale: Merging Science and Cyberinfrastructure Pathways, NDS/MBDH Data Science Tools & Methods Workshop http://www.nationaldataservice.org/get_involved/events/DataScienceTools/
Craig Willis, Kacper Kowalik. (2018). Container applications in research computing and research data access, PEARC’18 https://pearc18.conference-program.com/?page_id=10&id=pan109&sess=sess155
Craig Willis. (2018). The Whole Tale: Merging Science and Cyberinfrastructure Pathways, PresQT 2018
Development Documents¶
Developer Guide¶
Whole Tale is an open-source software project. External contributions are encouraged. Please feel free to ask questions or suggest changes to the this Developer Guide.
Issue management¶
The core team uses Github for issue management. General issues or where the specific component is unknown are filed in <https://github.com/whole-tale/whole-tale/issues.
During weekly development calls, issues are prioritized, clarified, and assigned to release milestones.
Defining “done”¶
What does it mean for an issue or task to be “done”?
Code complete
Unit tests complete and passing
Manual tests defined and passing
Documentation updated
PR reviewed and merged
Code management¶
Best practices:
Never commit code to master. Always use a fork or feature branch and create a Pull Request for your work.
Name your branch for the purpose of the change. For example feat-add-foo.
Always include clear commit messages
Organize each commit to represent one logical set of changes. For example, separate out code formatting as one commit and functional changes as another.
Reference individual issues in commits
Prefer rebasing over merging from master
Learn to use rebase to squash commits – organize commits for ease of review.
Never merge your own PR if not approved by at least one person. If reviews aren’t happening in a timely matter, escalate them to the team.
Merging a PR means that the work has been tested, reviewed, and documented.
Testing¶
Every PR must include either a unit test or manual test scenario. PRs will not be be merged unless tests run successfully.
Manual test cases will be added to the test plan template.
For the Whole Tale API, we leverage Girder’s automated testing framework.
Tests are run via CircleCI. Tests will fail with < 82% coverage.
Repositories and components¶
The project has the following repositories:
whole-tale - Top-level repository for working groups and general issue tracker
wt-design-docs - Source for wholetale.readthedocs.io (this repository)
deploy-dev - Scripts to deploy a local development environment
terraform_deployment - Terraform process to deploy full-scale system
- Core services:
ngx-dashboard - Whole Tale dashboard
girder_wholetale - Girder plugin providing basic Whole Tale functionality.
gwvolman - Girder Worker plugin responsible for spawning Instances and mounting GirderFS on compute nodes
girderfs - FUSE filesystem for mounting Girder resources.
girder_wt_data_manager - Girder plugin for external data management.
virtual_resources - Girder plugin for file-system backed resources.
wt_home_dir - Girder plugin for WebDav support
wt_versioning - Girder plugin for versions and recorded run
Setting up for local development¶
The entire WT platform stack can be deployed locally or on a VM using the development deployment process.
The WT platform stack can be deployed on an Open-Stack cluster using the Terraform deployment process.
Integrating with the ‘Analyze in Whole Tale’ feature¶
To utilize Whole Tale’s ability to create a Tale based on data on your repository, follow the steps outline below. The general idea behind this feature is that the backend endpoint for this feature will never change, but the user interface may. To get around this, third parties should send their users to the /integration endpoint, which then re-directs them to the appropriate frontend URL.
Clone the girder_wholetale repository
Create a folder in server/lib with the name of your service as the folder name
Add an integration.py file in the folder
Copy and paste the contents of the DataONE or Dataverse integration.py into yours
Change the content in autoDescribeRoute to match your service, including any query parameters
Change the name of the __DataImport method to match the name of your service
Modify any of the query parameters in the method if you’ve changed them
Navigate to server/rest/integration.py
Import your method in your integration.py (see how it’s done for current integrators
Add self.route(‘GET’, (‘YOUR_SERVICE_NAME’,), YOUR_METHOD_NAME) to the __init__
GitHub Organization¶
- Meta repositories:
whole-tale - User facing repository used mostly as a general bug tracker
wt-design-docs - This repository
girder_deploy - Collection of scripts used for deploying Girder (obsoleted by terraform_deployment??)
terraform_deployment - Terraform deployment setup for WT production system.
deploy-dev - Scripts for developers that want to deploy Whole Tale
tale_serialization_formats - Contains documentation related to the format of the manifest.json file and zip&bag export formats
- Core services:
dashboard - Frontend UI to Whole Tale.
girder_wholetale - Girder plugin providing basic Whole Tale functionality.
girder_wt_data_manager - Girder plugin for external data management.
wt_sils - Girder plugin providing Smart Image Lookup Service.
gwvolman - Girder Worker plugin responsible for spawning Instances and mounting GirderFS on compute nodes.
globus_handler - Girder plugin that supervises transfers between multip GridFTP servers
Images:
xpra-base - WT Image for Xpra base
jupyter-yt - Base Jupyter image with yt preinstalled
openrefine - Base Openrefine image
all-spark-notebook - Jupyter Notebook with Spark
Communicating¶
The Whole Tale development team strives for open channels of communication. The development team communicates through the following:
Slack (https://wholetale.slack.com/)
Dev mailing list (wholetale-dev [at] googlegroups.com)
Github (https://github.com/whole-tale/)
Weekly development team meetings (See Meeting Notes)
Design Notes¶
Comparison between ownCloud and WsgiDAV¶
This is a quick and dirty performance comparison between the ownCloud WebDav server and WsgiDAV.
A few tests are run:
a recursive copy of a directory that mostly includes source files
large file copies (256M, 1G)
deletion of the directory in the first step
recursive listing
evaluation of a jupyter notebook
Both ownCloud and WsgiDAV are run on plain HTTP and locally. Data is stored on an SSD. ownCloud is configured with Redis caching, with Redis running locally and the connection being through a UNIX socket. Redis is also used for ‘memcache.locking’. These are recommended performance settings for ownCloud (see here). WsgiDAV is run with its own HTTP server.
The “many files” used in the tests consist of a structure with 50 sub-directories. There are a total of 680 files adding up to 5.1MB.
This document is to keep track of what metadata is automatically generated by the Whole Tale system, and what the default values are. This is relevant to publishing tales on DataONE.
System Metadata Generation¶
Any file that doesn’t already exist on DataONE needs to have a metadata document describing it’s properties. This is accomplished by using dataone_package.generate_system_metadata which ends up calling the d1_python library. The object’s MIME type md5, and size are all put into the metadata document.
Additionally, system metadata is generated for both the EML document and the resource map
Rights Holder¶
The rights holder typically corresponds to a user’s ORCID. Right now this is hard-coded and will be addressed with Globus-DataONE integration.
Access Policy¶
The metadata document hold information about who can access the data. By default, this is set to public and to read permissions.
Minimum EML Generation¶
When uploading the package, a minimum EML record is required. This will/can be edited by the user at another step later in the process. The minimum record has documents the following
The Title. This is set as the tale title
The
surName
of theindividualName
in thecreator
field. This is currently set as thelastName
of the user.The
surName
of theindividualName
in thecreator
field. This is currently set as thelastName
of the user.A
otherEntity
block for each object. This includes aphysical
section outlining the size and name of the object.A section for the
tale.yaml
file.
Creating Frontends¶
This document describes basic requirements for creating a usable Frontend (also known as Whole Tale Image).
Description¶
Software environment used in research plays a vital role in Whole Tale and needs to be preserved in the same fashion as other research artifacts. Currently Whole Tale utilizes Docker containers for that purpose. Our system is designed to allow users to create and publish Docker images, which can be subsequently used as building blocks for Tales.
Due to constraints related to reproducibility we do not permit uploading raw Docker images. Instead we require users to provide a GitHub repository, that contains everything that is necessary to build a Docker image, which at a bare minimum is a Dockerfile. A reference to a GitHub url and a commit id is kept in our database as a Recipe object.
Image composition guidelines¶
Docker images in WT are built from Recipe objects, which contain necessary
information to create a local copy of an environment, where docker build
command can be executed. There is a minimum set of requirements that the
resulting Image has to fulfil:
define a single port that is going to be used to access a running container via http(s) (e.g.
EXPOSE 8888
)define a command that is executed upon container initialization (
COMMAND ["/bin/app.exe"]
)define a user that will own container session (
USER jovyan
)define a single volume that will be used as a mount point for Whole Tale FS (
VOLUME ["/home/jovyan/work"]
)
Optionally:
FROM
target should be referenced via a digest rather than a tag (currently not enforced).
Aforementioned properties can be defined by the Dockerfile, but may be overwritten through WT’s Image object, e.g.:
"config": {
"command": "/init",
"port": 8787,
"targetMount": "/home/rstudio/work",
"urlPath": "",
"user": "rstudio"
}
Notes¶
config.command
andconfig.urlPath
properties supports templating. Currentlyconfig.port
and a randomly generated can be passed via{port}}
and{{token}}
respectively. (see Example below)config options passed to Docker during a Tale initialization are currently evaluated in a following order (first on the list takes precedence):
Tale object’s
config
propertyImage object’s
config
propertyDocker image defaults
gwvolman defaults
Example Frontends¶
A basic Jupyter image that WT uses is defined here. Relevant part of the Image object:
"config": {
"command": "jupyter notebook --no-browser --port {port} --ip=0.0.0.0 --NotebookApp.token={token} --NotebookApp.base_url=/{base_path} --NotebookApp.port_retries=0",
"memLimit": "2048m",
"port": 8888,
"targetMount": "/home/jovyan/work",
"urlPath": "?token={token}",
"user": "jovyan"
}
For a full list of properties supported by config
parameter please refer to
plugin code.
Backup design notes¶
Background: * The MongoDB is currently backed up manually via mongodump and copied to an external resource (e.g., ytHub) * The terraform_deployment process was modified to restore from backup if the URL was provided in variables.tf. The idea behind this was to quickly deploy a new instance of the infrastructure with the latest user data – for example, moving from Nebula to Jetstream. * If WT is intended to be an installable package, then the backup scenario should be flexible but also have a low bar. For example, we don’t want to require everyone to use Crash Plan or come up with their own custom solution. Ideally, we would provide a basic functional backup that is flexible enough for most scenarios. * Initial discussions revolved around use of Box, since it is already resilient infrastructure, instead of trying to setup our own remote backup server. It would certainly be possible to setup a VM at IU to handle backups to disk. DataOne is using [Bacula](http://blog.bacula.org/); NCSA ITS recommends Crash Plan to projects. * Mongo is running in Docker Swarm and attached to the wt_mongo network. Any backup implementation will need to be able to handle mongodump.
Requirements: * Periodic (eventually nightly) backup of user and system data * Ability to restore from backup in the event of disaster recovery or when migrating/upgrading infrastructure * Need to support backup and recovery for from multiple WT instances (i.e., dev/testing/production) * Integration with monitoring/alert system (e.g., notifications if backup fails) * Configurable retention policy (default: 1 week)
User/system data: * User data is currently stored in MongoDB and the home directory mount on the fileserver node * System data is primary stored in MongoDB. There are a few configuration files that might be useful to capture – e.g., acme.json. * Open question Do we need to backup any of the cached/registered user data?
Preliminary implementation: * Uses [rclone](https://rclone.org/) a cloud-oriented rsync. The user would supply an rclone configuration file during WT system installation. Unfortunately this requires interactive setup to get an oauth token, but the token is valid for 60 days. * The backup process is containerized and includes both rclone and mongodump support see https://github.com/whole-tale/backup. * Data and rclone config is mounted into the container via -v. The backup.sh script runs mongo dump, tar’s the mounted directory, and copied to box under “WT/name/YYYYMMDD/” * Nightly backups are handled via systemd timer. A single timer will run backup nightly. * Data is mounted into the container via -v flag along with rclone conf. * The backup process runs on the WT fileserver for direct access to user directories. * Initially home directories are tar’d, gzip’d and sync’d to box. Mongodump output is compressed.
Meeting Notes¶
Meeting notes are available in Github https://github.com/whole-tale/wt-design-docs/tree/master/development/meetings.
Testing Process¶
Usability Test Results¶
As part of the effort to gain insight into the usability of the Whole Tale dashboard, members of the team developed a series of common tasks that users may perform. In addition, questions were developed for each task. After, members of the development team performed said actions and answered the questions. This document is the synthesis of the members’ responses.
The usability tests can be found at
https://hackmd.io/d-m0LsVXSh2EpzhcRJMCUA
Task 1: Launching the LIGO Tale¶
Broad Goal: Find a tale (a combination of environment and data), launch it, use it (briefly) to its desired end, and then terminate it.
Questions:
What is a tale?
There was common agreement that a tale is a container/package that has at least someone’s code and data inside.
When you ran the LIGO tale, what did it do?
There was a bug that prevented some users from launching the tale. Users that did not experience the error was able to successfully launch the tale. The notebook file had to be manually opened, and then it was possible to generate the output data plots
What was clear/unclear about this process?
There was a lack of direction of what to do/where to go/how to run the experiment when the user entered jupyter notebook. In particular, the file structure was confusing.
Common Issues/Questions:¶
When launching a tale, it was confusing to see both “View” and “Launch” on the tale widget.
After clicking “Launch”, it was expected that the tale launched and opened, but instead, if the user missed the window to click the “Go to Tale” button, they had to visit the Status page where they had to interact with the UI again to view it.
There was a lack of understanding what to do when the tale is launched (ie why am I not seeing the tale after clicking Launch; where do I go to see the tale?)
Suggestions:¶
Automatically open the notebook file when the user enters the tale.
When the user clicks ‘Launch”, open the tale in a new tab when it is ready.
Misc Notes:¶
The notebooks was in a data directory-why?
The ‘view’ link was expected to bring the user to a summary of the tale, not the experiment.
‘Data’ and ‘Home’ directories are not native to Jupyter environments, and it may be misleading/confusing for a user that is familiar with the Jupyter environment
Task 2: Registering data from Globus¶
Broad Goal: Register a dataset that is accessible by a URL, and then put that data into its own folder.
Questions:¶
What is the difference between the home, data, and workspace folders?
There was agreement between two respondents that raw data would appear in the data folder.
Two respondents thought that personal data would be kept in home
The workspace folder had the greatest amount of uncertainty: each respondent failed to clearly identify its purpose.
If you want to share a registered dataset, which folder would you put it in?
Two respondents felt that the data would go into the data folder, but the workspace folder was also a considered option.
Common Issues/Questions:¶
There was common theme about how the workspace folder should be utilized, possibly stemming from a lack of understanding of what its use was.
Task 3: Creating a Tale¶
Broad Goal: Given a registered dataset, create a new “Tale” that utilizes it and launch that Tale.
Questions:¶
Were you easily able to find the registered data set?
One user experienced an error when attempting to access their data; two of the respondents were able to locate the data.
Inside the tale, were you able to identify your files of interest?
Two of the respondents ran into a bug when launching the tale and were unable to complete the task. Another user was able to find the data in the Data folder.
Misc Notes:¶
Two respondents mentioned that the login screen may be confusing to new users.
One respondent was confused about the kinematic folder in the tale.
The data folder was read only, even though the tale was un-published which prevented the user from organizing the tale.
Task 4: Importing Data from DataONE¶
Broad Goal: Register a dataset that is accessible from DataONE, and use it in a Tale.
Questions:¶
Where did you find the data you registered in Whole Tale?
The data was found in the Data folder (in the dashboard).
What is the difference between Catalog and My Data?
One user thought that it was the difference between cached data and what had been registered. Two other respondents agreed that the Catalog has data that others had registered, while My Data has data specific to the user.
Did you find the data in your running tale in RStudio?
Two of the users ran into a bug when launching the tale, the other ran into a bug when attempting to access /work/data.
What was clear/unclear about the process?
It may be confusing for a user dealing/using dois, reference URIs, data ids, etc.
There was a lack of indication of where the data was being put in the dashboard.
Task 5: Importing Data from Globus¶
Broad Goal: Register a dataset that is accessible from Globus, and use it in a Tale.
Questions:¶
Which folder were you expecting the data to be registered in?
There was a split between users thinking that the data would go into the workspace and data folders. During registration, a notification came up that stated the data was being copied to the workspace.
Did the file names and extensions in the tale match the ones in Globus?
The filenames did not match those in Globus, and were extensionless.
If there were any hurdles for plotting the data, what were they?
The filenames not matching what was in Globus was an issue. There was also only one file that made it over from Globus.
During registration, only a single file was registered in Whole Tale, despite there being more in the Globus package.
Task 6: Import Recipe and Build Image¶
Broad Goal: Register a Git Repo containing a recipe and build an WholeTale Image
Questions:¶
Was the process self explanatory. How could the UI design or hints/documentation be improved to help a user walk through without seeking help?
In general, each user had issues determining what was asked at each step, and what each step represented (for example, what is a recipe?).
There was a consensus that error reporting can be improved when recipe/image creation fails.
Are the current steps an efficient method for representing the breadth of functionality you might want to achieve from WholeTale frontends?
There was agreement that the current system may be too complicated for normal users. “The notion of “recipes” and “frontends” as an abstraction over Docker images makes it even harder to understand what I’m doing.”
Can you think of use cases for using the second step (Create Image) without the first i.e. to provide multiple images for the same recipe?
None of the users could think of any use cases.
Any other feedback about existing process? Please provide input into streamlining the process, if relevant
The was consensus that it wasn’t clear which fields were required, and that needing to specify the commit can be automated. It was also suggested that some fields, like port and volume, may be able taken from the dockerfile. It may be a good idea to separate the more advanced fields from the bare minimum ones.
TODO: Described testing process.
Manual testing
Unit testing
Integration testing
Continuous integration
Release Process¶
Release process includes the following:
Create release milestones in Github for the following repositories
Team identifies target features for release, creates issues, and assigns milestones to associated issues
Features are implemented, tested, and documentation updated either on
master
or a designated feature branchChanges are merged to
stable
branchStable branch is updated with any version changes, if applicable
Release candidate is created by tagging the
stable
branch (v1.0-rc1)Release candidate is tested
Release notes are created and added to documentation
Final release is created by tagging the
stable
branch (v1.0)Install release to production instance
Announce release to community
Detailed release process¶
For all repos, merge or cherry-pick commits from
master
tostable
, bump versions, and create release tag.Wait for autobuilds of Docker images, then deploy to staging.
Publish releases via github
Deploy tagged version to staging
Testing/smoke test
Deploy to production
globus_handler
Bump version in plugin.json (master/stable)
girder_wt_data_manager, wt_home_dirs, wt_versioning, virtual_resources
Bump version in plugin.json (master/stable)
girder_wholetale
Bump version in plugin.yml (master/stable)
Pin version of gwvolman in requirements.txt
gwvolman
Bump version in setup.py (master/stable)
Pin version of girderfs in requirements.txt
girder
Clone repo with submodules
In stable branch, checkout version tag for each plugin, commit
wt-design-docs:
Add release notes
Update ISSUE_TEMPLATE/test_plan.md
Release steps used for v0.6
# Clone the repos
# Bump version on master
# Merge to stable
# Bump version on master
# Release
# repodocker_wholetale
# girderfs
# gwvolman
# virtual_resources
# wt_versioning
# girder_wt_data_manager
# globus_handler
# wt_home_dirs
# girder_wholetale
# ngx-dashboard
# wt-design-docs
# terraform_deployment
version_stable="1.0.0"
tag_stable="v1.0"
version_master_python="1.1.0dev0"
version_master_plugin="1.1.0"
# virtual_resources
git clone https://github.com/whole-tale/virtual_resources
cd virtual_resources
sed -i.bak "s/^version: .*/version: ${version_stable}/g" plugin.yml
git add plugin.yml
git commit -m "Bump version"
git checkout stable
git merge master
git tag ${tag_stable}
git push origin stable
git push origin --tags
git checkout master
sed -i.bak "s/^version: .*/version: ${version_stable}/g" plugin.yml
git add plugin.yml
git add plugin.json
git commit -m "Bump version"
git push origin master
# wt_versioning
git clone https://github.com/whole-tale/wt_versioning
cd wt_versioning
sed -i.bak "s/^version: .*/version: ${version_stable}/g" plugin.yml
git add plugin.yml
git commit -m "Bump version"
git checkout stable
git merge master
git tag ${tag_stable}
git push origin stable
git push origin --tags
git checkout master
sed -i.bak "s/^version: .*/version: ${version_stable}/g" plugin.yml
git add plugin.yml
git add plugin.json
git commit -m "Bump version"
git push origin master
# girder_wt_data_manager
git clone https://github.com/whole-tale/girder_wt_data_manager
cd girder_wt_data_manager
sed -i.bak "s/^version: .*/version: ${version_stable}/g" plugin.yml
git add plugin.yml
git commit -m "Bump version"
git checkout stable
git merge master
git tag ${tag_stable}
git push origin stable
git push origin --tags
git checkout master
sed -i.bak "s/^version: .*/version: ${version_stable}/g" plugin.yml
git add plugin.yml
git add plugin.json
git commit -m "Bump version"
git push origin master
git clone https://github.com/whole-tale/wt_home_dirs
cd wt_home_dirs
# Same steps as wt_sils
git clone https://github.com/whole-tale/globus_handler
cd globus_handler
# Same steps as girder_wt_data_manager
git clone https://github.com/whole-tale/girderfs
cd girderfs
sed -i.bak "s/__version__ = '[^']*'/__version__ = '${version_stable}'/g" ./girderfs/__init__.py
git add ./girderfs/__init__.py
git commit -m "Bump version"
git checkout stable
git merge master
git push origin stable
git tag ${tag_stable}
git push origin --tags
git checkout master
sed -i.bak "s/__version__ = '[^']*'/__version__ = '${version_master_python}'/g" ./girderfs/__init__.py
git add girderfs/__init__.py
git commit -m "Bump version"
git push origin master
git clone https://github.com/whole-tale/gwvolman
cd gwvolman
# Pin girderfs version in requirements.txt to ${tag_stable}
sed -i.bak "s/version='[^']*',/version='${version_stable}',/g" setup.py
git add setup.py requirements.txt
git commit -m "Bump version"
git checkout stable
git merge master
git push origin stable
git tag ${tag_stable}
git push origin --tags
git checkout master
# Pin girderfs version in requirements.txt to master
sed -i.bak "s/version='[^']*',/version='${version_master_python}',/g" setup.py
git add setup.py requirements.txt
git commit -m "Bump version"
git push origin master
git clone https://github.com/whole-tale/girder_wholetale
cd girder_wholetale
# Pin gwvolman version in requirements.txt to ${tag_stable}
sed -i.bak "s/^version: .*/version: ${version_stable}/g" plugin.yml
git add plugin.yml requirements.txt
git commit -m "Bump version"
git checkout stable
git merge master
git push origin stable
git tag ${tag_stable}
git push origin --tags
git checkout master
# Pin gwvolman version in requirements.txt to master
sed -i.bak "s/^version: .*/version: ${version_master_plugin}/g" plugin.yml
git add plugin.yml requirements.txt
git commit -m "Bump version"
git push origin master
git clone --recurse-submodules https://github.com/whole-tale/girder
cd girder
git checkout stable
git merge master
cd plugins/globus_handler
git checkout ${tag_stable}
cd ../plugins/wholetale
git checkout ${tag_stable}
cd ../wt_data_manager/
git checkout ${tag_stable}
cd ../wt_home_dir/
git checkout ${tag_stable}
cd ../wt_sils/
git checkout ${tag_stable}
cd ..
git commit -a -m 'Bump version'
git push origin stable
git tag ${tag_stable}
git push origin --tags
git clone https://github.com/whole-tale/dashboard
cd dashboard
sed -i.bak "s/\"version\": \"[^\"]*\"/\"version\": \"${version_stable}\"/g" package.json
git add package.json
git commit -m "Bump version"
git checkout stable
git merge master
# Manually merge conflicts
git commit
git push origin stable
git tag ${tag_stable}
git push origin --tags
# Update release notes
git clone https://github.com/whole-tale/wt-design-docs
cd wt-design-docs
sed -i.bak "s/version = '[^']*']/version = '${version_stable}'/g" conf.py
git add conf.py
git commit -m "Bump version"
git push origin master
git tag ${tag_stable}
git push origin --tags
# Publish releases via github
# Deploy tagged version to staging
# Testing/smoke test
# Deploy to production