dtool: Manage Scientific Data¶
Make your data more resilient, portable and easy to work with by packaging files & metadata into self contained datasets.
dtool: Manage Scientific Data¶
Make your data more resilient, portable and easy to work with by packaging files & metadata into self contained datasets.
- Documentation: http://dtool.readthedocs.io
- Paper: https://doi.org/10.7717/peerj.6562
- Free software: MIT License
Overview¶
dtool is a suite of software for managing scientific data and making it
accessible programatically. It consists of a command line interface dtool
and a Python API: dtoolcore.
The dtool
command line interface allows one to organise files into datasets
and to move datasets between different storage solutions, for example from
local disk to remote object storage. Importantly it also provides methods to
verify that the transfer has been successful.
The Python API gives complete access to the data and metadata in a dataset. It makes it easy to create scripts for processing the items, or a subset of items, in a dataset. The Python API also allows datasets to be constructed programatically.
dtool is extensible, meaning that it is possible to create plugins both for adding functionality to the command line interface and for creating interfaces to custom storage backends.
The dtool
Python package is a meta package that installs the packages:
- dtoolcore - core API
- dtool-cli - CLI plugin scaffold
- dtool-annotation - CLI commands for working with dataset annotations
- dtool-config - CLI commands for configuring dtool
- dtool-create - CLI commands for creating datasets
- dtool-info - CLI commands for getting information about datasets
- dtool-overlay - CLI commands for working with per item metadata stored as overlays
- dtool-symlink - storage broker interface allowing symlinking to data
- dtool-http - storage broker interface allowing read only access to datasets over HTTP
Installation:
$ pip install dtool
There are support packages for several object storage solutions:
- dtool-s3 - storage broker interface to S3 object storage
- dtool-azure - storage broker interface to Azure Storage
- dtool-ecs - storage broker interface to ECS S3 object storage
- dtool-irods - storage broker interface to iRODS
If you have access to Amazon S3, Microsoft Azure, ECS S3 or iRODS storage you may also want to install support for these:
$ pip install dtool-s3 dtool-azure dtool-ecs dtool-irods
Usage:
$ dtool create my-awesome-dataset
Created proto dataset file:///Users/olssont/my-awesome-dataset
Next steps:
1. Add raw data, eg:
dtool add item my_file.txt file:///Users/olssont/my-awesome-dataset
Or use your system commands, e.g:
mv my_data_directory /Users/olssont/my-awesome-dataset/data/
2. Add descriptive metadata, e.g:
dtool readme interactive file:///Users/olssont/my-awesome-dataset
3. Convert the proto dataset into a dataset:
dtool freeze file:///Users/olssont/my-awesome-dataset
Installation notes¶
dtool is a Python package that is pip installable.
Make sure that pip
, setputools
and wheel
are up to date.
This is a requirement of one of the dependencies (ruamel.yaml
).
$ pip install -U pip setuptools wheel
dtool can then be installed using pip
.
$ pip install dtool
Adding support for S3 object storage¶
Install the dtool-s3
package using pip
.
$ pip install dtool-s3
To configure Amazon S3 credentials see the README file in the dtool-s3 GitHub repository.
Adding support for Azure storage¶
Install the dtool-azure
package using pip
.
$ pip install dtool-azure
To configure Microsoft Azure credentials see the README file in the dtool-azure GitHub repository.
Adding support for ECS S3 object storage¶
Install the dtool-ecs
package using pip
.
$ pip install dtool-ecs
To configure ECS S3 object storage credentials see the README file in the dtool-ecs GitHub repository.
Adding support for iRODS storage¶
Install the dtool-irods
package using pip
.
$ pip install dtool-irods
Warning
In order to be able to use the iRODS backend storage you will need to install the iCommands. Linux packages can be downloaded from irods.org/download. On Mac OSX these can be installed using the brew package manager:
$ brew install irods
For more details see the dtool-irods GitHub repository.
Philosophy - what is dtool?¶
What problem is dtool solving?¶
Managing data as a collection of individual files is hard. Analysing that data will require that certain sets of files are present, understanding it requires suitable metadata, and copying or moving it while keeping its integrity is difficult.
dtool solves this problem by packaging a collection of files and accompanying metadata into a self contained and unified whole: a dataset.
Having metadata separate from the data, for example in an Excel spread sheet with links to the data files, it becomes difficult to reorganise the data without fear of breaking links between the data and the metadata. By encapsulating both the data files and associated metadata in a dataset one is free to move the dataset around at will. The high level organisation of datasets can therefore evolve over time as data management processes change.
dtool also solves an issue of trust. By including file hashes as metadata it is possible to verify the integrity of a dataset after it has been moved to a new location or when coming back to a dataset after a period of time.
It is possible to discover and access both metadata and data files in a dataset. It is therefore easy to create scripts and pipelines to process the items, or a subset of items, in a dataset.
What is a “dtool dataset”?¶
Briefly, a dtool dataset consists of:
- The files added to the dataset, known as the dataset “items”
- Metadata used to describe the dataset as a whole
- Metadata describing the items in the dataset
The exact details of how this data and metadata is stored depends on the
“backend” (the type of storage used). In other words a dataset is stored
differently on local file system disk to how it is stored in Amazon S3 object
store. However, the dtool
commands and the Python API for interacting with
datasets are the same for all backends.
What does a dtool dataset look like on local disk?¶
Below is the structure of a fictional dataset containing three items from an RNA sequencing experiment.
$ tree ~/my_dataset
/Users/olssont/my_dataset
├── README.yml
└── data
├── rna_seq_reads_1.fq.gz
├── rna_seq_reads_2.fq.gz
└── rna_seq_reads_3.fq.gz
The README.yml
file is where metadata used to describe the whole dataset is
stored. The items of the dataset are stored in the directory named data
.
There is also hidden metadata, stored as plain text files, in a directory named
.dtool
. This should not be edited directly by the user.
How does one create a dtool dataset?¶
This happens in stages:
- One creates a so called “proto dataset”
- One adds data and metadata to this proto dataset
- One converts the proto dataset into a dataset by “freezing” it
Once a proto dataset is “frozen” it is simply referred to as a dataset and it is no longer possible to modify the data in it. In other words it is not possible to add or remove items from a dataset or to alter any of the items in a dataset.
The process can be likened to creating an open box (the proto dataset), putting items (data) into it, sticking a label (metadata) on it, and closing the box (freezing the dataset).
Give me more details!¶
An in depth discussion of dtool can be found in the paper Lightweight data management with dtool.
Quick start guide¶
This quick start guide shows how the dtool
command line tool can be used to
accomplish some common data management tasks.
Organising files into a dataset on local disk¶
In this scenario one simply wants to organise one or more files into a dataset in the file system on the local computer.
When working on local disk a dataset is simply a standardised directory layout combined with some hidden files used to annotate the dataset and its items.
The first step is to create a “proto” dataset. The command below creates a
dataset named fishers-iris-data
in the current working directory.
$ dtool create fishers-iris-data
One can now add files to the dataset by moving/copying them to the
fisher-iris-data/data
directory, or by using the built in dtool add
item
command. In the example below the file iris.csv
is added to the
proto dataset.
$ touch iris.csv
$ dtool add item iris.csv fishers-iris-data
Metadata describing the data is as important as the data itself. Metadata
describing the dataset is stored in the file fisers-iris-data/README.yml
.
An easy way to add content to this file is to use the dtool readme
interactive
, which will prompt for input regarding the dataset.
$ dtool readme interactive fishers-iris-data
description [Dataset description]: Fisher's classic iris data, but with an empty file :(
project [Project name]: dtool demo
confidential [False]:
personally_identifiable_information [False]:
name [Your Name]: Tjelvar Olsson
email [olssont@nbi.ac.uk]:
username [olssont]:
creation_date [2017-10-06]:
Updated readme
To edit the readme using your default editor:
dtool readme edit fiser-iris-data
Finally, to convert the proto dataset into a dataset one uses the dtool
freeze
command.
$ dtool freeze fishers-iris-data
Generating manifest [####################################] 100% iris.csv
Dataset frozen fiser-iris-data
Copying data from an external hard drive to remote storage as a dataset¶
Genome sequencing generates large volumes of data, which are often sent from the sequencing company to the user by posting an external hard drive. When backing up such data on a remote storage system one does not want to have to reorganise the data before copying it to the remote storage system.
In this case one can create a “symlink” dataset and copy that to the remote storage. A symlink dataset is a dataset where the data directory is a symlink to another location, for example the data directory on the external hard drive.
$ dtool create bgi-sequencing-12345 --symlink-path /mnt/external-hard-drive
Again, adding metadata to the dataset is vital.
$ dtool readme interactive bgi-sequencing-12345
One can then convert the proto dataset into a dataset by “freezing” it.
$ dtool freeze bgi-sequencing-12345
It is now time to copy the dataset to the remote storage. The command below
assumes that one has credentials setup to write to the Amazon S3 bucket
dtool-demo
. The command copies the local dataset to the S3 dtool-demo
bucket.
$ dtool cp bgi-sequencing-12345 s3://dtool-demo/
The command above returns feedback on the URI used to identify the dataset in
the remote storage. In this case
s3://dtool-demo/1e47c076-2eb0-43b2-b219-fc7d419f1f16
.
The URI used to identify the dataset uses the UUID of the dataset rather than the dataset’s name. This is to avoid name clashes in the object storage.
Finally, one may want to confirm that the data transfer was successful. This
can be achieved using the dtool diff
command, which should show no
differences if the transfer was successful.
$ dtool diff bgi-sequencing-12345 s3://dtool-demo/1e47c076-2eb0-43b2-b219-fc7d419f1f16
By default only identifiers and file sizes are compared. To check file hashes
make use of the --full
option.
Warning
When comparing datasets identifiers, sizes and hashes are compared. When checking that the hashes are identical the hashes for the first dataset are recalculated using the hashing algorithm of the reference dataset (the second). If the dataset in S3 had been specified as the first argument then all the files would have had to have been downloaded to the local disk before calculating their hashes, which would have made the command slower.
Copying a dataset from remote storage to local disk¶
After having copied a dataset to a remote storage system one may have deleted the copy on the local disk. In this case one may want to be able to get the dataset back onto the local disk.
This can be achieved using the dtool cp
command. The command below copies
the dataset in iRODS to the current working directory.
$ dtool cp s3://dtool-demo/1e47c076-2eb0-43b2-b219-fc7d419f1f16 ./
Note that on the local disk the dataset will use the name of the dataset rather
than the UUID, in this example bgi-sequencing-12345
.
Again one can verify the data transfer using the dtool diff
command.
$ dtool diff bgi-sequencing-12345 s3://dtool-demo/1e47c076-2eb0-43b2-b219-fc7d419f1f16
Working with datasets¶
Listing datasets¶
It is possible to list all datasets in a directory or in a S3 bucket
using the dtool ls
command.
$ dtool ls ~/my_datasets
bgi-sequencing-12345
file:///Users/olssont/my_datasets/bgi-sequencing-12345
drone-images
file:///Users/olssont/my_datasets/drone-images
fishers-iris-data
file:///Users/olssont/my_datasets/fishers-iris-data
my_rnaseq_data
file:///Users/olssont/my_datasets/my_rnaseq_data
Tip
When using this command proto datasets are highlighted in red.
Tip
The dtool ls
command takes a URI. As such it can be used to list
the datasets in remote storage locations. The example below lists all
the datasets in the S3 bucket named dtool-demo
:
$ dtool ls s3://dtool-demo/
Generating an inventory of datasets¶
It is possible to generate CSV/TSV/HTML inventories of datasets in a directory or in another base URI such as an Amazon S3 bucket. For example, the command below is used to generate a HTML report of all the datasets in the s3://dtool-demo/ bucket.
$ dtool inventory --format html s3://dtool-demo/ > inventory.html
Verifying a dataset has not been modified since freezing it¶
A dtool dataset has metadata listing its items and their hashes. This information can be used to verify that a dataset is in the same state as it was when it was frozen.
In the example below the dataset has been corrupted in three ways.
- The file
rna_seq_reads_4.fq.gz
has been added to it - The file
rna_seq_reads_3.fq.gz
has been deleted from it - The content of the file
rna_seq_reads_1.fq.gz
has been modified
$ dtool verify ~/my_datasets/my_rnaseq_data
Unknown item: 49919bdae83011b96bf54d984735e24c4419feb5 rna_seq_reads_4.fq.gz
Missing item: 72b24007759c0086a316d13838021c2571853a16 rna_seq_reads_3.fq.gz
By default only identifiers and file sizes are compared. To check file hashes
make use of the --full
option.
$ dtool verify --full ~/my_datasets/my_rnaseq_data
Unknown item: 49919bdae83011b96bf54d984735e24c4419feb5 rna_seq_reads_4.fq.gz
Missing item: 72b24007759c0086a316d13838021c2571853a16 rna_seq_reads_3.fq.gz
Altered item: d4e065787eab480e9cbd2bac6988bc7717464c83 rna_seq_reads_1.fq.gz
Displaying the README descriptive metadata¶
To display the README metadata used to describe the dataset one can make use of
the dtool readme show
command.
$ dtool readme show ~/my_datasets/chrX-rna-seq
---
description: RNA-seq sample data
creation_date: 2017-11-20
ftp: "ftp://ftp.ccb.jhu.edu/pub/RNAseq_protocol/"
doi: "10.1038/nprot.2016.095"
Reporting summary information about a dataset¶
One often wants to find out how many items are in a dataset and what their
total size is. This can be achieved using the dtool summary
command.
$ dtool summary ~/my_datasets/drone-images
name: drone-images
uuid: c2542c2b-d149-4f73-84bc-741bf9af918f
creator_username: hartleym
number_of_items: 59
size: 152.5MiB
frozen_at: 2017-09-19
Listing the item identifiers in a dataset¶
To list all the item identifiers in a dataset one can use the dtool
identifiers
command.
$ dtool identifiers ~/my_datasets/my_rnaseq_data
b0f92a668d24a3015692b0869e2b7590a62a380c
72b24007759c0086a316d13838021c2571853a16
d4e065787eab480e9cbd2bac6988bc7717464c83
Tip
Using dtool ls
on a dataset URI results in a list of item
identifiers and relapths:
$ dtool ls ~/my_datasets/my_rnaseq_data
b0f92a668d24a3015692b0869e2b7590a62a380c - rna_seq_reads_2.fq.gz
72b24007759c0086a316d13838021c2571853a16 - rna_seq_reads_3.fq.gz
d4e065787eab480e9cbd2bac6988bc7717464c83 - rna_seq_reads_1.fq.gz
Finding out the size of an item in a dataset¶
To find the size of a specific item in a dataset one can use the dtool item
properties
command. The command below accesses the properties of the item
with the identifier 58f50508c42a56919376132e36b693e9815dbd0c
.
$ dtool item properties ~/my_datasets/drone-images 58f50508c42a56919376132e36b693e9815dbd0c
{
"relpath": "IMG_8585.JPG",
"size_in_bytes": 2716446,
"utc_timestamp": 1505818439.0,
"hash": "dbcb0d6f22ec660fa4ac33b3d74556f3"
}
Accessing the content of an item in a dataset¶
When all files are on local disk getting access to them is trivial. However, when files are located in some object storage system in the cloud, access may be less trivial.
dtool solves this problem by providing a call to a method that returns an absolute path on local disk with a promise that the file requested will be available from there when the call returns the path.
The dtool command line interface makes this call available as the command
dtool item fetch
.
Below is an example of this command being used on a local disk file storage.
$ dtool item fetch ~/my_datasets/drone-images 58f50508c42a56919376132e36b693e9815dbd0c
/Users/olssont/my_datasets/drone-images/data/IMG_8585.JPG
Below is an example of this command being used on a dataset in the S3 bucket
dtool-demo
.
$ dtool item fetch s3://dtool-demo/1e47c076-2eb0-43b2-b219-fc7d419f1f16 3dce23b901709a24cfbb974b70c1ef132af10a67
/Users/olssont/.cache/dtool/s3/1e47c076-2eb0-43b2-b219-fc7d419f1f16/3dce23b901709a24cfbb974b70c1ef132af10a67.txt
Processing all the items in a dataset¶
By combining the use of dtool identifiers
and dtool item fetch
it is
possible to create basic Bash scripts to process all the items in a dataset.
$ DS_URI=~/my_datasets/my_rnaseq_data
$ for ITEM_ID in `dtool identifiers $DS_URI`;
> do ITEM_FPATH=`dtool item fetch $DS_URI $ITEM_ID`;
> echo $ITEM_FPATH;
> done
/Users/olssont/my_datasets/my_rnaseq_data/data/rna_seq_reads_2.fq.gz
/Users/olssont/my_datasets/my_rnaseq_data/data/rna_seq_reads_3.fq.gz
/Users/olssont/my_datasets/my_rnaseq_data/data/rna_seq_reads_1.fq.gz
Tagging datasets¶
It is possible to tag datasets with labels.
To tag a dataset with the label “rnaseq” one would use the command below:
$ dtool tag set <DS_URI> rnaseq
It is possible to add more than one tag to a dataset. The command below adds the tag “A.thaliana”:
$ dtool tag set <DS_URI> A.thaliana
To list tags one would use the command below:
$ dtool tag ls <DS_URI>
This would produce the output:
A.thalina
rnaseq
It is possible to delete a tag that has been added to a dataset:
$ dtool tag delete <DS_URI> A.thaliana
Annotating datasets¶
It is possible to annotate a dataset with so called key/value pairs. Such key/value annotations are intended to make it easy to add and access specific metadata at a per dataset level.
The difference between annotations and the descriptive metadata is that the former is easier to work with in a programmatic fashion. The descriptive metadata, stored in the dataset’s README content, is more free form. It is non-trivial to access specific pieces of information from the descriptive metadata in the dataset’s README content, whereas a dtool annotation can be easily accessed by its name (key).
To create an annotation using the dtool CLI one would use the dtool annotation
set
command. For example to annotate a dataset with a “project” one would use
the command:
$ dtool annotation set <DS_URI> project world-peace
To access the “project” annotation one would use the dtool annotation get
command:
$ dtool annotation get <DS_URI> project
world-peace
Annotations set using dtool annotation set
are strings by default. It is possible
to set the type to int
, float
, and bool
using the --type
option. For
example to annotate a dataset with a “stars” rating one could use the command:
$ dtool annotation set --type int <DS_URI> stars 3
For more complex data structures one can set the type to json
. For example:
$ dtool annotation set --type json <DS_URI> params '{"x": 3.4, "y": 5.6}'
It is possible to list all the annotations of a dataset:
$ dtool annotation ls
params {"x": 3.4, "y": 5.6}
project world-peace
stars 3
To update an annotation one can use the dtool annotation set
command again.
For example to show that a dataset is really fantastic one could increase its
star rating to 5:
$ dtool annotation set <DS_URI> stars 5 --type int
$ dtool annotation get <DS_URI> stars
5
Warning
There are restrictions on the characters and the length of the keys. They have to
match the regular expression ^[a-zA-Z.-_]*$
and it must be 80 characters or less.
Working with overlays¶
Overlays provide a means to store and access per item metadata.
Display table with all per item metadata¶
It is possible to display all the per item metadata as a CSV table using the
command dtool overlays show
.
$ dtool overlays show http://bit.ly/Ecoli-reads-minified
identifiers,pair_id,is_read1,useful_name,relpaths
8bda245a8cd526673aab775f90206c8b67d196af,9760280dc6313d3bb598fa03c5931a7f037d7ffc,False,ERR022075,ERR022075_2.fastq.gz
9760280dc6313d3bb598fa03c5931a7f037d7ffc,8bda245a8cd526673aab775f90206c8b67d196af,True,ERR022075,ERR022075_1.fastq.gz
The dataset above has three overlays named: pair_id
, is_read1
, and
useful_name
. The columns named identifiers
and relpaths
are
reported for bookkeeping purposes.
Accessing an overlay value of a specific dataset item¶
It is possible to get access to the value stored in an overlay for a specific
item using the command dtool item overlay
.
$ dtool item overlay \
is_read1 \
http://bit.ly/Ecoli-reads-minified \
9760280dc6313d3bb598fa03c5931a7f037d7ffc
True
Creating overlays¶
Overlay creation happens in two steps.
- Create a template overlay CSV file using the format above
- Use the template to write all overlays in the template to the dataset
Creating overlay templates¶
A starting template can be created using the dtool overlays show
command.
For a dataset with no overlays this will result in a table with the columns
identifiers
and relpaths
. The table will have one row for each item in
the dataset. One can then add columns for the overlays one would wish to
create.
However, in many cases one would want to use metadata in the items’ relapths to generate a starting CSV template. This can be achieved using the commands:
dtool overlays template parse
dtool overlays template glob
dtool overlays template pairs
Consider for example the dataset below:
$ dtool ls http://bit.ly/Ecoli-reads-minified
8bda245a8cd526673aab775f90206c8b67d196af ERR022075_2.fastq.gz
9760280dc6313d3bb598fa03c5931a7f037d7ffc ERR022075_1.fastq.gz
The command below could be used to generate a template for the overlays “useful_name” and “read”:
$ dtool overlays template parse \
http://bit.ly/Ecoli-reads-minified \
'{useful_name}_{read:d}.fastq.gz'
Results in the CSV output below:
identifiers,read,useful_name,relpaths
8bda245a8cd526673aab775f90206c8b67d196af,2,ERR022075,ERR022075_2.fastq.gz
9760280dc6313d3bb598fa03c5931a7f037d7ffc,1,ERR022075,ERR022075_1.fastq.gz
To ignore a variable element when parsing one can use unnamed curly braces. The command below for example only generates the overlay “useful_name”:
$ dtool overlays template parse \
http://bit.ly/Ecoli-reads-minified \
'{useful_name}_{:d}.fastq.gz'
identifiers,useful_name,relpaths
8bda245a8cd526673aab775f90206c8b67d196af,ERR022075,ERR022075_2.fastq.gz
9760280dc6313d3bb598fa03c5931a7f037d7ffc,ERR022075,ERR022075_1.fastq.gz
Sometimes one simply wants to create a boolean overlay based on weather or not
a particular file matches a glob pattern. The command below can be used to
create a CSV template for an overlay named is_read1
:
$ dtool overlays template glob \
http://bit.ly/Ecoli-reads-minified \
is_read1 \
'*1.fastq.gz'
identifiers,is_read1,relpaths
8bda245a8cd526673aab775f90206c8b67d196af,False,ERR022075_2.fastq.gz
9760280dc6313d3bb598fa03c5931a7f037d7ffc,True,ERR022075_1.fastq.gz
Sometimes it is useful to be able to find pairs of items. For example when dealing with genomic sequencing data that has forward and reverse reads.
One can create a “pair_id” overlay CSV template for this dataset using the command below:
$ dtool overlays template pairs http://bit.ly/Ecoli-reads-minified .fastq.gz
identifiers,pair_id,relpaths
8bda245a8cd526673aab775f90206c8b67d196af,9760280dc6313d3bb598fa03c5931a7f037d7ffc,ERR022075_2.fastq.gz
9760280dc6313d3bb598fa03c5931a7f037d7ffc,8bda245a8cd526673aab775f90206c8b67d196af,ERR022075_1.fastq.gz
In the above the suffix “.fastq.gz” is used to extract the prefix ERR022075_
that is used to find matching pairs.
Writing an overlay template to a dataset¶
Once one has a overlay template CSV file one can write this to a dataset:
$ dtool overlays write <DS_URI> overlays.csv
Further reading¶
For more information see the at https://github.com/jic-dtool/dtool-overlay
Configuring user name and email¶
When running the dtool interactive readme
the default name and email
address are Your Name
and you@example.com
.
$ dtool readme interactive my_dataset
description [Dataset description]:
project [Project name]:
confidential [False]:
personally_identifiable_information [False]:
name [Your Name]:
email [you@example.com]:
username [olssont]:
creation_date [2017-12-14]:
These defaults can be configuring the user name and email address.
$ dtool config user name "Care A. Bout-Data"
Care A. Bout-Data
$ dtool config user email researcher@famous.uni.ac.uk
researcher@famous.uni.ac.uk
Rerunning the previous dtool readme interactive
command now gives updated
defaults when prompting for input.
$ dtool readme interactive my_dataset
description [Dataset description]:
project [Project name]:
confidential [False]:
personally_identifiable_information [False]:
name [Care A. Bout-Data]:
email [researcher@famous.uni.ac.uk]:
username [olssont]:
creation_date [2017-12-14]:
Configuring the dtool cache directory¶
When fetching a dataset item from a dataset stored in object storage the file get stored in a cache directory. The default cache directory is:
~/.cache/dtool
You may want to configure this cache to be in a different location. This can be achieved using the dtool config cache
command:
$ mkdir /tmp/dtool
$ dtool config cache /tmp/dtool
It is also possible to override both the default and the configured cache
directory by exporting the environment variable DTOOL_CACHE_DIRECTORY
.
This can be useful when using local SSD on a compute cluster:
$ mkdir /local/ssd/dtool
$ export DTOOL_CACHE_DIRECTORY=/local/ssd/dtool
Warning
There is no automatic mechanism built into dtool to clear up the cache. It can therefore grow very large if you are working with lots of datasets in object storage.
Configuring a custom README template¶
When running the dtool interactive readme
command one is prompted to enter
the default descriptive metadata shown below.
$ dtool readme interactive my_dataset
description [Dataset description]:
project [Project name]:
confidential [False]:
personally_identifiable_information [False]:
name [Your Name]:
email [you@example.com]:
username [olssont]:
creation_date [2017-12-14]:
It is possible to configure the required metadata prompted for by the
dtool readme interactive
command. This requires the creation of a
README file making use of the YAML file format.
The default template is shown below.
---
description: Dataset description
project: Project name
confidential: False
personally_identifiable_information: False
owners:
- name: {DTOOL_USER_FULL_NAME}
email: {DTOOL_USER_EMAIL}
username: {username}
creation_date: {date}
# links:
# - http://doi.dx.org/your_doi
# - http://github.com/your_code_repository
# budget_codes:
# - E.g. CCBS1H10S
To create a custom template that also prompted for a species definition one
could create the file ~/custom_dtool_readme.yml
with the content below.
---
description: Dataset description
project: Project name
species: A. thaliana
confidential: False
personally_identifiable_information: False
owners:
- name: {DTOOL_USER_FULL_NAME}
email: {DTOOL_USER_EMAIL}
username: {username}
creation_date: {date}
To configure the dtool to make use of this template one can use the dtool config readme-template
command:
$ dtool config readme-template ~/custom_dtool_readme.yml
The dtool config readme-template
command sets the
DTOOL_README_TEMPLATE_FPATH
key in the ~/.config/dtool/dtool.json
file.
Alternatively one can make use of the DTOOL_README_TEMPLATE_FPATH
environment variable:
$ export DTOOL_README_TEMPLATE_FPATH=~/custom_dtool_readme.yml
Re-running the previous dtool readme interacitve
command now includes a prompt for the species and the default value A. thaliana
:
$ dtool readme interactive my_dataset
description [Dataset description]:
project [Project name]:
species [A. thaliana]:
confidential [False]:
personally_identifiable_information [False]:
name [Your Name]:
email [you@example.com]:
username [olssont]:
creation_date [2017-12-14]:
Configuring storage brokers¶
Some remote storage brokers require extra configuration to enable authentication.
The command below configures access to a Azure storage container named
jicinformatics
:
$ dtool config azure set jicinformatics the-secret-token
the-secret-token
For information on other storage brokers have a look at their documentation
and/or use dtool config --help
to get more information.
Publishing a dataset¶
It is possible to publish a datasets hosted in AWS S3 and Microsoft Azure Storage. A dataset is published by making it accessible via the HTTP(S) protocol.
Warning
A published dataset is accessible by anyone in the world with an internet connection!
$ dtool publish s3://dtool-demo/ba92a5fa-d3b4-4f10-bcb9-947f62e652db
Dataset accessible at https://dtool-demo.s3.amazonaws.com/ba92a5fa-d3b4-4f10-bcb9-947f62e652db
The URL retuned by the dtool publish
command can be used to interact with the dataset.
$ dtool summary https://dtool-demo.s3.amazonaws.com/ba92a5fa-d3b4-4f10-bcb9-947f62e652db
name: hypocotyl3
uuid: ba92a5fa-d3b4-4f10-bcb9-947f62e652db
creator_username: olssont
number_of_items: 339
size: 86.7MiB
frozen_at: 2018-09-12
Python API¶
The dtool
command line tool is built using the Python API in dtoolcore. This API can also be used to create
and interact with datasets directly.
Below is an example showing how to load a dataset from a URI and use it to print out a list of all the data item identifiers in the dataset.
>>> from dtoolcore import DataSet
>>> dataset = DataSet.from_uri("bgi-sequencing-12345")
>>> for i in dataset.identifiers:
... print(i)
...
1c10766c4a29536bc648260f456202091e2f57b4
fbcc24bed36128535a263b74b2e138d7cc43e90c
9ca330a84f3dbbdd457a860b5e3c21c917743dd6
3dce23b901709a24cfbb974b70c1ef132af10a67
78e7f1507da598e9f6a02810c1f846cfc24fb8ad
42f43f49b74ef7f901010965aae71170c9fd3ef6
ab069337b0f86cdad899d57e8de63d5b2b680c85
b55ae3fbe6081eb2ed4ed2c4ea316dbeb943ea2c
More information on how to make use of the Python API can be found in the dtoolcore documentation.
Creating plugins¶
It is possible to create plugins to the dtool
command line tool. There are
two different types of plugins: command line tools and backend storage brokers.
The former allows a developer to add custom extensions to the dtool
command. The latter allows a developer to create an interface for talking to a
new type of storage. One could for example create a storage broker to interface
with Amazon S3 object storage.
Extending the dtool
command line tool¶
Information on how to extend the dtool
command line tool is available in
the README file of dtool-cli.
Concrete examples making use of this plugin system are:
Creating an interface to a new type of storage¶
Below are the steps required to create a storage broker for allowing dtool
to interact with a new backend. A concrete example making use of this plugin
system is dtool-irods.
Examine the code in
dtoolcore.storagebroker.DiskStorageBroker
.Create a Python class for your storage, e.g.
MyStorageBroker
Add a
MyStorageBroker.key`
attribute to the class, this key is used to lookup an appropriate storage broker when interacting with a datasetAdd a
dtoolcore.FileHasher
instance that matches the hashing algorithm used by your storage to yourMyStorageBroker.hasher
attributeAdd implementations for all the public functions in
dtoolcore.storagebroker.DiskStorageBroker
class toMyStorageBroker
Expose the
MyStorageBroker
class as adtool.storage_broker
entrypoint, e.g. add a section along the lines of the below to thesetup.py
file:entry_points={ "dtool.storage_brokers": [ "MyStorageBroker=my_dtool_storage_plugin:MyStorageBroker", ], },
Citing dtool¶
Olsson TSG, Hartley M. 2019. Lightweight data management with dtool. PeerJ 7:e6562 https://doi.org/10.7717/peerj.6562
CHANGELOG¶
This project uses semantic versioning. This change log uses principles from keep a changelog.
[3.25.0] - 2020-03-25¶
Added support for tags from the dtool CLI.
Added¶
- The CLI command ‘dtool tag set’
- The CLI command ‘dtool tag ls’
- The CLI command ‘dtool tag delete’
[3.24.0] - 2020-03-23¶
Added Python API support for tags.
Added¶
- Added
dtoolcore._BaseDataSet.put_tag()
method - Added
dtoolcore._BaseDataSet.delete_tag()
method - Added
dtoolcore._BaseDataSet.list_tags()
method - Added
dtoolcore.storagebroker.BaseStorageBroker.delete_key()
method - Added
dtoolcore.storagebroker.BaseStorageBroker.get_tag_key()
method - Added
dtoolcore.storagebroker.BaseStorageBroker.list_tags()
method - Added
dtoolcore.storagebroker.BaseStorageBroker.put_tag()
method - Added
dtoolcore.storagebroker.BaseStorageBroker.delete_tag()
method - Added
dtoolcore.storagebroker.DiskStorageBroker.delete_key()
method - Added
dtoolcore.storagebroker.DiskStorageBroker.get_tag_key()
method - Added
dtoolcore.storagebroker.DiskStorageBroker.list_tags()
method - Default cache directory changed from
~/.cache/dtool/http
to~/.cache/dtool
Fixed¶
- Cache environment variable changed from DTOOL_HTTP_CACHE_DIRECTORY to DTOOL_CACHE_DIRECTORY
[3.23.0] - 2020-02-28¶
Added¶
- Add
dtool readme validate
command - Ability to update descriptive metadata in README of frozen datasets
when using
dtool redme write
Fixed¶
- Fixed several defects in how URIs were parsed and generated on Windows.
[3.22.0] - 2020-02-06¶
Improved Python API for creating datasets.
Added¶
- dtoolcore.create_proto_dataset() helper function
- dtoolcore.create_derived_proto_dataset() helper function
- dtoolcore.DataSetCreator helper context manager class
- dtoolcore.DerivedDataSetCreator helper context manager class
Fixed¶
- Fixed defect where using
DTOOL_NUM_PROCESSES
> 1 resulted in a cPickle.PicklingError on some storage brokers. Multiprocessing is now only used if the storage broker supports it.
[3.21.1] - 2020-01-23¶
- Fixed defect where ‘dtool verify’ calculated hashes even when the ‘-f/–full’ option was not specified. The ‘dtool verify’ command now runs more quickly.
[3.21.0] - 2020-01-21¶
Added¶
- Ability to use multiple processes (cores) to generate item properties for
manifest files in parallel. Set the environment variable
DTOOL_NUM_PROCESSES
to specify the number of processes to use.
Fixed¶
- Included .dtool/annotations directory in DiskStorageBroker self description file
[3.20.0] - 2019-10-31¶
New feature: Dataset annotation
Dataset annotations are intended to make it easy to add and access specific metadata at a per dataset level.
The difference between annotations and the descriptive metadata is that the former is easier to work with in a programmatic fashion. The descriptive metadata, stored in the dataset’s README content, is more free form. It is non-trivial to access specific pieces of information from the descriptive metadata in the dataset’s README content, whereas a dtool annotation can be easily accessed by its name.
Added¶
- Added
dtool annotation set
command - Added
dtool annotation get
command - Added
dtool annotation ls
command
[3.19.0] - 2019-09-12¶
Added¶
- Added sorting of items by relpath to ‘dtool ls <DS_URI>’
Fixed¶
- Fixed formatting of ‘dtool ls <DS_URI>’ from using two whitespaces to using
one tab to make it easier to work with command line tools such as
cut
- Fixed ordering of lines in overlay CSV template from being sorted by the identifier to being ordered by the relpath
[3.18.0] - 2019-09-06¶
Added¶
- Added ‘dtool overlays show’ command
- Added ‘dtool overlays write’ command
- Added ‘dtool overlays template parse’ command
- Added ‘dtool overlays template glob’ command
- Added ‘dtool overlays template pairs’ command
Deprecated¶
- Deprecated ‘dtool overlay ls’
- Deprecated ‘dtool overlay show’
[3.17.0] - 2019-08-06¶
Added¶
- Added support for host name in file URI.
- Added
dtool status
command for working out if a dataset is frozen or not - Added
dtool uri
command for expanding absolute and relative paths into proper URIs
[3.16.0] - 2019-07-12¶
Added¶
- Added more debug logging
- Added
dtool config ecs ls
command to list ECS base URIs that have been - Added support for configuring access to ECS buckets in multiple namespaces
Fixed¶
- The
dtool config azure ls
command now returns base URIs rather than container names
[3.15.0] - 2019-04-26¶
Added¶
dtool config readme-template
CLI command for configuring the path to a custom readme templatedtoolcore._BaseDataSet.base_uri
propertydtoolcore.storagebroker.BaseStorageBroker.generate_base_uri
methoddtoolcore.utils.DEFAULT_CACHE_PATH
global helper variabledtoolcore.utils.get_config_value_from_file
helper functiondtoolcore.utils.write_config_value_to_file
helper function
Changed¶
dtool config cache
now works with one unified cache directory for all storage brokers- Started using unified environment variable to specify the cache directory
DTOOL_CACHE_DIRECTORY
- Default cache directory changed set to
~/.cache/dtool
Fixed¶
- Fixed defect when username was supplied as two separate strings to
dtool config user name
in CLI
[3.14.0] - 2018-11-21¶
Added¶
- Added
dtool publish
command - Added
-f/--format
option todtool summary
command to enable output in JSON format - Added sorting of CSV/TSV/HTML inventories by dataset name
Changed¶
- Changed default output of
dtool summary
to be human readable YAML
[3.11.0] - 2018-09-20¶
Added¶
dtool cp
to replacedtool copy
dtool readme write
to write readme from file or stdindtool item overlay
command
Deprecated¶
dtool copy
in favour ofdtool cp
Removed¶
- Removed
created_at
field from default README template
Fixed¶
- Defect in
dtool create
when providing a relative path to the--symlink-path
option - Python 2 defect in dealing with unicode in README.yml file when using
dtool readme edit
[3.10.0] - 2018-09-11¶
Added¶
dtoolcore.filehasher.hashsum_digest
helper functiondtoolcore.filehasher.md5sum_digest
helper function
Changed¶
- Improved name from
dtoolcore.filehasher.hashsum
todtoolcore.filehasher.hashsum_hexdigest
Fixed¶
- Deal with issue in how ruamel.yaml deals with float values
[3.9.0] - 2018-08-03¶
Added¶
- Added ability to update the name of a frozen dataset from the
dtool
CLI - Added
update_name
method toDataSet
class (previously only available onProtoDataSet
class)
[3.8.0] - 2018-07-31¶
Dataset name validation.
Added¶
dtoolcore.generate_admin_metadata
function raisesdtoolcore.DtoolCoreInvalidNameError
if invalid name is provideddtoolcore.utils.name_is_valid
utility function for checking sanity of dataset names- Validation of dataset name upon creation using dtool CLI
- Validation of dataset name when updating it using dtool CLI
Fixed¶
- Fixed defect where
dtool ls -q
was listing dataset names rather than URIs making it impossible to process datasets in a BASE_URI programatically - Make
SymlinkStorageBroker
compatible with dtoolcore 3.4.0
[3.7.0] - 2018-07-26¶
Storage broker base class redesign and refactoring.
Added¶
- Ability to update descriptive metadata in README of frozen datasets
- Validation that the descriptive metadata provided by the
dtool readme edit
command is valid YAML - Added
dtoolcore.storagebroker.BaseStorageBroker
- Added logging to the reusable
BaseStorageBroker
methods get_text
new method onBaseStorageBroker
classput_text
new method onBaseStorageBroker
classget_admin_metadata_key
new method onBaseStorageBroker
classget_readme_key
new method onBaseStorageBroker
classget_manifest_key
new method onBaseStorageBroker
classget_overlay_key
new method onBaseStorageBroker
classget_structure_key
new method onBaseStorageBroker
classget_dtool_readme_key
new method onBaseStorageBroker
classget_size_in_bytes
new method onBaseStorageBroker
classget_utc_timestamp
new method onBaseStorageBroker
classget_hash
new method onBaseStorageBroker
classget_relpath
new method onBaseStorageBroker
classupdate_readme
new method onBaseStorageBroker
classDataSet.put_readme
method that can be used to update descriptive metadata- in (frozen) dataset README whilst keeping a copy of the historical README content
- Add
storage_broker_version
key to structure parameters
Fixed¶
- Stop
copy_resume
function calculating hashes unnecessarily - Fixed the documentation of the
dtool verify
command
[3.6.2] - 2018-07-10¶
Fixed¶
- Default config file now set in
dtoolcore.utils.get_config_value
if not provided in caller
[3.6.1] - 2018-07-09¶
Fixed¶
- Made download to DTOOL_HTTP_CACHE_DIRECTORY more robust
- Added ability to deal with redirects to enable working with shortened URLs
[3.6.0] - 2018-07-05¶
Added¶
- Bundling of
dtool-http
package
Removed¶
- Bundling of
dtool-irods
package - Bundling of
dtool-s3
package
[3.5.0] - 2018-06-06¶
Added¶
- Pre-checks to ‘dtool freeze’ command to ensure that there is no rogue content in the base of disk datasets
- Added rogue content validation check to DiskStorageBroker.pre_freeze hook
[3.4.0] - 2018-05-24¶
Added¶
- Pre-checks to ‘dtool freeze’ command to ensure that the item handles are sane, i.e. that they do not contain newline characters
- Pre-checks to ‘dtool freeze’ command to ensure that there are not too many items in the proto dataset, default to less than 10000
[3.3.1] - 2018-05-18¶
Fixed¶
- Defect where inventory html template is not included in Python package on PyPi
[3.3.0] - 2018-05-18¶
Added¶
- Add “created_at” key to the administrative metadata
dtool inventory
command for generating csv/tsv/html inventories of collections of datasets- Added support for
-h
flag as well as--help
- Added timestamp to logging output
Fixed¶
- Improved handling of URIs in validation code
- Fixed defect where running
dtool item properties
with an invalid identifier resulted in a KeyError exception being propagated to the user - Fixed defect where
dtool verify
did not compare file sizes - Fixed timestamp defect in DiskStoragBroker
[3.2.1] - 2018-05-01¶
Fixed¶
- Fixed issue arising from a file being put into iRODS and the connection breaking before the appropriate metadata could be set on the file in iRODS. See also: https://github.com/jic-dtool/dtool-irods/issues/7
[3.2.0] - 2018-02-09¶
Release to make it easier to create symlink datasets in an automated fashion.
Changed¶
- Simplified the way to specify the symbolic link path in the SymLinkStorageBroker
- The path to the data when creating a symlink dataset is now specified using the
-s/--symlink-path
option rather than being something that is prompted for. This makes it easier to create symlink datasets in an automated fashion.
[3.1.0] - 2018-02-05¶
Added¶
--resume
option todtool copy
command--quite
and--verbose
options todtool ls
and improved formatting- Add
dtoolcore.copy_resume
function
[3.0.0] - 2018-01-18¶
This release makes use of the dtoolcore version 3.0.0 API, which improves the handling of URIs and adds more metadata describing the structure of datasets.
Another major feature of this release is the addition of an S3 storage broker that can be used to interact with Amazon’s S3 object storage.
Added¶
- AWS S3 object storage broker
- Writing of
.dtool/structure.json
file to the DiskStorageBroker; a file for describing the structure of the dtool dataset in a computer readable format - Writing of
.dtool/README.txt
file to the DiskStorageBroker; a file for describing the structure of the dtool dataset in a human readable format - Writing of
.dtool/structure.json
file to the IrodsStorageBroker; a file for describing the structure of the dtool dataset in a computer readable format - Writing of
.dtool/README.txt
file to the IrodsStorageBroker; a file for describing the structure of the dtool dataset in a human readable format
Changed¶
- Make use of dtoolcore version 3 API
Fixed¶
- Removed the historical
dtool_readme
key/value pair from the administrative metadata (in the .dtool/dtool file)
[2.4.0] - 2017-12-14¶
Added¶
- Ability to specify a custom README.yml template file path.
- Ability to configure the full user name for the README.yml template using
DTOOL_USER_FULL_NAME
Fixed¶
- Made
.dtool/manifest.json
content created by DiskStorageBroker human readable by adding new lines and indentation to the JSON formatting. - Made the DiskStorageBroker.list_overlay_names method more robust. It no
longer falls over if the
.dtool/overlays
directory has been lost, i.e. by cloning a dataset with no overlays from a Git repository. - Fixed defect where an incorrect URI would get set on the dataset when using
DataSet.from_path
class method on a relative path - Made the YAML output more pretty by adding more indentation.
- Replaced hardcoded
nbi.ac.uk
email with configurableDTOOL_USER_EMAIL
in the default README.yml template. - Fixed
IrodsStorageBroker.generate_uri
class method - Made
.dtool/manifest.json
content created by IrodsStorageBroker human readable by adding new lines and indentation to the JSON formatting. - Added rule to catch
CAT_INVALID_USER
string for giving a more informative error message when iRODS authentication times out
[2.3.2] - 2017-10-25¶
Fixed¶
- Fixed issue where the symbolic link was not fully resolved when creating a symlink dataset that used the terminal to prompt for the data directory
[2.3.1] - 2017-10-25¶
Fixed¶
- More graceful exit if one presses Cancel in file browser when creating a symlink dataset
- Data directory now falls back on click command line prompt if TkInter has issues when creating a symlink dataset
[2.3.0] - 2017-10-23¶
Added¶
pre_freeze_hoook
to the stroage broker interface called at the beginning ofProtoDataSet.freeze
method.--quiet
flag todtool create
commanddtool overlay ls
command to list the overlays in datasetdtool overlay show
command to show the content of a specific overlay
Changed¶
- Improved speed of freezing a dataset in iRODS by making use of caches to reduce the number of calls made to iRODS during this process
dtool copy
now specifies target location using URI rather than using the--prefix
and--storage
arguments
Fixed¶
- Made the
DiskStorageBroker.create_structure
method more robust - More informative error message when iRODS has not been configured
- More informative error message when iRODS authentication times out
- Stopped client hanging when iRODS authentication has timed out
- storagebroker’s
put_item
method now returns relpath - Made the
IrodsStorageBroker.create_structure
method more robust by checking if the parent collection exists - Made error handling in
dtool create
more specific - Added propagation of original error message when
StorageBrokerOSError
captures indtool create
[2.2.0] - 2017-10-09¶
Added¶
dtool ls
can now be used to list the relpaths of the items in a dataset-f/--full
flag todtool diff
command to include checking of file hashes-f/--full
flag todtool verify
command to include checking of file hashes
Changed¶
dtool ls
now works with URIs rather than with prefix and storage argumentsdtool diff
now only compares identifiers and file sizes by defaultdtool verify
now only compares identifiers and file sizes by default
Fixed¶
- Made
DiskStorageBroker.list_dataset_uris
class method more robust
[2.1.1] - 2017-10-05¶
Fixed¶
- Fixed defect in iRODS storage broker where files with white space resulted in broken identifiers
[2.1.0] - 2017-10-04¶
Added¶
dtool readme show
command that returns the readme content--quiet
flag todtool copy
command
Changed¶
- Improved the
dtool readme --help
output
Fixed¶
- Progress bar now shows information on individual items being processed
dtool ls
now works with relative paths- Fix defect where
IrodsStorageBroker.put_item
raised SystemError when trying to overwrite an existing file
[2.0.2] - 2017-09-25¶
Fixed¶
- Better validation of input in terms of base vs proto vs frozen dataset URIs
- Fixed bug where copy creates an intermediate proto dataset that self identifies as a frozen dataset.
- Fixed potential bug where a copy could convert a proto dataset to a dataset before all its overlays had been copied over
- Fixed type of “frozen_at” time stamp in admin metadata: from string to float
[2.0.0] - 2017-09-14¶
Initial release of dtool
as a meta package.
MIT License¶
Copyright (c) 2017 Tjelvar Olsson
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.