Welcome to the sample-sheet Documentation!

A permissively licensed library designed to replace Illumina’s Experiment Manager.

❯ pip install sample-sheet

Or install with the Conda package manager after setting up your Bioconda channels:

❯ conda install sample-sheet

Which should be equivalent to:

❯ conda install -c bioconda -c conda-forge -c defaults sample-sheet

Features

  • Roundtrip reading, editing, and writing of Sample Sheets
  • de novo creation creation of Sample Sheets
  • Exporting Sample Sheets to JSON

Documentation

Quick Start

To demonstrate the features of this library we will use a test file at an HTTPS endpoint. To follow along, ensure you have the smart_open library installed!

>>> from sample_sheet import SampleSheet
>>> url = 'https://raw.githubusercontent.com/clintval/sample-sheet/master/tests/resources/paired-end-single-index.csv'
>>> sample_sheet = SampleSheet(url)

The metadata of the sample sheet can be accessed with the Header, Reads and, Settings attributes:

>>> sample_sheet.Header.Assay
'SureSelectXT'
>>> sample_sheet.Reads
[151, 151]
>>> sample_sheet.is_paired_end
True
>>> sample_sheet.Settings.BarcodeMismatches
'2'

The samples can be accessed directly or via iteration:

>>> sample_sheet.samples
[Sample({'Sample_ID': '1823A', 'Sample_Name': '1823A-tissue', 'index': 'GAATCTGA'}),
 Sample({'Sample_ID': '1823B', 'Sample_Name': '1823B-tissue', 'index': 'AGCAGGAA'}),
 Sample({'Sample_ID': '1824A', 'Sample_Name': '1824A-tissue', 'index': 'GAGCTGAA'}),
 Sample({'Sample_ID': '1825A', 'Sample_Name': '1825A-tissue', 'index': 'AAACATCG'}),
 Sample({'Sample_ID': '1826A', 'Sample_Name': '1826A-tissue', 'index': 'GAGTTAGC'}),
 Sample({'Sample_ID': '1826B', 'Sample_Name': '1823A-tissue', 'index': 'CGAACTTA'}),
 Sample({'Sample_ID': '1829A', 'Sample_Name': '1823B-tissue', 'index': 'GATAGACA'})]
>>> first_sample, *other_samples = list(sample_sheet)
>>> first_sample
Sample({'Sample_ID': '1823A', 'Sample_Name': '1823A-tissue', 'index': 'GAATCTGA'})

Defining Sample Read Structures

If a column labeled Read_Structure is provided per sample, then additional functionality is enabled.

>>> first_sample, *_ = sample_sheet.samples
>>> first_sample.Read_Structure
ReadStructure(structure='151T8B151T')
>>> first_sample.Read_Structure.total_cycles
310
>>> first_sample.Read_Structure.tokens
['151T', '8B', '151T']

Sample Sheet Creation

Sample sheets can be created de novo and written to a file-like object. The following snippet shows how to add attributes to mandatory sections, add optional user-defined sections, and add samples before writing to nowhere.

>>> import os
>>> from sample_sheet import SampleSheet, Sample
>>> sample_sheet = SampleSheet()

# [Header] section
# Adding an attribute with spaces must be done with the add_attr() method
>>> sample_sheet.Header['IEM4FileVersion'] = 4
>>> sample_sheet.Header['Investigator Name'] = 'jdoe'

# [Settings] section
>>> sample_sheet.Settings['CreateFastqForIndexReads'] = 1
>>> sample_sheet.Settings['BarcodeMismatches'] = 2

# Optional sample sheet sections can be added and then accessed
>>> sample_sheet.add_section('Manifests')
>>> sample_sheet.Manifests['PoolDNA'] = "DNAMatrix.txt"

# Specify a paired-end kit with 151 template bases per read
>>> sample_sheet.Reads = [151, 151]

# Add a single-indexed sample with both a name, ID, and index
>>> sample = Sample(dict(Sample_ID='1823A', Sample_Name='1823A-tissue', index='ACGT'))
>>> sample_sheet.add_sample(sample)

# Write the Sample Sheet!
>>> sample_sheet.write(open(os.devnull, 'w'))

API Reference

sample_sheet module

class sample_sheet.ReadStructure(structure: str)[source]

Bases: object

An object describing the order, number, and type of bases in a read.

A read structure is a sequence of tokens in the form <number><operator> where <operator> can describe template, skip, index, or UMI bases.

Operator Description
T Template base (e.g. experimental DNA, RNA)
S Bases to be skipped or ignored
B Bases to be used as an index to identify the sample
M Bases to be used as an index to identify the molecule
Parameters:structure – Read structure string representation.

Examples

>>> rs = ReadStructure("10M141T8B")
>>> rs.is_paired_end
False
>>> rs.has_umi
True
>>> rs.tokens
['10M', '141T', '8B']

Note

This class does not currently support read structures where the last operator has ambiguous length by using <+> preceding the <operator>.

Definitions of common read structure uses can be found at the following location:

Discussion on the topic of read structure use in hts-specs:

_sum_cycles_from_tokens(tokens: List[str]) → int[source]

Sum the total number of cycles over a list of tokens.

copy() → sample_sheet.ReadStructure[source]

Return a deep copy of this read structure.

has_indexes

Return if this read structure has any index operators.

has_skips

Return if this read structure has any skip operators.

has_umi

Return if this read structure has any UMI operators.

index_cycles

The number of cycles dedicated to indexes.

index_tokens

Return a list of all index tokens in the read structure.

is_dual_indexed

Return if this read structure is dual indexed.

is_indexed

Return if this read structure has sample indexes.

is_paired_end

Return if this read structure is paired-end.

is_single_end

Return if this read structure is single-end.

is_single_indexed

Return if this read structure is single indexed.

skip_cycles

The number of cycles dedicated to skips.

skip_tokens

Return a list of all skip tokens in the read structure.

template_cycles

The number of cycles dedicated to template.

template_tokens

Return a list of all template tokens in the read structure.

tokens

Return a list of all tokens in the read structure.

total_cycles

The number of total number of cycles in the structure.

umi_cycles

The number of cycles dedicated to UMI.

umi_tokens

Return a list of all UMI tokens in the read structure.

class sample_sheet.Sample(data: Optional[Mapping] = None, **kwargs)[source]

Bases: requests.structures.CaseInsensitiveDict

A single sample for a sample sheet.

This class is built with the keys and values in the "[Data]" section of the sample sheet. As specified by Illumina, the only required keys are:

  • "Sample_ID"

Although this library recommends you define the following column names:

  • "Sample_ID"
  • "Sample_Name"
  • "index"

If the key "Read_Structure" is provided then its value is promoted to the class ReadStructure and additional functionality is enabled.

Parameters:
  • data – Mapping of key-value pairs describing this sample.
  • kwargs – Key-value pairs describing this sample.

Examples

>>> mapping = {"Sample_ID": "87", "Sample_Name": "3T", "index": "A"}
>>> sample = Sample(mapping)
>>> sample
Sample({'Sample_ID': '87', 'Sample_Name': '3T', 'index': 'A'})
>>> sample = Sample({'Read_Structure': '151T'})
>>> sample.Read_Structure
ReadStructure(structure='151T')
to_json() → Mapping[source]

Return the properties of this Sample as JSON serializable.

class sample_sheet.SampleSheet(path: Union[pathlib.Path, str, TextIO, None] = None)[source]

Bases: object

A representation of an Illumina sample sheet.

A sample sheet document almost conform to the .ini standards, but does not, so a custom parser is needed. Sample sheets are stored in plain text with comma-seperated values and string quoting around any field which contains a comma. The sample sheet is composed of four sections, marked by a header.

Title name Description
[Header] .ini convention
[<Other>] .ini convention (optional, multiple, user-defined)
[Settings] .ini convention
[Reads] .ini convention as a vertical array of items
[Data] table with header
Parameters:path – Any path supported by pathlib.Path and/or smart_open.smart_open when smart_open is installed.
_repr_tty_() → str[source]

Return a summary of this sample sheet in a TTY compatible codec.

add_sample(sample: sample_sheet.Sample) → None[source]

Add a Sample to this SampleSheet.

All samples are validated against the first sample added to the sample sheet to ensure there are no ID collisions or incompatible read structures (if supplied). All samples are also validated against the "[Reads]" section of the sample sheet if it has been defined.

The following validation is performed when adding a sample:

  • Read_Structure is identical in all samples, if supplied
  • Read_Structure is compatible with "[Reads]", if supplied
  • Samples on the same "Lane" cannot have the same "Sample_ID" and "Library_ID".
  • Samples cannot have the same "Sample_ID" if no "Lane" has been defined.
  • The same "index" or "index2" combination cannot exist per flowcell or per lane if lanes have been defined.
  • All samples have the same index design ("index", "index2") per flowcell or per lane if lanes have been defined.
Parameters:sampleSample to add to this SampleSheet.

Note

It is unclear if the Illumina specification truly allows for equivalent samples to exist on the same sample sheet. To mitigate the warnings in this library when you encounter such a case, use a code pattern like the following:

>>> import warnings
>>> warnings.simplefilter("ignore")
>>> from sample_sheet import SampleSheet
>>> SampleSheet('tests/resources/single-end-colliding-sample-ids.csv');
SampleSheet('tests/resources/single-end-colliding-sample-ids.csv')
add_samples(samples: Iterable[sample_sheet.Sample]) → None[source]

Add samples in an iterable to this SampleSheet.

add_section(section_name: str) → None[source]

Add a section to the SampleSheet.

all_sample_keys

Return the unique keys of all samples in this SampleSheet.

The keys are discovered first by the order of samples and second by the order of keys upon those samples.

experimental_design

Return a markdown summary of the samples on this sample sheet.

This property supports displaying rendered markdown only when running within an IPython interpreter. If we are not running in an IPython interpreter, then print out a nicely formatted ASCII table.

Returns:A visual table of IDs and names for all samples.
Return type:Markdown, str
is_paired_end

Return if the samples are paired-end.

is_single_end

Return if the samples are single-end.

samples

Return the samples present in this SampleSheet.

to_json(**kwargs) → str[source]

Write this SampleSheet to JSON.

Returns:The JSON dump of all entries in this sample sheet.
Return type:str
to_picard_basecalling_params(directory: Union[str, pathlib.Path], bam_prefix: Union[str, pathlib.Path], lanes: Union[int, List[int]]) → None[source]

Writes sample and library information to a set of files for a given set of lanes.

BARCODE PARAMETERS FILES: Store information regarding the sample index sequences, sample index names, and, optionally, the library name. These files are used by Picard’s CollectIlluminaBasecallingMetrics and Picard’s ExtractIlluminaBarcodes. The output tab-seperated files are formatted as:

<directory>/barcode_params.<lane>.txt

LIBRARY PARAMETERS FILES: Store information regarding the sample index sequences, sample index names, and optionally sample library and descriptions. A path to the resulting demultiplexed BAM file is also stored which is used by Picard’s IlluminaBasecallsToSam. The output tab-seperated files are formatted as:

<directory>/library_params.<lane>.txt

The format of the BAM file output paths in the library parameter files are formatted as:

<bam_prefix>/<Sample_Name>.<Sample_Library>/<Sample_Name>.<index><index2>.<lane>.bam

Two files will be written to directory for all lanes specified. If the path to directory does not exist, it will be created.

Parameters:
  • directory – File path to the directory to write the parameter files.
  • bam_prefix – Where the demultiplexed BAMs should be written.
  • lanes – The lanes to write basecalling parameters for.
write(handle: TextIO, blank_lines: int = 1) → None[source]

Write this SampleSheet to a file-like object.

Parameters:
  • handle – Object to wrap by csv.writer.
  • blank_lines – Number of blank lines to write between sections.

sample_sheet.util module

sample_sheet.util.is_ipython_interpreter() → bool[source]

Return if we are in an IPython interpreter or not.

sample_sheet.util.maybe_render_markdown(string: str) → Any[source]

Render a string as Markdown only if in an IPython interpreter.

How to Contribute

Pull requests, feature requests, and issues welcome! The complete test suite is configured through Tox:

cd sample-sheet
❯ pip install tox
❯ tox  # Run entire dynamic / static analysis test suite

List all environments with:

❯ tox -av
using tox.ini: .../sample-sheet/tox.ini
using tox-3.1.2 from ../tox/__init__.py
default environments:
py36 -> run the test suite with (basepython)
py37 -> run the test suite with (basepython)
lint -> check the code style
type -> type check the library
docs -> test building of HTML docs

additional environments:
dev  -> the official sample_sheet development environment

To run just one environment:

❯ tox -e lint

To pass in positional arguments to a specified environment:

❯ tox -e py36 -- -x tests/test_sample_sheet.py