Welcome to the sample-sheet
Documentation!¶
A permissively licensed library designed to replace Illumina’s Experiment Manager.
❯ pip install sample-sheet
Or install with the Conda package manager after setting up your Bioconda channels:
❯ conda install sample-sheet
Which should be equivalent to:
❯ conda install -c bioconda -c conda-forge -c defaults sample-sheet
Features¶
- Roundtrip reading, editing, and writing of Sample Sheets
- de novo creation creation of Sample Sheets
- Exporting Sample Sheets to JSON
Documentation¶
Quick Start¶
To demonstrate the features of this library we will use a test file at an HTTPS endpoint. To follow along, ensure you have the smart_open library installed!
>>> from sample_sheet import SampleSheet
>>> url = 'https://raw.githubusercontent.com/clintval/sample-sheet/master/tests/resources/paired-end-single-index.csv'
>>> sample_sheet = SampleSheet(url)
The metadata of the sample sheet can be accessed with the Header
,
Reads
and, Settings
attributes:
>>> sample_sheet.Header.Assay
'SureSelectXT'
>>> sample_sheet.Reads
[151, 151]
>>> sample_sheet.is_paired_end
True
>>> sample_sheet.Settings.BarcodeMismatches
'2'
The samples can be accessed directly or via iteration:
>>> sample_sheet.samples
[Sample({'Sample_ID': '1823A', 'Sample_Name': '1823A-tissue', 'index': 'GAATCTGA'}),
Sample({'Sample_ID': '1823B', 'Sample_Name': '1823B-tissue', 'index': 'AGCAGGAA'}),
Sample({'Sample_ID': '1824A', 'Sample_Name': '1824A-tissue', 'index': 'GAGCTGAA'}),
Sample({'Sample_ID': '1825A', 'Sample_Name': '1825A-tissue', 'index': 'AAACATCG'}),
Sample({'Sample_ID': '1826A', 'Sample_Name': '1826A-tissue', 'index': 'GAGTTAGC'}),
Sample({'Sample_ID': '1826B', 'Sample_Name': '1823A-tissue', 'index': 'CGAACTTA'}),
Sample({'Sample_ID': '1829A', 'Sample_Name': '1823B-tissue', 'index': 'GATAGACA'})]
>>> first_sample, *other_samples = list(sample_sheet)
>>> first_sample
Sample({'Sample_ID': '1823A', 'Sample_Name': '1823A-tissue', 'index': 'GAATCTGA'})
Defining Sample Read Structures¶
If a column labeled Read_Structure
is provided per sample, then
additional functionality is enabled.
>>> first_sample, *_ = sample_sheet.samples
>>> first_sample.Read_Structure
ReadStructure(structure='151T8B151T')
>>> first_sample.Read_Structure.total_cycles
310
>>> first_sample.Read_Structure.tokens
['151T', '8B', '151T']
Sample Sheet Creation¶
Sample sheets can be created de novo and written to a file-like object. The following snippet shows how to add attributes to mandatory sections, add optional user-defined sections, and add samples before writing to nowhere.
>>> import os
>>> from sample_sheet import SampleSheet, Sample
>>> sample_sheet = SampleSheet()
# [Header] section
# Adding an attribute with spaces must be done with the add_attr() method
>>> sample_sheet.Header['IEM4FileVersion'] = 4
>>> sample_sheet.Header['Investigator Name'] = 'jdoe'
# [Settings] section
>>> sample_sheet.Settings['CreateFastqForIndexReads'] = 1
>>> sample_sheet.Settings['BarcodeMismatches'] = 2
# Optional sample sheet sections can be added and then accessed
>>> sample_sheet.add_section('Manifests')
>>> sample_sheet.Manifests['PoolDNA'] = "DNAMatrix.txt"
# Specify a paired-end kit with 151 template bases per read
>>> sample_sheet.Reads = [151, 151]
# Add a single-indexed sample with both a name, ID, and index
>>> sample = Sample(dict(Sample_ID='1823A', Sample_Name='1823A-tissue', index='ACGT'))
>>> sample_sheet.add_sample(sample)
# Write the Sample Sheet!
>>> sample_sheet.write(open(os.devnull, 'w'))
API Reference¶
sample_sheet
module¶
-
class
sample_sheet.
ReadStructure
(structure: str)[source]¶ Bases:
object
An object describing the order, number, and type of bases in a read.
A read structure is a sequence of tokens in the form
<number><operator>
where<operator>
can describe template, skip, index, or UMI bases.Operator Description T Template base (e.g. experimental DNA, RNA) S Bases to be skipped or ignored B Bases to be used as an index to identify the sample M Bases to be used as an index to identify the molecule Parameters: structure – Read structure string representation. Examples
>>> rs = ReadStructure("10M141T8B") >>> rs.is_paired_end False >>> rs.has_umi True >>> rs.tokens ['10M', '141T', '8B']
Note
This class does not currently support read structures where the last operator has ambiguous length by using
<+>
preceding the<operator>
.Definitions of common read structure uses can be found at the following location:
Discussion on the topic of read structure use in
hts-specs
:-
_sum_cycles_from_tokens
(tokens: List[str]) → int[source]¶ Sum the total number of cycles over a list of tokens.
-
has_indexes
¶ Return if this read structure has any index operators.
-
has_skips
¶ Return if this read structure has any skip operators.
-
has_umi
¶ Return if this read structure has any UMI operators.
-
index_cycles
¶ The number of cycles dedicated to indexes.
-
index_tokens
¶ Return a list of all index tokens in the read structure.
-
is_dual_indexed
¶ Return if this read structure is dual indexed.
-
is_indexed
¶ Return if this read structure has sample indexes.
-
is_paired_end
¶ Return if this read structure is paired-end.
-
is_single_end
¶ Return if this read structure is single-end.
-
is_single_indexed
¶ Return if this read structure is single indexed.
-
skip_cycles
¶ The number of cycles dedicated to skips.
-
skip_tokens
¶ Return a list of all skip tokens in the read structure.
-
template_cycles
¶ The number of cycles dedicated to template.
-
template_tokens
¶ Return a list of all template tokens in the read structure.
-
tokens
¶ Return a list of all tokens in the read structure.
-
total_cycles
¶ The number of total number of cycles in the structure.
-
umi_cycles
¶ The number of cycles dedicated to UMI.
-
umi_tokens
¶ Return a list of all UMI tokens in the read structure.
-
-
class
sample_sheet.
Sample
(data: Optional[Mapping] = None, **kwargs)[source]¶ Bases:
requests.structures.CaseInsensitiveDict
A single sample for a sample sheet.
This class is built with the keys and values in the
"[Data]"
section of the sample sheet. As specified by Illumina, the only required keys are:"Sample_ID"
Although this library recommends you define the following column names:
"Sample_ID"
"Sample_Name"
"index"
If the key
"Read_Structure"
is provided then its value is promoted to the classReadStructure
and additional functionality is enabled.Parameters: - data – Mapping of key-value pairs describing this sample.
- kwargs – Key-value pairs describing this sample.
Examples
>>> mapping = {"Sample_ID": "87", "Sample_Name": "3T", "index": "A"} >>> sample = Sample(mapping) >>> sample Sample({'Sample_ID': '87', 'Sample_Name': '3T', 'index': 'A'}) >>> sample = Sample({'Read_Structure': '151T'}) >>> sample.Read_Structure ReadStructure(structure='151T')
-
class
sample_sheet.
SampleSheet
(path: Union[pathlib.Path, str, TextIO, None] = None)[source]¶ Bases:
object
A representation of an Illumina sample sheet.
A sample sheet document almost conform to the
.ini
standards, but does not, so a custom parser is needed. Sample sheets are stored in plain text with comma-seperated values and string quoting around any field which contains a comma. The sample sheet is composed of four sections, marked by a header.Title name Description [Header]
.ini
convention[<Other>]
.ini
convention (optional, multiple, user-defined)[Settings]
.ini
convention[Reads]
.ini
convention as a vertical array of items[Data]
table with header Parameters: path – Any path supported by pathlib.Path
and/orsmart_open.smart_open
when smart_open is installed.-
add_sample
(sample: sample_sheet.Sample) → None[source]¶ Add a
Sample
to thisSampleSheet
.All samples are validated against the first sample added to the sample sheet to ensure there are no ID collisions or incompatible read structures (if supplied). All samples are also validated against the
"[Reads]"
section of the sample sheet if it has been defined.The following validation is performed when adding a sample:
Read_Structure
is identical in all samples, if suppliedRead_Structure
is compatible with"[Reads]"
, if supplied- Samples on the same
"Lane"
cannot have the same"Sample_ID"
and"Library_ID"
. - Samples cannot have the same
"Sample_ID"
if no"Lane"
has been defined. - The same
"index"
or"index2"
combination cannot exist per flowcell or per lane if lanes have been defined. - All samples have the same index design (
"index"
,"index2"
) per flowcell or per lane if lanes have been defined.
Parameters: sample – Sample
to add to thisSampleSheet
.Note
It is unclear if the Illumina specification truly allows for equivalent samples to exist on the same sample sheet. To mitigate the warnings in this library when you encounter such a case, use a code pattern like the following:
>>> import warnings >>> warnings.simplefilter("ignore") >>> from sample_sheet import SampleSheet >>> SampleSheet('tests/resources/single-end-colliding-sample-ids.csv'); SampleSheet('tests/resources/single-end-colliding-sample-ids.csv')
-
add_samples
(samples: Iterable[sample_sheet.Sample]) → None[source]¶ Add samples in an iterable to this
SampleSheet
.
-
add_section
(section_name: str) → None[source]¶ Add a section to the
SampleSheet
.
-
all_sample_keys
¶ Return the unique keys of all samples in this
SampleSheet
.The keys are discovered first by the order of samples and second by the order of keys upon those samples.
-
experimental_design
¶ Return a markdown summary of the samples on this sample sheet.
This property supports displaying rendered markdown only when running within an IPython interpreter. If we are not running in an IPython interpreter, then print out a nicely formatted ASCII table.
Returns: A visual table of IDs and names for all samples. Return type: Markdown, str
-
is_paired_end
¶ Return if the samples are paired-end.
-
is_single_end
¶ Return if the samples are single-end.
-
samples
¶ Return the samples present in this
SampleSheet
.
-
to_json
(**kwargs) → str[source]¶ Write this
SampleSheet
to JSON.Returns: The JSON dump of all entries in this sample sheet. Return type: str
-
to_picard_basecalling_params
(directory: Union[str, pathlib.Path], bam_prefix: Union[str, pathlib.Path], lanes: Union[int, List[int]]) → None[source]¶ Writes sample and library information to a set of files for a given set of lanes.
BARCODE PARAMETERS FILES: Store information regarding the sample index sequences, sample index names, and, optionally, the library name. These files are used by Picard’s CollectIlluminaBasecallingMetrics and Picard’s ExtractIlluminaBarcodes. The output tab-seperated files are formatted as:
<directory>/barcode_params.<lane>.txt
LIBRARY PARAMETERS FILES: Store information regarding the sample index sequences, sample index names, and optionally sample library and descriptions. A path to the resulting demultiplexed BAM file is also stored which is used by Picard’s IlluminaBasecallsToSam. The output tab-seperated files are formatted as:
<directory>/library_params.<lane>.txt
The format of the BAM file output paths in the library parameter files are formatted as:
<bam_prefix>/<Sample_Name>.<Sample_Library>/<Sample_Name>.<index><index2>.<lane>.bam
Two files will be written to
directory
for alllanes
specified. If the path todirectory
does not exist, it will be created.Parameters: - directory – File path to the directory to write the parameter files.
- bam_prefix – Where the demultiplexed BAMs should be written.
- lanes – The lanes to write basecalling parameters for.
-
write
(handle: TextIO, blank_lines: int = 1) → None[source]¶ Write this
SampleSheet
to a file-like object.Parameters: - handle – Object to wrap by csv.writer.
- blank_lines – Number of blank lines to write between sections.
-
How to Contribute¶
Pull requests, feature requests, and issues welcome!
The complete test suite is configured through Tox
:
❯ cd sample-sheet
❯ pip install tox
❯ tox # Run entire dynamic / static analysis test suite
List all environments with:
❯ tox -av
using tox.ini: .../sample-sheet/tox.ini
using tox-3.1.2 from ../tox/__init__.py
default environments:
py36 -> run the test suite with (basepython)
py37 -> run the test suite with (basepython)
lint -> check the code style
type -> type check the library
docs -> test building of HTML docs
additional environments:
dev -> the official sample_sheet development environment
To run just one environment:
❯ tox -e lint
To pass in positional arguments to a specified environment:
❯ tox -e py36 -- -x tests/test_sample_sheet.py