poretools: a toolkit for working with nanopore sequencing data from Oxford Nanopore.¶
The MinION (TM) from Oxford Nanopore Technologies (ONT) is the first nanopore sequencer to be commercialised and is now available to early-access users. The MinION (TM) is a USB-connected, portable nanopore sequencer which permits real-time analysis of streaming event data. Currently, the research community lacks a standardized toolkit for the analysis of nanopore datasets.
We have therefore develped poretools
, a flexible toolkit for exploring datasets generated by
nanopore sequencing devices from MinION for the purposes of quality control and downstream analysis.
Poretools
operates directly on the native FAST5 (a variant of the HDF5 standard) file format produced
by ONT and provides a wealth of format conversion utilities and data exploration and visualization tools.
A preprint of the poretools
manuscript is available on bioarxiv: http://biorxiv.org/content/early/2014/07/23/007401
Below are a few examples of common usage.
- Extract sequences in FASTQ format from a set of FAST5 files.
poretools fastq fast5/
- Make a collector’s curve of the yield from a sequencing run.
poretools yield_plot --plot-type reads fast5/
- Plot a histogram of read sizes from a set of FAST5 files.
poretools hist fast5/
Table of contents¶
Installation¶
Basic Installation¶
git clone https://github.com/arq5x/poretools
cd poretools
Install as root:
python setup.py install
Install as a plain old user who has root access:
sudo python setup.py install
Install as a plain old who lacks sudo
privileges:
# details: https://docs.python.org/2/install/index.html#alternate-installation-the-user-scheme
python setup.py install --user
# now update your PATH such that it includes the directory to which poretools was just copied.
# look for a line in the installation log like: Installing poretools script to /home/arq5x/.local/bin
# in this case, I would either add that path to the PATH environment variable for the current session:
export PATH=$PATH:/home/arq5x/.local/bin
# or, better yet add it to your .bashrc file.
# at this point you should be able to run the poretools executable from anywhere on your system.
poretools --help
Installing on Windows with MinKNOW installed¶
MinKNOW installs the Anaconda distribution of Python, which means that h5py, matplotlib, and pandas are already installed.
However, currently MinKNOW does not update the Windows registry to specify that Anaconda is the default version of Python, which makes installing packages tricky. To address this, some changes need to be made to the registry. This can be fixed by downloading the following file:
Ensure it is named ‘poretools.reg’ and then run it (by double-clicking). Windows will prompt you about making changes to the registry, which you should agree to.
Now, you need to install seaborn, which is the plotting package that poretools
uses as a replacement for R and rpy2 as of version 0.5.1
.
conda install seaborn
If conda cannot install seaborn, you could consider installing pip
and running:
pip install seaborn
Then, to install poretools, simply download and run the Windows installer:
Installing on OS X¶
First, you should install a proper package manager for OS X. In our experience, HomeBrew works extremely well.
To install HomeBrew, you run the following command (lifted from the HomeBrew site):
ruby -e "$(curl -fsSL https://raw.github.com/Homebrew/homebrew/go/install)"
Using HomeBrew, install HDF5 from the HomeBrew Science “tap”;
brew tap homebrew/science
brew install hdf5
You will also need Cython and numpy packages (if they are not already installed):
pip install cython
pip install numpy
Now, you will need to install the R statistical analysis software (you may already have this...). The CRAN website houses automatic installation packages for different versions of OS X. Here are links to such packages for Snow Leopard and higher as well as Mavericks.
At this point, you can install poretools.
git clone https://github.com/arq5x/poretools
cd poretools
Install as an administrator of your machine:
sudo python setup.py install
Install as a plain old who lacks sudo
priveleges:
# details: https://docs.python.org/2/install/index.html#alternate-installation-the-user-scheme
python setup.py install --user
Installing dependencies on Ubuntu¶
Package dependencies
sudo apt-get install git python-setuptools python-dev cython libhdf5-serial-dev
Then install R 3.0, this requires a bit of hacking. You need to replace ‘precise’ with the appropriate version if you are on a different Ubuntu version, see <http://cran.r-project.org/bin/linux/ubuntu/README> for more details.
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9
Open in a text editor (as sudo) the file /etc/apt/sources.list
and add the following line to the bottom, for Ubuntu 12.04:
deb http://www.stats.bris.ac.uk/R/bin/linux/ubuntu precise/
Or, for Ubuntu 14.04:
deb http://www.stats.bris.ac.uk/R/bin/linux/ubuntu trusty/
Then install poretools, finally:
git clone https://github.com/arq5x/poretools
cd poretools
sudo python setup.py install
poretools
In the cloud¶
Amazon Web Services machine image ID: ami-4c0ec424
Via docker¶
Build the docker container yourself (preferred):
git clone https://github.com/arq5x/poretools
cd poretools
docker build -t poretools .
docker run poretools --help
Or use the pre-built image from Docker Hub:
docker pull stephenturner/poretools
docker run stephenturner/poretools --help
To run the poretools container on data residing on the host machine, run docker run -h
and look at the help for the -v
option.
Options¶
The following demonstrates the options available in poretools
.
poretools --help
usage: poretools [-h] [-v]
{combine,fastq,fasta,stats,hist,events,readstats,tabular,nucdist,qualdist,winner,wiggle,times}
...
optional arguments:
-h, --help show this help message and exit
-v, --version Installed poretools version
[sub-commands]:
{combine,fastq,fasta,stats,hist,events,readstats,tabular,nucdist,qualdist,winner,wiggle,times}
combine Combine a set of FAST5 files in a TAR achive
fastq Extract FASTQ sequences from a set of FAST5 files
fasta Extract FASTA sequences from a set of FAST5 files
stats Get read size stats for a set of FAST5 files
hist Plot read size histogram for a set of FAST5 files
events Extract each nanopore event for each read
readstats Extract signal information for each read over time.
tabular Extract the lengths and name/seq/quals from a set of
FAST5 files in TAB delimited format
nucdist Get the nucl. composition of a set of FAST5 files
qualdist Get the qual score composition of a set of FAST5 files
winner Get the longest read from a set of FAST5 files
squiggle Plot the observed signals for FAST5 reads
times Return the start times from a set of FAST5 files in
tabular format
yield_plot Plot the yield over time for a set of FAST5 files
IPython Notebook¶
An IPython notebook demonstrating the functionality and output of poretools
is available in the repository.
Use this link to view it via the nbviewer
service:
<http://nbviewer.ipython.org/github/arq5x/poretools/blob/master/poretools/ipynb/test_run_report.ipynb>
Usage examples¶
Note
In the following examples, test_data
can be replaced with the directory containing the FAST5 files
from your own runs. If you are new to ONT sequencing, the test_data
directory is shipped with poretools
for experimentation.
poretools fastq
¶
Extract sequences in FASTQ format from a set of FAST5 files.
poretools fastq test_data/*.fast5
Or, if there are too many files for your OS to do the wildcard expansion, just provide a directory.
poreutils
will automatically find all of the FAST5 files in the directory.
poretools fastq test_data/
Extract sequences in FASTQ format from a set of FAST5 files.
poretools fastq test_data/
poretools fastq --min-length 5000 test_data/
poretools fastq --max-length 5000 test_data/
poretools fastq --type all test_data/
poretools fastq --type fwd test_data/
poretools fastq --type rev test_data/
poretools fastq --type 2D test_data/
poretools fastq --type fwd,rev test_data/
A type of “best” will extract the 2D read, if it exists. If not, it will extract either the template or complement read, whichever is available and has a better average Phred score.
poretools fastq --type best test_data/
Only extract sequence with more complement events than template. These are the so-called “high quality 2D reads” and are the most accurate sequences from a given run.
poretools fastq --type 2D --high-quality test_data/
The data in fastq format are returned in standard output.
poretools fasta
¶
Extract sequences in FASTA format from a set of FAST5 files.
poretools fasta test_data/
poretools fasta --min-length 5000 test_data/
poretools fasta --max-length 5000 test_data/
poretools fasta --type all test_data/
poretools fasta --type fwd test_data/
poretools fasta --type rev test_data/
poretools fasta --type 2D test_data/
poretools fasta --type fwd,rev test_data/
poretools fasta --type best test_data/
The data in fasta format are returned in standard output.
poretools combine
¶
Create a tarball from a set of FAST5 (HDF5) files.
# plain tar (recommended for speed)
poretools combine -o foo.fast5.tar test_data/*.fast5
# gzip
poretools combine -o foo.fast5.tar.gz test_data/*.fast5
# bzip2
poretools combine -o foo.fast5.tar.bz2 test_data/*.fast5
poretools yield_plot
¶
Create a collector’s curve reflecting the sequencing yield over time for a set of reads. There are two types of plots. The first is the yield of reads over time:
poretools yield_plot --plot-type reads test_data/
The result should look something like:
The second is the yield of base pairs over time:
poretools yield_plot --plot-type basepairs test_data/
The result should look something like:
Of course, you can save to PDF or PNG with –saveas:
poretools yield_plot \
--plot-type basepairs \
--saveas foo.pdf\
test_data/
poretools yield_plot \
--plot-type basepairs \
--saveas foo.png\
test_data/
If you don’t like the default aesthetics, try –theme-bw:
poretools yield_plot --theme-bw test_data/
poretools squiggle
¶
Make a “squiggle” plot of the signal over time for a given read or set of reads
poretools squiggle test_data/foo.fast5
The result should look something like:
If you don’t like the default aesthetics, try –theme-bw:
poretools squiggle --theme-bw test_data/
Other options:
# save as PNG
poretools squiggle --saveas png test_data/foo.fast5
# save as PDF
poretools squiggle --saveas pdf test_data/foo.fast5
# make a PNG for each FAST5 file in a directory
poretools squiggle --saveas png test_data/
poretools winner
¶
Report the longest read among a set of FAST5 files.
poretools winner test_data/
poretools winner --type all test_data/
poretools winner --type fwd test_data/
poretools winner --type rev test_data/
poretools winner --type 2D test_data/
poretools winner --type fwd,rev test_data/
poretools winner --type best test_data/
poretools stats
¶
Collect read size statistics from a set of FAST5 files.
poretools stats test_data/
total reads 2286.000000
total base pairs 8983574.000000
mean 3929.822397
median 4011.500000
min 13.000000
max 6864.000000
poretools hist
¶
Plot a histogram of read sizes from a set of FAST5 files.
poretools hist test_data/
poretools hist --min-length 1000 --max-length 10000 test_data/
poretools hist --num-bins 20 --max-length 10000 test_data/
If you don’t like the default aesthetics, try –theme-bw:
poretools hist --theme-bw test_data/
The result should look something like:
poretools nucdist
¶
Look at the nucleotide composition of a set of FAST5 files.
poretools nucdist test_data/
A 78287 335291 0.233489714904
C 75270 335291 0.224491561062
T 92575 335291 0.276103444471
G 84754 335291 0.252777438106
N 4405 335291 0.0131378414571
poretools qualdist
¶
Look at the quality score composition of a set of FAST5 files.
poretools qualdist test_data/
! 0 83403 335291 0.248748102395
" 1 46151 335291 0.137644613187
# 2 47463 335291 0.141557632027
$ 3 34471 335291 0.102809201559
% 4 24879 335291 0.0742012162569
& 5 20454 335291 0.0610037251224
' 6 16783 335291 0.0500550268274
( 7 13699 335291 0.0408570465655
) 8 11356 335291 0.0338690868529
* 9 9077 335291 0.0270720061081
+ 10 6492 335291 0.0193622852984
, 11 4891 335291 0.014587328619
- 12 3643 335291 0.0108651887465
. 13 2585 335291 0.00770972080968
/ 14 1969 335291 0.0058725107444
0 15 1475 335291 0.00439916371152
1 16 1146 335291 0.00341792651756
2 17 902 335291 0.00269020045274
3 18 790 335291 0.00235616225905
4 19 619 335291 0.0018461575169
5 20 532 335291 0.00158668142002
6 21 440 335291 0.00131229290378
7 22 397 335291 0.00118404609727
8 23 379 335291 0.00113036138757
9 24 313 335291 0.000933517452004
: 25 327 335291 0.000975272226215
; 26 138 335291 0.000411582774366
< 27 121 335291 0.000360880548538
= 28 96 335291 0.000286318451733
> 29 76 335291 0.000226668774289
? 30 69 335291 0.000205791387183
@ 31 61 335291 0.000181931516205
A 32 48 335291 0.000143159225866
B 33 23 335291 6.8597129061e-05
C 34 14 335291 4.17547742111e-05
D 35 6 335291 1.78949032333e-05
F 37 3 335291 8.94745161666e-06
poretools qualpos
¶
Produce a box-whisker plot of qualoty score distribution over positions in reads.
poretools qualpos test_data/
The result should look something like:
poretools tabular
¶
Dump the length, name, seq, and qual of the sequence in one or a set of FAST5 files.
poretools tabular foo.fast5
length name sequence quals
10 @channel_100_read_14_complement GTCCCCAACAACAC $%%'"$"%!)
poretools events
¶
Extract the raw nanopore events from each FAST5 file.
poretools events test_data/ | head -5
file strand mean start stdv length model_state model_level move p_model_state mp_model_state p_mp_model_state p_A p_C p_G p_T raw_index
test_data/2016_3_4_3507_1_ch120_read240_strand.fast5 template 58.3245290305 1559.89409031 1.34165996292 0.0146082337317 CGACTT 58.1304809188 0 0.0226559 CATCTT 0.0229866 0.284469 0.130683 0.137386 0.447461
test_data/2016_3_4_3507_1_ch120_read240_strand.fast5 template 50.1420877511 1559.90869854 0.921372775302 0.0348605577689 GACTTT 49.3934875964 1 0.0849836 GACTTT 0.0849836 0.257314 0.350541 0.101351 0.290794
test_data/2016_3_4_3507_1_ch120_read240_strand.fast5 template 47.5841029424 1559.9435591 0.771398562801 0.00763612217795 ACTTTG 48.2080162623 1 0.108899 TCTTTG 0.13079 0.000477931 0.00853333 0.306356 0.684632
test_data/2016_3_4_3507_1_ch120_read240_strand.fast5 template 51.5879264562 1559.95119522 0.684238307171 0.0112881806109 CTTTGA 52.7784154546 1 0.110625 CTTTGG 0.121103 4.69995e-06 0.00382846 0.0169048 0.979262
Extract the pre-basecalled events from each FAST5 file.
poretools events --pre-basecalled test_data/ | head -5
file strand mean start stdv length model_state model_level move p_model_state mp_model_state p_mp_model_state p_A p_C p_G p_T raw_index
burn-in-run-2/ch100_file15_strand.fast5 pre_basecalled 51.4652695313 5352344 0.655003995591 35
burn-in-run-2/ch100_file15_strand.fast5 pre_basecalled 60.1776123047 5352379 1.05143911309 18
burn-in-run-2/ch100_file15_strand.fast5 pre_basecalled 48.9152374359 5352397 0.864834628834 67
burn-in-run-2/ch100_file15_strand.fast5 pre_basecalled 55.4002178596 5352464 1.75915620083 17
poretools times
¶
poretools times test_data/ | head -5
channel filename read_length exp_starttime unix_timestamp duration unix_timestamp_end iso_timestamp day hour minute
120 test_data/2016_3_4_3507_1_ch120_read240_strand.fast5 5826 1457127309 1457128868 47 1457128915 2016-03-04T15:01:08-0700 04 15 01
120 test_data/2016_3_4_3507_1_ch120_read353_strand.fast5 3399 1457127309 1457129863 28 1457129891 2016-03-04T15:17:43-0700 04 15 17
120 test_data/2016_3_4_3507_1_ch120_read415_strand.fast5 2640 1457127309 1457130808 24 1457130832 2016-03-04T15:33:28-0700 04 15 33
120 test_data/2016_3_4_3507_1_ch120_read418_strand.fast5 3487 1457127309 1457130851 31 1457130882 2016-03-04T15:34:11-0700 04 15 34
poretools occupancy
¶
Plot the throughput performance of each pore on the flowcell during a given sequencing run.
poretools occupancy test_data/
The result should look something like:
poretools index
¶
Tabulate all file location info and metadata such as ASIC ID and temperature from a set of FAST5 files
poretools index test_data | head -5 | column -t
source_filename template_fwd_length complement_rev_length 2d_length asic_id asic_temp heatsink_temp channel exp_start_time exp_start_time_string_date exp_start_time_string_time start_time start_time_string_date start_time_string_time duration fast5_version
test_data/2016_3_4_3507_1_ch120_read240_strand.fast5 5826 5011 5079 3571011476 30.37 36.99 120 1457127309 2016-Mar-04 (Fri) 14:35:09 1457128868 2016-Mar-04 (Fri) 15:01:08 47 metrichor1.16
test_data/2016_3_4_3507_1_ch120_read353_strand.fast5 3399 2962 2940 3571011476 30.37 36.99 120 1457127309 2016-Mar-04 (Fri) 14:35:09 1457129863 2016-Mar-04 (Fri) 15:17:43 28 metrichor1.16
test_data/2016_3_4_3507_1_ch120_read415_strand.fast5 2640 2244 2428 3571011476 30.37 36.99 120 1457127309 2016-Mar-04 (Fri) 14:35:09 1457130808 2016-Mar-04 (Fri) 15:33:28 24 metrichor1.16
test_data/2016_3_4_3507_1_ch120_read418_strand.fast5 3487 2950 3384 3571011476 30.37 36.99 120 1457127309 2016-Mar-04 (Fri) 14:35:09 1457130851 2016-Mar-04 (Fri) 15:34:11 31 metrichor1.16
Extract the metadata from the fast5 file
poretools metadata 013731_11rx_v2_3135_1_ch20_file19_strand.fast5
asic_id asic_temp heatsink_temp
31037 28.11 37.88
poretools metadata --read 013731_11rx_v2_3135_1_ch20_file19_strand.fast5
filename scaling_used abasic_peak_height hairpin_polyt_level median_before start_time read_id read_number hairpin_peak_height abasic_found abasic_event_index duration start_mux hairpin_found hairpin_event_index
013731_11rx_v2_3135_1_ch20_file19_strand.fast5 1 124.31769966 0.413218809334 226.393825112 4648221 3b4e45bf-6d42-45bc-9314-1d8a630971c2 19 125.783167256 1 2 195322 4 1 1478
Release History¶
Version 0.6.0 (29-Aug-2016)¶
- Added new
organise
command to place FAST5 files into a useful folder hierarchy - Updated the logic for event timing to handle both R9 and earlier FAST5 files.
- Added a “best” option to the
fasta
andfastq
tools to identify the best sequence for a read (of 2d, template, complement). - Added R9 RNN support.
- Various updates to API to accommodate the R9 changes made to the HDF5 structure.
Requirements¶
- HDF5 >= 1.8.7 (http://www.hdfgroup.org/HDF5/)
- Python >= 2.7
- h5py >= 2.0.0
- matplotlib
- seaborn
- pandas
Note
Please note that Anaconda and Python(x,y) already have all these dependencies installed: Anaconda (Linux, Windows, OS X): https://store.continuum.io/cshop/anaconda/ Python(x,y) (Windows): https://code.google.com/p/pythonxy/