PyPairs Documentation

PyPI Docs Build Status bioconda

PyPairs - A python scRNA-Seq classifier

This is a python-reimplementation of the Pairs algorithm as described by A. Scialdone et. al. (2015). Original Paper available under: <https://doi.org/10.1016/j.ymeth.2015.06.021>

A supervided maschine learning algorithm aiming to classify single cells based on their transcriptomic signal. Initially created to predict cell cycle phase from scRNA-Seq data, this algorithm can be used for various applications.

Build to be fully compatible with Scanpy [Wolf18].

Code available on GitHub.

Core Dependencies

Authors

  • Antonio Scialdone - original algorithm
  • Ron Fechtner - implementation and extension in Python

Release notes

Announcement

Note

Please only use pypairs >= 3.1.0

Versions

Version 3.1.0, Apr 4, 2019
  • New feature:
    • Multithreading now available for pais.cyclone()
  • Minor changes and fixes:
    • pais.sandbag() now significally faster
    • pais.sandbag() more stable in terms of memory access
Version 3.0.1 - 3.0.13, Mar 13, 2019
  • Various bug fixes, including:
    • Bioconda compability
    • Dataset loading
    • Cache file required
    • Cell Cycle specific scoring
Version 3.0.0, Jan 18, 2019 - Jan 31, 2019
  • Complete restructuring of the package. Now fully compatiple with scanpy .
  • Added:
    • This documentation
    • Default (oscope) dataset & marker pairs [Leng15]
  • Changed:
Version 2.0.1 - 2.0.6, Nov 22, 2018
  • Minor bug fixes and improvements.
Version 2.0.0, Aug 14, 2018
  • Major restructuring of the package
  • Improved parallel processing
  • New features:
Version 1.0.1 - 1.0.3, Jul 29, 2018
  • Bug fixes and improvements. (Mostly bugs though)
  • Added multi-core processing
Version 1.0.0, Mar 4, 2018
  • Speed and performance improvements.
Version 0.1, Feb 22, 2018
  • Simple python reimplementation of the Pairs algorithm.
  • Included sandbag() and cyclone() algorithms

Getting Started

Installation

This package is hosted at PyPi ( https://pypi.org/project/pypairs/ ) and can be installed on any system running Python3 via pip with:

pip install pypairs

Alternatively, pypairs can be installed using Conda (most easily obtained via the Miniconda Python distribution:

conda install -c bioconda pypairs

Minimal Example

Datasets provide a example scRNA dataset and default marker pairs for cell cycle prediction:

from pypairs import pairs, datasets

# Load samples from the oscope scRNA-Seq dataset with known cell cycle
training_data = datasets.leng15(mode='sorted')

# Run sandbag() to identify marker pairs
marker_pairs = pairs.sandbag(training_data, fraction=0.6)

# Load samples from the oscope scRNA-Seq dataset without known cell cycle
testing_data = datasets.leng15(mode='unsorted')

# Run cyclone() score and predict cell cycle classes
result = pairs.cyclone(testing_data, marker_pairs)

# Further downstream analysis
print(result)

Documentation

To use PyPairs import the package as i.e. follows:

from pypairs import pairs, datasets, settings, utils

Sandbag

This function implements the classification step of the pair-based prediction method described by Scialdone et al. (2015) [Scialdone15].

To illustrate, consider classification of cells into G1 phase. Pairs of marker genes are identified with sandbag(), where the expression of the first gene in the training data is greater than the second in G1 phase but less than the second in all other phases.

pairs.sandbag(data[, annotation, …]) Calculate ‘marker pairs’ from a genecount matrix.

Cyclone

For each cell, cyclone() calculates the proportion of all marker pairs where the expression of the first gene is greater than the second in the new data (pairs with the same expression are ignored). A high proportion suggests that the cell is likely to belong to this category, as the expression ranking in the new data is consistent with that in the training data. Proportions are not directly comparable between phases due to the use of different sets of gene pairs for each phase. Instead, proportions are converted into scores that account for the size and precision of the proportion estimate. The same process is repeated for all phases, using the corresponding set of marker pairs in pairs.

pairs.cyclone(data[, marker_pairs, …]) Score samples for each category based on marker pairs.

While this method is described for cell cycle phase classification, any biological groupings can be used here. However, for non-cell cycle phase groupings users should manually apply their own score thresholds for assigning cells into specific groups.

Datasets

datasets.leng15([mode, gene_sub, sample_sub]) Single cell RNA-seq data of human hESCs to evaluate Oscope [Leng15]
datasets.default_cc_marker([dataset]) Cell cycle marker pairs derived from [Leng15] with the default sandbag() settings.

Quality Assesment

utils.evaluate_prediction(prediction, reference) Calculates F1 Score, Recall and Precision of a cyclone() prediction.

Utils

utils.export_marker(marker, fname[, defaultpath]) Export marker pairs to json-File.
utils.load_marker(fname[, defaultpath]) Export marker pairs to json-File.

Settings

The default directories for saving figures and caching files.

settings.figdir Directory for saving figures (default: './figures/').
settings.cachedir Directory for cache files (default: './cache/').

The verbosity of logging output, where verbosity levels have the following meaning: 0=’error’, 1=’warning’, 2=’info’, 3=’hint’

settings.verbosity Verbosity level (default: 1).

Print versions of packages that might influence numerical results.

log.print_versions() Versions that might influence the numerical results.

References

[Leng15]Leng et al. (2015) Oscope identifies oscillatory genes in unsynchronized single-cell RNA-seq experiments., Nat Methods.
[Scialdone15]Scialdone et al. (2015), Computational assignment of cell-cycle stage from single-cell transcriptome data, Methods.
[Wolf18]Wolf et al. (2018) SCANPY: large-scale single-cell gene expression data analysis, Genome Biology.