ImmuneDB travisci docs codecov pypi docker

ImmuneDB is a system to facilitate efficient storage and analysis of high-throughput B- and T-cell sequence data. It provides V- and J-gene identification, clonal assignment, lineage construction, selection pressure calculation, and thorough exporting functionality.

It also provides an intuitive and useful web interface a demo of which you can see here.

Quick Start

To get started immediately, please see the Docker installation instructions.

More Information

ImmuneDB is comprised of two GitHub repositories: Python analysis tools (arosenfeld/immunedb) and a web interface (arosenfeld/immunedb-frontend)

The system aims to:

  • Reduce ad-hoc scripting: Data analysis performed on an ad-hoc basis with custom scripts and data formats is error-prone and leads to inconsistencies. ImmuneDB provides a standardized analysis platform, performing many common tasks automatically.

  • Minimize flat-files: Flat files are currently the standard method of data exchange in the biological sciences. There are a myriad of drawbacks when using these including a lack of referential integrity, unclear provenance, and non-standardized formats.

    ImmuneDB attempts to reduce the need for flat files, through the use of an industry-leading database, MySQL. When data must be exchanged as a flat-file, many export options, including FASTA and tab-delineation, are available.

  • Interoperate with existing tools: ImmuneDB integrates tools from other researchers to provide features such as lineage construction, genotyping, and selection pressure calculations. Further, ImmuneDB can import and export in a variety of common formats, making it compatible with the larger AIRR ecosystem of tools.

Installing Locally (advanced)

This section details how to set ImmuneDB up locally on a machine. This is a more complicated process than using the Docker method but may be useful if you plan on running ImmuneDB remotely on a server rather than locally.

Dependency Installation

MySQL

ImmuneDB utilizes MySQL as its underlying data store. We recommend using its drop-in replacement, MariaDB. Please consult their website and your operating systems package manager for installation instructions.

R (optional)

Baseline can optionally be used to calculate selection pressure on clones. This requires R to be installed along with the ade4 package. Installation is platform dependent.

The newest version of Baseline can be downloaded here. The path to the main script will be needed for clone statistics generation as described in Statistics Generation.

For genotyping, TIgGER must also be installed.

Bowtie2 (optional)

Bowtie2 can be used to locally align sequences which cannot be aligned using the built-in anchor method.

Clearcut (optional)

Clearcut can be used to generate lineage trees for clones. After downloading and compiling per the instructions, note the path to the clearcut executable which will be required for generating trees in Clone Trees (Optional).

ImmuneDB Installation

It is recommended that ImmuneDB be installed within a venv, creating an isolated environment from the rest of the system.

To create a virtual environment and activate it run:

$ python3 -m venv immunedb
$ source immunedb/bin/activate

Then install ImmuneDB:

$ pip install immunedb

Web Interface Installation

Please refer to the ImmuneDB Frontend installation instructions.

Running the Example Pipeline

This page serves to familiarize new users with the basic flow of running the ImmuneDB pipeline. Example input FASTQ files are provided which contain human B-cell heavy chain sequences.

Commands are listed as either being run in either the Docker container or on the host.

To begin, run the Docker container as documented:

Run on Host
 $ docker run -v $HOME/immunedb_share:/share \
      -p 5000:5000 -p 8080:8080 -it arosenfeld/immunedb:v0.27.0

Metadata Specification

Before ImmuneDB can be run, metadata must be specified for each input file. For this example, one has already been created for you. To learn how to create a metadata file for your own data, see Creating a Metadata Sheet.

ImmuneDB Instance Creation

Next, we create a database for the data with:

Run in Docker
 $ immunedb_admin create example_db /share/configs

This creates a new database named example_db and stores its configuration in /share/configs/example_db.json.

Identifying or Importing Sequences

The first step of the pipeline is to annotate sequences and store the resulting data in the newly created database. To do so, the immunedb_identify is used. It requires that V and J germline sequences be specified in two separate FASTA files. The Docker image provides Human & Mouse IGH, TRA, and TRB germlines in $HOME/germlines.

For this example, there are two provided input files in /example along with the requisite metadata.tsv file which you can view with:

Run in Docker
$ ls /example

Given this, run the immunedb_identify command:

Run in Docker
 $ immunedb_identify /share/configs/example_db.json \
      /root/germlines/imgt_human_ighv.fasta \
      /root/germlines/imgt_human_ighj.fasta \
      /example

Sequence Collapsing

ImmuneDB determines the uniqueness of a sequence both at the sample and subject level. For the latter, immunedb_collapse is used to find sequences that are the same except at positions that have an N. Thus, the sequences ATNN and ANCN would be collapsed.

To collapse sequences, run:

Run in Docker
 $ immunedb_collapse /share/configs/example_db.json

Clonal Assignment

After sequences are assigned V and J genes, they can be clustered into clones based on CDR3 Amino Acid similarity with the immunedb_clones command. This takes a number of arguments which should be read before use.

There are three ways to create clones: based on CDR3 AA similarity, T-cell exact CDR3 NT identity, and a lineage based method. For this example we’ll use the similarity based method with default parameters:

Run in Docker
 $ immunedb_clones /share/configs/example_db.json similarity

This will create clones where all sequences in a clone will have the same V-gene, J-gene, and (by default) 85% CDR3 AA identity.

Statistics Generation

Two sets of statistics can be calculated in ImmuneDB:

  • Clone Statistics: For each clone and sample combination, how many unique and total sequences appear as well as the mutations from the germline.
  • Sample Statistics: Distribution of sequence and clone features on a per-sample basis, including V and J usage, nucleotides matching the germline, copy number, V length, and CDR3 length. It calculates all of these with and without outliers, and including and excluding partial reads.

These are calculated with the immunedb_clone_stats and immunedb_sample_stats commands and must be run in that order.

Run in Docker
 $ immunedb_clone_stats /share/configs/example_db.json
 $ immunedb_sample_stats /share/configs/example_db.json

Selection Pressure (Optional)

Warning

Selection pressure calculations are time-consuming, so you can skip this step if time is limited.

Selection pressure of clones can be calculated with Baseline. To do so run:

Run in Docker
 $ immunedb_clone_pressure /share/configs/example_db.json \
      /apps/baseline/Baseline_Main.r

Note, this process is relatively slow and may take some time to complete.

Clone Trees (Optional)

Lineage trees for clones is generated with the immunedb_clone_trees command. The only currently supported method is neighbor-joining as provided by Clearcut.

Among others, the --min-mut-copies parameter allows for mutations to be omitted if they have not occurred at least a specified number of times. This can be useful to correct for sequencing error.

Run in Docker
 $ immunedb_clone_trees /share/configs/example_db.json --min-mut-copies 2

Web Interface

ImmuneDB has a web interface to interact with a database instance. Running this can be slightly complicated, but the Docker image contains a helper script to simplify the process:

Run in Docker
 $ serve_immunedb.sh /share/configs/example_db.json

You can then navigate to http://localhost:8080.

Running the Pipeline on Your Data

This page describes how to run the ImmuneDB pipeline on raw FASTA/FASTQ data. It is assumed that you’ve previously tried the example pipeline and understand the basics of running commands in the Docker container.

Like in the example, each code block has a header saying if the command should be run on the host or in the Docker container.

Copying Your Sequence Data Into Docker

Unlike in the example pipeline where sequencing data was provided, you’ll need to copy your own FASTA/FASTQ sequencing data into the Docker container.

To do so, on the host, we create a new directory in the shared directory into which we’ll copy your sequencing data. Here we’re calling it sequences but you’ll probably want to choose a more descriptive name:

Run on Host
$ mkdir $HOME/immunedb_share/sequences
$ cp PATH_TO_SEQUENCES $HOME/immunedb_share/sequences

Creating a Metadata Sheet

Next, we’ll use the immunedb_metadata command to create a template metadata file for your sequencing data. In the Docker container run:

Run in Docker
$ cd /share/sequences
$ immunedb_metadata --use-filenames

This creates a metadata.tsv file in /share/sequences in Docker or $HOME/immunedb_share/sequences on the host.

The --use-filenames flag is optional, and simply populates the sample_name field with the file names stripped of their .fasta or .fastq extension.

Editing the Metadata Sheet

On the host open the $HOME/immunedb_share/sequences file in Excel or your favorite spreadsheet editor. The headers included in the file are required. You may add additional headers as necessary for your dataset (e.g. tissue, cell_subset, timepoint) so long as they follow the following rules:

  • The headers must all be unique
  • Each header may only contain lowercase letters, numbers, and underscores
  • Each header must begin with a (lowercase) character
  • Each header must not exceed 32 characters in length
  • The values within each column cannot exceed 64 characters in length

Note

When data is missing or not necessary in a field, leave it blank or set to NA, N/A, NULL, or None (case-insensitive).

Running the Pipeline

Much of the rest of the pipeline follows from the example pipeline’s instance creation step. To start, create a database. Here we’ll call it my_db but you’ll probably want to give it a more descriptive name:

Run in Docker
$ immunedb_admin create my_db /share/configs

Then we’ll identify the sequences. For this process the germline genes must be specified. The germlines provided as FASTA files in the Docker image are:

  • imgt_human_ighv & imgt_human_ighj: Human B-cell heavy chains
  • imgt_human_trav & imgt_mouse_traj: Human T-cell α chains
  • imgt_human_trbv & imgt_mouse_trbj: Human T-cell β chains
  • imgt_mouse_ighv & imgt_mouse_ighj: Mouse B-cell heavy chains

For this segment we’ll assume human B-cell heavy chains, but the process is the same for any dataset:

Run in Docker
$ immunedb_identify /share/configs/my_db.json \
      /root/germlines/imgt_human_ighv.fasta \
      /root/germlines/imgt_human_ighj.fasta \
      /share/sequences
$ immunedb_collapse /share/configs/my_db.json

Then we assign clones. For B-cells we recommend:

Run in Docker
$ immunedb_clones /share/configs/my_db.json similarity

For T-cells we recommend:

Run in Docker
$ immunedb_clones /share/configs/my_db.json similarity --level nt \
      --min-similarity 1

If you have a mixed dataset, you can assign clones in different ways, filtering on V-gene type. For example:

Run in Docker
$ immunedb_clones /share/configs/my_db.json similarity --gene IGHV
$ immunedb_clones /share/configs/my_db.json similarity --gene TCRB \
      --level nt --min-similarity 1

The last required step is to generate aggregate statistics:

Run in Docker
 $ immunedb_clone_stats /share/configs/my_db.json
 $ immunedb_sample_stats /share/configs/my_db.json

For B-cells, you might want to generate lineages too. The following excludes mutations that only occur once. immunedb_clone_trees has many other parameters for filtering which you can view with the --help flag:

Run in Docker
 $  immunedb_clone_trees /share/configs/my_db.json --min-mut-copies 2

Selection pressure can be run with the following. This process is quite time-consuming, even for small datasets:

Run in Docker
 $ immunedb_clone_pressure /share/configs/my_db.json \
      /apps/baseline/Baseline_Main.r

Finally, start the web interface:

Run in Docker
 $ serve_immunedb.sh /share/configs/my_db.json

Wait a few moments until you see webpack: Compiled successfully. and then the data should be available at http://localhost:8080.

Python API

Note

This section is currently incomplete. We’re working to fill out the details of the Python API as soon as possible.

Configuration

The immunedb.common.config module provides methods to initialize a connection to a new or existing database.

Most programs using ImmuneDB will start with code similar to:

import immunedb.common.config as config


parser = config.get_base_arg_parser('Some description of the program')
# ... add any additional arguments to the parser ...
args = parser.parse_args()

session = config.init_db(args.db_config)

When this script is run, it will require at least one argument which is the path to a database configuration (as generated with immunedb_admin). Using that, a Session object will be made, connected to the associated database.

One can also directly specify the path to a configuration directly.

import immunedb.common.config as config


session = config.init_db('path/to/config')

Alternatively a dictionary with the same information can be passed:

import immunedb.common.config as config


session = config.init_db({
    'host': '...',
    'database': '...',
    'username': '...',
    'password': '...',
})

Returned will be a Session object which can be used to interact with the database.

Using the Session

ImmuneDB is built using SQLAlchemy as a MySQL abstraction layer. Simply put, instead of writing SQL, the database is queried using Python constructs. Full documentation on using the session can be found in SQLAlchemy’s documentation.

Once a session is created, the models listed below can be queried.

Example Queries

Below are some example queries that demonstrate how to use the ImmuneDB API.

Clone CDR3s

Get all clones with a given V-gene and print their CDR3 AA sequences.

Input

import immunedb.common.config as config
from immunedb.common.models import Clone

session = config.init_db(...)

for clone in session.query(Clone).filter(Clone.v_gene == 'IGHV3-30'):
    print('clone {} has AAs {}'.format(clone.id, clone.cdr3_aa))

Output

clone 37884 has AAs CARGYSSSYFDYW
clone 37886 has AAs CARSRTSLSIYGVVPTGDFDSW
clone 37885 has AAs CARNGLNTVSGVVISPKYWLDPW
clone 37887 has AAs CARDLFRGVDFYYYGMDVW
Clone Frequency

Determine how many sequences appear in each sample belonging to clone 1234.

Note the CloneStats model has one entry for each clone/sample combination plus one where the sample_id field is null which represents the overall clone.

Input

import immunedb.common.config as config
from immunedb.common.models import CloneStats

session = config.init_db(...)
for stat in session.query(CloneStats).filter(
        CloneStats.clone_id == 1234).order_by(CloneStats.sample_id):
    print('clone {} has {} unique sequences and {} copies {}'.format(
        stat.clone_id,
        stat.unique_cnt,
        stat.total_cnt,
        ('in sample ' + stat.sample.name) if stat.sample else 'overall'))

Output

clone 1234 has 53 unique sequences and 1331 copies overall
clone 1234 has 27 unique sequences and 379 copies in sample sample1
clone 1234 has 27 unique sequences and 339 copies in sample sample3
clone 1234 has 24 unique sequences and 311 copies in sample sample4
clone 1234 has 28 unique sequences and 302 copies in sample sample10
V-gene Usage

This is a more complex query which gathers the V-gene usage of all sequences which are (a) in subject with ID 1, (b) associated with a clone, and (c) are unique to the subject, printing them from least to most frequent.

Input

import immunedb.common.config as config
from immunedb.common.models import Sequence, SequenceCollapse

session = config.init_db(...)

subject_unique_seqs = session.query(
    func.count(Sequence.seq_id).label('count'),
    Sequence.v_gene
).join(
    SequenceCollapse
).filter(
    Sequence.subject_id == 1,
    ~Sequence.clone_id.is_(None),
    SequenceCollapse.copy_number_in_subject > 0
).group_by(
    Sequence.v_gene
).order_by(
    'count'
)

for seq in subject_unique_seqs:
    print(seq.v_gene, seq.count)

Output

# ... output truncated ...
IGHV4-34 1128
IGHV1-2 1160
IGHV3-48 1169
IGHV4-39 1310
IGHV3-7 1345
IGHV3-30|3-30-5|3-33 1607
IGHV3-23|3-23D 1626
IGHV3-21 1878

Data Models

Directly Querying the Database

ImmuneDB is backed by a MySQL database that can be queried directly to gather information, bypassing the Python API.

Accessing the Database

There are many ways to access the database directly. The two introduced here are directly through MySQL or using immunedb_sql which simply wraps a call to MySQL.

Directly with MySQL

From the command line, you may access an ImmuneDB database DATABASE from user USERNAME with:

$ mysql -u USERNAME -p DATABASE

This will prompt for a password and then to the database. This method of access is useful for quickly querying the database. To save results of a query QUERY run the command:

$ mysql -u USERNAME -p DATABASE -e "QUERY" > output

Referencing ImmuneDB

If you use ImmuneDB, please cite the tool as:

Rosenfeld, A. M., Meng, W., Luning Prak, E. T., Hershberg, U., ImmuneDB, a Novel Tool for the Analysis, Storage, and Dissemination of Immune Repertoire Sequencing Data. Frontiers in Immunology 9 (2018).

ImmuneDB was originally announced previously in:

Rosenfeld, A. M., Meng, W., Luning Prak, E. T., Hershberg, U., ImmuneDB: a system for the analysis and exploration of high-throughput adaptive immune receptor sequencing data, Bioinformatics 33 (2016), no. 2, 292–293.