Limmbo’s documentation¶
Install¶
LiMMBo is currently available via pip and will in the future be available through conda-forge . The latter platform provides the recommended installation of LIMIX, which LiMMBo heavily relies on.
The recommended way of installing both packages is the following: Install LIMIX via conda:
conda install -c conda-forge limix
After successful installation of LIMIX, simply install LiMMBo via pip:
pip install limmbo
Data input¶
Data input¶
Parse command-line arguments¶
A detailed list of the command-line options for getGWASargs can be found in runGWAS.
A detailed list of the command-line options for getVarianceEstimationArgs can be found in Variance decomposition.
Read data¶
-
class
limmbo.io.reader.
ReadData
(verbose=True)[source]¶ Generate object containing all datasets relevant for the analysis. For variance decomposition, at least phenotypes and relatedness estimates need to be specified. For association testing with LMM, at least phenotype, relatedness estimates and genotypes need to be read.
-
getCovariates
(file_covariates=None, delim=', ')[source]¶ Reads [N x K] covariate matrix with [N] samples and [K] covariates.
Parameters: - file_covariates (string) – [N x (K +1)] covariates file with [N] sample IDs in the first column
- delim (string) – delimiter of covariates file, one of ” “, “,”, “t”
Returns: updated the following attributes of the ReadData instance:
- self.covariates (np.array): [N x K] covariates matrix
Return type: None
Examples
>>> from pkg_resources import resource_filename >>> from limmbo.io.reader import ReadData >>> from limmbo.io.utils import file_type >>> data = ReadData(verbose=False) >>> file_covs = resource_filename('limmbo', ... 'io/test/data/covs.csv') >>> data.getCovariates(file_covariates=file_covs) >>> data.covariates.index[:3] Index([u'ID_1', u'ID_2', u'ID_3'], dtype='object') >>> data.covariates.values[:3,:3] array([[ 0.92734699, 1.59767659, -0.67263682], [ 0.57061985, -0.84679736, -1.11037123], [ 0.44201204, -1.61499228, 0.23302345]])
-
getGenotypes
(file_genotypes=None, delim=', ')[source]¶ Reads genotype file in the following formats: plink (.bed, .bim, .fam), gen (.gen, .sample) or comma-separated values (.csv) file.
Parameters: - file_geno (string) –
path to phenotype file in .plink or .csv format - plink format:
as specified in the plink user manual, binary plink format with .bed, .fam and .bim file- .csv format:
- [(NrSNP + 1) x (N`+1)] .csv file with: [`N] sample IDs in the first row and [NrSNP] genotype IDs in the first column
- sample IDs should be of type: chrom-pos-rsID for instance 22-50714616-rs6010226
- .csv format:
- delim (string) – delimiter of genotype file (when text format), one of ” “, “,”, “t”
Returns: updated the following attributes of the ReadData instance:
- self.genotypes (np.array): [N x NrSNPs] genotype matrix
- self.genotypes_info (pd.dataframe): [NrSNPs x 2] dataframe with columns ‘chrom’ and ‘pos’, and rsIDs as index
Return type: None
Examples
>>> from pkg_resources import resource_filename >>> from limmbo.io import reader >>> from limmbo.io.utils import file_type >>> data = reader.ReadData(verbose=False) >>> # Read genotypes in delim-format >>> file_geno = resource_filename('limmbo', ... 'io/test/data/genotypes.csv') >>> data.getGenotypes(file_genotypes=file_geno) >>> data.genotypes.index[:4] Index([u'ID_1', u'ID_2', u'ID_3', u'ID_4'], dtype='object') >>> data.genotypes.shape (1000, 20) >>> data.genotypes.values[:5,:5] array([[0., 0., 0., 0., 0.], [0., 0., 0., 0., 0.], [0., 0., 0., 0., 0.], [2., 1., 0., 0., 0.], [1., 0., 0., 0., 0.]]) >>> data.genotypes_info[:5] chrom pos rs1601111 3 88905003 rs13270638 8 20286021 rs75132935 8 76564608 rs72668606 8 79733124 rs55770986 7 2087823 >>> ### read genotypes in plink format >>> file_geno = resource_filename('limmbo', ... 'io/test/data/genotypes') >>> data.getGenotypes(file_genotypes=file_geno) >>> data.genotypes.values[:5,:5] array([[0., 0., 0., 0., 0.], [0., 0., 0., 0., 0.], [0., 0., 0., 0., 0.], [2., 1., 0., 0., 0.], [1., 0., 0., 0., 0.]]) >>> data.genotypes_info[:5] chrom pos rs1601111 3 88905003 rs13270638 8 20286021 rs75132935 8 76564608 rs72668606 8 79733124 rs55770986 7 2087823
- file_geno (string) –
-
getPCs
(file_pcs=None, nrpcs=None, delim=', ')[source]¶ Reads file with [N x PC] matrix of [PC] principal components from the genotypes of [N] samples.
Parameters: - file_pcs (string) – [N x (PC +1)] PCA file with [N] sample IDs in the first column
- delim (string) – delimiter of PCA file, one of ” “, “,”, “t”
- nrpcs (integer) – Number of PCs to use (uses first nrpcs principal components)
Returns: updated the following attributes of the ReadData instance:
- self.pcs (np.array): [N x PC] principal component matrix
Return type: None
Examples
>>> from pkg_resources import resource_filename >>> from limmbo.io.reader import ReadData >>> from limmbo.io.utils import file_type >>> data = ReadData(verbose=False) >>> file_pcs = resource_filename('limmbo', ... 'io/test/data/pcs.csv') >>> data.getPCs(file_pcs=file_pcs, nrpcs=10, delim=" ") >>> data.pcs.index[:3] Index([u'ID_1', u'ID_2', u'ID_3'], dtype='object', name=0) >>> data.pcs.values[:3,:3] array([[-0.02632738, -0.05791269, -0.03619099], [ 0.00761785, 0.00538956, 0.00624196], [ 0.01069307, -0.0205066 , 0.02299996]])
-
getPhenotypes
(file_pheno=None, delim=', ')[source]¶ Reads [N x P] phenotype file; file ending must be either .txt or .csv
Parameters: - file_pheno (string) – path to [(N`+1) x (`P`+1)] phenotype file with: [`N] sample IDs in the first column and [P] phenotype IDs in the first row
- delim (string) – delimiter of phenotype file, one of ” “, “,”, “t”
Returns: updated the following attributes of the ReadData instance:
- self.phenotypes (np.array): [N x P] phenotype matrix
Return type: None
Examples
>>> from pkg_resources import resource_filename >>> from limmbo.io.reader import ReadData >>> from limmbo.io.utils import file_type >>> data = ReadData(verbose=False) >>> file_pheno = resource_filename('limmbo', ... 'io/test/data/pheno.csv') >>> data.getPhenotypes(file_pheno=file_pheno) >>> data.phenotypes.index[:3] Index([u'ID_1', u'ID_2', u'ID_3'], dtype='object') >>> data.phenotypes.columns[:3] Index([u'trait_1', u'trait_2', u'trait_3'], dtype='object') >>> data.phenotypes.values[:3,:3] array([[-1.56760036, -1.5324513 , 1.17789321], [-0.85655034, 0.48358151, 1.35664966], [ 0.10772832, -0.02262884, -0.27963328]])
-
getRelatedness
(file_relatedness, delim=', ')[source]¶ Read file of [N x N] pairwise relatedness estimates of [N] samples.
Parameters: - file_relatedness (string) – [(N + 1) x N] .csv file with: [N] sample IDs in the first row
- delim (string) – delimiter of covariate file, one of ” “, “,”, ” “
Returns: updated the following attributes of the ReadData instance:
- self.relatedness (np.array): [N x N] relatedness matrix
Return type: None
Examples
>>> from pkg_resources import resource_filename >>> from limmbo.io.reader import ReadData >>> from limmbo.io.utils import file_type >>> data = ReadData(verbose=False) >>> file_relatedness = resource_filename('limmbo', ... 'io/test/data/relatedness.csv') >>> data.getRelatedness(file_relatedness=file_relatedness) >>> data.relatedness.index[:3] Index([u'ID_1', u'ID_2', u'ID_3'], dtype='object') >>> data.relatedness.columns[:3] Index([u'ID_1', u'ID_2', u'ID_3'], dtype='object') >>> data.relatedness.values[:3,:3] array([[1.00892922e+00, 2.00758504e-04, 4.30499103e-03], [2.00758504e-04, 9.98944885e-01, 4.86487318e-03], [4.30499103e-03, 4.86487318e-03, 9.85787665e-01]])
-
getSampleSubset
(file_samplelist=None, samplelist=None)[source]¶ Read file or string with subset of sample IDs to analyse.
Parameters: - file_samplelist (string) – “path/to/file_samplelist”: file contains subset sample IDs with one ID per line, no header.
- samplestring (string) – comma-separated list of sample IDs e.g. “ID1,ID2,ID5,ID10”.
Returns: - (numpy array)
array containing list of sample IDs
-
getTraitSubset
(traitstring=None)[source]¶ Limit analysis to specific subset of traits
Parameters: traitstring (string) – comma-separated trait numbers (for single traits) or hyphen- separated trait numbers (for trait ranges) or combination of both for trait selection (1-based) Returns: - (numpy array)
- array containing list of trait IDs
Examples
>>> from limmbo.io import reader >>> data = reader.ReadData(verbose=False) >>> traitlist = data.getTraitSubset("1,3,5,7-10") >>> print traitlist [0 2 4 6 7 8 9]
-
getVarianceComponents
(file_Cg=None, file_Cn=None, delim_cg=', ', delim_cn=', ')[source]¶ Reads a comma-separated files with [P x P] matrices of [P] trait covariance estimates.
Parameters: - file_Cg (string) – [P x P] .csv file with [P] trait covariance estimates of the genetic component
- file_Cn (string) – [P x P] .csv file with [P] trait covariance estimates of the non-genetic (noise) component
Returns: updated the following attributes of the ReadData instance:
- self.Cg (np.array): [P x P] matrix with trait covariance of the genetic component
- self.Cn (np.array): [P x P] matrix with trait covariance of the non-genetic (noise) component
Return type: None
Examples
>>> from pkg_resources import resource_filename >>> from limmbo.io.reader import ReadData >>> data = ReadData() >>> file_Cg = resource_filename('limmbo', ... 'io/test/data/Cg.csv') >>> file_Cn = resource_filename('limmbo', ... 'io/test/data/Cn.csv') >>> data.getVarianceComponents(file_Cg=file_Cg, file_Cn=file_Cn) >>> data.Cg.shape (10, 10) >>> data.Cn.shape (10, 10) >>> data.Cg[:3,:3] array([[ 0.45446454, -0.21084613, 0.01440468], [-0.21084613, 0.11443656, 0.01250233], [ 0.01440468, 0.01250233, 0.02347906]]) >>> data.Cn[:3,:3] array([[ 0.53654803, -0.14392748, -0.45483001], [-0.14392748, 0.88793093, 0.30539822], [-0.45483001, 0.30539822, 0.97785614]])
-
Check data¶
-
class
limmbo.io.input.
InputData
(verbose=True)[source]¶ Generate object containing all datasets relevant for variance decomposition (phenotypes, relatedness estimates) and pre-processing steps (check for common samples and sample order, covariates regression and phenotype transformation)
Parameters: verbose (bool) – initialise verbose: should progress messages be printed to stdout -
addCovariates
(covariates, covs_samples=None)[source]¶ Add [N x K] covariate data with [N] samples and [K] covariates to InputData instance.
Parameters: - covariates (array-like) – [N x `K] covariate matrix of N individuals and K covariates; if pandas.DataFrame with covs_samples as index, covs_samples do not have to specified separately.
- covs_samples (array-like) – [N] sample ID
Returns: updated the following attributes of the InputData instance:
- self.covariates (pd.DataFrame): [N x K] covariates matrix
- self.covs_samples (np.array): [N] sample IDs
Return type: None
Examples
>>> from limmbo.io import input >>> import numpy as np >>> import pandas as pd >>> covariates = [(1,2,4),(1,1,6),(0,4,8)] >>> covs_samples = ['S1','S2', 'S3'] >>> covariates = pd.DataFrame(covariates, index=covs_samples) >>> indata = input.InputData(verbose=False) >>> indata.addCovariates(covariates = covariates, ... covs_samples = covs_samples) >>> print indata.covariates.shape (3, 3) >>> print indata.covs_samples.shape (3,)
-
addGenotypes
(genotypes, geno_samples=None, genotypes_info=None)[source]¶ Add [N x NrSNP] genotype array of [N] samples and [NrSNP] genotypes, [N] array of sample IDs and [NrSNP x 2] dataframe of genotype description to InputData instance.
Parameters: - genotypes (array-like) – [N x NrSNP] genotype array of [N] samples and [NrSNP] genotypes; if pandas.DataFrame with geno_samples as index, geno_samples do not have to specified separately.
- geno_samples (array-like) – [N] vector of N sample IDs
- genotypes_info (dataframe) – [NrSNPs x 2] dataframe with columns ‘chrom’ and ‘pos’, and rsIDs as index
Returns: updated the following attributes of the InputData instance:
- self.genotypes (pd.DataFrame): [N x NrSNPs] genotype matrix
- self.geno_samples (np.array): [N] sample IDs
- self.genotypes_info (pd.DataFrame): [NrSNPs x 2] dataframe with columns ‘chrom’ and ‘pos’, and rsIDs as index
Return type: None
Examples
>>> from pkg_resources import resource_filename >>> from limmbo.io import reader >>> from limmbo.io import input >>> data = reader.ReadData(verbose=False) >>> file_geno = resource_filename('limmbo', ... 'io/test/data/genotypes.csv') >>> data.getGenotypes(file_genotypes=file_geno) >>> indata = input.InputData(verbose=False) >>> indata.addGenotypes(genotypes=data.genotypes, ... genotypes_info=data.genotypes_info) >>> indata.geno_samples[:5] array(['ID_1', 'ID_2', 'ID_3', 'ID_4', 'ID_5'], dtype=object) >>> indata.genotypes.shape (1000, 20) >>> indata.genotypes.values[:5,:5] array([[0., 0., 0., 0., 0.], [0., 0., 0., 0., 0.], [0., 0., 0., 0., 0.], [2., 1., 0., 0., 0.], [1., 0., 0., 0., 0.]]) >>> indata.genotypes_info[:5] chrom pos rs1601111 3 88905003 rs13270638 8 20286021 rs75132935 8 76564608 rs72668606 8 79733124 rs55770986 7 2087823
-
addPCs
(pcs, pc_samples=None)[source]¶ Add [N x PC] matrix of [PC] principal components from the genotypes of [N] samples to InputData instance.
Parameters: - pcs (array-like) – [N x `PCs] principal component matrix of N individuals and PCs principal components; if pandas.DataFrame with pc_samples as index, covs_samples do not have to specified separately.
- pc_samples (array-like) – [N] sample IDs
Returns: updated the following attributes of the InputData instance:
- self.pcs (pd.DataFrame): [N x PCs] principal component matrix
- self.pc_samples (np.array): [N] sample IDs
Return type: None
Examples
>>> from pkg_resources import resource_filename >>> from limmbo.io import reader >>> from limmbo.io import input >>> data = reader.ReadData(verbose=False) >>> file_pcs = resource_filename('limmbo', ... 'io/test/data/pcs.csv') >>> data.getPCs(file_pcs=file_pcs, nrpcs=10, delim=" ") >>> indata = input.InputData(verbose=False) >>> indata.addPCs(pcs = data.pcs) >>> print indata.pcs.shape (1000, 10) >>> print indata.pc_samples.shape (1000,)
-
addPhenotypes
(phenotypes, pheno_samples=None, phenotype_ID=None)[source]¶ Add phenotypes, their phenotype ID and their sample IDs to InputData instance
Parameters: - phenotypes (array-like) – [N x `P] phenotype matrix of N individuals and P phenotypes; if pandas.DataFrame with pheno_samples as index and phenotypes_ID as columns, pheno_samples and phenotype_ID do not have to specified separately.
- pheno_samples (array-like) – [N] sample ID
- phenotype_ID (array-like) – [P] phenotype IDs
Returns: updated the following attributes of the InputData instance:
- self.phenotypes (pd.DataFrame): [N x P] phenotype array
- self.pheno_samples (np.array): [N] sample IDs
- self.phenotype_ID (np.array): [P] phenotype IDs
Return type: None
Examples
>>> from limmbo.io import input >>> import numpy as np >>> import pandas as pd >>> pheno = np.array(((1,2),(7,1),(3,4))) >>> pheno_samples = ['S1','S2', 'S3'] >>> phenotype_ID = ['ID1','ID2'] >>> phenotypes = pd.DataFrame(pheno, index=pheno_samples, ... columns = phenotype_ID) >>> indata = input.InputData(verbose=False) >>> indata.addPhenotypes(phenotypes = phenotypes) >>> print indata.phenotypes.shape (3, 2) >>> print indata.pheno_samples.shape (3,) >>> print indata.phenotype_ID.shape (2,)
-
addRelatedness
(relatedness, relatedness_samples=None)[source]¶ Add [N x N] pairwise relatedness estimates of [N] samples to the InputData instance
Parameters: - relatedness (array-like) – [N x `N] relatedness matrix of N individuals; if pandas.DataFrame with relatedness_samples as index, relatedness_samples do not have to specified separately.
- relatedness_samples (array-like) – [N] sample IDs
Returns: updated the following attributes of the InputData instance:
- self.relatedness (pd.DataFrame): [N x N] relatedness matrix
- self.relatedness_samples (np.array): [N] sample IDs
Return type: None
Examples
>>> from limmbo.io import input >>> import numpy >>> import pandas as pd >>> from numpy.random import RandomState >>> from numpy.linalg import cholesky as chol >>> random = RandomState(5) >>> N = 100 >>> SNP = 1000 >>> X = (random.rand(N, SNP) < 0.3).astype(float) >>> relatedness = numpy.dot(X, X.T)/float(SNP) >>> relatedness_samples = numpy.array( ... ['S{}'.format(x+1) for x in range(N)]) >>> relatedness = pd.DataFrame(relatedness, ... index=relatedness_samples) >>> indata = input.InputData(verbose=False) >>> indata.addRelatedness(relatedness = relatedness) >>> print indata.relatedness.shape (100, 100) >>> print indata.relatedness_samples.shape (100,)
-
addVarianceComponents
(Cg, Cn)[source]¶ Add [P x P] matrices of [P] trait covariance estimates of the genetic trait variance component (Cg) and the non-genetic (noise) variance component (Cn) to InputData instance
Parameters: - Cg (array-like) – [P x `P] matrix of P trait covariance estimates of the genetic trait covaraince component
- Cn (array-like) – [P x `P] matrix of P trait covariance estimates of the non-genetic (noise) trait covaraince component
Returns: updated the following attributes of the InputData instance:
- self.Cg (np.array): [P x `P] matrix of P trait covariance estimates of the genetic trait covariance component
- self.Cn (np.array): [P x `P] matrix of P trait covariance estimates of the non-genetic trait covaraince component
Return type: None
Examples
>>> from pkg_resources import resource_filename >>> from limmbo.io import reader >>> from limmbo.io import input >>> import numpy as np >>> from numpy.random import RandomState >>> from numpy.linalg import cholesky as chol >>> data = reader.ReadData(verbose=False) >>> file_pheno = resource_filename('limmbo', ... 'io/test/data/pheno.csv') >>> data.getPhenotypes(file_pheno=file_pheno) >>> file_Cg = resource_filename('limmbo', ... 'io/test/data/Cg.csv') >>> file_Cn = resource_filename('limmbo', ... 'io/test/data/Cn.csv') >>> data.getVarianceComponents(file_Cg=file_Cg, ... file_Cn=file_Cn) >>> indata = input.InputData(verbose=False) >>> indata.addPhenotypes(phenotypes = data.phenotypes) >>> indata.addVarianceComponents(Cg = data.Cg, Cn=data.Cn) >>> print indata.Cg.shape (10, 10) >>> print indata.Cg.shape (10, 10)
-
commonSamples
(samplelist=None)[source]¶ Get [M] common samples out of phenotype, relatedness and optional covariates with [N] samples (if all samples present in all datasets [M] = [N]) and ensure that samples are in same order.
Parameters: samplelist (array-like) – array of sample IDs to select from data Returns: updated the following attributes of the InputData instance: - self.phenotypes (pd.DataFrame): [M x P] phenotype matrix
- self.pheno_samples (np.array): [M] sample IDs
- self.relatedness (pd.DataFrame): [M x M] relatedness matrix
- self.relatedness_samples (np.array): [M] sample IDs of relatedness matrix
- self.covariates (pd.DataFrame): [M x K] covariates matrix
- self.covs_samples (np.array): [M] sample IDs
- self.genotypes (pd.DataFrame): [M x NrSNPs] genotypes matrix
- self.geno_samples (np.array): [M] sample IDs
- self.pcs (pd.DataFrame): [M x PCs] principal component matrix
- self.pc_samples (np.array): [M] sample IDs
Return type: None Examples
>>> from limmbo.io import input >>> import numpy as np >>> import pandas as pd >>> from numpy.random import RandomState >>> from numpy.linalg import cholesky as chol >>> random = RandomState(5) >>> P = 2 >>> K = 4 >>> N = 10 >>> SNP = 1000 >>> pheno = random.normal(0,1, (N, P)) >>> pheno_samples = np.array(['S{}'.format(x+4) ... for x in range(N)]) >>> phenotype_ID = np.array(['ID{}'.format(x+1) ... for x in range(P)]) >>> phenotypes = pd.DataFrame(pheno, index=pheno_samples, ... columns=phenotype_ID) >>> X = (random.rand(N, SNP) < 0.3).astype(float) >>> relatedness = np.dot(X, X.T)/float(SNP) >>> relatedness_samples = np.array(['S{}'.format(x+1) ... for x in range(N)]) >>> covariates = random.normal(0,1, (N-2, K)) >>> covs_samples = np.array(['S{}'.format(x+1) ... for x in range(N-2)]) >>> indata = input.InputData(verbose=False) >>> indata.addPhenotypes(phenotypes = pheno, ... pheno_samples = pheno_samples, ... phenotype_ID = phenotype_ID) >>> indata.addRelatedness(relatedness = relatedness, ... relatedness_samples = relatedness_samples) >>> indata.addCovariates(covariates = covariates, ... covs_samples = covs_samples) >>> indata.covariates.shape (8, 4) >>> indata.phenotypes.shape (10, 2) >>> indata.relatedness.shape (10, 10) >>> indata.commonSamples(samplelist=["S4", "S6", "S5"]) >>> indata.covariates.shape (3, 4) >>> indata.phenotypes.shape (3, 2) >>> indata.relatedness.shape (3, 3)
-
getAlleleFrequencies
()[source]¶ Compute allele frequencies of genotypes.
Returns: updated the following attributes of the InputData instance: - self.freqs (pandas DataFrame): [NrSNP x 2] matrix of alt and ref allele frequencies; index: snp IDs
Return type: None Examples
>>> from pkg_resources import resource_filename >>> from limmbo.io import reader >>> from limmbo.io import input >>> from limmbo.utils.utils import makeHardCalledGenotypes >>> from limmbo.utils.utils import AlleleFrequencies >>> data = reader.ReadData(verbose=False) >>> file_geno = resource_filename('limmbo', ... 'io/test/data/genotypes.csv') >>> data.getGenotypes(file_genotypes=file_geno) >>> indata = input.InputData(verbose=False) >>> indata.addGenotypes(genotypes=data.genotypes, ... genotypes_info=data.genotypes_info, ... geno_samples=data.geno_samples) >>> freqs = indata.getAlleleFrequencies() >>> freqs.iloc[:5,:] p q rs1601111 0.292186 0.707814 rs13270638 0.303581 0.696419 rs75132935 0.024295 0.975705 rs72668606 0.119091 0.880909 rs55770986 0.169338 0.830662
-
regress
()[source]¶ Regress out covariates (optional).
Returns: updated the following attributes of the InputData instance: - self.phenotypes (np.array): [M x P] phenotype matrix of residuals of linear model
- self.covariates: None
Return type: None Examples
>>> from limmbo.io import input >>> import numpy as np >>> from numpy.random import RandomState >>> from numpy.linalg import cholesky as chol >>> random = RandomState(5) >>> P = 5 >>> K = 4 >>> N = 100 >>> pheno = random.normal(0,1, (N, P)) >>> pheno_samples = np.array(['S{}'.format(x+1) ... for x in range(N)]) >>> phenotype_ID = np.array(['ID{}'.format(x+1) ... for x in range(P)]) >>> covariates = random.normal(0,1, (N, K)) >>> covs_samples = np.array(['S{}'.format(x+1) ... for x in range(N)]) >>> indata = input.InputData(verbose=False) >>> indata.addPhenotypes(phenotypes = pheno, ... pheno_samples = pheno_samples, ... phenotype_ID = phenotype_ID) >>> indata.addCovariates(covariates = covariates, ... covs_samples = covs_samples) >>> indata.phenotypes.values[:3, :3] array([[ 0.44122749, -0.33087015, 2.43077119], [ 1.58248112, -0.9092324 , -0.59163666], [-1.19276461, -0.20487651, -0.35882895]]) >>> indata.regress() >>> indata.phenotypes.values[:3, :3] array([[ 0.34421705, -0.01470998, 2.25710966], [ 1.69886647, -1.41756814, -0.55614649], [-1.10700674, -0.66017713, -0.22201814]])
-
standardiseGenotypes
()[source]¶ Standardise genotypes:
\[w_{ij} = \frac{x_{ij} -2p_i}{\sqrt{2p_i (1-p_i)}}\]where \(x_{ij}\) is the number of copies of the reference allele for the \(i\) th SNP of the \(j\) th individual and \(p_i\) is the frequency of the reference allele (as described in (Yang et al 2011)).
Returns: updated the following attributes of the InputData instance: - self.genotypes_sd (numpy array): [N x NrSNP] matrix of NrSNP standardised genotypes for N samples.
Return type: None Examples
>>> from pkg_resources import resource_filename >>> from limmbo.io import reader >>> from limmbo.io import input >>> from limmbo.utils.utils import makeHardCalledGenotypes >>> from limmbo.utils.utils import AlleleFrequencies >>> data = reader.ReadData(verbose=False) >>> file_geno = resource_filename('limmbo', ... 'io/test/data/genotypes.csv') >>> data.getGenotypes(file_genotypes=file_geno) >>> indata = input.InputData(verbose=False) >>> indata.addGenotypes(genotypes=data.genotypes, ... genotypes_info=data.genotypes_info) >>> geno_sd = indata.standardiseGenotypes() >>> geno_sd.iloc[:5,:3] 0 1 2 ID_1 -2.201123 -2.141970 -8.9622 ID_2 -2.201123 -2.141970 -8.9622 ID_3 -2.201123 -2.141970 -8.9622 ID_4 0.908627 -0.604125 -8.9622 ID_5 -0.646248 -2.141970 -8.9622
-
subsetTraits
(traitlist=None)[source]¶ Limit analysis to specific subset of traits
Parameters: traitlist (array-like) – array of trait numbers to select from phenotypes Returns: updated the following attributes of the InputData instance: - self.traitlist (list): of [t] trait numbers (int) to choose for analysis
- self.phenotypes (pd.DataFrame): reduced set of [N x t] phenotypes
- self.phenotype.ID (np.array): reduced set of [t] phenotype IDs
Return type: None Examples
>>> from pkg_resources import resource_filename >>> from limmbo.io.reader import ReadData >>> from limmbo.io.input import InputData >>> from limmbo.io.utils import file_type >>> data = ReadData(verbose=False) >>> file_pheno = resource_filename('limmbo', ... 'io/test/data/pheno.csv') >>> data.getPhenotypes(file_pheno=file_pheno) >>> traitlist = data.getTraitSubset(traitstring="1-3,5") >>> indata = InputData(verbose=False) >>> indata.addPhenotypes(phenotypes = data.phenotypes) >>> print indata.phenotypes.shape (1000, 10) >>> print indata.phenotype_ID.shape (10,) >>> indata.subsetTraits(traitlist=traitlist) >>> print indata.phenotypes.shape (1000, 4) >>> print indata.phenotype_ID.shape (4,)
-
transform
(transform)[source]¶ Transform phenotypes
Parameters: transform (string) – transformation method for phenotype data:
- scale: mean center, divide by sd
- gaussian: inverse normalisation
Returns: updated the following attributes of the InputData instance: - self.phenotypes (np.array): [N x P] (transformed) phenotype matrix
Return type: None Examples
>>> from limmbo.io import input >>> import numpy as np >>> from numpy.random import RandomState >>> from numpy.linalg import cholesky as chol >>> random = RandomState(5) >>> P = 5 >>> K = 4 >>> N = 100 >>> pheno = random.normal(0,1, (N, P)) >>> pheno_samples = np.array(['S{}'.format(x+1) ... for x in range(N)]) >>> phenotype_ID = np.array(['ID{}'.format(x+1) ... for x in range(P)]) >>> SNP = 1000 >>> X = (random.rand(N, SNP) < 0.3).astype(float) >>> relatedness = np.dot(X, X.T)/float(SNP) >>> relatedness_samples = np.array(['S{}'.format(x+1) ... for x in range(N)]) >>> indata = input.InputData(verbose=False) >>> indata.addPhenotypes(phenotypes = pheno, ... pheno_samples = pheno_samples, ... phenotype_ID = phenotype_ID) >>> indata.addRelatedness(relatedness = relatedness, ... relatedness_samples = relatedness_samples) >>> indata.phenotypes.values[:3, :3] array([[ 0.44122749, -0.33087015, 2.43077119], [ 1.58248112, -0.9092324 , -0.59163666], [-1.19276461, -0.20487651, -0.35882895]]) >>> indata.transform(transform='gaussian') >>> indata.phenotypes.values[:3, :3] array([[ 0.23799988, -0.11191464, 2.05785598], [ 1.41041953, -0.81365681, -0.92217818], [-1.55977999, 0.01240937, -0.62091817]])
-
Variance decomposition¶
LiMMBo¶
-
class
limmbo.core.vdbootstrap.
LiMMBo
(datainput, S, timing=False, iterations=10, verbose=False)[source]¶ -
combineBootstrap
(results)[source]¶ Combine the [S x S] subset covariance matrices to find the overall [P x P] covariance matrices Cg and Cn.
Parameters: results (list) – results of runBootstrapVarianceDecomposition() Returns: dictionary containing: - Cg_fit (numpy.array): [P x P] genetic covariance matrix via fitting
- Cn_fit (numpy.array): [P x P] noise covariance matrix via fitting
- Cg_average (numpy.array): [P x P] genetic covariance matrix via simple average
- Cn_average (numpy.array): [P x P] noise covariance matrix via simple average
- Cg_all_bs (numpy.array): [runs x S x S] genetic covariance matrices of runs phenotype subsets of size S
- Cn_all_bs (numpy.array): [runs x S x S] noise covariance matrices of runs phenotype subsets of size S
- proc_time_ind_bs (list): individual run times for all variance decomposition runs of [S x S] Cg and Cn
- proc_time_sum_ind_bs (list): sum of individual run times for all variance decomposition runs of [S x S] Cg and Cn
- proc_time_combine_bs (list): run time for finding [P x P] trait covariance estimates from fitting [S x S] bootstrap covariance estimates
- nr_of_bs (int): number of bootstrap runs
- nr_of_successful_bs (int): total number of successful bootstrapping runs i.e. variance decomposition converged
- results_fit_Cg (): results parameters of the bfgs-fit of the genetic covariance matrices (via scipy.optimize.fmin_l_bfgs_g)
- results_fit_Cn (): results parameters of the bfgs-fit of the noise covariance matrices (via scipy.optimize.fmin_l_bfgs_g)
Return type: (dictionary)
-
runBootstrapCovarianceEstimation
(seed, cpus, minCooccurrence=3, n=None)[source]¶ Distribute variance decomposition of subset matrices via pp
Parameters: - seed (int) – seed to initialise random number generator for bootstrapping
- minCooccurrence (int) – minimum number a trait pair should be sampled; once reached for all trait pairs, sampling is stopped if n is None; default=3
- n (int) – if not None, sets the total number of permutations, otherwise n determined by minCooccurrence; default: None
- cpus (int) – number of cpus available for covariance estimation
Returns: list containing variance components for all bootstrap runs
Return type: (list)
-
saveVarianceComponents
(resultsCombineBootstrap, output, intermediate=True)[source]¶ Save variance components as comma-separated files or python objects (via Cpickle).
Parameters: - resultsCombineBootstrap (dictionary) –
- Cg_fit (numpy.array): [P x P] genetic covariance matrix via fitting
- Cn_fit (numpy.array): [P x P] noise covariance matrix via fitting
- Cg_average (numpy.array): [P x P] genetic covariance matrix via simple average
- Cn_average (numpy.array): [P x P] noise covariance matrix via simple average
- Cg_all_bs (numpy.array): [runs x S x S] genetic covariance matrices of runs phenotype subsets of size S
- Cn_all_bs (numpy.array): [runs x S x S] noise covariance matrices of runs phenotype subsets of size S
- proc_time_ind_bs (list): individual run times for all variance decomposition runs of [S x S] Cg and Cn
- proc_time_sum_ind_bs (list): sum of individual run times for all variance decomposition runs of [S x S] Cg and Cn
- proc_time_combine_bs (list): run time for finding [P x P] trait covariance estimates from fitting [S x S] bootstrap covariance estimates
- nr_of_bs (int): number of bootstrap runs
- nr_of_successful_bs (int): total number of successful bootstrapping runs i.e. variance decomposition converged
- results_fit_Cg (): results parameters of the bfgs-fit of the genetic covariance matrices (via scipy.optimize.fmin_l_bfgs_g)
- results_fit_Cn (): results parameters of the bfgs-fit of the noise covariance matrices (via scipy.optimize.fmin_l_bfgs_g)
- output (string) – path/to/directory where variance components will be saved; needs writing permission
- intermediate (bool) – if set to True, intermediate variance components (average covariance matrices, bootstrap matrices and results parameter of BFGS fit) are saved
Returns: None
- resultsCombineBootstrap (dictionary) –
-
Standard REML¶
-
limmbo.core.vdsimple.
vd_reml
(datainput, iterations=10, verbose=True)[source]¶ Compute variance decomposition of phenotypes into genetic and noise covariance via standard REML: approach implemented in LIMIX
Parameters: - datainput (
InputData
) – object with ID-matched [N x P] phenotypes with [N] individuals and [P] phenotypes and [N x N] relatedness estimates - output (string, optional) – output directory with user-writing permissions; needed if caching is True
- cache (bool, optional) – should results be cached
- verbose (bool, optional) – should messages be printed to stdout
Returns: tuple containing:
- Cg (numpy.array): [P x P] genetic variance component
- Cn (numpy.array): [P x P] noise variance component
- process_time (double): cpu time of variance decomposition
Return type: (tuple)
Examples:
>>> import numpy >>> from numpy.random import RandomState >>> from numpy.linalg import cholesky as chol >>> from limmbo.core.vdsimple import vd_reml >>> from limmbo.io.input import InputData >>> random = RandomState(10) >>> N = 100 >>> S = 1000 >>> P = 5 >>> snps = (random.rand(N, S) < 0.2).astype(float) >>> kinship = numpy.dot(snps, snps.T) / float(10) >>> y = random.randn(N, P) >>> pheno = numpy.dot(chol(kinship), y) >>> pheno_ID = [ 'PID{}'.format(x+1) for x in range(P)] >>> samples = [ 'SID{}'.format(x+1) for x in range(N)] >>> datainput = InputData() >>> datainput.addPhenotypes(phenotypes = pheno, ... phenotype_ID = pheno_ID, ... pheno_samples = samples) >>> datainput.addRelatedness(relatedness = kinship, ... relatedness_samples = samples) >>> Cg, Cn, ptime = vd_reml(datainput, verbose=False) >>> Cg.shape (5, 5)
- datainput (
Association analysis¶
-
class
limmbo.core.gwas.
GWAS
(datainput, seed=10, verbose=True, searchDelta=False)[source]¶ -
computeEmpiricalP
(pvalues, nrpermutations=1000)[source]¶ Compute empirical p-values: permute the genotypes, do the association test, record if permuted p-value of SNP is smaller than original p-value. Sum these occurrences and divide by total number of permutation.
Parameters: - pvalues (array-like) – [P x NrSNP] (single-trait) or [1 x NrSNP] (multi-trait) array of p-values
- nrpermutations (int) – number of permutations; 1/nrpermutations is the maximum level of significance (alpha)to test for, e.g. nrpermuations=100 -> alpha=0.01
Returns: [P x NrSNP] (single-trait) or [1 x NrSNP] (multi-trait) array of empirical p-values
Return type: (numpy array)
-
computeFDR
(fdr)[source]¶ Create an empirical p-values distribution by permuting the genotypes and running the association test on these (10 random permuations). Recored the observed pvalues, order by rank and find the rank of the desired FDR to determine the empirical FDR.
Parameters: fdr (float) – desired fdr threshold Returns: tuple containing: - empirical fdr (float):
- empirical_pvalue_distribution (numpy array): array of empirical p-values for 10 permutations tests
Return type: (tuple)
-
manhattanQQ
(results, colourS='DarkBLue', colourNS='Orange', alphaS=0.5, alphaNS=0.1, thr_plotting=None, saveTo=None)[source]¶ Plot manhattan and quantile-quantile plot of association results.
Parameters: - results (dictionary) – dictionary generated via runAssociation analysis
- colourS (string, optional) – colour of significant points
- colourNS (string, optional) – colour of non-significant points
- alphaS (float, optional) – plotting transparency of significant points
- alphaNS (float, optional) – plotting transparency of non-significant points
- thr_plotting (float, optional) – y-intercept for horizontal line as a marker for significance
- saveTo (string, optional) – /path/to/output/directory to automatically save plot as pdf; needs user writing permission.
Returns: (None)
-
runAssociationAnalysis
(mode, setup='lmm', adjustSingleTrait=None)[source]¶ Analysing the association between phenotypes, genotypes, optional covariates and random genetic effects.
Parameters: - mode (string) – pecifies the type of linear model: either ‘multitrait’ for multivariate analysis or ‘singletrait’ for univariate analysis.
- setup (string) – specifies the linear model setup: either ‘lmm’ for linear mixed model or ‘lm’ for a simple linear model.
- adjustSingleTrait (string) – Method to adjust single-trait association p-values for testing multiple traits; If None (default) no adjusting. Options are ‘bonferroni’ (for bonferroni correction’) or ‘effective’ (for correcting for the effective number of tests as described in (Galwey,2009).
Returns: dictionary containing:
- lm (
limix.qtl.LMM
): LIMIX LMM object - pvalues (numpy array): [NrSNP x P] (when mode is singletrait) or [1 x`NrSNP`] array of p-values.
- betas (numpy array): [NrSNP x P] array of effect size estimates per SNP across all traits.
- pvalues_adjust (numpy array): only returned if mode is ‘singletrait’ and ‘adjustSingleTrait’ is not None; contains single-trait p-values adjusted for the number of single-trait analyses conducted.
Return type: (dictionary)
Examples
>>> from pkg_resources import resource_filename >>> from limmbo.io.reader import ReadData >>> from limmbo.io.input import InputData >>> from limmbo.core.gwas import GWAS >>> data = ReadData(verbose=False) >>> file_pheno = resource_filename('limmbo', ... 'io/test/data/pheno.csv') >>> file_geno = resource_filename('limmbo', ... 'io/test/data/genotypes.csv') >>> file_relatedness = resource_filename('limmbo', ... 'io/test/data/relatedness.csv') >>> file_covs = resource_filename('limmbo', ... 'io/test/data/covs.csv') >>> file_Cg = resource_filename('limmbo', ... 'io/test/data/Cg.csv') >>> file_Cn = resource_filename('limmbo', ... 'io/test/data/Cn.csv') >>> data.getPhenotypes(file_pheno=file_pheno) >>> data.getCovariates(file_covariates=file_covs) >>> data.getRelatedness(file_relatedness=file_relatedness) >>> data.getGenotypes(file_genotypes=file_geno) >>> data.getVarianceComponents(file_Cg=file_Cg, ... file_Cn=file_Cn) >>> indata = InputData(verbose=False) >>> indata.addPhenotypes(phenotypes = data.phenotypes) >>> indata.addRelatedness(relatedness = data.relatedness) >>> indata.addCovariates(covariates = data.covariates) >>> indata.addGenotypes(genotypes=data.genotypes, ... genotypes_info=data.genotypes_info) >>> indata.addVarianceComponents(Cg = data.Cg, Cn=data.Cn) >>> indata.commonSamples() >>> indata.regress() >>> indata.transform(transform="scale") >>> gwas = GWAS(datainput=indata, seed=10, verbose=False) >>> >>> >>> # Example of multi-trait single-variant association testing >>> # using a linear mixed model. >>> resultsAssociation = gwas.runAssociationAnalysis( ... setup="lmm", mode="multitrait") >>> resultsAssociation.keys() ['lm', 'betas', 'pvalues'] >>> resultsAssociation['pvalues'].shape (1, 20) >>> '{:0.3e}'.format(resultsAssociation['pvalues'].min()) '4.584e-09' >>> resultsAssociation['betas'].shape (10, 20) >>> >>> >>> # Example of single-trait single-variant association testing >>> # using a linear mixed model. >>> resultsAssociation = gwas.runAssociationAnalysis( ... setup="lmm", mode="singletrait", ... adjustSingleTrait = "effective") >>> resultsAssociation.keys() ['pvalues_adjust', 'lm', 'betas', 'pvalues'] >>> resultsAssociation['pvalues'].shape (10, 20) >>> resultsAssociation['betas'].shape (10, 20) >>> '{:0.3e}'.format(resultsAssociation['pvalues_adjust'].min()) '2.262e-03'
-
saveAssociationResults
(results, outdir, name='', pvalues_empirical=None)[source]¶ Saves results of association analyses.
Parameters: - results (dictionary) –
dictionary generated via runAssociation analysis, containing:
- lm (
limix.qtl.LMM
): LIMIX LMM object - pvalues (numpy array): [P x NrSNP] array of p-values
- betas (numpy array): [P x NrSNP] array of effect size estimates per SNP across all traits
- pvalues_adjust (numpy array): only returned mode == singletrait and if ‘adjustSingleTrait’ is not None; contains single-trait p-values adjusted for the number of single-trait analyses conducted
- lm (
- outdir (string) – ‘/path/to/output/directory’; needs user writing permission
- name (string) – input name specific name, such as chromosome used in analysis
- pvalues_empirical (pandas dataframe, optional) – empirical pvalues via adjustSingleTrait
Returns: (None)
- results (dictionary) –
-
Command-line interface¶
Association analysis¶
Models the association between phenotypes and genotypes, accepting additional covariates and parameters to account for population structure and relatedness between samples. Users can choose between single-trait and multi-trait models, simple linear or linear mixed model set-ups.
usage: runAssociation [-h] [-p FILE_PHENO] [--pheno_delim PHENO_DELIM]
[-g FILE_GENOTYPES] [--geno_delim GENOTYPES_DELIM] -o
OUTDIR (-st | -mt) (-lm | -lmm) [-n NAME] [-v]
[-k FILE_RELATEDNESS]
[--kinship_delim RELATEDNESS_DELIM] [-cg FILE_CG]
[--cg_delim CG_DELIM] [-cn FILE_CN]
[--cn_delim CN_DELIM] [-pcs FILE_PCS]
[--pcs_delim PCS_DELIM] [-c FILE_COVARIATES]
[--covariate_delim COVARIATE_DELIM]
[-adjustP {bonferroni,effective,None}]
[-nrpermutations NRPERMUTATIONS] [-fdr FDR] [-seed SEED]
[-tr {scale,gaussian}] [-reg] [-traitset TRAITSTRING]
[-nrpcs NRPCS]
[--file_samplelist FILE_SAMPLELIST | --samplelist SAMPLELIST]
[--plot] [-colourS COLOURS] [-colourNS COLOURNS]
[-alphaS ALPHAS] [-alphaNS ALPHANS] [-thr THR]
[--version]
Basic required arguments¶
-p, --file_pheno | |
Path [string] to [(N+1) x (P+1)] .csv file of [P] phenotypes with [N] samples (first column: sample IDs, first row: phenotype IDs). Default: None | |
--pheno_delim | Delimiter of phenotype file. g Default: “,” |
-g, --file_geno | |
Genotype file: either [S x N].csv file (first column: SNP id, first row: sample IDs) or plink formated genotypes (.bim/.fam/.bed). Default: None | |
--geno_delim | Delimiter of genotype file (if not in plink format). Default: “,” |
-o, --outdir | Path [string] of output directory; user needs writing permission. Default: None |
-st, --singletrait | |
Set flag to conduct a single-trait association analysesDefault: False | |
-mt, --multitrait | |
Set flag to conduct a multi-trait association analysesDefault: False | |
-lm, --lm | Set flag to use a simple linear model for the associationanalysis |
-lmm, --lmm | Set flag to use a linear mixed model for the associationanalysis |
Output arguments¶
-n, --name | Name (used for output file naming). Default: |
-v, --verbose | [bool]: should analysis progress be displayed. Default: False |
Optional files¶
-k, --file_kinship | |
Path [string] to [N x (N+1)] file of kinship/relatedness matrix with [N] samples (first row: sample IDs). Required when –lmm/-lm. Default: None | |
--kinship_delim | |
Delimiter of kinship file. g Default: “,” | |
-cg, --file_cg | Required for large phenotype sizes when –lmm/-lm; computed via runLiMMBo; specifies file name for genetic trait covariance matrix (rows: traits, columns: traits). Default: None |
--cg_delim | Delimiter of Cg file. g Default: “,” |
-cn, --file_cn | Required for large phenotype sizeswhen –lmm/-lm; computed via runLiMMBo; specifies file name for non-genetic trait covariance matrix (rows: traits, columns: traits). Default: None |
--cn_delim | Delimiter of Cn file. g Default: “,” |
-pcs, --file_pcs | |
Path to [N x PCs] file of principal components from genotypes to be included as covariates (first column: sample IDs, first row: PC IDs); Default: None | |
--pcs_delim | Delimiter of PCs file. g Default: “,” |
-c, --file_cov | Path [string] to [(N+1) x C] file of covariates matrix with [N] samples and [K] covariates (first column: sample IDs, first row: phenotype IDs). Default: None |
--covariate_delim | |
Delimiter of covariates file. g Default: “,” |
Optional association parameters¶
-adjustP, --adjustP | |
Possible choices: bonferroni, effective, None Method to adjust single-trait p-values formultiple hypothesis testing when runningmultiple single-trait GWAS: bonferroni/effective number of tests `(Galwey,2009) <http://onlinelibrary.wiley.com/doi/10.1002/gepi.20408/abstract>`_Default: None | |
-nrpermutations, --nrpermutations | |
Number of permutations for computing empirical p-values; 1/nrpermutations is maximum level of testing for significance. Default: None | |
-fdr, --fdr | FDR threshold for computing empirical FDR. Default: None |
-seed, --seed | Seed [int] to inittiate random number generation for permutations. Default: 256 |
Optional data processing parameters¶
-tr, --transform_method | |
Possible choices: scale, gaussian Choose type [string] of data preprocessing: scale (mean center, divide by sd) or gaussian (inverse normalise). Default: “scale” | |
-reg, --reg_covariates | |
[bool]: should covariates be regressed out? Default: False |
Optional subsetting options¶
-traitset, --traitset | |
Comma- (for list of traits) or hyphen- (for trait range) or comma and hyphen-separated list [string] of traits (trait columns) to choose; default: None (=all traits). Default: None | |
-nrpcs, --nrpcs | |
First PCs to chose. Default: 10 | |
--file_samplelist | |
Path [string] to file with samplelist for sample selection, with one sample ID per line. Default: None | |
--samplelist | Comma-separated list [string] of samples IDs to restrict analysis to, e.g. ID1,ID2,ID5,ID9,ID10. Default: None |
Plot arguments¶
Arguments for depicting GWAS results as manhattan plot
--plot | Set flag if results of association analysis should be depicted as manhattan and quantile-quantile plot |
-colourS, --colourS | |
Colour of significant points in manhattan plot | |
-colourNS, --colourNS | |
Colour of non-significant points in manhattan plot | |
-alphaS, --alphaS | |
Transparency of significant points in manhattan plot | |
-alphaNS, --alphaNS | |
Transparency of non-significant points in manhattan plot | |
-thr, --threshold | |
Significance threshold; when –fdr specified, empirical fdr used as threshold |
Version¶
--version | show program’s version number and exit |
Variance decomposition¶
Estimates the genetic and non-genetic traitcovariance matrix parameters of a linear mixed model with random genetic and non-genetic effect via a bootstrapping-based approach.
usage: runVarianceEstimation [-h] [-p FILE_PHENO] [--pheno_delim PHENO_DELIM]
[-k FILE_RELATEDNESS]
[--kinship_delim RELATEDNESS_DELIM] -o OUTDIR
[-c FILE_COVARIATES]
[--covariate_delim COVARIATE_DELIM] [-seed SEED]
-sp S [-r RUNS] [-t]
[--minCooccurrence MINCOOCCURRENCE]
[-i ITERATIONS] [-cpus CPUS]
[-tr {scale,gaussian}] [-reg]
[-traitset TRAITSTRING]
[--file_samplelist FILE_SAMPLELIST | --samplelist SAMPLELIST]
[-dontSaveIntermediate] [-v] [--version]
Basic required arguments¶
-p, --file_pheno | |
Path [string] to [(N+1) x (P+1)] .csv file of [P] phenotypes with [N] samples (first column: sample IDs, first row: phenotype IDs). Default: None | |
--pheno_delim | Delimiter of phenotype file. g Default: “,” |
-k, --file_kinship | |
Path [string] to [N x (N+1)] file of kinship/relatedness matrix with [N] samples (first row: sample IDs). Required when –lmm/-lm. Default: None | |
--kinship_delim | |
Delimiter of kinship file. g Default: “,” | |
-o, --outdir | Path [string] of output directory; user needs writing permission. Default: None |
Optional files¶
-c, --file_cov | Path [string] to [(N+1) x C] file of covariates matrix with [N] samples and [K] covariates (first column: sample IDs, first row: phenotype IDs). Default: None |
--covariate_delim | |
Delimiter of covariates file. g Default: “,” |
Bootstrapping parameters¶
-seed, --seed | seed [int] used to generate bootstrap matrix. Default: 234 |
-sp, --smallp | Size [int] of phenotype subsamples used for variance decomposition. Default: None |
-r, --runs | Total number [int] of bootstrap runs. Default: None |
-t, --timing | [bool]: should variance decomposition be timed. Default: False |
--minCooccurrence | |
Minimum count [int] of the pairwise sampling of any given trait pair. Default: 3 | |
-i, --iterations | |
Number [int] of iterations for variance decomposition attempts. Default: 10 | |
-cpus, --cpus | Number [int] of available CPUs for parallelisation of variance decomposition steps. Default: None |
Optional data processing parameters¶
-tr, --transform_method | |
Possible choices: scale, gaussian Choose type [string] of data preprocessing: scale (mean center, divide by sd) or gaussian (inverse normalise). Default: “scale” | |
-reg, --reg_covariates | |
[bool]: should covariates be regressed out? Default: False |
Optional subsetting options¶
-traitset, --traitset | |
Comma- (for list of traits) or hyphen- (for trait range) or comma and hyphen-separated list [string] of traits (trait columns) to choose; default: None (=all traits). Default: None | |
--file_samplelist | |
Path [string] to file with samplelist for sample selection, with one sample ID per line.Default: None | |
--samplelist | Comma-separated list [string] of samples IDs to restrict analysis to, e.g. ID1,ID2,ID5,ID9,ID10. Default: None |
Output arguments¶
-dontSaveIntermediate, --dontSaveIntermediate | |
Set to suppress saving intermediate variance components. Default: True | |
-v, --verbose | [bool]: should analysis step description be printed. Default: False |
Version¶
--version | show program’s version number and exit |
Additional functions¶
-
limmbo.utils.utils.
boolanize
(string)[source]¶ Convert command line parameter “True”/”False” into boolean
Parameters: - string (string) –
- or "True" ("False") –
Returns: False/True
Return type: (bool)
-
limmbo.utils.utils.
generate_permutation
(P, S, n, seed=12321, exclude_zero=False)[source]¶ Generate permutation.
Parameters: Returns: Returns list of length n containing [np.arrays] of length [S] with subsets/permutations of numbers range(P)
Return type: (list)
-
limmbo.utils.utils.
getEigen
(covMatrix, reverse=True)[source]¶ Get eigenvectors and values of hermitian matrix:
Parameters: - covMatrix (array-like) – hermitian matrix
- reverse (bool) – if True (default): order eigenvalues (and vectors) in decreasing order
Returns: tuple containing:
- eigenvectors
- eigenvalues
Return type: (tuple)
-
limmbo.utils.utils.
getVariance
(eigenvalue)[source]¶ Based on input eigenvalue computes cumulative sum and normalizes to overall sum to obtain variance explained
Parameters: eigenvalue (array-like) – eigenvalues Returns: variance explained Return type: (float)
-
limmbo.utils.utils.
inflate_matrix
(bootstrap_traits, bootstrap, P, zeros=True)[source]¶ Project small matrix into large matrix using indeces provided:
Parameters: - bootstrap_traits (array-like) – [S x S] covariance matrix estimates
- bootstrap (array-like) – [S x 1] array with indices to project [S x S] matrix values into [P x P] matrix
- P (int) – total number of dimensions
- zeros (bool) – fill void spaces in large matrix with zeros (True, default) or nans (False)
Returns: Returns [P x P] matrix containing [S x S] matrix values at bootstrap indeces and zeros/nans elswhere
Return type: (numpy array)
-
limmbo.utils.utils.
match
(samples_ref, data_compare, samples_compare, squarematrix=False)[source]¶ Match the order of data and ID matrices to a reference sample order,
Parameters: - samples_ref (array-like) – [M] sammple Ids used as reference
- data_compare (array-like) – [N x L] data matrix with [N] samples and [L] columns
- samples_compare (array-like) – [N] sample IDs to be matched to samples_ref
- squarematrix (bool) – is data_compare a square matrix i.e. samples in cols and rows
Returns: tuple containing:
- data_compare (numpy array): [M x L] data matrix of input data_compare
- samples_compare (numpy array): [M] sample IDs of input samples_compare
- samples_before (int): number of samples in data_compare/samples_compare before matching to samples_ref
- samples_after (int): number of samples in data_compare/samples_compare after matching to samples_ref
Return type: (tuple)
-
limmbo.utils.utils.
nans
(shape)[source]¶ Create numpy array of NaNs
Parameters: shape (tuple) – shape of the empty array Returns: numpy array of NaNs Return type: (numpy array)
-
limmbo.utils.utils.
regularize
(m, verbose=True)[source]¶ Make matrix positive-semi definite by ensuring minimum eigenvalue >= 0: add absolute value of minimum eigenvalue and 1e-4 (for numerical stability of abs(min(eigenvalue) < 1e-4 to diagonal of matrix
Parameters: m (array-like) – symmetric matrix Returns: Returns tuple containing: - positive, semi-definite matrix from input m (numpy array)
- minimum eigenvalue of input m
Return type: (tuple)