Dask-glm

Dask-glm is a library for fitting Generalized Linear Models on large datasets

Dask-glm builds on the dask project to fit GLM‘s on datasets in parallel. It offers a scikit-learn compatible API for specifying your model.

Estimators

The estimators module offers a scikit-learn compatible API for specifying your model and hyper-parameters, and fitting your model to data.

>>> from dask_glm.estimators import LogisticRegression
>>> from dask_glm.datasets import make_classification
>>> X, y = make_classification()
>>> lr = LogisticRegression()
>>> lr.fit(X, y)
>>> lr
LogisticRegression(abstol=0.0001, fit_intercept=True, lamduh=1.0,
          max_iter=100, over_relax=1, regularizer='l2', reltol=0.01, rho=1,
          solver='admm', tol=0.0001)

All of the estimators follow a similar API. They can be instantiated with a set of parameters that control the fit, including whether to add an intercept, which solver to use, how to regularize the inputs, and various optimization parameters.

Given an instantiated estimator, you pass the data to the .fit method. It takes an X, the feature matrix or exogenous data, and a y the target or endogenous data. Each of these can be a NumPy or dask array.

With a fit model, you can make new predictions using the .predict method, and can score known observations with the .score method.

>>> lr.predict(X).compute()
array([False, False, False, True, ... True, False, True, True], dtype=bool)

See the API Reference for more.

Examples

A collection of notebooks demonstrating dask_glm.

Scikit-Learn-style API

This example demontrates compatability with scikit-learn’s basic fit API. For demonstration, we’ll use the perennial NYC taxi cab dataset.

In [1]:
import os
import s3fs
import pandas as pd
import dask.array as da
import dask.dataframe as dd
from distributed import Client

from dask import persist
from dask_glm.estimators import LogisticRegression
In [2]:
if not os.path.exists('trip.csv'):
    s3 = S3FileSystem(anon=True)
    s3.get("dask-data/nyc-taxi/2015/yellow_tripdata_2015-01.csv", "trip.csv")
In [3]:
client = Client()
In [4]:
ddf = dd.read_csv("trip.csv")

We can use the dask.dataframe API to explore the dataset, and notice that some of the values look suspicious:

In [5]:
ddf[['trip_distance', 'fare_amount']].describe().compute()
Out[5]:
trip_distance fare_amount
count 1.274899e+07 1.274899e+07
mean 1.345913e+01 1.190566e+01
std 9.844094e+03 1.030254e+01
min 0.000000e+00 -4.500000e+02
25% 1.000000e+00 6.500000e+00
50% 1.700000e+00 9.000000e+00
75% 3.100000e+00 1.350000e+01
max 1.542000e+07 4.008000e+03

Scikit-learn doesn’t currently support filtering observations inside a pipeline (yet), so we’ll do this before anything else.

In [6]:
# these filter out less than 1% of the observations
ddf = ddf[(ddf.trip_distance < 20) &
          (ddf.fare_amount < 150)]

Now, we’ll split our DataFrame into a train and test set, and select our feature matrix and target column (whether the passenger tipped).

In [7]:
df_train, df_test = ddf.random_split([0.80, 0.20], random_state=2)

columns = ['VendorID', 'passenger_count', 'trip_distance', 'payment_type', 'fare_amount']

X_train, y_train = df_train[columns], df_train['tip_amount'] > 0
X_test, y_test = df_test[columns], df_test['tip_amount'] > 0

X_train, y_train, X_test, y_test = persist(
    X_train, y_train, X_test, y_test
)

With our training data in hand, we fit our logistic regression. Nothing here should be surprising to those familiar with scikit-learn.

In [8]:
%%time
# this is a *dask-glm* LogisticRegresion, not scikit-learn
lm = LogisticRegression(fit_intercept=False)
lm.fit(X_train.values, y_train.values)
CPU times: user 35.9 s, sys: 8.69 s, total: 44.6 s
Wall time: 9min 2s

Again, following the lead of scikit-learn we can measure the performance of the estimator on the training dataset:

In [9]:
lm.score(X_train.values, y_train.values).compute()
Out[9]:
0.90022477759757635

and on the test dataset:

In [10]:
lm.score(X_test.values, y_test.values).compute()
Out[10]:
0.90030262922441306

API Reference

Estimators

Models following scikit-learn’s estimator API.

class dask_glm.estimators.LinearRegression(fit_intercept=True, solver='admm', regularizer='l2', max_iter=100, tol=0.0001, lamduh=1.0, rho=1, over_relax=1, abstol=0.0001, reltol=0.01)[source]

Esimator for a linear model using Ordinary Least Squares.

Parameters:

fit_intercept : bool, default True

Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.

solver : {‘admm’, ‘gradient_descent’, ‘newton’, ‘lbfgs’, ‘proximal_grad’}

Solver to use. See Algorithms for details

regularizer : {‘l1’, ‘l2’}

Regularizer to use. See Regularizers for details. Only used with admm and proximal_grad solvers.

max_iter : int, default 100

Maximum number of iterations taken for the solvers to converge

tol : float, default 1e-4

Tolerance for stopping criteria. Ignored for admm solver

lambduh : float, default 1.0

Only used with admm and proximal_grad solvers

rho, over_relax, abstol, reltol : float

Only used with the admm solver.

Examples

>>> from dask_glm.datasets import make_regression
>>> X, y = make_regression()
>>> est = LinearRegression()
>>> est.fit(X, y)
>>> est.predict(X)
>>> est.score(X, y)

Attributes

coef_ (array, shape (n_classes, n_features)) The learned value for the model’s coefficients
intercept_ (float of None) The learned value for the intercept, if one was added to the model
class dask_glm.estimators.LogisticRegression(fit_intercept=True, solver='admm', regularizer='l2', max_iter=100, tol=0.0001, lamduh=1.0, rho=1, over_relax=1, abstol=0.0001, reltol=0.01)[source]

Esimator for logistic regression.

Parameters:

fit_intercept : bool, default True

Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.

solver : {‘admm’, ‘gradient_descent’, ‘newton’, ‘lbfgs’, ‘proximal_grad’}

Solver to use. See Algorithms for details

regularizer : {‘l1’, ‘l2’}

Regularizer to use. See Regularizers for details. Only used with admm, lbfgs, and proximal_grad solvers.

max_iter : int, default 100

Maximum number of iterations taken for the solvers to converge

tol : float, default 1e-4

Tolerance for stopping criteria. Ignored for admm solver

lambduh : float, default 1.0

Only used with admm, lbfgs and proximal_grad solvers.

rho, over_relax, abstol, reltol : float

Only used with the admm solver.

Examples

>>> from dask_glm.datasets import make_classification
>>> X, y = make_classification()
>>> lr = LogisticRegression()
>>> lr.fit(X, y)
>>> lr.predict(X)
>>> lr.predict_proba(X)
>>> est.score(X, y)

Attributes

coef_ (array, shape (n_classes, n_features)) The learned value for the model’s coefficients
intercept_ (float of None) The learned value for the intercept, if one was added to the model
class dask_glm.estimators.PoissonRegression(fit_intercept=True, solver='admm', regularizer='l2', max_iter=100, tol=0.0001, lamduh=1.0, rho=1, over_relax=1, abstol=0.0001, reltol=0.01)[source]

Esimator for Poisson Regression.

Parameters:

fit_intercept : bool, default True

Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.

solver : {‘admm’, ‘gradient_descent’, ‘newton’, ‘lbfgs’, ‘proximal_grad’}

Solver to use. See Algorithms for details

regularizer : {‘l1’, ‘l2’}

Regularizer to use. See Regularizers for details. Only used with admm, lbfgs, and proximal_grad solvers.

max_iter : int, default 100

Maximum number of iterations taken for the solvers to converge

tol : float, default 1e-4

Tolerance for stopping criteria. Ignored for admm solver

lambduh : float, default 1.0

Only used with admm, lbfgs and proximal_grad solvers.

rho, over_relax, abstol, reltol : float

Only used with the admm solver.

Examples

>>> from dask_glm.datasets import make_poisson
>>> X, y = make_poisson()
>>> pr = PoissonRegression()
>>> pr.fit(X, y)
>>> pr.predict(X)
>>> pr.get_deviance(X, y)

Attributes

coef_ (array, shape (n_classes, n_features)) The learned value for the model’s coefficients
intercept_ (float of None) The learned value for the intercept, if one was added to the model

Families

class dask_glm.families.Logistic[source]

Implements methods for Logistic regression, useful for classifying binary outcomes.

static gradient(Xbeta, X, y)[source]

Logistic gradient

static hessian(Xbeta, X)[source]

Logistic hessian

static loglike(Xbeta, y)[source]

Evaluate the logistic loglikeliehood

Parameters:

Xbeta : array, shape (n_samples, n_features)

y : array, shape (n_samples)

static pointwise_gradient(beta, X, y)[source]

Logistic gradient, evaluated point-wise.

static pointwise_loss(beta, X, y)[source]

Logistic Loss, evaluated point-wise.

class dask_glm.families.Normal[source]

Implements methods for Linear regression, useful for modeling continuous outcomes.

class dask_glm.families.Poisson[source]

This implements Poisson regression, useful for modelling count data.

Algorithms

Optimization algorithms for solving minimizaiton problems.

dask_glm.algorithms.admm(X, y, regularizer='l1', lamduh=0.1, rho=1, over_relax=1, max_iter=250, abstol=0.0001, reltol=0.01, family=<class 'dask_glm.families.Logistic'>, **kwargs)[source]

Alternating Direction Method of Multipliers

Parameters:

X : array-like, shape (n_samples, n_features)

y : array-like, shape (n_samples,)

regularizer : str or Regularizer

lambuh : float

rho : float

over_relax : FLOAT

max_iter : int

maximum number of iterations to attempt before declaring failure to converge

abstol, reltol : float

family : Family

Returns:

beta : array-like, shape (n_features,)

dask_glm.algorithms.compute_stepsize_dask(beta, step, Xbeta, Xstep, y, curr_val, family=<class 'dask_glm.families.Logistic'>, stepSize=1.0, armijoMult=0.1, backtrackMult=0.1)[source]

Compute the optimal stepsize

beta : array-like step : float XBeta : array-lie Xstep : y : array-like curr_val : float famlily : Family, optional stepSize : float, optional armijoMult : float, optional backtrackMult : float, optional

Returns:

stepSize : flaot

beta : array-like

xBeta : array-like

func : callable

dask_glm.algorithms.gradient_descent(X, y, max_iter=100, tol=1e-14, family=<class 'dask_glm.families.Logistic'>, **kwargs)[source]

Michael Grant’s implementation of Gradient Descent.

Parameters:

X : array-like, shape (n_samples, n_features)

y : array-like, shape (n_samples,)

max_iter : int

maximum number of iterations to attempt before declaring failure to converge

tol : float

Maximum allowed change from prior iteration required to declare convergence

family : Family

Returns:

beta : array-like, shape (n_features,)

dask_glm.algorithms.lbfgs(X, y, regularizer=None, lamduh=1.0, max_iter=100, tol=0.0001, family=<class 'dask_glm.families.Logistic'>, verbose=False, **kwargs)[source]

L-BFGS solver using scipy.optimize implementation

Parameters:

X : array-like, shape (n_samples, n_features)

y : array-like, shape (n_samples,)

max_iter : int

maximum number of iterations to attempt before declaring failure to converge

tol : float

Maximum allowed change from prior iteration required to declare convergence

family : Family

Returns:

beta : array-like, shape (n_features,)

dask_glm.algorithms.newton(X, y, max_iter=50, tol=1e-08, family=<class 'dask_glm.families.Logistic'>, **kwargs)[source]

Newtons Method for Logistic Regression.

Parameters:

X : array-like, shape (n_samples, n_features)

y : array-like, shape (n_samples,)

max_iter : int

maximum number of iterations to attempt before declaring failure to converge

tol : float

Maximum allowed change from prior iteration required to declare convergence

family : Family

Returns:

beta : array-like, shape (n_features,)

dask_glm.algorithms.proximal_grad(X, y, regularizer='l1', lamduh=0.1, family=<class 'dask_glm.families.Logistic'>, max_iter=100, tol=1e-08, **kwargs)[source]
Parameters:

X : array-like, shape (n_samples, n_features)

y : array-like, shape (n_samples,)

max_iter : int

maximum number of iterations to attempt before declaring failure to converge

tol : float

Maximum allowed change from prior iteration required to declare convergence

family : Family

verbose : bool, default False

whether to print diagnostic information during convergence

Returns:

beta : array-like, shape (n_features,)

Regularizers

Available Regularizers

These regularizers are included with dask-glm.

class dask_glm.regularizers.ElasticNet(weight=0.5)[source]

Elastic net regularization.

proximal_operator(beta, t)[source]

See notebooks/ElasticNetProximalOperatorDerivation.ipynb for derivation.

class dask_glm.regularizers.L1[source]

L1 regularization.

class dask_glm.regularizers.L2[source]

L2 regularization.

Regularizer Interface

Users wishing to implement their own regularizer should satisfy this interface.

class dask_glm.regularizers.Regularizer[source]

Abstract base class for regularization object.

Defines the set of methods required to create a new regularization object. This includes the regularization functions itself and its gradient, hessian, and proximal operator.

add_reg_f(f, lam)[source]

Add regularization function to other function.

Parameters:

f : callable

Function taking beta and *args

lam : float

regularization constant

Returns:

wrapped : callable

function taking beta and *args

add_reg_grad(grad, lam)[source]

Add regularization gradient to other gradient function.

Parameters:

grad : callable

Function taking beta and *args

lam : float

regularization constant

Returns:

wrapped : callable

function taking beta and *args

add_reg_hessian(hess, lam)[source]

Add regularization hessian to other hessian function.

Parameters:

hess : callable

Function taking beta and *args

lam : float

regularization constant

Returns:

wrapped : callable

function taking beta and *args

f(beta)[source]

Regularization function.

Parameters:beta : array, shape (n_features,)
Returns:result : float
classmethod get(obj)[source]

Get the concrete instance for the name obj.

Parameters:

obj : Regularizer or str

Valid instances of Regularizer are passed through. Strings are looked up according to obj.name and a new instance is created

Returns:

obj : Regularizer

gradient(beta)[source]

Gradient of regularization function.

Parameters:beta : array, shape (n_features,)
Returns:gradient : array, shape (n_features,)
hessian(beta)[source]

Hessian of regularization function.

Parameters:beta : array, shape (n_features,)
Returns:hessian : array, shape (n_features, n_features)
proximal_operator(beta, t)[source]

Proximal operator for regularization function.

Parameters:

beta : array, shape (n_features,)

t : float # TODO: is that right?

Returns:

proximal_operator : array, shape (n_features,)

Indices and tables