Coclust: a Python package for co-clustering

Coclust provides both a Python package which implements several diagonal and non-diagonal co-clustering algorithms, and a ready to use script to perform co-clustering.

Co-clustering (also known as biclustering), is an important extension of cluster analysis since it allows to simultaneously groups objects and features in a matrix, resulting in both row and column clusters.

The script enables the user to process a dataset with co-clustering algorithms without writing Python code.

The Python package provides an API for Python developers. This API allows to use the algorithms in a pipeline with scikit-learn library for example.

coclust is distributed under the 3-Clause BSD license. It works with both Python 2.7 and Python 3.5.

Installation

You can install coclust with all the dependencies with:

pip install "coclust[alldeps]"

It will install the following libraries:

  • numpy
  • scipy
  • scikit-learn
  • matplotlib

If you only want to use co-clustering algorithms and don’t want to install visualization or evaluation dependencies, you can install it with:

pip install coclust

It will install the following required libraries:

  • numpy
  • scipy
  • scikit-learn

Windows users

It is recommended to use a third party distribution to install the dependencies before installing coclust. For example, when using the Continuum distribution, go to the download site to get and double-click the graphical installer. Then, enter pip install coclust at the command line.

Linux users

It is recommended to install the dependencies with your package manager. For example, on Ubuntu or Debian:

sudo apt-get install python-numpy python-scipy python-sklearn python-matplotlib
sudo pip install coclust

Performance note

OpenBLAS provides a fast multi-threaded implementation, you can install it with:

sudo apt-get install libopenblas-base

If other implementations are installed on your system, you can select OpenBLAS with:

sudo update-alternatives --config libblas.so.3

Running the tests

In order to run the tests, you have to install nose, for example with:

pip install nose

You also have to get the datasets used for the tests:

git clone https://github.com/franrole/cclust_package.git

And then, run the tests:

cd cclust_package
nosetests --with-coverage --cover-inclusive --cover-package=coclust

Examples

The datasets used here are available at:

https://github.com/franrole/cclust_package/tree/master/datasets

Basic usage

In the following example, the CSTR dataset is loaded from a Matlab matrix using the SciPy library. The data is stored in X and a co-clustering model using direct maximisation of the modularity is then fitted with 4 clusters. The modularity is printed and the predicted row labels and column labels are retrieved for further exploration or evaluation.

from scipy.io import loadmat
from coclust.coclustering import CoclustMod

file_name = "../datasets/cstr.mat"
matlab_dict = loadmat(file_name)
X = matlab_dict['fea']

model = CoclustMod(n_clusters=4)
model.fit(X)

print(model.modularity)
predicted_row_labels = model.row_labels_
predicted_column_labels = model.column_labels_

For example, the normalized mutual information score is computed using the scikit-learn library:

from sklearn.metrics.cluster import normalized_mutual_info_score as nmi

true_row_labels = matlab_dict['gnd'].flatten()

print(nmi(true_row_labels, predicted_row_labels))

Advanced usage overview

from coclust.io.data_loading import load_doc_term_data
from coclust.visualization import (plot_reorganized_matrix,
                                  plot_cluster_top_terms,
                                  plot_max_modularities)
from coclust.evaluation.internal import best_modularity_partition
from coclust.coclustering import CoclustMod

# read data
path = '../datasets/classic3_coclustFormat.mat'
doc_term_data = load_doc_term_data(path)
X = doc_term_data['doc_term_matrix']
labels = doc_term_data['term_labels']

# get the best co-clustering over a range of cluster numbers
clusters_range = range(2, 6)
model, modularities = best_modularity_partition(X, clusters_range, n_rand_init=1)

# plot the reorganized matrix
plot_reorganized_matrix(X, model)

# plot the top terms
n_terms = 10
plot_cluster_top_terms(X, labels, n_terms, model)

# plot the modularities over the range of cluster numbers
plot_max_modularities(modularities, range(2, 6))

(Source code)

scikit-learn pipeline

from coclust.coclustering import CoclustInfo

from sklearn.datasets import fetch_20newsgroups
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics.cluster import normalized_mutual_info_score

categories = [
    'rec.motorcycles',
    'rec.sport.baseball',
    'comp.graphics',
    'sci.space',
    'talk.politics.mideast'
]

ng5 = fetch_20newsgroups(categories=categories, shuffle=True)

true_labels = ng5.target

pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('coclust', CoclustInfo()),
])

pipeline.set_params(coclust__n_clusters=5)
pipeline.fit(ng5.data)

predicted_labels = pipeline.named_steps['coclust'].row_labels_

nmi = normalized_mutual_info_score(true_labels, predicted_labels)

print(nmi)

More examples

More examples are available as notebooks:

https://github.com/franrole/cclust_package/tree/master/demo

Python API

Coclustering

The coclust.coclustering module gathers implementations of co-clustering algorithms.

Classes

Each of the following classes implements a co-clustering algorithm:

coclust.coclustering.CoclustMod([…]) Co-clustering by direct maximization of graph modularity.
coclust.coclustering.CoclustSpecMod([…]) Co-clustering by spectral approximation of the modularity matrix.
coclust.coclustering.CoclustInfo([…]) Information-Theoretic Co-clustering.

User guide

coclust.coclustering.CoclustMod and coclust.coclustering.CoclustSpecMod are diagonal co-clustering algorithms whereas coclust.coclustering.CoclustInfo is a non-diagonal co-clustering algorithm.

Clustering

The coclust.clustering module provides clustering algorithms.

class coclust.clustering.SphericalKmeans(n_clusters=2, init=None, max_iter=20, n_init=1, tol=1e-09, random_state=None, weighting=True)[source]

Spherical k-means clustering.

Parameters:
  • n_clusters (int, optional, default: 2) – Number of clusters to form
  • init (numpy array or scipy sparse matrix, shape (n_features, n_clusters), optional, default: None) – Initial column labels
  • max_iter (int, optional, default: 20) – Maximum number of iterations
  • n_init (int, optional, default: 1) – Number of time the algorithm will be run with different initializations. The final results will be the best output of n_init consecutive runs.
  • random_state (integer or numpy.RandomState, optional) – The generator used to initialize the centers. If an integer is given, it fixes the seed. Defaults to the global numpy random number generator.
  • tol (float, default: 1e-9) – Relative tolerance with regards to criterion to declare convergence
  • weighting (boolean, default: True) – Flag to activate or deactivate TF-IDF weighting
labels_

array-like, shape (n_rows,) – cluster label of each row

criterion

float – criterion obtained from the best run

criterions

list of floats – sequence of criterion values during the best run

fit(X, y=None)[source]

Perform clustering.

Parameters:X (numpy array or scipy sparse matrix, shape=(n_samples, n_features)) – Matrix to be analyzed

Spherical k-means

coclust.clustering.spherical_kmeans provides an implementation of the spherical k-means algorithm.

class coclust.clustering.spherical_kmeans.SphericalKmeans(n_clusters=2, init=None, max_iter=20, n_init=1, tol=1e-09, random_state=None, weighting=True)[source]

Spherical k-means clustering.

Parameters:
  • n_clusters (int, optional, default: 2) – Number of clusters to form
  • init (numpy array or scipy sparse matrix, shape (n_features, n_clusters), optional, default: None) – Initial column labels
  • max_iter (int, optional, default: 20) – Maximum number of iterations
  • n_init (int, optional, default: 1) – Number of time the algorithm will be run with different initializations. The final results will be the best output of n_init consecutive runs.
  • random_state (integer or numpy.RandomState, optional) – The generator used to initialize the centers. If an integer is given, it fixes the seed. Defaults to the global numpy random number generator.
  • tol (float, default: 1e-9) – Relative tolerance with regards to criterion to declare convergence
  • weighting (boolean, default: True) – Flag to activate or deactivate TF-IDF weighting
labels_

array-like, shape (n_rows,) – cluster label of each row

criterion

float – criterion obtained from the best run

criterions

list of floats – sequence of criterion values during the best run

fit(X, y=None)[source]

Perform clustering.

Parameters:X (numpy array or scipy sparse matrix, shape=(n_samples, n_features)) – Matrix to be analyzed

Input and output

The coclust.io module provides functions to load data and check if it is correct to be given as input of a clustering or co-clustering algorithm.

Data loading

The coclust.io.data_loading module provides functions to load data from files of different types.

coclust.io.data_loading.load_doc_term_data(data_filepath, term_labels_filepath=None, doc_labels_filepath=None)[source]

Load cooccurence data from a .[…]sv or a .mat file.

The expected formats are:

  • (data_filepath).[...]sv: three […] separated columns:

    1st line:
    • 1st column: number of documents
    • 2nd column: number of words
    Other lines:
    • 1st column: document index
    • 2nd column: word index
    • 3rd column: word counts
  • (data_filepath).mat: matlab file with fields:

    • 'doc_term_matrix': scipy.sparse.csr_matrix of shape (#docs, #terms)
    • 'doc_labels': list of int (len = #docs)
    • 'term_labels': list of string (len = #terms)

    If the key 'doc_term_matrix' is not found, data loading fails. If the key 'doc_labels' or 'term_labels' are missing, a warning message is displayed.

Term and doc labels can be separatly loaded from a one column .[x]sv|.txt file:

  • (term_labels_filepath).[x]sv|.txt:
    one column, one term label per row. The row index is assumed to correspond to the term index in the (columns of the) co-occurrence data matrix.
  • (doc_labels_filepath).[x]sv|.txt:
    one column, one document label per row. The row index is assumed to correspond to the non zero value number read by row from the co-occurrence data matrix.
Parameters:file_path (string) – Path to file that contains the cooccurence data
Returns:
  • 'doc_term_matrix': scipy.sparse.csr_matrix of shape (#docs, #terms)
  • 'doc_labels': list of int (#docs)
  • 'term_labels': list of string (#terms)
Return type:a dictionnary
Raises:ValueError – If the input file is not found or if its content is not correct.

Example

>>> dict = load_doc_term_data('../datasets/classic3.csv')
>>> dict['doc_term_matrix'].shape
(3891, 4303)

Input checking

The coclust.io.input_checking module provides functions to check input matrices.

coclust.io.input_checking.check_array(a, pos=True)[source]

Check if an array contains numeric values with non empty rows nor columns.

Parameters:
  • a – The input array
  • pos (bool) – If True, check if the values are positives
Raises:
  • TypeError – If the array is not a Numpy/SciPy array or matrix or if the values are not numeric.
  • ValueError – If the array contains empty rows or columns or contains NaN values, or negative values (if pos is True).
coclust.io.input_checking.check_numbers(matrix, n_clusters)[source]

Check if the given matrix has enough rows and columns for the given number of co-clusters.

Parameters:
  • matrix – The input matrix
  • n_clusters (int) – Number of co-clusters
Raises:

ValueError – If the data matrix has not enough rows or columns.

coclust.io.input_checking.check_numbers_clustering(matrix, n_clusters)[source]

Check if the given matrix has enough rows and columns for the given number of clusters.

Parameters:
  • matrix – The input matrix
  • n_clusters (int) – Number of clusters
Raises:

ValueError – If the data matrix has not enough rows or columns.

coclust.io.input_checking.check_numbers_non_diago(matrix, n_row_clusters, n_col_clusters)[source]

Check if the given matrix has enough rows and columns for the given number of row and column clusters.

Parameters:
  • matrix – The input matrix
  • n_row_clusters (int) – Number of row clusters
  • n_col_clusters (int) – Number of column clusters
Raises:

ValueError – If the data matrix has not enough rows or columns.

coclust.io.input_checking.check_positive(X)[source]

Check if all values are positives.

Parameters:X (numpy array or scipy sparse matrix) – Matrix to be analyzed
Raises:ValueError – If the matrix contains negative values.
Returns:X
Return type:numpy array or scipy sparse matrix

Jupyter and IPython Notebook utilities

The coclust.io.notebook module provides functions to manage input and output in the evaluation notebook.

coclust.io.notebook.input_with_default_int(prompt, prefill)[source]

Prompt an int.

Parameters:
  • prompt (string) – The message printed before the field.
  • prefill (int) – The default value.
Returns:

The value entered by the user or the default value.

Return type:

int

coclust.io.notebook.input_with_default_str(prompt, prefill)[source]

Prompt a string.

Parameters:
  • prompt (string) – The message printed before the field.
  • prefill (string) – The default value.
Returns:

The value entered by the user or the default value.

Return type:

string

Evaluation

The coclust.evaluation module provides functions to evaluate the results of clustering or co-clustering algorithms.

Internal measures

The coclust.evaluation.internal module provides functions to evaluate clustering or co-clustering given internal criteria.

coclust.evaluation.internal.best_modularity_partition(in_data, nbr_clusters_range, n_rand_init=1)[source]

Evaluate the best partition over a range of number of cluster using co-clustering by direct maximization of graph modularity.

Parameters:
  • in_data (numpy array or scipy sparse matrix, shape=(n_samples, n_features)) – Matrix to be analyzed
  • nbr_clusters_range – Number of clusters to be evaluated
  • n_rand_init – Number of time the algorithm will be run with different initializations
Returns:

  • tmp_best_model (coclust.coclustering.CoclustMod) – model with highest final modularity
  • tmp_max_modularities (list) – final modularities for all evaluated partitions

External measures

The coclust.evaluation.external module provides functions to evaluate clustering or co-clustering results with external information such as the true labeling of the clusters.

coclust.evaluation.external.accuracy(true_row_labels, predicted_row_labels)[source]

Get the best accuracy.

Parameters:
  • true_row_labels (array-like) – The true row labels, given as external information
  • predicted_row_labels (array-like) – The row labels predicted by the model
Returns:

Best value of accuracy

Return type:

float

Visualization

The coclust.visualization module provides functions to visualize different measures or data.

General functions

coclust.visualization.plot_cluster_top_terms(…) Plot the top terms for each cluster.
coclust.visualization.get_term_graph(X, …) Get a graph of terms.
coclust.visualization.plot_cluster_sizes(model) Plot the sizes of the clusters.
coclust.visualization.plot_reorganized_matrix(X, …) Plot the reorganized matrix.
coclust.visualization.plot_confusion_matrix(cm) Plot a confusion matrix.
coclust.visualization.plot_convergence(…) Plot the convergence of a given criteria.

Model specific functions

coclust.visualization.plot_max_modularities(…) Plot all max modularities obtained after a series of evaluations.
coclust.visualization.plot_intermediate_modularities(model) Plot all intermediate modularities for a model.
coclust.visualization.plot_delta_kl(model[, …]) Plot the delta values of the Information-Theoretic Co-clustering.

Initialization

The coclust.initialization module provides functions to initialize clustering or co-clustering algorithms.

coclust.initialization.random_init(n_clusters, n_cols, random_state=None)[source]

Create a random column cluster assignment matrix.

Each row contains 1 in the column corresponding to the cluster where the processed data matrix column belongs, 0 elsewhere.

Parameters:
  • n_clusters (int) – Number of clusters
  • n_cols (int) – Number of columns of the data matrix (i.e. number of rows of the matrix returned by this function)
  • random_state (int or numpy.RandomState, optional) – The generator used to initialize the cluster labels. Defaults to the global numpy random number generator.
Returns:

Matrix of shape (n_cols, n_clusters)

Return type:

matrix

coclust.initialization.random_init_clustering(n_clusters, n_rows, random_state=None)[source]

Create a random row cluster assignment matrix.

Each row contains 1 in the column corresponding to the cluster where the processed data matrix row belongs, 0 elsewhere.

Parameters:
  • n_clusters (int) – Number of clusters
  • n_rows (int) – Number of rows of the data matrix (i.e. also the number of rows of the matrix returned by this function)
  • random_state (int or numpy.RandomState, optional) – The generator used to initialize the cluster labels. Defaults to the global numpy random number generator.
Returns:

Matrix of shape (n_rows, n_clusters)

Return type:

matrix

Scripts

The input matrix can be a Matlab file or a text file. For the Matlab file, the key corresponding to the matrix must be given. For the text file, each line should describe an entry of a matrix with three columns: the row index, the column index and the value. The separator is given by a script parameter.

Perform co-clustering: the coclust script

The coclust script can be used to run a particular co-clustering algorithm on a data matrix. The user has to select an algorithm which is given as a first argument to coclust. The choices are:

  • modularity
  • specmodularity
  • info

The following command line shows how to run the CoclustMod algorithm three times on a matrix contained in a Matlab file whose matrix key is the string ‘fea’. The computed row labels are to be stored in a file called cstr-rows.txt:

coclust modularity  -k fea --n_coclusters 4 --output_row_labels cstr-rows.txt  --n_runs 3 cstr.mat 

To have a list of all possible parameters for a given algorithm use the -h option as in the following example:

coclust modularity -h
usage: coclust [-h] {modularity,specmodularity,info} ...

Positional Arguments

subparser_name

Possible choices: modularity, specmodularity, info

choose the algorithm to use

Sub-commands:

modularity

use the modularity based algorithm

coclust modularity [-h] [-k MATLAB_MATRIX_KEY | -sep CSV_SEP]
                   [--output_row_labels OUTPUT_ROW_LABELS]
                   [--output_column_labels OUTPUT_COLUMN_LABELS]
                   [--output_fuzzy_row_labels OUTPUT_FUZZY_ROW_LABELS]
                   [--output_fuzzy_column_labels OUTPUT_FUZZY_COLUMN_LABELS]
                   [--convergence_plot CONVERGENCE_PLOT]
                   [--reorganized_matrix REORGANIZED_MATRIX] [-n N_COCLUSTERS]
                   [-m MAX_ITER] [-e EPSILON]
                   [-i INIT_ROW_LABELS | --n_runs N_RUNS] [--seed SEED]
                   [-l TRUE_ROW_LABELS] [--visu]
                   INPUT_MATRIX
input
INPUT_MATRIX matrix file path
-k, --matlab_matrix_key
 if not set, csv input is considered
-sep, --csv_sep
 

if not set, “,” is considered; use “t” for tab-separated values

Default: “,”

output
--output_row_labels
 file path for the predicted row labels
--output_column_labels
 file path for the predicted column labels
--output_fuzzy_row_labels
 

file path for the predicted fuzzy row labels

Default: 2

--output_fuzzy_column_labels
 

file path for the predicted fuzzy column labels

Default: 2

--convergence_plot
 file path for the convergence plot
--reorganized_matrix
 file path for the reorganized matrix
algorithm parameters
-n, --n_coclusters
 

number of co-clusters

Default: 2

-m, --max_iter

maximum number of iterations

Default: 15

-e, --epsilon

stop if the criterion (modularity) variation in an iteration is less than EPSILON

Default: 1e-09

-i, --init_row_labels
 file containing the initial row labels, if not set random initialization is performed
--n_runs

number of runs

Default: 1

--seed set the random state, useful for reproductible results
evaluation parameters
-l, --true_row_labels
 file containing the true row labels
--visu

Plot modularity values and reorganized matrix (requires Numpy, SciPy and matplotlib).

Default: False

specmodularity

use the spectral modularity based algorithm

coclust specmodularity [-h] [-k MATLAB_MATRIX_KEY | -sep CSV_SEP]
                       [--output_row_labels OUTPUT_ROW_LABELS]
                       [--output_column_labels OUTPUT_COLUMN_LABELS]
                       [--reorganized_matrix REORGANIZED_MATRIX]
                       [-n N_COCLUSTERS] [-m MAX_ITER] [-e EPSILON]
                       [--n_runs N_RUNS] [--seed SEED] [-l TRUE_ROW_LABELS]
                       [--visu]
                       INPUT_MATRIX
input
INPUT_MATRIX matrix file path
-k, --matlab_matrix_key
 if not set, csv input is considered
-sep, --csv_sep
 

if not set, “,” is considered; use “t” for tab-separated values

Default: “,”

output
--output_row_labels
 file path for the predicted row labels
--output_column_labels
 file path for the predicted column labels
--reorganized_matrix
 file path for the reorganized matrix
algorithm parameters
-n, --n_coclusters
 

number of co-clusters

Default: 2

-m, --max_iter

maximum number of iterations

Default: 15

-e, --epsilon

stop if the criterion (modularity) variation in an iteration is less than EPSILON

Default: 1e-09

--n_runs

number of runs

Default: 1

--seed set the random state, useful for reproductible results
evaluation parameters
-l, --true_row_labels
 file containing the true row labels
--visu

Plot modularity values and reorganized matrix (requires Numpy, SciPy and matplotlib).

Default: False

info

Undocumented

coclust info [-h] [-k MATLAB_MATRIX_KEY | -sep CSV_SEP]
             [--output_row_labels OUTPUT_ROW_LABELS]
             [--output_column_labels OUTPUT_COLUMN_LABELS]
             [--reorganized_matrix REORGANIZED_MATRIX] [-K N_ROW_CLUSTERS]
             [-L N_COL_CLUSTERS] [-m MAX_ITER] [-e EPSILON]
             [-i INIT_ROW_LABELS | --n_runs N_RUNS] [--seed SEED]
             [-l TRUE_ROW_LABELS] [--visu]
             INPUT_MATRIX
input
INPUT_MATRIX matrix file path
-k, --matlab_matrix_key
 if not set, csv input is considered
-sep, --csv_sep
 

if not set, “,” is considered; use “t” for tab-separated values

Default: “,”

output
--output_row_labels
 file path for the predicted row labels
--output_column_labels
 file path for the predicted column labels
--reorganized_matrix
 file path for the reorganized matrix
algorithm parameters
-K, --n_row_clusters
 

number of row clusters

Default: 2

-L, --n_col_clusters
 

number of column clusters

Default: 2

-m, --max_iter

maximum number of iterations

Default: 15

-e, --epsilon

stop if the criterion (modularity) variation in an iteration is less than EPSILON

Default: 1e-09

-i, --init_row_labels
 file containing the initial row labels, if not set random initialization is performed
--n_runs

number of runs

Default: 1

--seed set the random state, useful for reproductible results
evaluation parameters
-l, --true_row_labels
 file containing the true row labels
--visu

Plot modularity values and reorganized matrix (requires Numpy, SciPy and matplotlib).

Default: False

Detect the best number of co-clusters: the coclust-nb script

coclust-nb detects the number of co-clusters giving the best modularity score. It therefore relies on the CoclustMod algorithm. This is a simple yet often effective way to determine the appropriate number of co-clusters. A sample usage sample is given below:

coclust-nb cstr.csv --seed=1 --n_runs=20 --max_iter=60  --from 2 --to 6 
usage: coclust-nb [-h] [-k MATLAB_MATRIX_KEY | -sep CSV_SEP]
                  [--output_row_labels OUTPUT_ROW_LABELS]
                  [--output_column_labels OUTPUT_COLUMN_LABELS]
                  [--reorganized_matrix REORGANIZED_MATRIX] [--from FROM]
                  [--to TO] [-m MAX_ITER] [-e EPSILON] [--n_runs N_RUNS]
                  [--seed SEED] [--visu]
                  INPUT_MATRIX

input

INPUT_MATRIX matrix file path
-k, --matlab_matrix_key
 if not set, csv input is considered
-sep, --csv_sep
 

if not set, “,” is considered; use “t” for tab-separated values

Default: “,”

output

--output_row_labels
 file path for the predicted row labels
--output_column_labels
 file path for the predicted column labels
--reorganized_matrix
 file path for the reorganized matrix

algorithm parameters

--from

minimum number of co-clusters

Default: 2

--to

maximum number of co-clusters

Default: 10

-m, --max_iter

maximum number of iterations

Default: 15

-e, --epsilon

stop if the criterion (modularity) variation in an iteration is less than EPSILON

Default: 1e-09

--n_runs

number of runs

Default: 1

--seed set the random state, useful for reproductible results

evaluation parameters

--visu

Plot modularity values and reorganized matrix (requires Numpy, SciPy and matplotlib).

Default: False

_images/logo_lipade.png