Welcome to pyradigm

Pyradigm is a PYthon based data structure to ease and improve Dataset’s InteGrity in Machine learning workflows.

https://travis-ci.org/raamana/pyradigm.svg?branch=master Code Health https://api.codacy.com/project/badge/Grade/cffd80f290544e2e824011bfccf35ff8 Python versionshttps://badge.fury.io/py/pyradigm.svg

About

Background

A common problem for machine learning developers is keeping track of the source of the features extracted, and to ensure integrity of the dataset (e.g. not getting data mixed up from different subjects and/or classes). This is incredibly hard as the number of projects grow, or personnel changes are frequent. These aspects can break the chain of hyper-local info about various datasets, such as where did the original data come from, how was it processed or quality controlled, how was it put together, by who and what does some columns in the table mean etc. This package aims to provide a Python data structure to encapsulate a machine learning dataset with key info greatly suited for neuroimaging applications (or similar domains), where each sample needs to be uniquely identified with a subject ID (or something similar). Key-level correspondence across data, labels (e.g. 1 or 2), classnames (e.g. ‘healthy’, ‘disease’) and the related attributes helps maintain data integrity. Moreover, attributes like free-text description help annotate all the important information. The class methods offer the ability to arbitrarilty combine and subset datasets, while automatically updating their description reduces burden to keep track of the original source of features.

Check the Usage examples and API Reference pages, and let me know your comments.

Context

For users of Pandas, some of the elements in pyradigm‘s API/interface may look familiar. However, the aim of this data structure is not to offer an alternative to pandas, but to ease the machine learning workflow for neuroscientists by

  1. offering several well-knit methods and useful attributes specifically geared towards neuroscience research,
  2. aiming to offer utilities that combines multiple or advanced patterns of routine dataset handling and
  3. using a more accessible language (compared to hard to read pandas docs aimed at econometric audience) to better cater to neuroscience developers (esp. the novice).

Thanks for checking out. Your feedback will be appreciated.

Installation

Pyradigm can easily be installed with a single command:

pip install pyradigm

If you lack sudo access, try

pip install pyradigm --user

Requirements

  • Packages: numpy
  • Supported versions: 2.7, 3.5 and 3.6

Usage examples

This class is greatly suited for neuroimaging applications (or any other domain), where each sample needs to be uniquely identified with a subject ID (or something similar).

Key-level correspondence across data, labels (1 or 2), classnames (‘healthy’, ‘disease’) and the related helps maintain data integrity and improve the provenance, in addition to enabling traceback to original sources from where the features have been originally derived.

Just to given you a concrete examples, let’s look at how an ML dataset is handled traditionally.

You have a matrix X of size n x p, with n samples and p features, and a vector y containing the target values (or class labels or class identifiers). This X and y serves as training (and test set) for a classifier like SVM to fit the data X to match y as accurately as possible.

Let’s get a little more concrete:

import sys, os
import numpy as np
import matplotlib
%matplotlib
%matplotlib inline
import matplotlib.pyplot as plt

n = 10 # number of samples
p = 3  # number of features

X = np.random.random([n, p]) # random data for illustration
y = [1]*5 + [2]*5            # random labels ...

np.set_printoptions(precision=2) # save some screen space
print('X : \n{}'.format(X))
print('y : \n{}'.format(y))
Using matplotlib backend: TkAgg
X :
[[ 0.73  0.85  0.3 ]
 [ 0.63  0.09  0.87]
 [ 0.14  0.71  0.19]
 [ 0.25  0.33  0.08]
 [ 0.8   0.85  0.99]
 [ 0.78  0.76  0.47]
 [ 0.25  0.54  0.18]
 [ 0.57  0.98  0.36]
 [ 0.1   0.1   0.74]
 [ 0.16  0.76  0.53]]
y :
[1, 1, 1, 1, 1, 2, 2, 2, 2, 2]

Almost all the machine learning toolboxes take their input in this form: X and y, regardless of the original source that produced these features in the first place.

This is all fine if all you ever wanted to do is to extract some features, do some machine learning and dispose these features away!

** But this is almost never the case!**

Because it doesn’t simply end there.

At a minimum, I often need to know * which samples are misclassified - meaning you need to know what the identifiers are and not simply their row indices in X? * what are the charecteristics of those samples? * what classes do they belong to?

And all this info needs to be obtained * without having to write lots of code connecting few non-obvious links to disparate sources of data (numerical features X, and sample identifiers in a CSV file) to find the relevant info * without having to track down who or which method originally produced these features * how the previous personnel or grad student organized the whole dataset, if you haven’t generated the features yourself from scratch

And if you are like me, you would be thinking about how would you organize your workflow such that the aforementioned tasks can be accomplished with ease.

This data structure attempts to accomplish that with ease. By always organizing the extracted features keyed-in into a dictionary with their sample id, and other important info such as target values and other identified info. This, by definition, preserves the integrity of the data (inability to incorrectly label samples etc).

No, this data structure doesn’t offer the full provenance tracking, which is quite a challenging problem. But it tries make your life a little easier in your ML workflows.

An example application is shown below, touching upon the following topics:

  • Motivation
  • Constructing a dataset
  • Attributes
  • Accessing samples
  • Iteration over samples
  • Subset selection
  • Saving/reloading a dataset (Serialization)
  • Combining datasets and diving them into useful subsets
  • Portability (e.g. with sklearn)

Improting the necessary modules and our fancy class definition:

from pyradigm import MLDataset

We can now instantiate it and give it a description:

dataset = MLDataset()
dataset.description = 'ADNI1 baseline: cortical thickness features from Freesurfer v4.3, QCed.'
dataset
ADNI1 baseline: cortical thickness features from Freesurfer v4.3, QCed.
Empty dataset.

You can see the dataset some description attached to it, however we know it is empty. This can be verified in a boolean context as shown below:

bool(dataset)
False

Let’s add samples to this dataset which is when this dataset implementation becomes really handy. Before we do that, we will define some convenience routines defined to just illustrate a simple yet common use of this dataset.

def read_thickness(path):
    """Dummy function to minic a data reader."""

    # in your actural routine, this might be:
    #   pysurfer.read_thickness(path).values()
    return np.random.random(2)


def get_features(work_dir, subj_id):
    """Returns the whole brain cortical thickness for a given subject ID."""

    # extension to identify the data file; this could be .curv, anything else you choose
    ext_thickness = '.thickness'

    thickness = dict()
    for hemi in ['lh', 'rh']:
        path_thickness = os.path.join(work_dir, subj_id, hemi + ext_thickness)
        thickness[hemi] = read_thickness(path_thickness)

    # concatenating them to build a whole brain feature set
    thickness_wb = np.concatenate([thickness['lh'], thickness['rh']])

    return thickness_wb

So now we have IO routines to read the data for us. Let’s define where the data will come from:

work_dir = '/project/ADNI/FreesurferThickness_v4p3'
class_set = ['Cntrl', 'Alzmr', 'MCI']
class_sizes = [15, 12, 18]

This would obviously change for your applications, but this has sufficient properties to illustrate the point.

Let’s look at what methods this dataset offers us:

dir(dataset)
['add_classes',
 'add_sample',
 'class_set',
 'class_sizes',
 'classes',
 'data',
 'data_and_labels',
 'del_sample',
 'description',
 'extend',
 'feature_names',
 'get_class',
 'get_feature_subset',
 'get_subset',
 'glance',
 'keys',
 'num_classes',
 'num_features',
 'num_samples',
 'random_subset',
 'random_subset_ids',
 'random_subset_ids_by_count',
 'sample_ids',
 'sample_ids_in_class',
 'save',
 'summarize_classes',
 'train_test_split_ids',
 'transform']

That’s a lot of methods of convenience to organize and retrieve dataset.

So let’s go through them by their usage sections.

Constructor

You can see there few methods such as add_sample, get_subset etc: important method being add_sample, which is key to constructing this dataset. Let’s go ahead and some samples:

To contruct a dataset, one typically starts with a list of subject IDs to be added - we create few random lists, each to be considered as a separate class:

import random
from datetime import datetime
random.seed(datetime.now())

def read_target_list(class_name, class_size):
    "Generates a random target list. In reality, you would do something like the commented code below."
    target_list = list()
    for idx in range(class_size):
        target_list.append('{}{:04d}'.format(class_name[0],np.random.randint(1000)))

    return target_list

Now we go through each of the above classes, and add each sample that class to the dataset.

for class_index, class_id in enumerate(class_set):
    print('Working on class {:>5}'.format(class_id))

    target_list = read_target_list(class_id,class_sizes[class_index])
    for subj_id in target_list:
        print('\t reading subject {:>15}'.format(subj_id))
        thickness_wb = get_features(work_dir, subj_id)

        # adding the sample to the dataset
        dataset.add_sample(subj_id, thickness_wb, class_index, class_id)
Working on class Cntrl
     reading subject           C0562
     reading subject           C0408
     reading subject           C0760
     reading subject           C0170
     reading subject           C0241
     reading subject           C0980
     reading subject           C0822
     reading subject           C0565
     reading subject           C0949
     reading subject           C0041
     reading subject           C0372
     reading subject           C0141
     reading subject           C0492
     reading subject           C0064
     reading subject           C0557
Working on class Alzmr
     reading subject           A0034
     reading subject           A0768
     reading subject           A0240
     reading subject           A0042
     reading subject           A0141
     reading subject           A0888
     reading subject           A0032
     reading subject           A0596
     reading subject           A0969
     reading subject           A0215
     reading subject           A0074
     reading subject           A0229
Working on class   MCI
     reading subject           M0760
     reading subject           M0434
     reading subject           M0033
     reading subject           M0942
     reading subject           M0034
     reading subject           M0868
     reading subject           M0595
     reading subject           M0476
     reading subject           M0770
     reading subject           M0577
     reading subject           M0638
     reading subject           M0421
     reading subject           M0006
     reading subject           M0552
     reading subject           M0040
     reading subject           M0165
     reading subject           M0256
     reading subject           M0127

Nice. Isn’t it?

So what’s nice about this, you say? The simple fact that you are constructing a dataset as you read the data in its most elemental form (in the units of the dataset such as the subject ID in our neuroimaging application). You’re done as soon as you’re done reading the features from disk.

What’s more - you can inspect the dataset in an intuitive manner, as shown below:

dataset
ADNI1 baseline: cortical thickness features from Freesurfer v4.3, QCed.
45 samples, 3 classes, 4 features.
Class Cntrl : 15 samples.
Class Alzmr : 12 samples.
Class   MCI : 18 samples.

Even better, right? No more coding of several commands to get the complete and concise sense of the dataset.

Convenient attributes

If you would like, you can always get more specific information, such as:

dataset.num_samples
45
dataset.num_features
4
dataset.class_set
['MCI', 'Cntrl', 'Alzmr']
dataset.class_sizes
Counter({'Alzmr': 12, 'Cntrl': 15, 'MCI': 18})
dataset.class_sizes['Cntrl']
15

If you’d like to take a look data inside for few subjects - shall we call it a glance?

dataset.glance()
{'C0170': array([ 0.37,  0.78,  0.5 ,  0.79]),
 'C0241': array([ 0.11,  0.18,  0.58,  0.36]),
 'C0408': array([ 0.49,  0.38,  0.05,  0.82]),
 'C0562': array([ 0.64,  0.59,  0.01,  0.8 ]),
 'C0760': array([ 0.12,  0.51,  0.95,  0.23])}

We can control the number of items to glance, by passing a number to dataset.glance() method:

dataset.glance(2)
{'C0408': array([ 0.49,  0.38,  0.05,  0.82]),
 'C0562': array([ 0.64,  0.59,  0.01,  0.8 ])}

Or you may be wondering what are the subject IDs in the dataset.. here they are:

dataset.sample_ids
['C0562',
 'C0408',
 'C0760',
 'C0170',
 'C0241',
 'C0980',
 'C0822',
 'C0565',
 'C0949',
 'C0041',
 'C0372',
 'C0141',
 'C0492',
 'C0064',
 'C0557',
 'A0034',
 'A0768',
 'A0240',
 'A0042',
 'A0141',
 'A0888',
 'A0032',
 'A0596',
 'A0969',
 'A0215',
 'A0074',
 'A0229',
 'M0760',
 'M0434',
 'M0033',
 'M0942',
 'M0034',
 'M0868',
 'M0595',
 'M0476',
 'M0770',
 'M0577',
 'M0638',
 'M0421',
 'M0006',
 'M0552',
 'M0040',
 'M0165',
 'M0256',
 'M0127']

Accessing samples

Thanks to elegant implementation, data for a given sample ‘M0299’ can simply be obtained by:

dataset['M0040']
array([ 0.27,  0.52,  0.61,  0.49])

Like a Python dict, it raises an error if the key is not in the dataset:

dataset['dlfjdjf']
---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

<ipython-input-22-4b19d52bac71> in <module>()
----> 1 dataset['dlfjdjf']


~/dev/pyradigm/pyradigm/pyradigm.py in __getitem__(self, item)
    839             return self.__data[item]
    840         else:
--> 841             raise KeyError('{} not found in dataset.'.format(item))
    842
    843     def __iter__(self):


KeyError: 'dlfjdjf not found in dataset.'

A more graceful handling would be to use dataset.get to control what value to be returned in case the requested id is not found in the dataset.

dataset.get('dkfjd', np.nan)
nan

Iteration

Thanks to builtin iteration, we can easily iterate over all the samples:

for sample, features in dataset:
    print("{} : {:>10} : {}".format(sample, dataset.classes[sample], features))
C0562 :      Cntrl : [ 0.64  0.59  0.01  0.8 ]
C0408 :      Cntrl : [ 0.49  0.38  0.05  0.82]
C0760 :      Cntrl : [ 0.12  0.51  0.95  0.23]
C0170 :      Cntrl : [ 0.37  0.78  0.5   0.79]
C0241 :      Cntrl : [ 0.11  0.18  0.58  0.36]
C0980 :      Cntrl : [ 0.1   0.52  0.79  0.68]
C0822 :      Cntrl : [ 0.44  0.97  0.06  0.99]
C0565 :      Cntrl : [ 0.89  0.5   0.89  0.48]
C0949 :      Cntrl : [ 0.84  0.84  0.51  0.12]
C0041 :      Cntrl : [ 0.07  0.19  0.68  0.81]
C0372 :      Cntrl : [ 0.7   0.05  0.67  0.39]
C0141 :      Cntrl : [ 0.46  0.18  0.69  0.17]
C0492 :      Cntrl : [ 0.82  0.77  0.07  0.69]
C0064 :      Cntrl : [ 0.24  0.54  0.36  0.37]
C0557 :      Cntrl : [ 0.59  0.86  0.1   0.42]
A0034 :      Alzmr : [ 0.35  0.96  0.41  0.93]
A0768 :      Alzmr : [ 0.65  0.37  0.7   0.24]
A0240 :      Alzmr : [ 0.87  0.78  0.1   0.28]
A0042 :      Alzmr : [ 0.12  0.3   0.35  0.7 ]
A0141 :      Alzmr : [ 0.85  0.28  0.06  0.74]
A0888 :      Alzmr : [ 0.85  0.78  0.93  0.7 ]
A0032 :      Alzmr : [ 0.28  0.41  0.61  0.09]
A0596 :      Alzmr : [ 0.28  0.15  0.88  0.23]
A0969 :      Alzmr : [ 0.47  0.37  0.52  0.58]
A0215 :      Alzmr : [ 0.49  0.7   0.31  0.96]
A0074 :      Alzmr : [ 0.87  0.7   0.37  0.7 ]
A0229 :      Alzmr : [ 0.96  0.34  0.59  0.96]
M0760 :        MCI : [ 0.27  0.22  0.37  0.14]
M0434 :        MCI : [ 0.26  0.04  0.49  0.92]
M0033 :        MCI : [ 0.14  0.39  0.71  0.5 ]
M0942 :        MCI : [ 0.19  0.29  0.42  0.46]
M0034 :        MCI : [ 0.36  0.54  0.67  0.71]
M0868 :        MCI : [ 0.29  0.46  0.47  0.83]
M0595 :        MCI : [ 0.62  0.07  0.66  0.75]
M0476 :        MCI : [ 0.73  0.97  0.59  0.24]
M0770 :        MCI : [ 0.81  0.78  0.28  0.61]
M0577 :        MCI : [ 0.84  0.86  0.94  0.5 ]
M0638 :        MCI : [ 0.61  0.64  0.94  0.94]
M0421 :        MCI : [ 0.73  0.16  0.97  0.69]
M0006 :        MCI : [ 0.76  0.62  0.49  0.03]
M0552 :        MCI : [ 0.26  0.85  0.13  0.31]
M0040 :        MCI : [ 0.27  0.52  0.61  0.49]
M0165 :        MCI : [ 0.03  0.79  0.92  0.79]
M0256 :        MCI : [ 0.06  0.06  0.69  0.97]
M0127 :        MCI : [ 0.42  0.11  0.93  0.5 ]

Did you see that? It’s so intuitive and natural! Such a clean traversal of dataset.

Thanks to the choice of the OrderedDict() to represent the data, classes and labels underneath, the order of sample addition is retained. Hence the correspondence across samples in the dataset not only key-wise (by the sample id), but also index-wise.

Subject-wise tranform

Quite often, we are interested in computing some statistics on data for a given subject (such as mean, or ROI-wise median). Typically this requires a loop, with some computation and organizing it in a new dataset! A simple routine pattern of usage, but can’t avoided if you are still fiddling with representing your dataset in medieval matrices! :).

If you organized your dataset in a pyradigm, such computation is trivial, thanks to builtin implementation of transform method. The mean value for each subject can be computed and organized in a new dataset, with an intuitive and single line:

mean_data = dataset.transform(np.mean)
mean_data.description = 'mean values per subject'
mean_data
mean values per subject
45 samples, 3 classes, 1 features.
Class Cntrl : 15 samples.
Class Alzmr : 12 samples.
Class   MCI : 18 samples.

As the transform accepts an arbitrary callable, we could do many more sophisticated things, such as access the subset of features e.g. cortical thickness for a particular region of interest (say posterior cingulate gyrus).

# let's make a toy function to return the indices for the ROI
def get_ROI_indices(x): return x[:3]

Using this “mask” function, we can easily obtain features for an ROI

pcg = dataset.transform(get_ROI_indices)

We can verify that the new dataset does indeed have only 3 features, for the same subjects/classes:

pcg
None
ADNI1 baseline: cortical thickness features from Freesurfer v4.3, QCed.
45 samples, 3 classes, 3 features.
Class Cntrl : 15 samples.
Class Alzmr : 12 samples.
Class   MCI : 18 samples.
pcg.num_features
3

Let’s make a bar plot with the just computed numbers:

data, lbl, keys = pcg.data_and_labels()
n, bins, patches = plt.hist(data)
docs/usage_files/usage_70_0.png

Remember as the original source of data was random, this has no units, property or meaning!

Subset selection

In addition to the structured way of obtaining the various properties of this dataset, this implementation really will come in handy when you have to slice and dice the dataset (with large number of classes and features) into smaller subsets (e.g. for binary classification). Let’s see how we can retrieve the data for a single class:

ctrl = dataset.get_class('Cntrl')

That’s it, obtaining the data for a given class is a simple call away.

Now let’s see what it looks like:

ctrl
 Subset derived from: ADNI1 baseline: cortical thickness features from Freesurfer v4.3, QCed.
15 samples, 1 classes, 4 features.
Class Cntrl : 15 samples.

Even with updated description automatically, to indicate its history. Let’s see some data from controls:

ctrl.glance(2)
{'C0408': array([ 0.49,  0.38,  0.05,  0.82]),
 'C0562': array([ 0.64,  0.59,  0.01,  0.8 ])}

We can also query a random subset of samples for manual inspection or cross-validation purposes. For example:

random_subset = dataset.random_subset(perc_in_class=0.3)
random_subset
 Subset derived from: ADNI1 baseline: cortical thickness features from Freesurfer v4.3, QCed.
12 samples, 3 classes, 4 features.
Class Cntrl : 4 samples.
Class Alzmr : 3 samples.
Class   MCI : 5 samples.

You can see which samples were selected:

random_subset.sample_ids
['C0562',
 'C0565',
 'C0372',
 'C0492',
 'A0240',
 'A0032',
 'A0229',
 'M0034',
 'M0770',
 'M0552',
 'M0165',
 'M0127']

You can verify that it is indeed random by issuing another call:

# supplying a new seed everytime to ensure randomization
from datetime import datetime
dataset.random_subset(perc_in_class=0.3).sample_ids
['C0562',
 'C0822',
 'C0949',
 'C0141',
 'A0034',
 'A0141',
 'A0032',
 'M0434',
 'M0942',
 'M0868',
 'M0421',
 'M0552']

Let’s see how we can retrieve specific samples by their IDs (for which there are many use cases):

data = dataset.get_subset(dataset.sample_ids[1:20])
data
 Subset derived from: ADNI1 baseline: cortical thickness features from Freesurfer v4.3, QCed.
19 samples, 2 classes, 4 features.
Class Cntrl : 14 samples.
Class Alzmr : 5 samples.

So as simple as that.

Cross-validation

If you would like to develop a variant of cross-validation, and need to obtain a random split of the dataset to obtain training and test sets, it is as simple as:

train_set, test_set = dataset.train_test_split_ids( train_perc = 0.5)

This method returns two sets of sample ids corresponding to training set (which 50% of samples from all classes in the dataset) and the rest in test_set. Let’s see what they have:

train_set, test_set
(['C0760',
  'C0822',
  'C0565',
  'C0170',
  'C0562',
  'C0141',
  'C0041',
  'A0768',
  'A0888',
  'A0032',
  'A0969',
  'A0141',
  'A0034',
  'M0434',
  'M0421',
  'M0577',
  'M0256',
  'M0127',
  'M0033',
  'M0760',
  'M0476',
  'M0165'],
 ['M0040',
  'A0240',
  'C0241',
  'C0492',
  'A0074',
  'A0042',
  'M0942',
  'M0595',
  'M0006',
  'C0372',
  'C0064',
  'C0557',
  'M0552',
  'M0034',
  'C0408',
  'C0980',
  'A0229',
  'C0949',
  'A0596',
  'M0770',
  'A0215',
  'M0868',
  'M0638'])

We can also get a train/test split by specifying an exact number of subjects we would like from each class (e.g. when you would like to avoid class imbalance in the training set):

train_set, test_set = dataset.train_test_split_ids( count_per_class = 3)

Let’s see what the training set contains - we expect 3*3 =9 subjects :

train_set
['C0557',
 'C0041',
 'C0949',
 'A0768',
 'A0888',
 'A0229',
 'M0165',
 'M0476',
 'M0040']

We can indeed verify that is the case, by creating a new smaller dataset from that list of ids and getting a summary:

training_dataset = dataset.get_subset(train_set)
training_dataset
 Subset derived from: ADNI1 baseline: cortical thickness features from Freesurfer v4.3, QCed.
9 samples, 3 classes, 4 features.
Class Cntrl : 3 samples.
Class Alzmr : 3 samples.
Class   MCI : 3 samples.

Another programmatic way to look into different classes is this:

class_set, label_set, class_sizes = training_dataset.summarize_classes()
class_set, label_set, class_sizes
(['MCI', 'Cntrl', 'Alzmr'], [2, 0, 1], array([ 3.,  3.,  3.]))

which returns all the classes that you could iterative over.

Using these two lists, we can easily obtain subset datasets, as illustrated below.

dataset
ADNI1 baseline: cortical thickness features from Freesurfer v4.3, QCed.
45 samples, 3 classes, 4 features.
Class Cntrl : 15 samples.
Class Alzmr : 12 samples.
Class   MCI : 18 samples.
binary_dataset = dataset.get_class(['Cntrl','Alzmr'])
binary_dataset
 Subset derived from: ADNI1 baseline: cortical thickness features from Freesurfer v4.3, QCed.
27 samples, 2 classes, 4 features.
Class Cntrl : 15 samples.
Class Alzmr : 12 samples.

How about selecting a subset of features from all samples?

binary_dataset.get_feature_subset(range(2))
Subset features derived from:

 Subset derived from: ADNI1 baseline: cortical thickness features from Freesurfer v4.3, QCed.
27 samples, 2 classes, 2 features.
Class Cntrl : 15 samples.
Class Alzmr : 12 samples.

Great. Isn’t it? You can also see the two-time-point history (initial subset in classes, followed by a subset in features).

Serialization

Once you have this dataset, you can save and load these trivially using your favourite serialization module. Let’s do some pickling:

out_file = os.path.join(work_dir,'binary_dataset_Ctrl_Alzr_Freesurfer_thickness_v4p3.MLDataset.pkl')
binary_dataset.save(out_file)

That’s it - it is saved.

Let’s reload it from disk and make sure we can indeed retrieve it:

reloaded = MLDataset(filepath=out_file) # another form of the constructor!
reloaded
 Subset derived from: ADNI1 baseline: cortical thickness features from Freesurfer v4.3, QCed.
27 samples, 2 classes, 4 features.
Class Cntrl : 15 samples.
Class Alzmr : 12 samples.

We can check to see they are indeed one and the same:

binary_dataset == reloaded
True

Dataset Arithmetic

You might wonder how can you combine two different types of features ( thickness and shape ) from the dataset. Piece of cake, see below ...

To concatenat two datasets, first we make a second dataset:

dataset_two = MLDataset(in_dataset=dataset) # yet another constructor: in its copy form!

How can you check if they are “functionally identical”? As in same keys, same data and classes for each key... Easy:

dataset_two == dataset
True

Now let’s try the arithmentic:

combined = dataset + dataset_two
Identical keys found. Trying to horizontally concatenate features for each sample.

Great. The add method recognized the identical set of keys and performed a horiz cat, as can be noticed by the twice the number of features in the combined dataset:

combined
45 samples, 3 classes, 8 features.
Class Cntrl : 15 samples.
Class Alzmr : 12 samples.
Class   MCI : 18 samples.

We can also do some removal in similar fashion:

smaller = combined - dataset
C0562 removed.
C0408 removed.
C0760 removed.
C0170 removed.
C0241 removed.
C0980 removed.
C0822 removed.
C0565 removed.
C0949 removed.
C0041 removed.
C0372 removed.
C0141 removed.
C0492 removed.
C0064 removed.
C0557 removed.
A0034 removed.
A0768 removed.
A0240 removed.
A0042 removed.
A0141 removed.
A0888 removed.
A0032 removed.
A0596 removed.
A0969 removed.
A0215 removed.
A0074 removed.
A0229 removed.
M0760 removed.
M0434 removed.
M0033 removed.
M0942 removed.
M0034 removed.
M0868 removed.
M0595 removed.
M0476 removed.
M0770 removed.
M0577 removed.
M0638 removed.
M0421 removed.
M0006 removed.
M0552 removed.
M0040 removed.
M0165 removed.
M0256 removed.
M0127 removed.
/Users/Reddy/dev/pyradigm/pyradigm/pyradigm.py:1169: UserWarning: Requested removal of all the samples - output dataset would be empty.
  warnings.warn('Requested removal of all the samples - output dataset would be empty.')

Data structure is even producing a warning to let you know the resulting output would be empty! We can verify that:

bool(smaller)
False

Portability

This is all well and good. How does it interact with other packages out there, you might ask? It is as simple as you can imagine:

from sklearn import svm
clf = svm.SVC(gamma=0.001, C=100.)
data_matrix, target, sample_ids = binary_dataset.data_and_labels()
clf.fit(data_matrix, target)
SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

There you have it, a simple example to show you the utility and convenience of this dataset.

Thanks for checking it out.

I would appreciate if you could give me feedback on improving or sharpening it further.

API Reference

A tutorial-like presentation is available at Usage examples, using the following API.

class pyradigm.MLDataset(filepath=None, in_dataset=None, arff_path=None, data=None, labels=None, classes=None, description='', feature_names=None, encode_nonnumeric=False)[source]

Bases: object

An ML dataset to ease workflow and maintain integrity.

add_classes(classes)[source]

Helper to rename the classes, if provided by a dict keyed in by the orignal keys

classes : dict
Dict of class named keyed in by sample IDs.
TypeError
If classes is not a dict.
ValueError
If all samples in dataset are not present in input dict, or one of they samples in input is not recognized.
add_sample(sample_id, features, label, class_id=None, overwrite=False, feature_names=None)[source]

Adds a new sample to the dataset with its features, label and class ID.

This is the preferred way to construct the dataset.

sample_id : str, int
The identifier that uniquely identifies this sample.
features : list, ndarray
The features for this sample
label : int, str
The label for this sample
class_id : int, str
The class for this sample. If not provided, label converted to a string becomes its ID.
overwrite : bool
If True, allows the overwite of features for an existing subject ID. Default : False.
feature_names : list
The names for each feature. Assumed to be in the same order as features
ValueError
If sample_id is already in the MLDataset (and overwrite=False), or If dimensionality of the current sample does not match the current, or If feature_names do not match existing names
TypeError
If sample to be added is of different data type compared to existing samples.
classmethod check_features(features)[source]

Method to ensure data to be added is not empty and vectorized.

features : iterable
Any data that can be converted to a numpy array.
features : numpy array
Flattened non-empty numpy array.
ValueError
If input data is empty.
class_set

Set of unique classes in the dataset.

class_sizes

Returns the sizes of different objects in a Counter object.

classes

Identifiers (sample IDs, or sample names etc) forming the basis of dict-type MLDataset.

data

data in its original dict form.

data_and_labels()[source]

Dataset features and labels in a matrix form for learning.

Also returns sample_ids in the same order.

data_matrix : ndarray
2D array of shape [num_samples, num_features] with features corresponding row-wise to sample_ids
labels : ndarray
Array of numeric labels for each sample corresponding row-wise to sample_ids
sample_ids : list
List of sample ids
del_sample(sample_id)[source]

Method to remove a sample from the dataset.

sample_id : str
sample id to be removed.
UserWarning
If sample id to delete was not found in the dataset.
description

Text description (header) that can be set by user.

dtype

number of features in each sample.

extend(other)[source]

Method to extend the dataset vertically (add samples from anotehr dataset).

other : MLDataset
second dataset to be combined with the current (different samples, but same dimensionality)
TypeError
if input is not an MLDataset.
feature_names

Returns the feature names as an numpy array of strings.

get(item, not_found_value=None)[source]

Method like dict.get() which can return specified value if key not found

get_class(class_id)[source]

Returns a smaller dataset belonging to the requested classes.

class_id : str
identifier of the class to be returned.
MLDataset
With subset of samples belonging to the given class.
ValueError
If one or more of the requested classes do not exist in this dataset. If the specified id is empty or None
get_feature_subset(subset_idx)[source]

Returns the subset of features indexed numerically.

subset_idx : list, ndarray
List of indices to features to be returned
MLDataset : MLDataset
with subset of features requested.
UnboundLocalError
If input indices are out of bounds for the dataset.
get_subset(subset_ids)[source]

Returns a smaller dataset identified by their keys/sample IDs.

subset_ids : list
List od sample IDs to extracted from the dataset.
sub-dataset : MLDataset
sub-dataset containing only requested sample IDs.
glance(nitems=5)[source]

Quick and partial glance of the data matrix.

nitems : int
Number of items to glance from the dataset. Default : 5

dict

keys

Sample identifiers (strings) forming the basis of MLDataset (same as sample_ids)

static keys_with_value(dictionary, value)[source]

Returns a subset of keys from the dict with the value supplied.

label_set

Set of labels in the dataset corresponding to class_set.

labels

Returns the array of labels for all the samples.

num_classes

Total number of classes in the dataset.

num_features

number of features in each sample.

num_samples

number of samples in the entire dataset.

random_subset(perc_in_class=0.5)[source]

Returns a random sub-dataset (of specified size by percentage) within each class.

perc_in_class : float
Fraction of samples to be taken from each class.
subdataset : MLDataset
random sub-dataset of specified size.
random_subset_ids(perc_per_class=0.5)[source]

Returns a random subset of sample ids (of specified size by percentage) within each class.

perc_per_class : float
Fraction of samples per class
subset : list
Combined list of sample ids from all classes.
ValueError
If no subjects from one or more classes were selected.
UserWarning
If an empty or full dataset is requested.
random_subset_ids_by_count(count_per_class=1)[source]
Returns a random subset of sample ids of specified size by count,
within each class.
count_per_class : int
Exact number of samples per each class.
subset : list
Combined list of sample ids from all classes.
sample_ids

Sample identifiers (strings) forming the basis of MLDataset (same as keys).

sample_ids_in_class(class_id)[source]

Returns a list of sample ids belonging to a given class.

class_id : str
class id to query.
subset_ids : list
List of sample ids belonging to a given class.
save(file_path)[source]

Method to save the dataset to disk.

file_path : str
File path to save the current dataset to
IOError
If saving to disk is not successful.
summarize_classes()[source]

Summary of classes: names, numeric labels and sizes

tuple : class_set, label_set, class_sizes

class_set : list
List of names of all the classes
label_set : list
Label for each class in class_set
class_sizes : list
Size of each class (number of samples)
train_test_split_ids(train_perc=None, count_per_class=None)[source]

Returns two disjoint sets of sample ids for use in cross-validation.

Offers two ways to specify the sizes: fraction or count. Only one access method can be used at a time.

train_perc : float
fraction of samples from each class to build the training subset.
count_per_class : int
exact count of samples from each class to build the training subset.
train_set : list
List of ids in the training set.
test_set : list
List of ids in the test set.
ValueError
If the fraction is outside open interval (0, 1), or If counts are outside larger than the smallest class, or If unrecongized format is provided for input args, or If the selection results in empty subsets for either train or test sets.
transform(func, func_description=None)[source]
Applies a given a function to the features of each subject
and returns a new dataset with other info unchanged.
func : callable

A valid callable that takes in a single ndarray and returns a single ndarray. Ensure the transformed dimensionality must be the same for all subjects.

If your function requires more than one argument, use functools.partial to freeze all the arguments except the features for the subject.

func_description : str, optional
Human readable description of the given function.
xfm_ds : MLDataset
with features obtained from subject-wise transform
TypeError
If given func is not a callable
ValueError
If transformation of any of the subjects features raises an exception.

Simple:

from pyradigm import MLDataset

thickness = MLDataset(in_path='ADNI_thickness.csv')
pcg_thickness = thickness.apply_xfm(func=get_pcg, description = 'applying ROI mask for PCG')
pcg_median = pcg_thickness.apply_xfm(func=np.median, description='median per subject')

Complex example with function taking more than one argument:

from pyradigm import MLDataset
from functools import partial
import hiwenet

thickness = MLDataset(in_path='ADNI_thickness.csv')
roi_membership = read_roi_membership()
hw = partial(hiwenet, groups = roi_membership)

thickness_hiwenet = thickness.transform(func=hw, description = 'histogram weighted networks')
median_thk_hiwenet = thickness_hiwenet.transform(func=np.median, description='median per subject')
pyradigm.cli_run()[source]

Command line interface

This is the command line interface

  • to display basic info about datasets without having to code
  • to perform basic arithmetic (add multiple classes or feature sets)

Indices and tables