Measuring Feature Selection Performance

This module can be used to evaluate feature selection methods via K-fold cross validation.

class picturedrocks.performance.FoldTester(adata)

Performs K-fold Cross Validation for Marker Selection

FoldTester can be used to evaluate various marker selection algorithms. It can split the data in K folds, run marker selection algorithms on these folds, and classify data based on testing and training data.

Parameters:adata (anndata.AnnData) – data to slice into folds

Classify each cell using training data from other folds

For each fold, we project the data onto the markers selected for that fold, which we treat as test data. We also project the complement of the fold and treat that as training data.

Parameters:classifier – a classifier that trains with a training data set and predicts labels of test data. See NearestCentroidClassifier for an example.

Load folds from a file

The file can be one saved either by FoldTester.savefolds() or FoldTester.savefoldsandmarkers(). In the latter case, it will not load any markers.


Load folds and markers

Loads a folds and markers file saved by FoldTester.savefoldsandmarkers()

Parameters:file (str) – filename to load from (typically with a .npz extension)
makefolds(k=5, random=False)

Makes folds

  • k (int) – the value of K
  • random (bool) – If true, makefolds will make folds randomly. Otherwise, the folds are made in order (i.e., the first ceil(N / k) cells in the first fold, etc.)

Save folds to a file

Parameters:file (str) – filename to save (typically with a .npz extension)

Save folds and markers for each fold

This saves folds, and for each fold, the markers previously found by FoldTester.selectmarkers().

Parameters:file (str) – filename to save to (typically with a .npz extension)

Perform a marker selection algorithm on each fold

Parameters:select_function (function) – a function that takes in an AnnData object and outputs a list of gene markers, given by their index

Ensure that all observations are in exactly one fold

Return type:bool
class picturedrocks.performance.NearestCentroidClassifier

Nearest Centroid Classifier for Cross Validation

Computes the centroid of each cluster label in the training data, then predicts the label of each test data point by finding the nearest centroid.

class picturedrocks.performance.PerformanceReport(y, yhat)

Report actual vs predicted statistics

  • y (numpy.ndarray) – actual cluster labels, (N, 1)-shaped numpy array
  • yhat (numpy.ndarray) – predicted cluster labels, (N, 1)-shaped numpy array

Compute and make a confusion matrix figure

Returns:confusion matrix
Return type:plotly figure

Get the confusion matrix for the latest run

Returns:array of shape (K, K), with the [i, j] entry being the fraction of cells in cluster i that were predicted to be in cluster j
Return type:numpy.ndarray

Print a message with the score


Print a full report

This uses iplot, so we assume this will only be run in a Jupyter notebook and that init_notebook_mode has already been run.


Returns the number of cells misclassified.

picturedrocks.performance.kfoldindices(n, k, random=False)

Generate indices for k-fold cross validation

  • n (int) – number of observations
  • k (int) – number of folds
  • random (bool) – determines whether to randomize the order

numpy.ndarray – array of indices in each fold