Selecting Markers

PicturedRocks current implements two categories of marker selection algorithms:
  • mutual information-based algorithms
  • 1-bit compressed sensing based algorithms

Mutual information

TODO: Explanation of how these work goes here.

Before running any mutual information based algorithms, we need a discretized version of the gene expression matrix, with a limited number of discrete values (because we do not make any assumptions about the distribution of gene expression). Such data is stored in picturedrocks.markers.InformationSet, but by default, we suggest using picturedrocks.markers.makeinfoset() to generate such an object after appropriate normalization

Iterative Feature Selection

All information-theoretic feature selection methods in PicturedRocks are greedy algorithms. In general, they implement the abstract class IterativeFeatureSelection class. See Supervised Feature Selection and Unsupervised Feature Selection for specific algorithms.

class picturedrocks.markers.mutualinformation.iterative.IterativeFeatureSelection(infoset)

Abstract Class for Iterative Feature Selection

add(ind)

Select specified feature

Parameters:ind (int) – Index of feature to select
autoselect(n_feats)

Auto select features

This automatically selects n_feats features greedily by selecting the feature with the highest score at each iteration.

Parameters:n_feats (int) – The number of features to select
remove(ind)

Remove specified feature

Parameters:ind (int) – Index of feature to remove

Supervised Feature Selection

class picturedrocks.markers.mutualinformation.iterative.MIM(infoset)
class picturedrocks.markers.mutualinformation.iterative.CIFE(infoset)
class picturedrocks.markers.mutualinformation.iterative.JMI(infoset)

Unsupervised Feature Selection

class picturedrocks.markers.mutualinformation.iterative.UniEntropy(infoset)
class picturedrocks.markers.mutualinformation.iterative.CIFEUnsup(infoset)

Auxiliary Classes and Methods

class picturedrocks.markers.InformationSet(X, has_y=False)

Stores discrete gene expression matrix

Parameters:
  • X (numpy.ndarray) – a (num_obs, num_vars) shape array with dtype int
  • has_y (bool) – whether the array X has a target label column (a y column) as its last column
class picturedrocks.markers.SparseInformationSet(X, y=None)

Stores sparse discrete gene expression matrix

Parameters:
  • X (scipy.sparse.csc_matrix) – a (num_obs, num_vars) shape matrix with dtype int
  • has_y (bool) – whether the array X has a target label column (a y column) as its last column
picturedrocks.markers.makeinfoset(adata, include_y, k=5)

Discretize data and make a Sparse InformationSet object

Parameters:
  • adata (anndata.AnnData) – The data to discretize. By default data is discretized as round(log2(X + 1)).
  • include_y (bool) – Determines if the y (cluster label) column in included in the InformationSet object
Returns:

An object that can be used to perform information theoretic calculations.

Return type:

SparseInformationSet

Interactive Marker Selection

class picturedrocks.markers.interactive.InteractiveMarkerSelection(adata, feature_selection, disp_genes=10, connected=True, show_cells=True, show_genes=True, dim_red='tsne')

Run an interactive marker selection GUI inside a jupyter notebook

Parameters:
  • adata (anndata.AnnData) – The data to run marker selection on. If you want to restrict to a small number of genes, slice your anndata object.
  • feature_selection (picturedrocks.markers.mutualinformation.iterative.IterativeFeatureSelection) – An instance of a interative feature selection algorithm class that corresponds to adata (i.e., the column indices in feature_selection should correspond to the column indices in adata)
  • disp_genes (int) – Number of genes to display as options (by default, number of genes plotted on the tSNE plot is 3 * disp_genes, but can be changed by setting the plot_genes property after initializing.
  • connected (bool) – Parameter to pass to plotly.offline.init_notebook_mode. If your browser does not have internet access, you should set this to False.
  • show_cells (bool) – Determines whether to display a tSNE plot of the cells with a drop-down menu to look at gene expression levels for candidate genes.
  • show_genes (bool) – Determines whether to display a tSNE plot of genes to visualize gene similarity
  • dim_red ({"tsne", "umap"}) – Dimensionality reduction algorithm

Warning

This class requires modules not explicitly listed as dependencies of picturedrocks. Specifically, please ensure that you have ipywidgets installed and that you use this class only inside a jupyter notebook.

picturedrocks.markers.interactive.cife_obj(H, i, S)

The CIFE objective function for feature selection

Parameters:
  • H (function) – an entropy function, typically the bound method H on an instance of InformationSet. For example, if infoset is of type picturedrocks.markers.InformationSet, then pass infoset.H
  • i (int) – index of candidate gene
  • S (list) – list of features already selected
Returns:

the candidate feature’s score relative to the selected gene set S

Return type:

float