# Selecting Markers¶

$\DeclareMathOperator{\argmax}{arg max}$

PicturedRocks can be used to implement numerous information-theoretic feature selection methods. In both the supervised and unsupervised cases, it is computationally intractable to optimize the true objective functions. We want to find a small set of genes $$S$$, with $$|S| < n$$ such that, in the supervised case $$I(S; y)$$ is maximized and in the unsupervised case $$H(S)$$ is maximized. (For more information, see our paper)

Because these are computationally intractable, we have implemented the following approximations to $$\argmax_{x_i} I(S \cup \{x_i\}; y)$$ (the ideal supervised objective function)

• The Mutual Information Maximization (MIM) is a univariate feature selection method—it does not consider the interactions between variables and only considers features in isolation.
$\displaystyle J_\text{mim}(x_i) = I(x_i; y)$
• The Joint Mutual Information (JMI) algorithm proposed by Yang and Moody uses a diminishing penalty on redundancy. However, it only penalizes relevant redundancy which, we explain in our paper, is desirable.
$\displaystyle J_\text{jmi}(x_i) = I(x_i; y) - \frac{1}{|S|} \sum_{x_j\in S} I(x_i; x_j; y)$
• The (CIFE) algorithm proposed by Lin and Tang (and independently by others) does not diminish its redundancy penalty (which it applies only to relevant redundancy, as desired). Although it can penalize redundancy somewhat aggressively, it is the easiest to give theoretical analysis for (see our paper).
$\displaystyle J_\text{cife}(x_i) = I(x_i; y) - \sum_{x_j\in S} I(x_i; x_j; y)$

## Mutual information¶

Before running any mutual information based algorithms, we need a discretized version of the gene expression matrix, with a limited number of discrete values (because we do not make any assumptions about the distribution of gene expression). Such data is stored in InformationSet, but by default, we suggest using makeinfoset() to generate such an object. The makeinfoset() function uses the recursive quantile transform quantile_discretize().

### Iterative Feature Selection¶

All information-theoretic feature selection methods in PicturedRocks are greedy algorithms. In general, they implement the abstract class IterativeFeatureSelection class. See Supervised Feature Selection and Unsupervised Feature Selection below for specific algorithms.

class picturedrocks.markers.IterativeFeatureSelection(infoset)

Abstract Class for Iterative Feature Selection

add(ind)

Select specified feature

Parameters: ind (int) – Index of feature to select
autoselect(n_feats)

Auto select features

This automatically selects n_feats features greedily by selecting the feature with the highest score at each iteration.

Parameters: n_feats (int) – The number of features to select
remove(ind)

Remove specified feature

Parameters: ind (int) – Index of feature to remove

#### Supervised Feature Selection¶

class picturedrocks.markers.MIM(infoset)
class picturedrocks.markers.CIFE(infoset)
class picturedrocks.markers.JMI(infoset)

#### Unsupervised Feature Selection¶

class picturedrocks.markers.UniEntropy(infoset)
class picturedrocks.markers.CIFEUnsup(infoset)

### Auxiliary Classes and Methods¶

class picturedrocks.markers.InformationSet(X, has_y=False)

Stores discrete gene expression matrix

Parameters: X (numpy.ndarray) – a (num_obs, num_vars) shape array with dtype int has_y (bool) – whether the array X has a target label column (a y column) as its last column
class picturedrocks.markers.SparseInformationSet(X, y=None)

Stores sparse discrete gene expression matrix

Parameters: X (scipy.sparse.csc_matrix) – a (num_obs, num_vars) shape matrix with dtype np.integer y (Union[NoneType, numpy.ndarray]) – if there is a target label column then this should be the target label, otherwise pass None.
entropy(cols)

Entropy of an ensemble of columns

Parameters: cols (numpy.ndarray) – a 1-d array (of dtype int64) with indices of columns to compute entropy over. To include the y column, use index -1 as the first entry of cols. All entries in cols except for the first entry must be non-negative. the Shannon entropy of cols numpy.int64
entropy_wrt(cols)

Compute multiple entropies at once

This method computes the entropy of cols + [i] iterating over all possible values of i and returns an array of entropies (one for each column)

Parameters: cols (numpy.ndarray) – a 1-d array (of dtype int64) with indices of columns to compute entropy with respect to. To include the y column, use index -1 as the first entry of cols. All entries in cols except for the first entry must be non-negative.

Note

To compute the entropy of all columns, you need to pass an empty numpy array of dtype int64. A quick way to do so is np.arange(0).

Returns: a 1-d array of entropies (where entry i corresponds to the entropy of columns cols together with column i) numpy.ndarray
picturedrocks.markers.makeinfoset(adata, include_y, k=5)

Discretize data and make a SparseInformationSet object

Parameters: adata (anndata.AnnData) – The data to discretize. By default data is discretized as round(log2(X + 1)). include_y (bool) – Determines if the y (cluster label) column in included in the InformationSet object An object that can be used to perform information theoretic calculations. SparseInformationSet
picturedrocks.markers.mutualinformation.infoset.quantile_discretize(X, k=5)

Discretize data matrix with a recursive quantile transform

Parameters: X (Union[numpy.ndarray, scipy.sparse.spmatrix]) – The input data matrix to transform. k (int) – The number of bins to use in the discretization. The discretized data matrix np.ndarray