Selecting Markers¶
PicturedRocks can be used to implement numerous information-theoretic feature selection methods. In both the supervised and unsupervised cases, it is computationally intractable to optimize the true objective functions. We want to find a small set of genes \(S\), with \(|S| < n\) such that, in the supervised case \(I(S; y)\) is maximized and in the unsupervised case \(H(S)\) is maximized. (For more information, see our paper)
Because these are computationally intractable, we have implemented the following approximations to \(\argmax_{x_i} I(S \cup \{x_i\}; y)\) (the ideal supervised objective function)
- The Mutual Information Maximization (MIM) is a univariate feature selection method—it does not consider the interactions between variables and only considers features in isolation.
- \[\displaystyle J_\text{mim}(x_i) = I(x_i; y)\]
- The Joint Mutual Information (JMI) algorithm proposed by Yang and Moody uses a diminishing penalty on redundancy. However, it only penalizes relevant redundancy which, we explain in our paper, is desirable.
- \[\displaystyle J_\text{jmi}(x_i) = I(x_i; y) - \frac{1}{|S|} \sum_{x_j\in S} I(x_i; x_j; y)\]
- The (CIFE) algorithm proposed by Lin and Tang (and independently by others) does not diminish its redundancy penalty (which it applies only to relevant redundancy, as desired). Although it can penalize redundancy somewhat aggressively, it is the easiest to give theoretical analysis for (see our paper).
- \[\displaystyle J_\text{cife}(x_i) = I(x_i; y) - \sum_{x_j\in S} I(x_i; x_j; y)\]
Mutual information¶
Before running any mutual information based algorithms, we need a discretized
version of the gene expression matrix, with a limited number of discrete
values (because we do not make any assumptions about the distribution of gene
expression). Such data is stored in
InformationSet
, but by default, we suggest
using makeinfoset()
to generate such an object. The makeinfoset()
function uses the recursive quantile transform quantile_discretize()
.
Iterative Feature Selection¶
All information-theoretic feature selection methods in PicturedRocks are
greedy algorithms. In general, they implement the abstract class
IterativeFeatureSelection
class. See Supervised Feature Selection and Unsupervised Feature Selection below for specific algorithms.
-
class
picturedrocks.markers.
IterativeFeatureSelection
(infoset)¶ Abstract Class for Iterative Feature Selection
Auxiliary Classes and Methods¶
-
class
picturedrocks.markers.
InformationSet
(X, has_y=False)¶ Stores discrete gene expression matrix
Parameters: - X (numpy.ndarray) – a (num_obs, num_vars) shape array with
dtype
int
- has_y (bool) – whether the array X has a target label column (a y column) as its last column
- X (numpy.ndarray) – a (num_obs, num_vars) shape array with
-
class
picturedrocks.markers.
SparseInformationSet
(X, y=None)¶ Stores sparse discrete gene expression matrix
Parameters: - X (scipy.sparse.csc_matrix) – a (num_obs, num_vars) shape matrix with
dtype
int
- has_y (bool) – whether the array X has a target label column (a y column) as its last column
- X (scipy.sparse.csc_matrix) – a (num_obs, num_vars) shape matrix with
-
picturedrocks.markers.
makeinfoset
(adata, include_y, k=5)¶ Discretize data and make a Sparse InformationSet object
Parameters: - adata (anndata.AnnData) – The data to discretize. By default data is discretized as round(log2(X + 1)).
- include_y (bool) – Determines if the y (cluster label) column in included in the InformationSet object
Returns: An object that can be used to perform information theoretic calculations.
Return type:
-
picturedrocks.markers.mutualinformation.infoset.
quantile_discretize
(X, k=5)¶ Discretize data matrix with a recursive quantile transform
Parameters: - X (Union[numpy.ndarray, scipy.sparse.spmatrix]) – The input data matrix to transform.
- k (int) – The number of bins to use in the discretization.
Returns: The discretized data matrix
Return type: np.ndarray