Feature Selection Tutorial

In this Jupyter notebook, we’ll walk through the information-theoretic feature selection algorithms in PicturedRocks.

[1]:
import numpy as np
import scanpy.api as sc
import picturedrocks as pr
[2]:
adata = sc.datasets.paul15()
WARNING: In Scanpy 0.*, this returned logarithmized data. Now it returns non-logarithmized data.
... storing 'paul15_clusters' as categorical
[3]:
adata
[3]:
AnnData object with n_obs × n_vars = 2730 × 3451
    obs: 'paul15_clusters'
    uns: 'iroot'

The process_clusts method copies the cluster column and precomputes various indices, etc. If you have multiple columns that can be used as target labels (e.g., different treatments, clusters via different clustering algorithms or parameters, or demographics), this sets and processes the given columns as the one we’re currently examining.

This is necessary for supervised analysis and visualization tools in PicturedRocks that use cluster labels.

[4]:
pr.read.process_clusts(adata, "paul15_clusters")
[4]:
AnnData object with n_obs × n_vars = 2730 × 3451
    obs: 'paul15_clusters', 'clust', 'y'
    uns: 'iroot', 'num_clusts', 'clusterindices'

Normalize per cell and log transform the data

[5]:
sc.pp.normalize_per_cell(adata)
[6]:
sc.pp.log1p(adata)

The make_infoset method creates a SparseInformationSet object with a discretized version of the data matrix. It is useful to have only a small number of discrete states that each gene can take so that entropy is a reasonable measurement. By default, make_infoset performs an adaptive transform that we call a recursive quantile transform. This is implemented in pr.markers.mutualinformation.infoset.quantile_discretize. If you have a different discretization transformation, you can pass a transformed matrix directly to SparseInformationSet.

[7]:
infoset = pr.markers.makeinfoset(adata, True)

Because this dataset only has 3451 features, it is computationally easy to do feature selection without restricting the number of features. If we wanted to, we could do either supervised or unsupervised univariate feature selection (i.e., without considering any interactions between features).

[8]:
# supervised
mim = pr.markers.mutualinformation.iterative.MIM(infoset)
most_relevant_genes = mim.autoselect(1000)
[9]:
# unsupervised
ue = pr.markers.mutualinformation.iterative.UniEntropy(infoset)
most_variable_genes = ue.autoselect(1000)

At this stage we can slice our adata object as adata[:,most_relevant_genes] or adata[:,most_variable_genes] and create a new InformationSet object for this sliced object. We don’t need to do that here since there are not a lot of genes but will do so anyway for demonstration purposes.

Supervised Feature Selection

Let’s jump straight into supervised feature selection. Here we will use the CIFE objective

[10]:
adata_mr = adata[:,most_relevant_genes]
infoset_mr = pr.markers.makeinfoset(adata_mr, True)
[11]:
cife = pr.markers.mutualinformation.iterative.CIFE(infoset_mr)
[12]:
cife.score[:20]
[12]:
array([1.15097367, 1.11953242, 1.04094902, 0.98779889, 0.89294165,
       0.82332825, 0.80324986, 0.79920393, 0.69325805, 0.68464788,
       0.66075136, 0.65738939, 0.6331728 , 0.62632343, 0.61487087,
       0.60934578, 0.59525158, 0.59172139, 0.58504638, 0.57874063])
[13]:
top_genes = np.argsort(cife.score)[::-1]
print(adata_mr.var_names[top_genes[:10]])
Index(['Prtn3', 'Mpo', 'Ctsg', 'Elane', 'Car2', 'Car1', 'H2afy', 'Calr',
       'Blvrb', 'Fam132a'],
      dtype='object')

Let’s select ‘Mpo’

[14]:
ind = adata_mr.var_names.get_loc('Mpo')
[15]:
cife.add(ind)

Now, the top genes are

[16]:
top_genes = np.argsort(cife.score)[::-1]
print(adata_mr.var_names[top_genes[:10]])
Index(['Car1', 'Apoe', 'H2afy', 'Fam132a', 'Car2', 'Mt1', 'Blvrb', 'Srgn',
       'Mt2', 'Prtn3'],
      dtype='object')

Observe that the order has changed based on redundancy (or lack thereof) with ‘Mpo’. Let’s add ‘Car1’

[17]:
ind = adata_mr.var_names.get_loc('Car1')
cife.add(ind)
[18]:
top_genes = np.argsort(cife.score)[::-1]
print(adata_mr.var_names[top_genes[:10]])
Index(['Apoe', 'Ptprcap', 'Gpr56', 'Myb', 'Mcm5', 'Uqcrq', 'Lyar', 'Cox5a',
       'S100a10', 'Snrpd1'],
      dtype='object')

If we want to select the top gene repeatedly, we can use autoselect

[19]:
cife.autoselect(5)

To look at the markers we’ve selected, we can examine cife.S

[20]:
cife.S
[20]:
[1, 5, 34, 694, 904, 290, 597]
[21]:
adata_mr.var_names[cife.S]
[21]:
Index(['Mpo', 'Car1', 'Apoe', 'Srm', 'Atp5g3', 'Ncl', 'Rps3'], dtype='object')

User Interface

This process can also done manually with a user-interface allowing you to incorporate domain knowledge in this process. Use the View dropdown to look at heatplots for candidate genes and already selected genes.

[22]:
im = pr.markers.interactive.InteractiveMarkerSelection(adata_mr, cife, dim_red="umap", show_genes=False)
Running umap on cells...
/home/umang/anaconda3/envs/fastpr/lib/python3.6/site-packages/sklearn/metrics/pairwise.py:257: RuntimeWarning:

invalid value encountered in sqrt

[23]:
im.show()

Note, that because we passed the same cife object, any genes added/removed in the interface will affect the cife object.

[24]:
adata_mr.var_names[cife.S]
[24]:
Index(['Mpo', 'Car1', 'Apoe', 'Srm', 'Atp5g3', 'Ncl', 'Rps3'], dtype='object')

Unsupervised Feature Selection

This works very similarly. In the example below, we’ll autoselect 5 genes and then run the interface. Note that although the previous section would not work without cluster labels, the following code will.

[25]:
cife_unsup = pr.markers.mutualinformation.iterative.CIFEUnsup(infoset)
[26]:
cife_unsup.autoselect(5)

(If you ran the example above, this will load faster because the t_SNE coordinates for genes and cells have already been computed. You can also customize which plots are displayed with keyword arguments (e.g., InteractiveMarkerSelection(..., show_genes=False)). Future versions may allow arbitrary plots.

[27]:
im_unsup = pr.markers.interactive.InteractiveMarkerSelection(adata, cife_unsup, show_genes=False, show_cells=False, dim_red="umap")
[28]:
im_unsup.show()

Binary Feature Selection

We can also perform feature selection specifically for individual class labels (e.g., clusters). This is done by changing the SparseInformationSet’s y array. In the example below, we will target the class label “2Ery”. Notice that the features selected by MIM (MIM doesn’t consider redundancy) are only those that are informative about “2Ery” in particular.

Binary (i.e., not multiclass) feature selection can be performed with any information-theoretic feature selection algorithm (e.g., CIFE, JMI, MIM).

[29]:
# since we are changing y anyway, the value of include_y (True in the line below) doesn't matter
infoset2 = pr.markers.makeinfoset(adata, True)
infoset2.set_y((adata.obs['clust'] == '2Ery').astype(int).values)
[30]:
mim2 = pr.markers.mutualinformation.iterative.MIM(infoset2)
[31]:
im2 = pr.markers.interactive.InteractiveMarkerSelection(adata, mim2, show_genes=False, dim_red="umap")
Running umap on cells...
/home/umang/anaconda3/envs/fastpr/lib/python3.6/site-packages/sklearn/metrics/pairwise.py:257: RuntimeWarning:

invalid value encountered in sqrt

[32]:
im2.show()