Feature Selection Tutorial

In this Jupyter notebook, we’ll walk through the information-theoretic feature selection algorithms in PicturedRocks.

In [1]:
import numpy as np
import scanpy.api as sc
import picturedrocks as pr
In [2]:
adata = sc.datasets.paul15()
WARNING: In Scanpy 0.*, this returned logarithmized data. Now it returns non-logarithmized data.
... storing 'paul15_clusters' as categorical
In [3]:
adata
Out[3]:
AnnData object with n_obs × n_vars = 2730 × 3451
    obs: 'paul15_clusters'
    uns: 'iroot'

The process_clusts method copies the cluster column and precomputes various indices, etc. If you have multiple columns that can be used as target labels (e.g., different treatments, clusters via different clustering algorithms or parameters, or demographics), this sets and processes the given columns as the one we’re currently examining.

This is necessary for supervised analysis and visualization tools in PicturedRocks that use cluster labels.

In [4]:
pr.read.process_clusts(adata, "paul15_clusters")
Out[4]:
AnnData object with n_obs × n_vars = 2730 × 3451
    obs: 'paul15_clusters', 'clust', 'y'
    uns: 'iroot', 'num_clusts', 'clusterindices'

Normalize per cell and log transform the data

In [5]:
sc.pp.normalize_per_cell(adata)
In [6]:
sc.pp.log1p(adata)

The make_infoset method creates an InformationSet object with the following discretization: \([\log(X_{ij} + 1)]_{ij}\) where \(X_ij\) is the data matrix. It is useful to have only a small number of discrete states that each gene can take so that entropy is a reasonable measurement.

In [7]:
infoset = pr.markers.makeinfoset(adata, True)

Alternatively, we could have made our own InformationSet object with our own discretization. Here we re-create infoset by hand. Note: At this time, InformationSet cannot handle scipy’s sparse matrices.

In [8]:
X_disc = np.log2(adata.X + 1).round().astype(int)
In [9]:
infoset2 = pr.markers.mutualinformation.infoset.InformationSet(np.c_[X_disc, adata.obs['y']], True)
In [10]:
np.array_equal(infoset2.X, infoset.X)
Out[10]:
True

Because this dataset only has 3451 features, it is computationally easy to do feature selection without restricting the number of features. If we wanted to, we could do either supervised or unsupervised univariate feature selection (i.e., without considering any interactions between features).

In [11]:
# supervised
mim = pr.markers.mutualinformation.iterative.MIM(infoset)
most_relevant_genes = mim.autoselect(1000)
In [12]:
# unsupervised
ue = pr.markers.mutualinformation.iterative.UniEntropy(infoset)
most_variable_genes = ue.autoselect(1000)

At this stage, if we wanted to, we could slice our adata object as adata[:,most_relevant_genes] or adata[:,most_variable_genes] and create a new InformationSet object for this sliced object. We don’t need to do that here.

Supervised Feature Selection

Let’s jump straight into supervised feature selection. Here we will use the CIFE objective

In [13]:
cife = pr.markers.mutualinformation.iterative.CIFE(infoset)
In [14]:
cife.score
Out[14]:
array([0.06153028, 0.03873652, 0.10514698, ..., 0.13305092, 0.17305014,
       0.03877704])
In [15]:
top_genes = np.argsort(cife.score)[::-1]
print(adata.var_names[top_genes[:10]])
Index(['Ctsg', 'Prtn3', 'Car2', 'Elane', 'Car1', 'H2afy', 'Ermap', 'Mt2',
       'Fam132a', 'Klf1'],
      dtype='object')

Let’s select ‘Car1’

In [16]:
ind = adata.var_names.get_loc('Car1')
In [17]:
cife.add(ind)

Now, the top genes are

In [18]:
top_genes = np.argsort(cife.score)[::-1]
print(adata.var_names[top_genes[:10]])
Index(['Elane', 'Ctsg', 'Prtn3', 'Apoe', 'Fam132a', 'P4hb', 'Gstm1', 'H2afy',
       'Alas1', 'Car2'],
      dtype='object')

Observe that the order has changed based on redundancy (or lack thereof) with ‘Car1’. Let’s add ‘Ctsg’

In [19]:
ind = adata.var_names.get_loc('Ctsg')
cife.add(ind)
In [20]:
top_genes = np.argsort(cife.score)[::-1]
print(adata.var_names[top_genes[:10]])
Index(['Apoe', 'Mt1', 'Mcm5', 'Ran', 'Srm', 'Uqcrq', 'Psma7', 'Hspd1', 'Uqcr',
       'Cox5a'],
      dtype='object')

If we want to select the top gene at each step, we can use autoselect

In [21]:
cife.autoselect(10)

To look at the markers we’ve selected, we can examine cife.S

In [22]:
cife.S
Out[22]:
[552, 815, 305, 1953, 3002, 2666, 2479, 596, 2645, 785, 412, 1503]
In [23]:
adata.var_names[cife.S]
Out[23]:
Index(['Car1', 'Ctsg', 'Apoe', 'Mt1', 'Taldo1', 'S100a10', 'Ptprcap', 'Ccnd2',
       'Rps8', 'Crip1', 'Atpase6', 'Hspa8'],
      dtype='object')

This process can also done manually with a user-interface.

In [24]:
im = pr.markers.interactive.InteractiveMarkerSelection(adata, cife, dim_red="umap")
Running umap on genes...
/home/uvarma/miniconda3/envs/scpr/lib/python3.6/site-packages/sklearn/metrics/pairwise.py:257: RuntimeWarning:

invalid value encountered in sqrt

Running umap on cells...
/home/uvarma/miniconda3/envs/scpr/lib/python3.6/site-packages/sklearn/metrics/pairwise.py:257: RuntimeWarning:

invalid value encountered in sqrt

In [25]:
im.show()

Note, that because we passed the same cife object, any genes added/removed in the interface will affect the cife object.

In [26]:
adata.var_names[cife.S]
Out[26]:
Index(['Car1', 'Ctsg', 'Apoe', 'Mt1', 'Taldo1', 'S100a10', 'Ptprcap', 'Ccnd2',
       'Rps8', 'Crip1', 'Atpase6', 'Hspa8'],
      dtype='object')

Unsupervised Feature Selection

This works very similarly. In the example below, we’ll autoselect 5 genes and then run the interface. Note that although the previous section would not work without cluster labels, the following code will.

In [27]:
cife_unsup = pr.markers.mutualinformation.iterative.CIFEUnsup(infoset)
In [28]:
cife_unsup.autoselect(5)

(If you ran the example above, this will load faster because the t_SNE coordinates for genes and cells have already been computed. You can also disable one/both of these plots with keyword arguments (e.g., InteractiveMarkerSelection(..., show_genes=False))

In [29]:
im_unsup = pr.markers.interactive.InteractiveMarkerSelection(adata, cife_unsup, show_genes=False, dim_red="umap")
In [30]:
im_unsup.show()
In [ ]: