Feature Selection Tutorial¶
In this Jupyter notebook, we’ll walk through the information-theoretic feature selection algorithms in PicturedRocks.
In [1]:
import numpy as np
import scanpy.api as sc
import picturedrocks as pr
In [2]:
adata = sc.datasets.paul15()
WARNING: In Scanpy 0.*, this returned logarithmized data. Now it returns non-logarithmized data.
... storing 'paul15_clusters' as categorical
In [3]:
adata
Out[3]:
AnnData object with n_obs × n_vars = 2730 × 3451
obs: 'paul15_clusters'
uns: 'iroot'
The process_clusts
method copies the cluster column and precomputes
various indices, etc. If you have multiple columns that can be used as
target labels (e.g., different treatments, clusters via different
clustering algorithms or parameters, or demographics), this sets and
processes the given columns as the one we’re currently examining.
This is necessary for supervised analysis and visualization tools in PicturedRocks that use cluster labels.
In [4]:
pr.read.process_clusts(adata, "paul15_clusters")
Out[4]:
AnnData object with n_obs × n_vars = 2730 × 3451
obs: 'paul15_clusters', 'clust', 'y'
uns: 'iroot', 'num_clusts', 'clusterindices'
Normalize per cell and log transform the data
In [5]:
sc.pp.normalize_per_cell(adata)
In [6]:
sc.pp.log1p(adata)
The make_infoset
method creates an InformationSet
object with
the following discretization: \([\log(X_{ij} + 1)]_{ij}\) where
\(X_ij\) is the data matrix. It is useful to have only a small
number of discrete states that each gene can take so that entropy is a
reasonable measurement.
In [7]:
infoset = pr.markers.makeinfoset(adata, True)
Alternatively, we could have made our own InformationSet object with our
own discretization. Here we re-create infoset
by hand. Note: At this
time, InformationSet cannot handle scipy’s sparse matrices.
In [8]:
X_disc = np.log2(adata.X + 1).round().astype(int)
In [9]:
infoset2 = pr.markers.mutualinformation.infoset.InformationSet(np.c_[X_disc, adata.obs['y']], True)
In [10]:
np.array_equal(infoset2.X, infoset.X)
Out[10]:
True
Because this dataset only has 3451 features, it is computationally easy to do feature selection without restricting the number of features. If we wanted to, we could do either supervised or unsupervised univariate feature selection (i.e., without considering any interactions between features).
In [11]:
# supervised
mim = pr.markers.mutualinformation.iterative.MIM(infoset)
most_relevant_genes = mim.autoselect(1000)
In [12]:
# unsupervised
ue = pr.markers.mutualinformation.iterative.UniEntropy(infoset)
most_variable_genes = ue.autoselect(1000)
At this stage, if we wanted to, we could slice our adata
object as
adata[:,most_relevant_genes]
or adata[:,most_variable_genes]
and
create a new InformationSet
object for this sliced object. We don’t
need to do that here.
Supervised Feature Selection¶
Let’s jump straight into supervised feature selection. Here we will use
the CIFE
objective
In [13]:
cife = pr.markers.mutualinformation.iterative.CIFE(infoset)
In [14]:
cife.score
Out[14]:
array([0.06153028, 0.03873652, 0.10514698, ..., 0.13305092, 0.17305014,
0.03877704])
In [15]:
top_genes = np.argsort(cife.score)[::-1]
print(adata.var_names[top_genes[:10]])
Index(['Ctsg', 'Prtn3', 'Car2', 'Elane', 'Car1', 'H2afy', 'Ermap', 'Mt2',
'Fam132a', 'Klf1'],
dtype='object')
Let’s select ‘Car1’
In [16]:
ind = adata.var_names.get_loc('Car1')
In [17]:
cife.add(ind)
Now, the top genes are
In [18]:
top_genes = np.argsort(cife.score)[::-1]
print(adata.var_names[top_genes[:10]])
Index(['Elane', 'Ctsg', 'Prtn3', 'Apoe', 'Fam132a', 'P4hb', 'Gstm1', 'H2afy',
'Alas1', 'Car2'],
dtype='object')
Observe that the order has changed based on redundancy (or lack thereof) with ‘Car1’. Let’s add ‘Ctsg’
In [19]:
ind = adata.var_names.get_loc('Ctsg')
cife.add(ind)
In [20]:
top_genes = np.argsort(cife.score)[::-1]
print(adata.var_names[top_genes[:10]])
Index(['Apoe', 'Mt1', 'Mcm5', 'Ran', 'Srm', 'Uqcrq', 'Psma7', 'Hspd1', 'Uqcr',
'Cox5a'],
dtype='object')
If we want to select the top gene at each step, we can use
autoselect
In [21]:
cife.autoselect(10)
To look at the markers we’ve selected, we can examine cife.S
In [22]:
cife.S
Out[22]:
[552, 815, 305, 1953, 3002, 2666, 2479, 596, 2645, 785, 412, 1503]
In [23]:
adata.var_names[cife.S]
Out[23]:
Index(['Car1', 'Ctsg', 'Apoe', 'Mt1', 'Taldo1', 'S100a10', 'Ptprcap', 'Ccnd2',
'Rps8', 'Crip1', 'Atpase6', 'Hspa8'],
dtype='object')
This process can also done manually with a user-interface.
In [24]:
im = pr.markers.interactive.InteractiveMarkerSelection(adata, cife, dim_red="umap")
Running umap on genes...
/home/uvarma/miniconda3/envs/scpr/lib/python3.6/site-packages/sklearn/metrics/pairwise.py:257: RuntimeWarning:
invalid value encountered in sqrt
Running umap on cells...
/home/uvarma/miniconda3/envs/scpr/lib/python3.6/site-packages/sklearn/metrics/pairwise.py:257: RuntimeWarning:
invalid value encountered in sqrt
In [25]:
im.show()
Note, that because we passed the same cife
object, any genes
added/removed in the interface will affect the cife
object.
In [26]:
adata.var_names[cife.S]
Out[26]:
Index(['Car1', 'Ctsg', 'Apoe', 'Mt1', 'Taldo1', 'S100a10', 'Ptprcap', 'Ccnd2',
'Rps8', 'Crip1', 'Atpase6', 'Hspa8'],
dtype='object')
Unsupervised Feature Selection¶
This works very similarly. In the example below, we’ll autoselect 5 genes and then run the interface. Note that although the previous section would not work without cluster labels, the following code will.
In [27]:
cife_unsup = pr.markers.mutualinformation.iterative.CIFEUnsup(infoset)
In [28]:
cife_unsup.autoselect(5)
(If you ran the example above, this will load faster because the t_SNE
coordinates for genes and cells have already been computed. You can also
disable one/both of these plots with keyword arguments (e.g.,
InteractiveMarkerSelection(..., show_genes=False)
)
In [29]:
im_unsup = pr.markers.interactive.InteractiveMarkerSelection(adata, cife_unsup, show_genes=False, dim_red="umap")
In [30]:
im_unsup.show()
In [ ]: