Exploring Feature Importances with MMoCHi

Authors: Daniel Caron and William Specht

In this notebook, we will evaluate individual features for their importance during random forest classification. This can provide insight into how the random forests are performing and identify novel transcriptomic and proteomic markers of cell subsets.

Exploring Feature importances

What are feature importances and why use them?

Unlike some supervised classifiers, which rely on strict feature selection or dimensionally reduced feature sets, random forests perform well with high-dimensional data. When training decision trees, features are selected for their ability to reduce impurity at each branch. To reduce overfitting, random forests average over multiple decision trees trained under similar conditions. This naturally lends itself well to ranking features based on their importance for classification. Although there are many methods to calculate feature importance, here we use scikit-learn’s implementation of impurity-based feature_importances where importance is calculated from the mean decrease in Gini impurity across the trees in a forest. A ranking of these importances can then be generated for each classification layer providing insight into differences between subsets at each level of the classifier. Note that other methods to calculate important features, including permutation-based methods, tend to be much slower to calculate, but could also be used.

Feature importances can serve a few purposes. First, they can be used to understand how the random forest is classifying cell types and may reveal why specific events are or are not being labeled correctly. While well-performing classifiers will have high feature importances for highly expressed cell type markers, poorly performing classifiers may give high feature importance to batch-specific artifacts. If this is the case, the training data or feature set given to the random forest for classification can be altered to improve performance. Alternatively, if cell type markers are given a high performance but classification is still sub-par, random forest hyperparameters (with the clf_kwargs argument in mmc.Hierarchy.add_classification()) can be altered to increase fit and improve performance. More information on random forest hyperparameters here.

Another useful application of feature importances is as a discovery tool for novel cell type markers. Newly identified markers may inform high-confidence thresholding and fluorescence-based gating strategies. For this application, it’s worth noting that feature importances are not associated with individual subsets, so one must take care during interpretation of highly important features, especially in multiclass scenarios. To associate features with a subset, we suggest using differential expression analysis. By providing a window into profiles of our cell types, features identified may also pave the way for new avenues of study, similar to differential expression. See our preprint for more details.

Import packages

[1]:

import anndata
import mmochi as mmc
import os
import matplotlib.pyplot as plt
plt.rcParams['figure.dpi'] = 150
mmc.log_to_file('mmochi_feature_importance')

Load in hierarchy used for classification and classified data

Run Integrated Classification First!

This tutorial builds off of the Integrated Classification tutorial, and uses the trained MMoCHi classifier and classified AnnData object saved during that tutorial.

Let’s load in the previously saved AnnData and MMoCHi Hierarchy objects:

[2]:

assert os.path.isfile('data/classified_events.h5ad') & os.path.isfile('data/IntegratedClassifier.hierarchy'), \
    'Objects not found. Run the Integrated Clasification tutorial to generate them!'
adata = anndata.read_h5ad('data/classified_events.h5ad')
h = mmc.Hierarchy(load='data/IntegratedClassifier')
h.display(True)

Loading classifier from data/IntegratedClassifier...

Loaded data/IntegratedClassifier.hierarchy

_images/Exploring_Feature_Importances_6_2.png

Once the random forest classifiers have been trained and stored in the hierarchy, feature importances can be extracted as a list using mmc.feature_importances(), which returns a DataFrame of all features sorted by importance for the specified level. Within MMoCHi, features are labeled by their modality (separated by '_mod_'). You can see here that for lymphocyte subsets, there are many genes whose expression is highly important for subset identification:

[3]:

mmc.feature_importances(h, 'Lymphoid').head(25)

[3]:

	Feature	Importance
0	RPS23_mod_GEX	0.015662
1	CKAP4_mod_GEX	0.011250
2	EAF2_mod_GEX	0.010595
3	SEC61B_mod_GEX	0.010389
4	SEC14L1_mod_GEX	0.010384
5	HLA-DRA_mod_GEX	0.009776
6	FAU_mod_GEX	0.009667
7	CD74_mod_GEX	0.008734
8	CD19_mod_landmark_protein	0.008237
9	CD52_mod_GEX	0.008201
10	CTSW_mod_GEX	0.008054
11	NKG7_mod_GEX	0.007987
12	CD56_mod_landmark_protein	0.007987
13	TYROBP_mod_GEX	0.007819
14	RPS27_mod_GEX	0.007532
15	CD37_mod_GEX	0.007452
16	PRDX1_mod_GEX	0.007328
17	MS4A1_mod_GEX	0.007310
18	RPS14_mod_GEX	0.007162
19	PPIB_mod_GEX	0.007092
20	BANK1_mod_GEX	0.006815
21	JUND_mod_GEX	0.006707
22	TNFRSF17_mod_GEX	0.006542
23	ITM2C_mod_GEX	0.006396
24	HSP90B1_mod_GEX	0.006389

If you are interested in focusing on proteins and protein coding transcript markers, you may want to limit classification only protein coding genes. Additionally, limiting to only protein coding genes may help to remove some sequencing artifacts, and improve classification performance. This can be accomplished by removing these genes prior to classification, or using the limit argument in mmc.classify. See docs for more details.

On other levels, such as CD4+ vs CD8+ T cells, you can see various proteins played a more important role in cell type classification. Note that the much higher values for importance here are in part due to the hyperparameters used to train the random forest at this level (see the Hierarchy Design tutorial for more details).

[4]:

mmc.feature_importances(h, 'CD4_CD8').head(25)

[4]:

	Feature	Importance
0	CD8A_mod_GEX	0.209048
1	CD8a_mod_landmark_protein	0.207746
2	CD4_mod_landmark_protein	0.153321
3	CD8B_mod_GEX	0.060999
4	CD14_mod_landmark_protein	0.045218
5	CTSW_mod_GEX	0.031086
6	CD3_mod_landmark_protein	0.021159
7	CD4_mod_GEX	0.021029
8	CCL5_mod_GEX	0.016837
9	LINC02446_mod_GEX	0.012924
10	CD25_mod_landmark_protein	0.011975
11	CD40LG_mod_GEX	0.011359
12	GZMM_mod_GEX	0.010361
13	CST7_mod_GEX	0.010055
14	NKG7_mod_GEX	0.009144
15	MT-CO3_mod_GEX	0.009041
16	ARPC2_mod_GEX	0.006803
17	GZMK_mod_GEX	0.006338
18	MT-ATP6_mod_GEX	0.006118
19	HCST_mod_GEX	0.005961
20	DUSP2_mod_GEX	0.005052
21	LTB_mod_GEX	0.005037
22	MT-CO2_mod_GEX	0.004820
23	RPS3_mod_GEX	0.003996
24	MT-CYB_mod_GEX	0.003914

Visualization to evaluate classification

The plotting function below displays the top 25 most important features for the specified classification layer(s) in violin plots. These plots highlight which subsets express the most important features. By default, these plots group events by their high-confidence label and their predicted classification. This is incredibly useful for exploring discrepancies between high-confidence thresholding and classification, or for understanding performance on cells that were not high-confidence thresholded. For example, here we can identify that in the Lymphoid classifier, many of the events not labeled by high-confidence thresholding, but labeled NK/ILC ('? -> NK_ILC') actually expressed CD3 protein. Although this only represents ~200 events, it suggests there is room for improvement in the high-confidence thresholding or the hierarchy design. Note that these plots also display the expression of these markers on cells not included at this classification level ('nan -> nan') for reference.

[5]:

mmc.plot_important_features(adata, 'Lymphoid', h)

... storing 'hc -> class' as categorical

WARNING: dendrogram data not found (using key=dendrogram_hc -> class). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.
WARNING: You’re trying to run this on 33538 dimensions of `.X`, if you really want this, set `use_rep='X'`.
         Falling back to preprocessing with `sc.pp.pca` and default params.
WARNING: dendrogram data not found (using key=dendrogram_hc -> class). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.

_images/Exploring_Feature_Importances_13_2.png

To best compare expression levels between classified subsets, ‘classification’ can be passed as an argument to display all subsets at an individual classification level.

[6]:

mask = (~adata.obsm['lin']['CD4_CD8_hc'].isin(['nan']))
mmc.plot_important_features(adata[mask].copy(), ['CD4_CD8'], h, reference = 'classification')

WARNING: Dendrogram not added. Dendrogram is added only when the number of categories to plot > 2
WARNING: Dendrogram not added. Dendrogram is added only when the number of categories to plot > 2

_images/Exploring_Feature_Importances_15_1.png

Great!

You can now gain more insight into the decisions your classifier makes, and use feature importances to identify cell type markers