Exploring Feature Importances with MMoCHi

Authors: Daniel Caron and William Specht

In this notebook, we will evaluate individual features for their importance during random forest classification. This can provide insight into how the random forests are performing and identify novel transcriptomic and proteomic markers of cell subsets.

Exploring Feature importances

What are feature importances and why use them?

      Unlike some supervised classifiers, which rely on strict feature selection or dimensionally reduced feature sets, random forests perform well with high-dimensional data. When training decision trees, features are selected for their ability to reduce impurity at each branch. To reduce overfitting, random forests average over multiple decision trees trained under similar conditions. This naturally lends itself well to ranking features based on their importance for classification. Although there are many methods to calculate feature importance, here we use scikit-learn’s implementation of impurity-based feature_importances where importance is calculated from the mean decrease in Gini impurity across the trees in a forest. A ranking of these importances can then be generated for each classification layer providing insight into differences between subsets at each level of the classifier. Note that other methods to calculate important features, including permutation-based methods, tend to be much slower to calculate, but could also be used.

      Feature importances can serve a few purposes. First, they can be used to understand how the random forest is classifying cell types and may reveal why specific events are or are not being labeled correctly. While well-performing classifiers will have high feature importances for highly expressed cell type markers, poorly performing classifiers may give high feature importance to batch-specific artifacts. If this is the case, the training data or feature set given to the random forest for classification can be altered to improve performance. Alternatively, if cell type markers are given a high performance but classification is still sub-par, random forest hyperparameters (with the clf_kwargs argument in mmc.Hierarchy.add_classification()) can be altered to increase fit and improve performance. More information on random forest hyperparameters here.

      Another useful application of feature importances is as a discovery tool for novel cell type markers. Newly identified markers may inform high-confidence thresholding and fluorescence-based gating strategies. For this application, it’s worth noting that feature importances are not associated with individual subsets, so one must take care during interpretation of highly important features, especially in multiclass scenarios. To associate features with a subset, we suggest using differential expression analysis. By providing a window into profiles of our cell types, features identified may also pave the way for new avenues of study, similar to differential expression. See our paper at Cell Reports Methods for more details.

Import packages

[7]:
import anndata
import mmochi as mmc
import os
import matplotlib.pyplot as plt
plt.rcParams['figure.dpi'] = 150
mmc.log_to_file('mmochi_feature_importance')

Load in hierarchy used for classification and classified data

Run Integrated Classification First!

This tutorial builds off of the Integrated Classification tutorial, and uses the trained MMoCHi classifier and classified AnnData object saved during that tutorial.

Let’s load in the previously saved AnnData and MMoCHi Hierarchy objects:

[8]:
assert os.path.isfile('data/classified_events.h5ad') & os.path.isfile('data/IntegratedClassifier.hierarchy'), \
    'Objects not found. Run the Integrated Classification tutorial to generate them!'
adata = anndata.read_h5ad('data/classified_events.h5ad')
h = mmc.Hierarchy(load='data/IntegratedClassifier')
h.display(True)
Loading classifier from data/IntegratedClassifier...
Loaded data/IntegratedClassifier.hierarchy
_images/Exploring_Feature_Importances_6_2.png

      Once the random forest classifiers have been trained and stored in the hierarchy, feature importances can be extracted as a list using mmc.feature_importances(), which returns a DataFrame of all features sorted by importance for the specified level. Within MMoCHi, features are labeled by their modality (separated by '_mod_'). You can see here that for lymphocyte subsets, there are many genes whose expression is highly important for subset identification:

[9]:
mmc.feature_importances(h, 'Lymphoid').head(25)
[9]:
Feature Importance
0 MZB1_mod_gex 0.028209
1 ITM2C_mod_gex 0.020467
2 HSP90B1_mod_gex 0.017774
3 SSR3_mod_gex 0.014550
4 RPS27_mod_gex 0.013955
5 RPL19_mod_gex 0.013943
6 SEC61B_mod_gex 0.013772
7 HLA-DRB1_mod_gex 0.013506
8 FKBP11_mod_gex 0.011392
9 FOS_mod_gex 0.011151
10 RPL11_mod_gex 0.010829
11 TYROBP_mod_gex 0.010296
12 TXNDC11_mod_gex 0.010145
13 ST6GALNAC4_mod_gex 0.009714
14 EIF1_mod_gex 0.009353
15 DERL1_mod_gex 0.008687
16 MAN1A1_mod_gex 0.008624
17 GSPT1_mod_gex 0.008461
18 RPS23_mod_gex 0.008381
19 RPL34_mod_gex 0.008015
20 RPN1_mod_gex 0.007977
21 HLA-DQB1_mod_gex 0.007746
22 CD45RA_mod_landmark_protein 0.007548
23 JCHAIN_mod_gex 0.007463
24 RPL30_mod_gex 0.007312

      If you are interested in focusing on proteins and protein coding transcript markers, you may want to limit classification only protein coding genes. Additionally, limiting to only protein coding genes may help to remove some sequencing artifacts, and improve classification performance. This can be accomplished by removing these genes prior to classification, or using the limit argument in mmc.classify. See docs for more details.

      On other levels, such as CD4+ vs CD8+ T cells, you can see various proteins played a more important role in cell type classification. Note that the much higher values for importance here are in part due to the hyperparameters used to train the random forest at this level (see the Hierarchy Design tutorial for more details).

[10]:
mmc.feature_importances(h, 'CD4_CD8').head(25)
[10]:
Feature Importance
0 CD8a_mod_landmark_protein 0.049247
1 CD8A_mod_gex 0.041444
2 CD4_mod_landmark_protein 0.041034
3 CD8B_mod_gex 0.028655
4 CD14_mod_landmark_protein 0.019218
5 NKG7_mod_gex 0.015018
6 CD4_mod_gex 0.011821
7 CTSW_mod_gex 0.010518
8 CD25_mod_landmark_protein 0.010272
9 HCST_mod_gex 0.009794
10 CCL5_mod_gex 0.008704
11 CD40LG_mod_gex 0.006574
12 MT-CO3_mod_gex 0.006571
13 RPS27_mod_gex 0.006326
14 DUSP2_mod_gex 0.006306
15 CD45RA_mod_landmark_protein 0.006077
16 LINC02446_mod_gex 0.005983
17 CD3_mod_landmark_protein 0.005397
18 CST7_mod_gex 0.005114
19 GZMA_mod_gex 0.004686
20 GZMK_mod_gex 0.004410
21 RPL30_mod_gex 0.004340
22 MYL6_mod_gex 0.003844
23 GZMM_mod_gex 0.003840
24 FHIT_mod_gex 0.003468

Visualization to evaluate classification

      The plotting function below displays the top 25 most important features for the specified classification layer(s) in violin plots. These plots highlight which subsets express the most important features. By default, these plots group events by their high-confidence label and their predicted classification. This is incredibly useful for exploring discrepancies between high-confidence thresholding and classification, or for understanding performance on cells that were not high-confidence thresholded. For example, here we can identify that in the Lymphoid classifier, many of the events not labeled by high-confidence thresholding, but labeled NK/ILC ('? -> NK_ILC') actually expressed CD3 protein. Although this only represents ~200 events, it suggests there is room for improvement in the high-confidence thresholding or the hierarchy design. Note that these plots also display the expression of these markers on cells not included at this classification level ('nan -> nan') for reference.

[11]:
mmc.plot_important_features(adata, 'Lymphoid', h)
... storing 'hc -> class' as categorical
WARNING: dendrogram data not found (using key=dendrogram_hc -> class). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.
/home/ubuntu/miniconda3/envs/mmochi/lib/python3.11/site-packages/scanpy/tools/_utils.py:40: UserWarning: You’re trying to run this on 33538 dimensions of `.X`, if you really want this, set `use_rep='X'`.
         Falling back to preprocessing with `sc.pp.pca` and default params.
  warnings.warn(
WARNING: dendrogram data not found (using key=dendrogram_hc -> class). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.
_images/Exploring_Feature_Importances_13_4.png

To best compare expression levels between classified subsets, ‘classification’ can be passed as an argument to display all subsets at an individual classification level.

[12]:
mask = (~adata.obsm['lin']['CD4_CD8_hc'].isin(['nan']))
mmc.plot_important_features(adata[mask].copy(), ['CD4_CD8'], h, reference = 'classification')
WARNING: Dendrogram not added. Dendrogram is added only when the number of categories to plot > 2
WARNING: Dendrogram not added. Dendrogram is added only when the number of categories to plot > 2
_images/Exploring_Feature_Importances_15_1.png

Great!

You can now gain more insight into the decisions your classifier makes, and use feature importances to identify cell type markers