mmc.classify()

mmochi.classifier.classify(adata, hierarchy, key_added='lin', data_key=utils.DATA_KEY, x_data_key=None, x_modalities=None, batch_key=utils.BATCH_KEY, retrain=False, plot_thresholds=False, reduce_features_min_cells=25, allow_spike_ins=True, enforce_holdout=True, probabilities_suffix='_probabilities', resample_method='oversample', weight_integration=False, X=None, features_used=None, probability_cutoff=0, features_limit=None, skip_to=None, end_at=None, reference=None, external_holdout=False)

Classify subsets using the provided hierarchy. If the hierarchy has not yet been trained, this function trains the hierarchy and predicts cell types. If the hierarchy has been trained, this classifier will not retrain random forests unless explicitly asked to (retrain=True). Many classifier kwargs are defined in the hierarchy on a node-by-node basis. Please see mmc.Hierarchy() and mmc.add_classification() for additional details.

Classifications information for every level are recorded in .obsm[‘lin’]. [classification_layer]_class denotes the classified MMoCHi classification for each event. [classification_layer]_hc denotes the high-confidence thresholded categories for each event. [classification_layer]_holdout denotes whether the event is part of explicit hold-out for that layer for each event. [classification_layer]_train is a boolean column indicating whether the event was used in training (noise events can be neither holdout nor _train). [classification_layer]_traincounts denotes the number of times an event was used in training (amplified events can have _traincounts >1). If SMOTE or its variations are used to balnce training data, traincounts will refer to rebalanced training counts and will not account for generated training events. [classification_layer]_probability indicates the confidence of the classifier at the specified layer for each event.

Note: Only [classification_layer]_hc and [classification_layer]_class layers are created for cutoff layers

We recommend following this function with mmc.terminal_names() to put classifications into a .obs column

Note: If this function terminates prematurely, adata.obs_names may be corrupted. The original adata.obs_names can be found in adata.obs[‘MMoCHi_obs_names’].

Parameters:

adata (AnnData) – Object containing expression of one modality in the .X and optionally expression data for another modality in the .obsm[data_key]
hierarchy (Hierarchy) – Hierarchy design specifying one or more classification levels and subsets.
key_added (str (default: 'lin')) – Key in adata.obsm to store the resulting DataFrame of the classifier’s results
data_key (Union[str, list, None] (default: utils.DATA_KEY)) – Key(s) in adata.obsm or .var[utils.MODALITY_COLUMN] to be used for high-confidence thresholding.
x_data_key (Union[str, list, None] (default: None)) – Key(s) in adata.obsm or .var[utils.MODALITY_COLUMN] to be used, if undefined, defaults to data_key
x_modalities (Optional[str] (default: None)) – Column in adata.var to find modality labels, or name of the modality of the data in the .X, if None defaults to ‘gex’
batch_key (Optional[str] (default: utils.BATCH_KEY)) – Name of a column in adata.obs that corresponds to a batch for use in the classifier
retrain (bool (default: False)) – If the classification level of a hierarchy is already defined, whether to replace that classifier with a retrained one. If False, no new classifier would be trained, and that classifier is used for prediction of the data. Overrides reduce_features_min_cells to 0 (as this early-stage feature reduction often breaks re-running models).
plot_thresholds (bool (default: False)) – Whether to display histograms of expression distribution while performing high-confidence thresholding
reduce_features_min_cells (int (default: 25)) – Remove features that are expressed in fewer than this number of cells, passed to _reduce_features. Feature reduction can be a very powerful tool for improving classifier performance. Only used if X and features_used are not provided and retrain is True.
allow_spike_ins (bool (default: True)) – Whether to allow spike ins when doing batched data. Spike ins are performed if a subset is below the minimum threshold within an individual batch, but not the overall dataset. If False, errors may occur if there are no cells of a particular subset within a batch. Warning, spike ins currently do not respect held out data, meaning there will be much fewer than 20% held out data if spike ins are frequent.
enforce_holdout (bool (default: True)) – Whether to enforce a hold out of an additional 20% at each classification level, to prevent spike-ins from using that data.
probabilities_suffix (Optional[str] (default: '_probabilities')) – If defined, probability outputs of the full df of predict_probabilities will be saved to the .uns[level+probabilities_suffix]
resample_method (Optional[str] (default: 'oversample')) – Method used to resample high-confidence data prior to training, passed to _balance_training_classes. “Oversample” or None. Other methods are experimental. See _balance_training_classes and the imblearn package for details.
weight_integration (bool (default: False)) – Whether to have each batch represented equally in the forest, or weight the number of trees from each batch to their representation in the total dataset
X (Optional[csr_matrix] (default: None)) – Optional setup data to circumvent the initial classifier setup functions. The presence of both of these overrides any predefined feature reduction techniques.
str (features_used scipy.sparse.csr_matrix and listlike of) – Optional setup data to circumvent the initial classifier setup functions. The presence of both of these overrides any predefined feature reduction techniques.
probability_cutoff (float (default: 0)) – Between 0 and 1, the minimum proportion of trees that must agree on a call for that event to be labeled by the classifier. This is experimental.
features_limit (Union[str, List[str], dict, None] (default: None)) – A list of str specifying features to limit to, in the format [feature_name]_mod_[modality], e.g.
skip_to (Optional[str] (default: None)) – Name of hierarchy classification levels to start at. This is extremely useful when debugging particular levels of the classifier.
end_at (Optional[str] (default: None)) – Name of hierarchy classification levels to prematurely end at. This is extremely useful when debugging particular levels of the classifier. The order of levels corresponds to the order they are defined in the hierarchy. Requires that the AnnData object has a predefined .obsm[key_added]. If skip_to is defined, it replaces only the columns that occur at and beyond that level.
reference (Optional[str] (default: None)) – Column in the adata.obs to be used for comparing (in the log file) the results of high-confidence thresholding to predetermined annotations or clusters
external_holdout (bool (default: False)) – Whether to omit events that are True in adata.obsm[key_added][‘external_holdout’] from training, calibration, and hyperparameter optimization, these events will have high confidence thresholds applied along and will be classified by applying the final model for each layer of the classifier. External hold out can be defined using mmc.define_external_holdout()

Return type:

Tuple[AnnData, Hierarchy]

Returns:

adata (AnnData object) – Object containing a .obsm[key_added] containing columns corresponding to the classification high-confidence (“_hc”), the held out data not used for testing (“_holdout”), the prediction (“_class”) for each level. If probababilities_suffix is defined, also contains “_probabilities” keys in the .uns corresponding to the prediction probabilities of each class for each event.
Hierarchy (Hierarchy object) – A trained hierarchy with classifiers built into it and a record of thresholds and other settings used.