mmc.Classification

class mmochi.hierarchy.Classification(markers, min_events=None, class_weight=None, in_danger_noise_checker=None, classifier=None, features_limit=None, feature_names=None, is_cutoff=False, max_training=None, force_spike_ins=[], calibrate=None, clf_kwargs={})

A Hierarchy building block, describing subsetting rules, whose parent is a subset (or “all”). These can be added to a Hierarchy using the .add_classification() method

Parameters:

markers (List[str]) – The features that will be used for high-confidence thresholding to define subsets beneath this classification. During thresholding, matching or similar feature names are looked up first in the provided data_key, then in the .var. See mmc.utils.marker for details on feature lookup.
min_events (Union[int, float, None] (default: None)) – The minimum number of (or proportion of total) high-confidence events that must be identified for in order to train a random forest classifier with each Subset. If not enough events are identified, that Subset will be skipped.
class_weight (Union[dict, List[dict], None] (default: None)) – The class_weight strategy for handling scoring (“balanced” or “balanced_subsample”). This is passed to sklearn.ensemble.RandomForestClassifier.
in_danger_noise_checker (Union[str, bool, None] (default: None)) – Whether to check for (and amplify or remove, respectively) in danger and noise events. In danger events are high-confidence events at classification boundaries. Events labeled noise are high-confidence events whose nearest neighbors do not share the same label, and are thus likely mislabeled. Can be a boolean, “in danger only”, or “noise only” for only amplifying danger or removing noise respectively.
classifier (default: None) – The classifier to be used for classification. If defined, one must also define feature_names.
features_limit (Optional[List[str]] (default: None)) – listlike of str or dictionary in the format {‘modality_1’:[‘gene_1’,’gene_2’,…], ‘modality_2’:’All’} Specifies the default features allowed for training the classifier.
feature_names (Optional[List[str]] (default: None)) – Names of features used to train this classifier. Not set if classifier is None.
is_cutoff (Optional[bool] (default: False)) – The default for whether Classification nodes should be treated as a cutoff triggering only high-confidence thresholding (True) or if a random forest should be created and trained to make classification (False). Cutoff layers can also be used with categorical or boolean data to subset down to a single tissue site or other relevant metadata.
features_limit – Listlike of str or dictionary in the format {‘modality_1’:[‘gene_1’,’gene_2’,…], ‘modality_2’:’All’} Specifies the default features allowed for training the classifier.
max_training (Optional[int] (default: None)) – Specifies the default maximum number of events used for training. This directly affects training speed.
force_spike_ins (default: []) – The default list of Subsets for which training events should be sampled with spike-ins from across batches, even if individual batches have enough events for training. This can be useful for cell types that are very heterogenous across batches.
calibrate (Optional[bool] (default: None)) – Default for whether to perform calibration on the prediction probabilities of the random forest classifier. Uncalibrated values reflect the % of trees in agreement. Calibrated values more-closely reflect the % of calls correctly made at any given confidence level.
clf_kwargs (dict (default: {})) – The keyword arguments for classification. For more information about other kwargs that can be set, please see: sklearn.ensemble.RandomForestClassifier. In the case of batch-integrated classification, n_estimators refers to the (approximate) total trees in the forest.

Methods