mmc.Classification

class mmochi.hierarchy.Classification(markers, min_events=None, class_weight=None, in_danger_noise_checker=None, classifier=None, features_limit=None, feature_names=None, is_cutoff=False, max_training=None, force_spike_ins=[], calibrate=None, optimize_hyperparameters=None, hyperparameter_order=None, hyperparameters=None, hyperparameter_min_improvement=None, hyperparameter_optimization_cap=None, clf_kwargs=None)

A Hierarchy building block, describing subsetting rules, whose parent is a subset (or “all”). These can be added to a Hierarchy using the .add_classification() method

Parameters:
  • markers (List[str]) – The features that will be used for high-confidence thresholding to define subsets beneath this classification. During thresholding, matching or similar feature names are looked up first in the provided data_key, then in the .var. See mmc.utils.marker for details on feature lookup.

  • min_events (Union[int, float, None] (default: None)) – The minimum number of (or proportion of total) high-confidence events that must be identified for in order to train a random forest classifier with each Subset. If not enough events are identified, that Subset will be skipped.

  • class_weight (Union[dict, List[dict], None] (default: None)) – The class_weight strategy for handling scoring (“balanced” or “balanced_subsample”). This is passed to sklearn.ensemble.RandomForestClassifier.

  • in_danger_noise_checker (Union[str, bool, None] (default: None)) – Whether to check for (and amplify or remove, respectively) in danger and noise events. In danger events are high-confidence events at classification boundaries. Events labeled noise are high-confidence events whose nearest neighbors do not share the same label, and are thus likely mislabeled. Can be a boolean, “in danger only”, or “noise only” for only amplifying danger or removing noise respectively.

  • classifier (default: None) – The classifier to be used for classification. If defined, one must also define feature_names.

  • features_limit (Optional[List[str]] (default: None)) – listlike of str or dictionary in the format {‘modality_1’:[‘gene_1’,’gene_2’,…], ‘modality_2’:’All’} Specifies the default features allowed for training the classifier.

  • feature_names (Optional[List[str]] (default: None)) – Names of features used to train this classifier. Not set if classifier is None.

  • is_cutoff (Optional[bool] (default: False)) – The default for whether Classification nodes should be treated as a cutoff triggering only high-confidence thresholding (True) or if a random forest should be created and trained to make classification (False). Cutoff layers can also be used with categorical or boolean data to subset down to a single tissue site or other relevant metadata.

  • features_limit – Listlike of str or dictionary in the format {‘modality_1’:[‘gene_1’,’gene_2’,…], ‘modality_2’:’All’} Specifies the default features allowed for training the classifier.

  • max_training (Optional[int] (default: None)) – Specifies the default maximum number of events used for training. This directly affects training speed.

  • force_spike_ins (default: []) – The default list of Subsets for which training events should be sampled with spike-ins from across batches, even if individual batches have enough events for training. This can be useful for cell types that are very heterogenous across batches.

  • calibrate (Optional[bool] (default: None)) – Default for whether to perform calibration on the prediction probabilities of the random forest classifier. Uncalibrated values reflect the % of trees in agreement. Calibrated values more-closely reflect the % of calls correctly made at any given confidence level.

  • optimize_hyperparameters (Optional[bool] (default: None)) – Whether to optimize the classifier using the provided hyperparameters and possible values and balanced accuracy score as the optimization’s cost function. Note: Optimization validation will be performed on untrained and uncalibrated data from the randomly set aside data indicated by .obsm[key_added][level + ‘_opt_holdout’]

  • hyperparameter_order (Optional[list] (default: None)) – The order of the parameters in hyperparameters to search through using linear optimization. Note: the values in hyperparameter_order must match the keys of hyperparameters

  • hyperparameters (Optional[dict] (default: None)) – Key value pairs where the key is the name of a hyperparameter and the value are the possible values that the hyperparameter should check in optimization. eg. bootstrap: [True, False]

  • hyperparameter_min_improvement (Union[float, dict, None] (default: None)) – Minimum increase in performance (as measured by balanced_accuracy) of n_estimators and max_features hyperparameters before stopping optimization. May also provide dictionary with feature as the key and minimum increase as the value. You can use -1 to prevent any early-stopping based on minimum improvement.

  • hyperparameter_optimization_cap (Optional[float] (default: None)) – Value for the balanced accuracy score at which hyperparameter optimization will stop for that level. If none provided, 1.0 (perfect) will be used.

  • clf_kwargs (dict (default: None)) – The keyword arguments for classification. For more information about other kwargs that can be set, please see: sklearn.ensemble.RandomForestClassifier. In the case of batch-integrated classification, n_estimators refers to the (approximate) total trees in the forest.

Methods