mmc.Classification
- class mmochi.hierarchy.Classification(markers, min_events=None, class_weight=None, in_danger_noise_checker=None, classifier=None, features_limit=None, feature_names=None, is_cutoff=False, max_training=None, force_spike_ins=[], calibrate=None, optimize_hyperparameters=None, hyperparameter_order=None, hyperparameters=None, hyperparameter_min_improvement=None, hyperparameter_optimization_cap=None, clf_kwargs=None)
A Hierarchy building block, describing subsetting rules, whose parent is a subset (or “all”). These can be added to a Hierarchy using the .add_classification() method
- Parameters:
markers (
List[str]) – The features that will be used for high-confidence thresholding to define subsets beneath this classification. During thresholding, matching or similar feature names are looked up first in the provided data_key, then in the .var. See mmc.utils.marker for details on feature lookup.min_events (
Union[int,float,None] (default:None)) – The minimum number of (or proportion of total) high-confidence events that must be identified for in order to train a random forest classifier with each Subset. If not enough events are identified, that Subset will be skipped.class_weight (
Union[dict,List[dict],None] (default:None)) – The class_weight strategy for handling scoring (“balanced” or “balanced_subsample”). This is passed to sklearn.ensemble.RandomForestClassifier.in_danger_noise_checker (
Union[str,bool,None] (default:None)) – Whether to check for (and amplify or remove, respectively) in danger and noise events. In danger events are high-confidence events at classification boundaries. Events labeled noise are high-confidence events whose nearest neighbors do not share the same label, and are thus likely mislabeled. Can be a boolean, “in danger only”, or “noise only” for only amplifying danger or removing noise respectively.classifier (default:
None) – The classifier to be used for classification. If defined, one must also define feature_names.features_limit (
Optional[List[str]] (default:None)) – listlike of str or dictionary in the format {‘modality_1’:[‘gene_1’,’gene_2’,…], ‘modality_2’:’All’} Specifies the default features allowed for training the classifier.feature_names (
Optional[List[str]] (default:None)) – Names of features used to train this classifier. Not set if classifier is None.is_cutoff (
Optional[bool] (default:False)) – The default for whether Classification nodes should be treated as a cutoff triggering only high-confidence thresholding (True) or if a random forest should be created and trained to make classification (False). Cutoff layers can also be used with categorical or boolean data to subset down to a single tissue site or other relevant metadata.features_limit – Listlike of str or dictionary in the format {‘modality_1’:[‘gene_1’,’gene_2’,…], ‘modality_2’:’All’} Specifies the default features allowed for training the classifier.
max_training (
Optional[int] (default:None)) – Specifies the default maximum number of events used for training. This directly affects training speed.force_spike_ins (default:
[]) – The default list of Subsets for which training events should be sampled with spike-ins from across batches, even if individual batches have enough events for training. This can be useful for cell types that are very heterogenous across batches.calibrate (
Optional[bool] (default:None)) – Default for whether to perform calibration on the prediction probabilities of the random forest classifier. Uncalibrated values reflect the % of trees in agreement. Calibrated values more-closely reflect the % of calls correctly made at any given confidence level.optimize_hyperparameters (
Optional[bool] (default:None)) – Whether to optimize the classifier using the provided hyperparameters and possible values and balanced accuracy score as the optimization’s cost function. Note: Optimization validation will be performed on untrained and uncalibrated data from the randomly set aside data indicated by .obsm[key_added][level + ‘_opt_holdout’]hyperparameter_order (
Optional[list] (default:None)) – The order of the parameters in hyperparameters to search through using linear optimization. Note: the values in hyperparameter_order must match the keys of hyperparametershyperparameters (
Optional[dict] (default:None)) – Key value pairs where the key is the name of a hyperparameter and the value are the possible values that the hyperparameter should check in optimization. eg. bootstrap: [True, False]hyperparameter_min_improvement (
Union[float,dict,None] (default:None)) – Minimum increase in performance (as measured by balanced_accuracy) of n_estimators and max_features hyperparameters before stopping optimization. May also provide dictionary with feature as the key and minimum increase as the value. You can use -1 to prevent any early-stopping based on minimum improvement.hyperparameter_optimization_cap (
Optional[float] (default:None)) – Value for the balanced accuracy score at which hyperparameter optimization will stop for that level. If none provided, 1.0 (perfect) will be used.clf_kwargs (
dict(default:None)) – The keyword arguments for classification. For more information about other kwargs that can be set, please see: sklearn.ensemble.RandomForestClassifier. In the case of batch-integrated classification, n_estimators refers to the (approximate) total trees in the forest.
Methods