mmc.Hierarchy
- class mmochi.hierarchy.Hierarchy(default_min_events=0.001, default_class_weight='balanced_subsample', default_clf_kwargs=dict(max_depth=20, n_estimators=100, n_jobs=-1, bootstrap=True, verbose=True, max_features='sqrt'), default_in_danger_noise_checker=True, default_is_cutoff=False, default_features_limit=None, default_max_training=20000, default_force_spike_ins=[], default_calibrate=True, default_optimize_hyperparameters=False, default_hyperparameter_order=['n_estimators', 'max_depth', 'max_features', 'bootstrap'], default_hyperparameters={'n_estimators': [50, 100, 200, 400, 800, 1200], 'max_depth': [None, 10, 25], 'max_features': ['log2', 'sqrt', 0.05, 0.1], 'bootstrap': [True, False]}, default_hyperparameter_min_improvement={'n_estimators': 0.004, 'max_features': 0.004}, default_hyperparameter_optimization_cap=0.98, load=None)
Class to organize a MMoCHi hierarchy. The Hierarchy is a tree with alternating subset and classification nodes for progressively annotating cell types. Subset nodes define cell populations and the Hierarchy is initialized with a root Subset “All”, representing all events in the dataset. All other Subsets originate from a Classification node. Classification nodes are defined with a list of markers (for high-confidence labeling), and normally trigger selection of high-confidence events, training of a random forest classifier, and prediction. If a classification node is a cutoff, it will only trigger a selection of high-confidence events, and only those events will be cast into subsets. Subset nodes also contain the cell type definitions used for high-confidence thresholding.
Initializing the Hierarchy, you can also define many classification defaults, which can be additionally customized for each Classification node.
- Parameters:
default_min_events (
Union[int,float] (default:0.001)) – The default minimum number of (or proportion of total) high-confidence events that must be identified in order to train a random forest classifier with each Subset. If not enough events are identified, that Subset will be skipped.default_class_weight (
Union[str,dict,List[dict]] (default:'balanced_subsample')) – The default class_weight strategy for handling scoring. This is passed to sklearn.ensemble.RandomForestClassifier.default_clf_kwargs (
dict(default:dict(max_depth=20, n_estimators=100, n_jobs=-1, bootstrap=True, verbose=True, max_features='sqrt'))) – The default keyword arguments for classification. For more information about other kwargs that can be set, please see: sklearn.ensemble.RandomForestClassifier. In the case of batch-integrated classification, n_estimators refers to the (approximate) total trees in each forest.default_in_danger_noise_checker (
Union[str,bool] (default:True)) – The default for whether to check for (and amplify or remove, respectively) in danger and noise events. In danger events are high-confidence events at classification boundaries. Events labeled noise are high-confidence events whose nearest neighbors do not share the same label, and are thus likely mislabeled. Can be a boolean, or “in danger only”/”noise only”.default_is_cutoff (
Union[bool,str] (default:False)) – Whether Classification nodes should be treated as a cutoff by default (triggering only high-confidence thresholding) or non-cutoff (where a random forest is trained and all events are classified).default_features_limit (
Union[List[str],Dict[str,List[str]],None] (default:None)) – Listlike of str or dictionary in the format {‘modality_1’:[‘gene_1’,’gene_2’,…], ‘modality_2’:’All’} Specifies the default features allowed for training the classifier.default_max_training (
int(default:20000)) – Specifies the default maximum number of events used for training. This directly affects training speed.default_force_spike_ins (
List[str] (default:[])) – The default list of Subsets for which training events should be sampled with spike-ins from across batches, even if individual batches have enough events for training. This can be useful for cell types that are very heterogenous across batches.default_calibrate (
bool(default:True)) – Default for whether to perform calibration on the prediction probabilities of the random forest classifier. Uncalibrated values reflect the percent of trees in agreement. Calibrated values more-closely reflect the percent of calls correctly made at any given confidence level.default_optimize_hyperparameters (
bool(default:False)) – Whether to by default perform hyperparameter optimization of the random forest. Optimization occurs via a linear search of potential parameters and significantly slows down the classification process. Note: Aside from n_estimators, the first provided value for each parameter will be used for optimization of earlier hyperparameters.default_hyperparameter_order (
List[str] (default:['n_estimators', 'max_depth', 'max_features', 'bootstrap'])) – If optimize_hyperparameters is true, the order in which to perform the linear hyperparameter optimization. Optimization will occur by testing all variations of one hyperparameter before using the best selected one for further optimization. This list should have the same values as the keys of hyperparametersdefault_hyperparameters (
Dict[str,list] (default:{'n_estimators':[50, 100,200,400,800,1200],'max_depth': [None,10,25],'max_features':['log2','sqrt',0.05,0.1],'bootstrap':[True,False]})) – If optimize_hyperparameters is true, a dictionary of hyperparameter name to possible values to check for that hyperparameter. The classifier will be fit a number of times equal to the number of values in this dictionary.default_hyperparameter_min_improvement (
Union[float,dict,None] (default:{'n_estimators':0.004,'max_features':0.004})) – Default minimum increase in performance (as measured by balanced_accuracy) of n_estimators and max_features hyperparameters before stopping optimization. May also provide dictionary with feature as the key and minimum increase as the value. You can use -1 to prevent any early-stopping based on minimum improvement.default_hyperparameter_optimization_cap (
Optional[float] (default:0.98)) – Default value for the balanced accuracy score at which hyperparameter optimization will stop for that level. If none provided, 1.0 (perfect) will be used.load (
Optional[str] (default:None)) – Either None (to initiate a new hierarchy) or a path to a hierarchy to load (exclude .hierarchy in the path). Note that loading a hierarchy overrides all other defaults.
Methods
add_classification(name, parent_subset, markers)Add a Classification beneath a Subset.
add_subset(name, parent_classification, values)Add a Subset beneath a Classification node.
batchless_thresholds([name, batch])Sets thresholds, removing any that are batch-specific, and setting the threshold to the average threshold across batches
check_all_markers(adata[, data_key])Asserts all markers in hierarchy identified by .get_all_markers() are in adata.X or .obsm[data_key].
classification_markers(name)Provides markers used in one Classification node paired with the high-confidence definitions for each of its Subset nodes.
classification_parents(name)Provides the names of a node's parent and grandparent.
color_dict([new_color_palette, mode, ...])Provides a dictionary of colors associated with each subset in the hierarchy
copy()Performs a hard copy of the hierarchy (completely unlinked to the original).
display([plot, return_graph, ...])Display the hierarchy in a user-friendly format.
drop_threshold(marker[, name, batch])Remove thresholds from the database.
flatten_children(parent_subset_to_dissolve)Flattens child nodes of the hierarchy.
Provides a list of all the markers used for high-confidence thresholding.
Provides a list of all classification (or cutoff) nodes in the hierarchy.
get_clf(name[, base])Gets the classifier and feature names of a given node.
get_clf_kwargs([levels, kwargs])Provides the default clf kwargs from the hierarchy.
get_info(name, info_type)Gets specified information for a node in the hierarchy.
get_optimal_clf_kwargs([levels, kwargs])Provides the clf kwargs used in classification at specified leves of the classifier.
get_threshold_info(marker, name[, batch, ...])Identifies and returns threshold information, with support for searching all levels or batches if specified location lacks information.
has_clf(name)Checks whether a given node has a trained classifier defined.
load_thresholds(df[, verbose])Loads in thresholds from a .csv file.
publication_export([adata, batches, ...])Creates a formatted csv files of your hierarchy, including its high confidence definitions and thresholds.
Removes all thresholds from thresholds DataFrame.
run_all_thresholds(adata[, data_key, ...])Runs thresholding using the thresholding.threshold() function.
save(name)Save Hierarchy as a .hierarchy
save_thresholds([save_path, non_destructive])Saves thresholds as a .csv file, non_desctructive saving loads in the old file and appends new definitions onto it
set_clf(name, clf, feature_names)Stores a trained classifier and a list of features used for training of a specified classification level.
set_threshold(marker, thresholds, interactive)Sets a threshold in the Hierarchy for one marker
subsets_info(name)Provides information of the subsets beneath a classification layer and their high-confidence threshold definitions.
to_graphviz([supress_labels, node_width, ...])Exports the tree in the dot format of the graphviz software, which can be useful for plotting.