Change Log
All notable changes to this package will be documented in this file.
The format is based on Keep a Changelog and attempts to adhere to Semantic Versioning.
To update, replace your repository with a fresh clone, and install using pip, as before:
git clone https://github.com/donnafarberlab/mmochi.git
conda activate mmochi
cd mmochi
pip install .
Current version
[0.3.6dev] - 02OCT25
Fixed
Fixed loading and saving functions for
mmc.Hierarchyto not append excessive".hierarchy"file extensions if the user has already provided itFixed
mmc.log_to_fileto only append excessive".log"file extensions if the user has already provided it
[0.3.5] - 24MAY25
Added
Added force_model argument to mmc.run_all_thresholds to allow for forced calculation of a Gaussian mixture model for selected features (can be used to override default thresholds of 0 and .5 for genes)
Added user-readable output for
mmc.Hierarchyobjects listing the classification layers, terminal subset identities, and whether classifiers appear to have been trained.Added links in documentation to published paper
Changed
Streamlined installation instructions and linked troubleshooting instructions for the Jupyter Widgets extension
Simplified tox testing and expanded to Python versions 3.11, 3.12, and 3.13
Replaced outdated
Python3_8_requirements.txtfile with a satisfactory set of dependencies identified by conda during tox-testing. These environment files can be found in theexample_envsdirectory
Fixed
Fixed
mmc.stacked_density_plotsfunction to allow gene plottingFixed
mmc.utils.obsm_to_Xandmmc.utils.markerto appropriately support thedata_keystructure introduced in v0.3.2Bug fix in
mmc.plot_important_featurespreventing pytest completionFix typos in “Extended human immune subsets (v1)” example hierarchy
Fixed spelling errors throughout documentation
Removed
dtypeargument fromanndata.AnnDatadue to its pending deprecationReplaced references to the
.Aproperty in sparse matrices with.toarraydue to deprecationExplicit casting of some masks to arrays to prevent dtype errors
Explicit casting of various dtypes to suppress warnings
Fixed bug in
mmc.utils.generate_exclusive_featureswhen reading in newAnnDataobjectsRemoved unintentional return statements in testing scripts
Added explicit request for
lxml_html_cleanto the docs installation to satisfy sphinx dependencies
[0.3.2] - 23FEB24
Added
MMoCHi now supports more than 2 modalities by including each modality in the .X, then defining a .var column with each modalities name. This modality column name should be defined as
mmc.MODALITY_COLUMN. This functionality is in betaMMoCHi now supports a list of multiple data keys for
mmc.utils.DATA_KEYand all relevant functions. E.g. when using two modalitiesDATA_KEYcould equal['morphology','landmark_protein']to pull both modalities when relevant and only'landmark_protein'for methods explicitly using the.obsm.
Fixed
Minor typos in docstrings.
Changed
Fancy thresholding sliders can now increment by 0.01 from 0.1.
X_modality default changed from ‘GEX’ to ‘gex’
Invalid utils.marker() calls will now raise AssertionErrors instead of ValueErrors if unable to find feature in data_key
[0.3.1] - 26JAN24
Added
Added community-submitted hierarchy for γδ T cells [#2]
Changed
Improved Hyperparameter Optimization tutorial.
Reorganized Example Hierarchies page
[0.3.0] - 23JAN24
Added
Users can now optionally define an “external hold out”. This hold out is defined before high-confidence thresholding, and is thus can be isolated from all steps of MMoCHi’s training and preprocessing. While “internal hold out” (previously the only available hold out) is used to evaluate the fit of MMoCHi’s random forests at each level, external hold out can be used to evaluate MMoCHi’s overall classification performance. The new function
mmc.define_external_hold_out()allows users to define a random subset of the data to be used as an external hold out. This adds a new column to the.obsm['lin']calledexternal_hold_out, which can be used in various functions.Added an automated linear search hyperparameter optimization, which can be activated using new parameters in
mmc.Hierarchy, and can be customized withhyperparameter_min_improvement,hyperparameter_optimization_cap. See this new tutorial for more information: Hyperparameter OptimizationAdded
mmc.Hierarchy.get_optimal_clf_kwargsandmmc.Hierarchy.get_clf_kwargsto get a dataframe displaying either the optimized or the original keyword arguments (hyperparameters) passed to Random Forest.Added
mmc.Hierarchy.get_clf_kwargsandmmc.Hierarchy.get_optimal_clf_kwargsfunctions to display default random forest classifier kwargs and ones selected by hyperparameter optimization (aftermmc.classifyhas been run) respectivelyAdded a page to documentation for sharing default
marker_bandwidthvalues for various antibodies and included more examples to the Example Hierarchies page.Optional
inclusion_maskcan be passed to landmark registration so peaks can be detected on only a subset of events, but the warping can still be applied to all eventsAdded
h.publication_export()a helper function (currently in beta) for exporting supplementary tables describing the design of a MMoCHi hierarchy and thresholds usedAdded a copy button to all code blocks in the documentation
Fixed
Fixed how calibration uses held-out data, such that hold out data used for calibration is now separate from the data used for performance validation. In the case where calibration (or hyperparameter optimization), is enabled, the internal hold out data is split in half. The half used for optimization/calibration is now indicated by a new column in the
.obsm['lin'], called[level]_opt_hold_out.Added requirement for matplotlib above version 3.6.1, as lower versions break scanpy’s handling of cmaps
Added requirement capping AnnData below version 0.10.2, as that version breaks anndata.concat (due to bugfix #1189)
Added requirement for scanpy to be at or above version 1.8.0, as there were strange issues with scanpy import a lower versions.
Fixed formatting issues with hierarchy
.display()methodBug making
h.publication_export()unable to identify batch specific thresholdsBug where
clf_kwargsparameter was overwritten in themmc.Classifierobjects with optimal kwargs that were selected by hyperparameter optimization
Changed
‘hold_out’ was replaced with ‘holdout’ for consistency (this was previously only partially executed in the codebase)
Silenced deprecation warnings from the
numbapackage, as it is only used indirectly (e.g. in theumappackage)Silenced many warnings when running pytest
Import
thresholdandrun_thresholdfrom themmc.thresholdingmodule in init.py for consistency (so they are now accessible bymmc.thresholdormmc.run_threshold)
Past versions
[0.2.3] - 21AUG23
Fixed
Optimized a few steps to reduce peak memory usage by ~30%
Directly calculate train_counts (instead of mapping backwards)—improving accuracy in a few edge cases.
Fixed error where
_traincountswould be set to 1.Clarified nomenclature: events spiked in to train one batch and used as hold out in another are now set as
_hold_out=False.Updated pytests for additional debugging clarity.
Fixed typos in Input_Output_Specs.md
Removed temporary requirement limiting scikit-learn to below 1.3.0, as imbalanced-learn has updated to support scikit-learn==1.3.0
Adjusted dependencies specified for tox testing
Updated displays in tutorial notebooks to reflect bug fixes
Improved
mmc.utils.umap_thresh()by removing features that are columns in the.obsfrom being selected whenmarkersis set toNoneand to add aplt.show()to the loop so that plots are shown progressively.Updated
Python3_8_requirements.txtto remove unnecessary packages and convert to conda format
Added
Added
binsparameter to threshold plotting so that users can control the number of histogram bins
[0.2.2] - 30JUN23
Added
Automated documentation hosted by ReadTheDocs
Issue templates
Examples MMoCHi hierarchies
Examples in Input/Output Specifications for how to correctly format various objects
Fixed
Updated documentation of various functions
Added temporary requirement limiting scikit-learn to below 1.3.0, as that update breaks imbalanced-learn
Removed mmochi package from Python3_8_requirements.txt
Prevented plot windows from opening while running pytest on some systems
[0.2.1] - 17JUN23
Added
Functionality to restore
adata.obs_namesifmmc.classifyis interrupted.Internally,
mmc.classifyconvertsadata.obs_namesto indices, and remaps them at the end of the function.Now,
adata.obs_namescan be restored from the temporary column:adata.obs[‘MMoCHi_obs_names’]
New option for
mmc.density_plotwhether to hide events with 0 expression (default: True)
Changed
Nomenclature changed for columns in the
.obsm: the'_tcounts'to'_traincounts','_proba'to'_probability'or'_probabilities'(in.uns),'conf'default column name to'certainty'(in the.obs). Note that although MMoCHi functions will not support this old format, previously generated classifications can be renamed to match this new format.Default log-normalization in preprocess_adatas changed for GEX to counts per 10k and ADT to counts per 1k
Updated tutorials, doctstrings, README, and input_output_specs for clarity
Fixed
Removed
leidenalgfromrequirements.txtto prevent conda install from hanging while trying to solve the environmentBug handling
np.naninmmc.utils.umap_interrogate_levelDisabled some internal warnings in
mmc.utils
[0.2.0] - 26MAY23
Added
Introduced new tutorials for Hierarchy Design, High Confidence Thresholding, Exploring Feature Importances, and Pretrained Classification
Edited the tutorials for Integrated Classification and Landmark Registration
Created a new helper function (
mmc.umap_interrogate_level()) for plotting UMAPs to understand classification performanceAdded variables for setting the default
data_keyandbatch_keyfor many functions by changingmmc.DATA_KEYandmmc.BATCH_KEYfor many functions. Note, defaults for these two for some functions have now been changedAdded checks to ensure
_obs,_gex, and_mod_are not in the feature names inputted into MMoCHi
Changed
‘Ground Truth’ and ‘gt’ have been renamed to ‘High Confidence’ and ‘hc’ throughout the package for clarity
‘hold_out’ was replaced with ‘holdout’ for consistency
Nomenclature of column names in
adata.obsm['lin']has changed. Note that although MMoCHi functions will not support this old format, previously generated .obsm[‘lin’] dataframes contain enough data to be converted to the new format.Columns in the
.obsm['lin']now include (see Input/Output Specifications for more detail):{level}_hc: High-confidence threshold identity{level}_holdout: Events explicitly held out from training data selection{level}_train: Events used for random forest training{level}_tcounts: Number of times events were duplicated during random forest training.
Notably, events that are identified as ‘noise’ or excluded during subsampling during training data selection will be
Falsefor both{level}_holdoutand{level}_train
Coloring of
mmc.run_all_thresholds()histograms was reversed to match those ofmmc.umap_thresh().Made many functions private (this should not affect any functions the user interacts with).
Fixed
Events held out from training were sometimes given
NaNfor the{level}_tcountsand{level}_train, which could be mistaken when typecasting to boolAdded garbage collection to reduce wasteful memory usage buildup after plotting UMAPs (due to occasional object duplication).
Improved pytests to validate post-classification structure of the
.obsm['lin']Fixed typos in some log and print statements
Updated docstrings and tutorials for compatibility with sphinx.
Replaced explicit reference to
.obsm['lin']withkey_addedparameter formmc.plot_important_features()
Removed
Unnecessary call to
clf.feature_importances_inmmc.feature_importances()Removed
load_dirargument inmmc.Hierarchy(). Hierarchies can now be loaded from a full file path provided by theloadparameter.Removed
optional-requirements.txt. All mandatory requirements are now listed inrequirements.txt
[0.1.4] - 09MAR23
Added
Support for predicting on extra-large datasets stored in int64-indexed sparse matrices. (These are not supported by scikit-learn, so they are split into bite-sized chunks.)
Loading saved hierarchies from other directories, using the new
load_dirargument to themmc.Hierarchy()constructor
Changed
During ground-truth cleanup, the PCA is now run on **scaled** highly variable features, such that highly expressed features, or differences in expression levels between modalities do not dominate.
Disable
reduce_features_min_cellsinmmc.classifywhenretrain == True, so that features are not filtered out when projecting a classifier onto a new dataset. If highly expressed features need to still be removed, this can be performed prior to inputting into themmc.classifyfunction
Fixed
**kwargscan now be passed throughmmc.plot_confusiontosklearn.metrics.ConfusionMatrixDisplay.from_predictions()Added support for recent versions of scikit-fda, which should include support for the current version:
skfda==0.8.1.
[0.1.3] - 22Dec22
Added
Updated tutorials for Integrated Classification and Landmark Registration
Docstrings and typing hints have been written for most user functions, and many internal functions for code clarity.
The
n_estimatorsused for each batch during batch-integration can now be weighted by representation in the dataset (weight_integration=True), or kept weighted equally in the forest (old performance, and the default)min_cluster_sizeparameter tommc.classifier.borderline_balanceadded
baseparameter tommc.hierarchy.Hierarchy.get_clffunction to (if needed) strip the outer calibration classifier, and return a random forestAdded
mmc.landmarking.update_peak_overridesas a convenience function for creating a dictionary of manual peak overrides ,mmc.landmarking.update_landmark_registeras a function to test new landmark registration settings on a single batch-marker (after landmark registration has been run on the entire dataset), andPeak overrides can now be specified as single-positives, by passing a
[None, float]A new option for density plotting,
mmc.landmarking.density_plot_total, has been added, to display the density of a single batch, single marker, in front of the density plot of the rest of the dataset (useful when integrating in one new batch)Added
mmc.landmarking.save_peak_overridesandmmc.landmarking.load_peak_overridesto save and load this object as a JSON. There is also added support for defining peak_overrides inmmc.landmarking.landmark_register_adtsusing the path to a JSON for automatic loading.
Changed
Moved most classification settings from
mmc.classifier.classifyto hierarchy to be edited on a per-node basis.During classifier batch-integration, the
n_estimatorsdefined in theclf_kwargsnow refers to the total number of trees in the forest (for more consistent performance with classification without batch-integration).Removed the option for sigmoid calibration to address inconsistencies with calibration performance and imbalanced classes. Advanced calibration settings will be revisited in a future version. Currently, the calibrated classifier is trained on data that was not used for initial training, was defined in ground truthing, and (assuming there are enough events) only uses 1/3rd of that data.
The keywords for
batch_idandbatchhave been updated tobatchandbatch_keyinmmc.landmarking.stacked_density_plotsandmmc.landmarking.density_plotfor consistency with other functions.Changed the order of returns for
mmc.thresholding.thresholdso that regardless of whetherrun=Falseorrun=True, thresh is the first returned object, and the thresholded data is an optional extra returned object.
Fixed
Improved pytest coverage of all modules, to test broader use-cases.
Specified peak overrides in
mmc.landmarking._landmark_registrationshould now more accurately reflect the closest possible mapping of those values to the proper position in the FDA function.Using
return_graph=Trueandplot=Falsetogether onmmc.hierarchy.Hierarchy.displaywill now return None (expected behavior) instead of throwing an error.When initializing a new
mmc.hierarchy.Classificationwith a predefined classifier, there is now error checking to ensure feature_names are also providedWarnings about any cython import failures during the import of scikit-fda are now silenced.
Fixed performance of
mmc.utils.generate_exclusive_featuresifadatasis given as a list of str. Previously this would mistakenly return an empty list.Fixed reading in 10x formatted .h5 files without url backend using
mmc.utils.preprocess_adatasFixed
get_datafunction when passed a keyword including_obsin the variable nameAdded fixes for cases where thresholds in
umap_threshwere out of the bounds of the data.Fixed error where if features were limited on a per-classification-level basis, the wrong set of features were passed to generate the training matrices.
Fixed error if no clusters needed to be balanced in
mmc.classifier.balance_borderline
Removed
Printing of the feature limit after set up of the classification.
Known Issues
No support currently for MuData objects. All I need to do is add a wrapper to convert a mudata object to an acceptable anndata object (just need to create a concat version, where a col in .var refers to modality, and another col corresponds to any “to_use”-type columns, joined together)
Currently, there is only full support for two modalities: the .X, and a data_key corresponding to a .obsm location. There is partial support for AnnData objects with multiple modalities in the .var, but this is not yet supported by the
mmc.utils.get_dataormmc.utils.markerfunctions, and result in errors when used for classification. In the future, there will be a data_key location referring to either the .obsm or to a column in the .var, allowing for a .var column that specifies many (not just 2) modalities.There is currently no validation of data_keys or feature names, but they cannot include
_obs,_gex, or_mod_.