Skip to content

v1.4.0

Latest
Compare
Choose a tag to compare
@Bribak Bribak released this 11 Nov 13:11
f9358d4

Change Log

For Version 1.4.0

  • Added an example workflow/tutorial for differential glycomics analysis to the Examples tab in the documentation
  • Added additional tests via pytest
  • Cleaned up repo with more stringent .gitignore, removing unnecessary files
  • Added hover-over tooltips to the glycoworkGUI, describing how the input files should be formatted
  • Exposed more keyword arguments of get_heatmap in GUI (CLR transformation + tick label control)

glycan_data

  • Broadened the motif definition of “Mucin_elongated_core2” in motif_list
  • Refined the motif definitions of the O-glycan core motifs in motif_list to prevent overlaps
  • Larger (and cleaner) datasets for: df_glycan, df_species, df_tissue, df_disease, and glycan_binding
  • Updated lib from 2,366 to 2,565 glycoletters

loader

  • Added the glycoproteomics_data_loader, to request stored glycoproteomics datasets
  • Added human_milk_N_PMID34087070 and human_keratinocytes_PMID37956981 as example datasets for glycoproteomics_data_loader (data are ID’ed in the “Glycosite” column in the format protein_site_composition)
  • Added HexOS and HexNAcOS monosaccharide lists to be used in downstream functions
  • Added modification_map to map which monosaccharides can be modified with which post-biosynthetic modification
  • Added DataFrameSerializer to have a version-independent serializer for handling df_glycan

stats

  • Added get_glycoform_diff to aggregate glycoforms differential expression across glycopeptides or glycoproteins via Fisher’s Combined Probability Test
  • Fixed a pandas deprecation warning in replace_outliers_winsorization (for pandas >= 2.2.2)
  • Added get_glm and process_glm_results to fit and analyze generalized linear models, with interaction terms, to grouped glycoproteomics data
  • Added partial_corr to calculate regularized partial correlations
  • Added estimate_technical_variance and perform_tests_monte_carlo to account for technical variation in glycomics data
  • Added the “cap_side” keyword argument to replace_outliers_with_IQR_bounds and replace_outliers_winsorization to allow users to cap outliers on “both”, “upper”, “lower” sides; default: “both”
  • Fixed the global NumPy RNG for clr_transformation and alr_transformation to ensure reproducibility
  • Added the “correction_method” keyword argument to correct_multiple_testing, to allow users to switch between regular Benjamini-Hochberg and two-stage Benjamini-Hochberg

motif

processing

  • Added support for sulfated monosaccharides to get_possible_monosaccharides
  • Added parse_glycoform, infer_features_from_composition, and process_for_glycoshift as helper functions in glycoproteomics data analysis
  • Expanded canonicalize_composition to deal with compositions of type “9 2 0 0”
  • Fine-tune canonicalize_iupac to not mess up formatting of sequences ending in “GlcOP-ol”
  • Added de_wildcard_glycoletter to retrieve a random specified monosaccharide/linkage of the general type present as a wildcard (e.g., Hex->Gal)
  • Added get_class to return the glycan class as a string, given a glycan sequence
  • If choose_correct_isoform is provided with isomers that have different amounts of ambiguities, it will now prioritize the isomers with the fewest ambiguities

graph

  • Added support for mixing monosaccharide and modification wildcards in compare_glycans and subgraph_isomorphism (e.g., “HexNAcOS”)
  • Added the handle_negation decorator and subgraph_isomorphism_with_negation to process motif annotation with restrictions (e.g., “Gal(b1-3)[!GlcNAc(b1-6)]GalNAc” to prevent annotating core2 O-glycans as core1)
  • subgraph_isomorphism is now decorated with handle_negation, such that if the “motif” argument contains a negating operator (“!”), the function will actually execute subgraph_isomorphism_with_negation
  • Added the “allowed_disaccharides” keyword argument to get_possible_topologies to support filtering possible extensions by physiological glycan extensions
  • Added a filter to get_possible_topologies to maintain chemically feasible structures by checking that the same carbon does not get two linkages
  • Support handling of post-biosynthetic modifications in get_possible_topologies, e.g., allowing things like “{6S}Gal(b1-3)[GlcNAc(b1-6)]GalNAc” as input, with uncertainty about where the sulfate is attached
  • Refactored graph_to_string_int to recursively construct a depth-first search tree to construct the IUPAC-condensed string
  • Supported monosaccharide-only graphs in generate_graph_features
  • Added deduplicate_glycans to remove duplicate glycans (with different IUPAC strings) from a list of glycans

analysis

  • Added the “glycoproteomics” and “level” keyword arguments to get_differential_expression to support the analysis of glycoproteomics data if “glycoproteomics=True”. “level” indicates whether different glycoforms should be analyzed at the level of glycopeptides or glycoproteins
  • Added get_glycoshift_per_site to analyze whether, and in which way, glycosylation changes between conditions for each glycosylation site (controlling for protein expression etc.) via generalized linear models (GLM) adapted for compositional data (i.e., CLR-transformation)
  • Added preprocess_data as a centralization of data preprocessing for easier maintenance
  • Moved preprocessing code from get_differential_expression, get_glycanova, get_biodiversity, and get_roc into preprocess_data
  • Fixed an issue in clean_up_heatmap in which sometimes the longer string instead of the longer sequence was picked for deduplication (e.g., “Internal_LewisX” vs “SialylLewisX”)
  • Moved clean_up_heatmap into motif.annotate
  • Added Omega-squared as an effect size output to get_glycanova
  • Fixed an issue in get_heatmap in which sometimes the function did not correctly rescue an input by transposing it, if the index contained special characters
  • Fixed an issue in get_pca in which the input of a dataframe for group specification resulted in an error
  • Disabled Levene’s test in get_differential_expression if either group has fewer than three samples, for numerical stability
  • Added the “partial_correlations” keyword argument to get_SparCC. If set to True, it will instead use regularized partial correlations to reduce multi-colinearity and enrich associations that represent direct effects (i.e., getting rid of bystander effects)
  • Added the “monte_carlo” keyword argument (default False) to preprocess_data and get_differential_expression. If True, this will simulate technical variation by sampling 128 Monte Carlo instances from a Dirichlet distribution for each sample. Only works for sequences & CLR for now. This will substantially increase runtime and be considerably more conservative in yielding significant differences between conditions. Use with caution.
  • In get_differential_expression glycans that had been filtered out by variance filtering now still have their mean abundance and log2FC recorded in the output table
  • Added the “show_all” keyword argument to get_heatmap to force all tick labels to display, even if they visually overlap

annotate

  • Added annotate_glycan_topology_uncertainty to probe whether motifs can be annotated in the case of structural ambiguity (e.g., {Fuc(a1-3)} in N-glycans, to still annotate Lewis X)
  • Expanded annotate_dataset to let it automatically switch between annotate_glycan and annotate_glycan_topology_uncertainty, depending on whether structural ambiguity is present in a glycan (the latter is much more costly in terms of computation)
  • Added the (default: True) keyword argument “remove_redundant” to quantify_motifs that will call clean_up_heatmap on the output to remove redundant motifs
  • Dynamically generated terminal motifs now have the prefix “Terminal_” in all outputs
  • Resolved a recent deprecation warning from pandas in get_k_saccharides
  • Added a warning to annotate_dataset that will print all features in “feature_set” that are not being recognized
  • Support the use of “terminal1” as a synonym to the original “terminal” in “feature_set”

draw

  • Support the new “Terminal_” prefix in GlycoDraw and annotate_figure

tokenization

  • Added support for sulfated HexA and HexN in map_to_basic
  • Added calculate_adduct_mass to calculate the mass for generic molecular formulae (e.g., C2H4O2)
  • Added support for chemical tags or adducts in composition_to_mass, glycan_to_mass, and mz_to_composition via the new “adduct” keyword argument
  • Added “Pen” to get_core
  • The default “glycan_class” in mz_to_composition is now “all” (but it can of course still be user-specified)
  • Added the new keyword argument “extras” to mz_to_composition, to allow users to switch off the consideration of adducts or doubly-charged input masses (the default now is to opt out of adducts but users can add that to “extras”)
  • Copy the input dictionary in composition_to_mass to prevent any in-place modification of the keys

network

biosynthesis

  • Made network construction faster via code optimizations
  • Added the “mode” keyword argument to choose_path, find_diamonds, trace_diamonds, and evoprune_network to allow for biosynthetic motif analysis to use information from relative abundances
  • We now support the use of longitudinal data in get_differential_biosynthesis to analyze whether biosynthetic flows change over time
  • Fixed an issue in get_differential_biosynthesis in which N-glycans with high-mannose sequences caused errors (due to the backward direction of synthesis)
  • Fixed an issue in get_differential_biosynthesis in which N-glycans, containing many unobserved intermediate sequences, had capacity bottleneck issues
  • Added the “min_default” keyword argument to estimate_weights, to allow class-dependent fine-tuning of the minimum capacity
  • Modified construct_network to disallow the transfer of modified monosaccharides (e.g., GlcNAc6S), only retaining the sequential assembly in accordance with known biosynthesis (e.g., GlcNAc, then 6S)
  • Added extend_glycans, edges_for_extension, and extend_network to extend the biosynthetic network based on observed reactions and permitted disaccharide extensions
  • Deprecated safe_max and find_ptm; will be done in-line instead

ml

  • Updated trained models for new lib

processing

  • Made dataset_to_graphs faster if there were any duplicates in the input glycans
  • Added augment_glycan and AugmentedGlycanDataset to support glycan data augmentation during training of deep learning models. Currently, the only supported data augmentation is wildcarding of monosaccharides/linkages (e.g., GalHex, b1-4?1-?) and the inverse (de-wildcarding)
  • Added the keyword arguments “augment_prob” and “generalization_prob” to split_data_to_train to control the likelihood of augmenting a glycan and the proportion of the glycan to be (de-)wildcarded if it is augmented

inference

  • Added an unwrap call to get_lectin_preds to fix the output format

models

  • Set “weights_only = True” for torch.load to prevent FutureWarning

model_training

  • Support already one-hot encoded multilabel labels in Poly1CrossEntropyLoss