Change Log

For Version 1.4.0

Added an example workflow/tutorial for differential glycomics analysis to the Examples tab in the documentation
Added additional tests via pytest
Cleaned up repo with more stringent .gitignore, removing unnecessary files
Added hover-over tooltips to the glycoworkGUI, describing how the input files should be formatted
Exposed more keyword arguments of get_heatmap in GUI (CLR transformation + tick label control)

glycan_data

Broadened the motif definition of “Mucin_elongated_core2” in motif_list
Refined the motif definitions of the O-glycan core motifs in motif_list to prevent overlaps
Larger (and cleaner) datasets for: df_glycan, df_species, df_tissue, df_disease, and glycan_binding
Updated lib from 2,366 to 2,565 glycoletters

loader

Added the glycoproteomics_data_loader, to request stored glycoproteomics datasets
Added human_milk_N_PMID34087070 and human_keratinocytes_PMID37956981 as example datasets for glycoproteomics_data_loader (data are ID’ed in the “Glycosite” column in the format protein_site_composition)
Added HexOS and HexNAcOS monosaccharide lists to be used in downstream functions
Added modification_map to map which monosaccharides can be modified with which post-biosynthetic modification
Added DataFrameSerializer to have a version-independent serializer for handling df_glycan

stats

Added get_glycoform_diff to aggregate glycoforms differential expression across glycopeptides or glycoproteins via Fisher’s Combined Probability Test
Fixed a pandas deprecation warning in replace_outliers_winsorization (for pandas >= 2.2.2)
Added get_glm and process_glm_results to fit and analyze generalized linear models, with interaction terms, to grouped glycoproteomics data
Added partial_corr to calculate regularized partial correlations
Added estimate_technical_variance and perform_tests_monte_carlo to account for technical variation in glycomics data
Added the “cap_side” keyword argument to replace_outliers_with_IQR_bounds and replace_outliers_winsorization to allow users to cap outliers on “both”, “upper”, “lower” sides; default: “both”
Fixed the global NumPy RNG for clr_transformation and alr_transformation to ensure reproducibility
Added the “correction_method” keyword argument to correct_multiple_testing, to allow users to switch between regular Benjamini-Hochberg and two-stage Benjamini-Hochberg

motif

processing

Added support for sulfated monosaccharides to get_possible_monosaccharides
Added parse_glycoform, infer_features_from_composition, and process_for_glycoshift as helper functions in glycoproteomics data analysis
Expanded canonicalize_composition to deal with compositions of type “9 2 0 0”
Fine-tune canonicalize_iupac to not mess up formatting of sequences ending in “GlcOP-ol”
Added de_wildcard_glycoletter to retrieve a random specified monosaccharide/linkage of the general type present as a wildcard (e.g., Hex->Gal)
Added get_class to return the glycan class as a string, given a glycan sequence
If choose_correct_isoform is provided with isomers that have different amounts of ambiguities, it will now prioritize the isomers with the fewest ambiguities

graph

Added support for mixing monosaccharide and modification wildcards in compare_glycans and subgraph_isomorphism (e.g., “HexNAcOS”)
Added the handle_negation decorator and subgraph_isomorphism_with_negation to process motif annotation with restrictions (e.g., “Gal(b1-3)[!GlcNAc(b1-6)]GalNAc” to prevent annotating core2 O-glycans as core1)
subgraph_isomorphism is now decorated with handle_negation, such that if the “motif” argument contains a negating operator (“!”), the function will actually execute subgraph_isomorphism_with_negation
Added the “allowed_disaccharides” keyword argument to get_possible_topologies to support filtering possible extensions by physiological glycan extensions
Added a filter to get_possible_topologies to maintain chemically feasible structures by checking that the same carbon does not get two linkages
Support handling of post-biosynthetic modifications in get_possible_topologies, e.g., allowing things like “{6S}Gal(b1-3)[GlcNAc(b1-6)]GalNAc” as input, with uncertainty about where the sulfate is attached
Refactored graph_to_string_int to recursively construct a depth-first search tree to construct the IUPAC-condensed string
Supported monosaccharide-only graphs in generate_graph_features
Added deduplicate_glycans to remove duplicate glycans (with different IUPAC strings) from a list of glycans

analysis

Added the “glycoproteomics” and “level” keyword arguments to get_differential_expression to support the analysis of glycoproteomics data if “glycoproteomics=True”. “level” indicates whether different glycoforms should be analyzed at the level of glycopeptides or glycoproteins
Added get_glycoshift_per_site to analyze whether, and in which way, glycosylation changes between conditions for each glycosylation site (controlling for protein expression etc.) via generalized linear models (GLM) adapted for compositional data (i.e., CLR-transformation)
Added preprocess_data as a centralization of data preprocessing for easier maintenance
Moved preprocessing code from get_differential_expression, get_glycanova, get_biodiversity, and get_roc into preprocess_data
Fixed an issue in clean_up_heatmap in which sometimes the longer string instead of the longer sequence was picked for deduplication (e.g., “Internal_LewisX” vs “SialylLewisX”)
Moved clean_up_heatmap into motif.annotate
Added Omega-squared as an effect size output to get_glycanova
Fixed an issue in get_heatmap in which sometimes the function did not correctly rescue an input by transposing it, if the index contained special characters
Fixed an issue in get_pca in which the input of a dataframe for group specification resulted in an error
Disabled Levene’s test in get_differential_expression if either group has fewer than three samples, for numerical stability
Added the “partial_correlations” keyword argument to get_SparCC. If set to True, it will instead use regularized partial correlations to reduce multi-colinearity and enrich associations that represent direct effects (i.e., getting rid of bystander effects)
Added the “monte_carlo” keyword argument (default False) to preprocess_data and get_differential_expression. If True, this will simulate technical variation by sampling 128 Monte Carlo instances from a Dirichlet distribution for each sample. Only works for sequences & CLR for now. This will substantially increase runtime and be considerably more conservative in yielding significant differences between conditions. Use with caution.
In get_differential_expression glycans that had been filtered out by variance filtering now still have their mean abundance and log2FC recorded in the output table
Added the “show_all” keyword argument to get_heatmap to force all tick labels to display, even if they visually overlap

annotate

Added annotate_glycan_topology_uncertainty to probe whether motifs can be annotated in the case of structural ambiguity (e.g., {Fuc(a1-3)} in N-glycans, to still annotate Lewis X)
Expanded annotate_dataset to let it automatically switch between annotate_glycan and annotate_glycan_topology_uncertainty, depending on whether structural ambiguity is present in a glycan (the latter is much more costly in terms of computation)
Added the (default: True) keyword argument “remove_redundant” to quantify_motifs that will call clean_up_heatmap on the output to remove redundant motifs
Dynamically generated terminal motifs now have the prefix “Terminal_” in all outputs
Resolved a recent deprecation warning from pandas in get_k_saccharides
Added a warning to annotate_dataset that will print all features in “feature_set” that are not being recognized
Support the use of “terminal1” as a synonym to the original “terminal” in “feature_set”

draw

Support the new “Terminal_” prefix in GlycoDraw and annotate_figure

tokenization

Added support for sulfated HexA and HexN in map_to_basic
Added calculate_adduct_mass to calculate the mass for generic molecular formulae (e.g., C2H4O2)
Added support for chemical tags or adducts in composition_to_mass, glycan_to_mass, and mz_to_composition via the new “adduct” keyword argument
Added “Pen” to get_core
The default “glycan_class” in mz_to_composition is now “all” (but it can of course still be user-specified)
Added the new keyword argument “extras” to mz_to_composition, to allow users to switch off the consideration of adducts or doubly-charged input masses (the default now is to opt out of adducts but users can add that to “extras”)
Copy the input dictionary in composition_to_mass to prevent any in-place modification of the keys

network

biosynthesis

Made network construction faster via code optimizations
Added the “mode” keyword argument to choose_path, find_diamonds, trace_diamonds, and evoprune_network to allow for biosynthetic motif analysis to use information from relative abundances
We now support the use of longitudinal data in get_differential_biosynthesis to analyze whether biosynthetic flows change over time
Fixed an issue in get_differential_biosynthesis in which N-glycans with high-mannose sequences caused errors (due to the backward direction of synthesis)
Fixed an issue in get_differential_biosynthesis in which N-glycans, containing many unobserved intermediate sequences, had capacity bottleneck issues
Added the “min_default” keyword argument to estimate_weights, to allow class-dependent fine-tuning of the minimum capacity
Modified construct_network to disallow the transfer of modified monosaccharides (e.g., GlcNAc6S), only retaining the sequential assembly in accordance with known biosynthesis (e.g., GlcNAc, then 6S)
Added extend_glycans, edges_for_extension, and extend_network to extend the biosynthetic network based on observed reactions and permitted disaccharide extensions
Deprecated safe_max and find_ptm; will be done in-line instead

ml

Updated trained models for new lib

processing

Made dataset_to_graphs faster if there were any duplicates in the input glycans
Added augment_glycan and AugmentedGlycanDataset to support glycan data augmentation during training of deep learning models. Currently, the only supported data augmentation is wildcarding of monosaccharides/linkages (e.g., GalHex, b1-4?1-?) and the inverse (de-wildcarding)
Added the keyword arguments “augment_prob” and “generalization_prob” to split_data_to_train to control the likelihood of augmenting a glycan and the proportion of the glycan to be (de-)wildcarded if it is augmented

inference

Added an unwrap call to get_lectin_preds to fix the output format

models

Set “weights_only = True” for torch.load to prevent FutureWarning

model_training

Support already one-hot encoded multilabel labels in Poly1CrossEntropyLoss

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.4.0

Change Log

For Version 1.4.0

glycan_data

loader

stats

motif

processing

graph

analysis

annotate

draw

tokenization

network

biosynthesis

ml

processing

inference

models

model_training