Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk signature analysis #83

Open
gwaybio opened this issue Sep 18, 2020 · 5 comments
Open

Bulk signature analysis #83

gwaybio opened this issue Sep 18, 2020 · 5 comments
Labels
Discussion and Notes Documenting ideas/discussions Experiments Tracking experimental questions, results, or analysis

Comments

@gwaybio
Copy link
Member

gwaybio commented Sep 18, 2020

In #82, I an analysis of bulk (aggregated) signatures from two compiled datasets. #58 is an initial attempt at this analysis, but was using earlier (and lower quality data).

I summarize the experiment immediately below, and then describe the results in more detail further below.

Summary

The Clone AE results are promising. The signature and method clearly work in both training and testing splits, and there appears to be some sort of dose response. What this dose response means biologically is unclear, but technically (in the resistant lines) it means that the signature features become less extreme in their ranking. This means that the absolute value of signature features are higher in the Wildtype_parental profiles.

The Four Clone signature applied to the Clone AE data is odd. The results (at least for the DMSO treated samples) are mostly outside the null, and the score is less extreme than the clone AE signature, but the sign is flipped! This could be a result of some weird programmatic anomaly in fitting linear models (I confirmed one thing that might do this isn't), a metadata label mixup in the four clone dataset, or the method isn't robust across batches.

The signatures applied to the four clone dataset (even the four clone signature) are less conclusive. I am not confident in these data in nearly the same way that I am about the Batch 8 profiles.

Next steps

Signature titration

The number of features in the signature is high. Since one goal of the project is to identify a smaller set of features to potentially use as a biomarker of drug resistance, I will perform a "signature titration" analysis in which I systematically add features (starting with the most significant), and quantify the average difference of test set TotalScore between sensitive and resistant clones. This approach will give us a way to select the minimal set of features required to separate the clone types.

More data collection

We will work with the Rockefeller team to decide next steps in data collection. I see two additional data we could collect (note I have not yet processed batch 9 or 10 yet)

  • Four plates (like in batch 8) except with Clone A/E, Wildtype parental, and wildtype clones
  • Four plates (like in batch 8) except with the four wildtype clones, four resistant clones, and wildtype parental
  • Redo batch 3 (lots of WT and resistant clones with more wildtype parental lines) with updated protocols

I would also like to double check the platemap metadata labels for batches 4, 5, 6, and 7

Data

Clone A/E

  • Batch 8 profiles (four plates)
  • Only DMSO treated samples
  • Cell lines:
    • Resistant: CloneA, CloneE
    • Polyclonal wildtype: WT_parental
    • n = 240
  • Feature selection:
    • Operations: variance_threshold, correlation_threshold, drop_na_columns, blocklist, drop_outliers
    • Correlation threshold: 0.95
    • Performed independently
    • p = 3538, after feature selection = 434
  • Caveat:
    • We're comparing WT_parental to two clones. The signature may include features representing clonal selection.

Four Clone

  • Batches 4, 5, 6, and 7 (seven plates)
  • Only DMSO treated samples
  • Cell lines:
    • Wildtype (sensitive): WT002, WT008, WT009, WT009
    • Resistant: BZ001, BZ008, BZ017, BZ018
    • Polyclonal wildtype: WT_parental
    • n = 420
  • Feature selection:
    • Operations: variance_threshold, correlation_threshold, drop_na_columns, blocklist, drop_outliers
    • Correlation threshold: 0.95
    • Performed independently
    • p = 3538, after feature selection = 338

Signature generation

Procedure

For each dataset, I perform the following procedure:

  1. I split the data into an 85% training and 15% testing set balanced by cell line. I built the signature using only the training set.
  2. Using the feature selected features, fit a linear model using the following covariates:
    • Metadata_clone_type_indicator (resistant vs. sensitive)
    • Metadata_batch (four_clone only; cloneAE is one batch)
    • Metadata_Plate
    • Metadata_clone_number (clone id)
  3. Perform a TukeyHSD test to adjust p values for within feature multiple comparisons.
    • (e.g. in the cloneAE dataset and in the linear model, testing an individual feature using the plates covariate is actually 6 comparisons!)
  4. Adjust the TukeyHSD adjusted p values further by a bonferroni correction.
  5. Select all features with a p value below this adjusted rate for the Metadata_clone_type_indicator covariate this is the "PreSignature".
  6. Remove all features from the "PreSignature" with a p value below the adjusted rate for the Metadata_Plate and Metadata_batch covariates.
    • Effect: This reduces the impact of technical artifacts in the signature application

Volcano plots

These plots visualize feature significance for each linear model covariate

Clone AE

Click to show figure

bulk_tukey_volcano_cloneAE

Four Clone

Click to show figure

bulk_tukey_volcano_four_clone

Result

The signatures contain many features, and we make a distinction between features "up" and features "down":

  • Four Clone: 76 features
    • Up: 39
    • Down: 37
  • CloneAE: 188 features
    • Up: 95
    • Down: 93

Apply signatures

Approach

Because we have two datasets and two signatures, I applied each signature to each dataset independently. I also apply each signature with 1,000 random permutations to define a null distribution.

Method

I use the singscore method Foroutan et al.. This is a "single sample" method to detect signature enrichment. It is a relatively simple, rank-based approach bounded between -1 and 1, where a score of 1 means that the sample is enriched for signature features.

Results

Comparison 1: Clone AE Dataset - Clone AE Signature

data_cloneAE_signature_cloneAE_apply_singscore

Comparison 2: Clone AE Dataset - Four Clone Signature

data_cloneAE_signature_four_clone_apply_singscore

Comparison 3: Four Clone Dataset - Clone AE Signature

data_four_clone_signature_cloneAE_apply_singscore

Comparison 4: Four Clone Dataset - Four Clone Signature

data_four_clone_signature_four_clone_apply_singscore

@gwaybio gwaybio added Discussion and Notes Documenting ideas/discussions Experiments Tracking experimental questions, results, or analysis labels Sep 18, 2020
@gwaybio
Copy link
Member Author

gwaybio commented Sep 18, 2020

cc @AnneCarpenter @shntnu

I think that we're seeing signal with this approach, and combined with a couple orthogonal approaches we've discussed already (see below), I think these results could fit in nicely into a larger story.

image

Besides the analyses and future data collection efforts we discussed today (and outlined above), it would be great to get your opinions, so we can discuss them and so I can better manage/prioritize my time. Thanks!

@AnneCarpenter
Copy link

I think the most compelling finding would be: (1) find one signature, (2) reassure that it's real by interpreting the meaning of the features that compose it (qualitative analysis), (3) show it's robust across batches (which might benefit by trimming to as few features as possible), (4) show it can prospectively predict sensitivity/resistance (either binary resistant/sensitive or continuous, using EC50 for the latter). Note that (3) isn't strictly necessary if you can do (4); you could always include controls run in the same batch as the query cell lines if necessary. The bullets under robustness are a lot of work and not particularly crucial unless they help you to develop approaches that make (4) work well. As discussed in checkin today , If the WT parental line is not aligned well across batches to classify properly, that's a problem. I'd focus on normalization methods first before attempting (4).

Showing the signature is similar across other classes of drugs or other drugs of the same class (your first set of bullets) is neat but not required for a complete story IMO. eLife has a "Research Advance" format that would suit that addendum if it turns out to be interesting.

@gwaybio
Copy link
Member Author

gwaybio commented Sep 24, 2020

Thanks for these thoughts! They will be great to refer to when deciding next steps. I definitely agree that different normalization methods would help, and it is something that we can try next.

(3) show it's robust across batches

I dug a bit deeper into this by applying the CloneAE signature to batch 1 and batch 2 data. These are the only other batches to collect cloneA and cloneE data that I've processed so far.

validation_data_cloneAE_validation_signature_cloneAE_apply_singscore

We're getting cloneA and cloneE but the wildtype parental line is a bit wonky.

Next steps for bulk analysis

Normalization

I will try sphering (aka whitening) the bulk profile data before input into the signature builder. Although the trick here is that sphering requires a negative control and we don't have a good one in these platemaps.

We could try harmonizing the single cells before generating bulk profiles and then proceed. The downside here is that Harmony places the data in a non-CellProfiler feature space, which would be bad from the interpretation perspective. I can try the inverse-Harmony approach that @sMyn42 is assessing, but these are still uncharted waters.

This exhausts the list of normalization/batch effect correction approaches the lab has used (at least recently). I could try other approaches outside our comfort zone, but this may not be the best use of time.

Signature subsampling

We know that the cloneAE signature is robust within batch (from training/test set perspective). Ignoring the fact that this could be a signature of clonal selection, I can still perform a systematic signature reduction experiment to test feature redundancy (and try to find a reduced signature)

@AnneCarpenter
Copy link

Could you use a subset of the parental WT line as the negative control? It IS a baseline of sorts, after all. Then the heldout parental WT samples can allow you to check if the alignment worked well (ideally you choose the heldout ones from particular plate locations or something, not just a literal random subset of the parental WTs).

@gwaybio
Copy link
Member Author

gwaybio commented Sep 25, 2020

Could you use a subset of the parental WT line as the negative control?

This would work for single cell profiles, but not for well-aggregated bulk (we only have three replicates). It could work with site-aggregated bulk (~27 pseudo-replicates), but we previously decided not to go with site-level aggregation in #70

There were two reasons to not use this approach: (1) we observed quite a bit of site-to-site variability (2) we've never done it before. I can see a path forward overcoming point 1 since we are trying a normalization approach that would adjust for site-to-site variability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Discussion and Notes Documenting ideas/discussions Experiments Tracking experimental questions, results, or analysis
Projects
None yet
Development

No branches or pull requests

2 participants