Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compare to clustermatch correlation coefficient #7

Open
rmflight opened this issue Jun 17, 2022 · 4 comments
Open

Compare to clustermatch correlation coefficient #7

rmflight opened this issue Jun 17, 2022 · 4 comments

Comments

@rmflight
Copy link
Member

Greene lab published an interesting paper on a correlation coefficient that uses a different measure that is very, very interesting.

Would be nice to see how well we compare in terms of speed and relationships detected.

@rmflight
Copy link
Member Author

At least in terms of speed on a single comparison, for a random sample with 5000 entries, I got

  • icikt (using Python all around so comparisons are valid): 0.06 s, -0.003 kendall-tau
  • ccc: 0.14 s, 0.0008

So we are still 10x faster on a single comparison, and in this random case, have coefficients close to 0 for both.

Would be nice to run all of the GTEx tissues and see if the ICI-Kt tracks with CCC.

@rmflight
Copy link
Member Author

Sooo, the paper claims that Spearman only picks up linear relationships, and they give examples of actual Spearman correlation coefficients that look like they are missing some relationships that their CCC picks up on. And reading around the web, Kendall-tau supposedly gives similar values as Spearman.

So if we wanted to go further with this we would have to investigate how well Kendall-tau matches CCC, or at least tracks with it especially for non-linear type examples. Because otherwise, they've definitely made something that seems superior.

@hunter-moseley
Copy link
Member

hunter-moseley commented Jun 24, 2022 via email

@rmflight
Copy link
Member Author

Right, that definitely makes sense. And there are two aspects to this:

  1. Picking up relationships that are interesting, and that the other coefficients dont capture
  2. Returning coefficients closer to zero that other coefficients stray from zero.

In Figure 1, they are showing CCC picking up on relationships that Pearson and Spearman do not. However, in Figure 2, CCC actually provides more gene-gene coefficients closer to 0 across the whole blood expression than the other two coefficients. I'm guessing it holds for the other tissues as well.

F1 large
Different types of relationships in data.
Each panel contains a set of simulated data points described by two generic variables: x and y. The first row shows Anscombe’s quartet with four different datasets (from Anscombe I to IV) and 11 data points each. The second row contains a set of general patterns with 100 data points each. Each panel shows the correlation value using Pearson (p), Spearman (s) and CCC (c). Vertical and horizontal red lines show how CCC clustered data points using x and y.

F2 large
Distribution of coefficient values on gene expression (GTEx v8, whole blood).
a) Histogram of coefficient values. b) Corresponding cumulative histogram. The dotted line maps the coefficient value that accumulates 70% of gene pairs. c) 2D histogram plot with hexagonal bins between all coefficients, where a logarithmic scale was used to color each hexagon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants