-
Notifications
You must be signed in to change notification settings - Fork 918
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: UpdateSymbolList incorrectly renames genes #8179
base: develop
Are you sure you want to change the base?
Conversation
Hi Seurat Team, Again still don't have perfect answer but just thought I would provide update here as alternative (though more conservative) method for updating genes. I have been testing function in dev branch (branch: file_cache_dev; https://github.com/samuel-marsh/scCustomize/blob/c14063f8cd34f3f9f94903c7e82980c68cbd3a84/R/Utilities.R#L1868) of scCustomize to handle things slightly differently. The function I wrote pulls the entire HGNC data csv (stores as cache using BiocFileCache to avoid need to download every time but allows for updating via cache update). It then filters input symbols to only symbols which are NOT currently approved symbols and only checks those unapproved symbols to see if they are listed in previous symbols and if they are it provides updated approved symbol. I say it's more conservative because while it prevents mis-naming of genes it does allow for potential of genes whose names are not updated. There are examples of genes who have swapped symbols with each other. Therefore they could potentially be filtered out and not updated. The level of conservative-ness probably depends on age of input gene set you are inputting as more recent gene sets are less likely to have genes which go un-renamed compared to older. Again I don't really know what perfect solution to the issue is when only input is gene symbols and not entrez/ensembl ids. But thought I would make you aware of this potential solution in addition to problems described in first post here. Best, |
Is there any update on this front? Is this still maintained/recommended or has this been solved in a different way meanwhile? |
Hi @mschilli87, I’ve created functioning my package scCustomize which can handle this now. It also works offline after first use with internet. https://samuel-marsh.github.io/scCustomize/articles/Update_Gene_Symbols.html Best, |
So maybe this PR should be replaced by one removing Seurat's own implementation and importing that one instead then? If this got merged, the code would be duplicated and need to be maintained in two places. |
So this PR doesn’t implement that function from scCustomize as I was going for minimizing dependencies and minimal difference in output format of existing function. I leave it to Seurat team to decide how they want to implement or change. Best, |
Hi Seurat Team,
This is PR that builds on previous fix described in #4545. This is by no means perfect fix (explanation below) so I leave it to you to decide path forward. If you decide a different solution is warranted I'm happy to help with PR if desired.
The issues is with potential for
UpdateSymbolList
to inappropriately rename genes. In original case the search for alias symbols was removed from internals ofUpdateSymbolList
by manually setting the parameter to previous symbols only. However, that unfortunately still causes issues as there are a number of previous symbols which are now symbols of different genes. For instance the genes MCM2, MCM7, and CCNL1 which are all currently approved genes. However, in current formUpdateSymbolList
reverts changes:Using the most recent 10X human reference genome (which is filtered so this is not full extent of potential issues), I have found >100 genes which would be inappropriately swapped.
The "simple" solution which is in this PR to avoid potential issues I added parameter to require that an object be specified and synonyms only be used if they are genes not already found in the object. This limits the function to use with Seurat object but protects against inappropriate renames (though not completely).
The reason it's not complete solution is because most Seurat objects are filtered versions of the count matrix and this often results in objects with half the genes present in the annotation file. Therefore the function does still leave the possibility to inappropriately rename a gene if it was gene that was filtered out during object creation. In order to avoid completely, it different fix and for Seurat to store full feature list from the
counts
input somewhere in the object to check against vs checking against the current features withFeatures
.Again if current PR solution is not desired that is totally fine but wanted you to be alert to issue.
Best,
Sam
Note/Edit: CI failure appears to be related to BioCManager install error not this PR.