Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understanding and reformatting the knowledge graph. #2

Open
sgfin opened this issue Jul 14, 2018 · 2 comments
Open

Understanding and reformatting the knowledge graph. #2

sgfin opened this issue Jul 14, 2018 · 2 comments

Comments

@sgfin
Copy link

sgfin commented Jul 14, 2018

Thanks for sharing this knowledge graph! I would love to be able to do a compare and contrast with some other methods, and ideally expand it a bit by joining it with other resources.

My apologies for the question of ignorance, but as a preliminary step, I am trying to convert the knowledge graph into a simpler triple format that I can load as a flat file into something like numpy. As such, I want to be sure I correctly understand the structure.

Could you confirm if I am reading this correctly? It appears that each triple forms two rows that look like this

<http://www.ncbi.nlm.nih.gov/gene/448835> <http://purl.obolibrary.org/obo/RO_0000085> <http://aber-owl.net/go/instance_0> . <http://aber-owl.net/go/instance_0> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.obolibrary.org/obo/GO_0031424> .

. Of the sets of brackets, it appears the first identifies the source node, the second encodes the edge's relationship, and the sixth identifies the target node. The third/fourth, appear to be an identifier of the tuple and the fifth appears to be the same everywhere.

Is the above interpretation correct? If so, is there an easy way to build up a simple dictionary of the node/edges urls? I'd prefer to encode them as simple numbers with a separate table mapping each number to a string, but couldn't find a node/edge dictionary in the repo.

Thanks so much again for all your work, and I hope this isn't a pain for you to answer.

@monaalsh
Copy link
Collaborator

monaalsh commented Jul 15, 2018

Thanks for your interest .
Yes, every row represents a triple, please refer to the paper for details about representing instances and ontology classes.
As for the graph output, you can use
RDFWrapper.groovy script which takes this input graph and can output
an edge list (which can be used for creating a python dictionary) and a mapping file to map each URI to an integer ID.

@sgfin
Copy link
Author

sgfin commented Jul 17, 2018

Thanks so much for your response. My apologies, but one more question:

Do you by any chance have a mapping file between URIs and either UMLS CUIs or their original source IDs (Pubchem, GO, etc.)? It looks like the code may reference some of these files in a data folder (which are also maybe used to test performance grouped by class?), but I don't see them.

Alternatively, do you know if there is a package in Python that would facilitate the URI -> original ID conversion, perhaps by utilizing the links out to the ontology? I am hoping to integrate with some other datasets, so I can't simply use custom integer values, and the URIs are not completely trivial to parse into their original IDs, though it looks like I may be able to hand engineer such a parser by inspection.

Thanks again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants