Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strip comments #7

Open
DJTB opened this issue Jan 21, 2017 · 2 comments
Open

Strip comments #7

DJTB opened this issue Jan 21, 2017 · 2 comments

Comments

@DJTB
Copy link

DJTB commented Jan 21, 2017

Hey hey, I love what you've done here!

It seems a bit ridiculous though that “the” is in the top 10 (edit: for Javascript at least), when the occurrences are all(?) from comments.
Would be great to see a dataset that doesn't include comments.

@anvaka
Copy link
Owner

anvaka commented Jan 21, 2017

Thank you!

I think having ability to parse actual code and categorize each line is a very powerful idea. However, that would be too expensive/time consuming for me to do. I guess one way to do so, would be to translate each file into language-specific abstract syntax tree using user defined functions in BigQuery, and then emit categorized lines.

Or maybe there is an easier way?

@DJTB
Copy link
Author

DJTB commented Jan 22, 2017

I'm not sure about other languages, but for web related tech you could run everything first through something like https://github.com/vitaly-t/decomment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants