Strip comments #7

DJTB · 2017-01-21T14:53:26Z

Hey hey, I love what you've done here!

It seems a bit ridiculous though that “the” is in the top 10 (edit: for Javascript at least), when the occurrences are all(?) from comments.
Would be great to see a dataset that doesn't include comments.

anvaka · 2017-01-21T22:27:12Z

Thank you!

I think having ability to parse actual code and categorize each line is a very powerful idea. However, that would be too expensive/time consuming for me to do. I guess one way to do so, would be to translate each file into language-specific abstract syntax tree using user defined functions in BigQuery, and then emit categorized lines.

Or maybe there is an easier way?

DJTB · 2017-01-22T02:39:27Z

I'm not sure about other languages, but for web related tech you could run everything first through something like https://github.com/vitaly-t/decomment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strip comments #7

Strip comments #7

DJTB commented Jan 21, 2017 •

edited

Loading

anvaka commented Jan 21, 2017

DJTB commented Jan 22, 2017 •

edited

Loading

Strip comments #7

Strip comments #7

Comments

DJTB commented Jan 21, 2017 • edited Loading

anvaka commented Jan 21, 2017

DJTB commented Jan 22, 2017 • edited Loading

DJTB commented Jan 21, 2017 •

edited

Loading

DJTB commented Jan 22, 2017 •

edited

Loading