You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It seems a bit ridiculous though that “the” is in the top 10 (edit: for Javascript at least), when the occurrences are all(?) from comments.
Would be great to see a dataset that doesn't include comments.
The text was updated successfully, but these errors were encountered:
I think having ability to parse actual code and categorize each line is a very powerful idea. However, that would be too expensive/time consuming for me to do. I guess one way to do so, would be to translate each file into language-specific abstract syntax tree using user defined functions in BigQuery, and then emit categorized lines.
I'm not sure about other languages, but for web related tech you could run everything first through something like https://github.com/vitaly-t/decomment
Hey hey, I love what you've done here!
It seems a bit ridiculous though that “the” is in the top 10 (edit: for Javascript at least), when the occurrences are all(?) from comments.
Would be great to see a dataset that doesn't include comments.
The text was updated successfully, but these errors were encountered: