Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Solidity (ethereum smart contract language) #50

Open
ghost opened this issue Sep 23, 2021 · 3 comments
Open

Add Solidity (ethereum smart contract language) #50

ghost opened this issue Sep 23, 2021 · 3 comments

Comments

@ghost
Copy link

ghost commented Sep 23, 2021

I would love to see Solidity added as a language, as it usually gets detected as JavaScript, Dart, Lua or other languages.

That language has a set of reserved keywords that are very different from other languages and should enable the training to perform extremely well on it.

@yoeo
Copy link
Owner

yoeo commented Sep 27, 2021

Hello @vbersier,

Indeed, it would be a good idea to add Solidity as Etherium/smart contract/NFT are everywere lately.
However, there are currently not enough Solidity example files on Github to feed Guesslang (~50k files required) https://github.com/search?q=language%3ASolidity&type=repositories.

I propose that we wait for the number of Solidity projects to grow on Github before adding this language.

@ghost
Copy link
Author

ghost commented Sep 28, 2021

Hi @yoeo

There are thousands of source code examples available from etherscan.io and bscscan.com. I wonder if it would be possible to somehow scrape them with their API?

https://docs.etherscan.io/api-endpoints/contracts#get-contract-source-code-for-verified-contract-source-codes
https://docs.bscscan.com/api-endpoints/contracts#get-contract-source-code-for-verified-contract-source-codes

In total already 13k files with open source license.

Github search seems glitchy as the number of code results changes with every refresh, from 700 to 63k results.

Finally I think this particular language will require a smaller training set than most other languages, as I explained the reserved keywords are very unique.

@yoeo
Copy link
Owner

yoeo commented Sep 29, 2021

Github search seems glitchy

You're right, now I see that Github search result is not stable for Solidity.
Depending on how many files I can actually retrieve from Github, I could perhaps add Solidity to the next batch of supported languages.

I wonder if it would be possible to somehow scrape them with their API?

Currently, the dataset is generated by this script https://github.com/yoeo/guesslangtools/
All the source codes are retrieved from Github but it should be possible to add other sources including etherscan.io and bscscan.com. Of course any contribution for this addition is warmly welcomed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant