Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data subset pull script is problematic. #114

Open
Nottlespike opened this issue Oct 15, 2024 · 5 comments
Open

Data subset pull script is problematic. #114

Nottlespike opened this issue Oct 15, 2024 · 5 comments

Comments

@Nottlespike
Copy link
Contributor

Related to PR #111
The script assumes that the user has Git LFS installed but it's not in the documentation on how to install then setup then initialize Git LFS. For most users this is very likely going to break if they attempt to use said script. A likely much better implementation would be to use this Github gist, https://gist.github.com/padeoe/697678ab8e528b85a2a7bddafea1fa4f#file-hfd-sh, integrated into the project as it uses only git and aria2c/wget as dependencies. There are also other advantages to integrating this script as you would not need to use a SSH keys as that adds complexity to the project and issues with ports as it's likely that port may be busy as I SSH into my cloud instance to setup prime.

@Jackmin801
Copy link
Member

Thanks for the suggestion! It would definitely be easier if we could pull huggingface shards through their http server. My experience with this is that the http server eventually rate limits you, which is not the case for git lfs pulls. This was true last year and could be different now though.

@Nottlespike
Copy link
Contributor Author

@Jackmin801 So I used this script to help a LOT of people download L3.1 405B when Git LFS timed out for them. You can also pass the HF read token from the env file to prevent rate limiting but I have never seen that with this script even pulling as anonymous.

@Jackmin801
Copy link
Member

Ah ok. Lets change it then. Do you think you could write the PR for this? Otherwise I can do it tmr.
Thanks again for the suggestion :)

@Nottlespike
Copy link
Contributor Author

I'm going to take a shot at the PR as I have integrated it before and use it extensively personally.

@Nottlespike
Copy link
Contributor Author

@Jackmin801 I have submitted a PR and fixed a possible "security" issue introduced by the current ~/10B/H100.toml. It is best to verify that the subset logic is working as intended.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants