Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to add new language #40

Open
yjmm10 opened this issue Aug 5, 2021 · 7 comments
Open

How to add new language #40

yjmm10 opened this issue Aug 5, 2021 · 7 comments
Labels

Comments

@yjmm10
Copy link

yjmm10 commented Aug 5, 2021

Hello, if I want to do migration training based on yours, can I use the trained model?
I tried to load the trained model but no effect, I hope to get your reply

@yjmm10
Copy link
Author

yjmm10 commented Aug 5, 2021

when i download dataset from guesslangtool, many repo is not exist, and github server reject my request.

@yoeo
Copy link
Owner

yoeo commented Aug 5, 2021

Hello @yjmm10

I tried to load the trained model but no effect, I hope to get your reply

I think that the current model doesn't suit transfer learning very well. The list of supported languages is embedded in the model graph itself. Mean that you'll have to hack the graph somehow to add new languages info.
That might change in future versions but currently there are few blockers (I can go more in details if required).

Today the only recommended way to add new languages is to build a dataset including the new languages with guesslangtools.

@yoeo
Copy link
Owner

yoeo commented Aug 5, 2021

when i download dataset from guesslangtool, many repo is not exist,

Yes that's expected, the Github public repository list that I use was last updated on January 2020 https://zenodo.org/record/3626071/
You can safely ignore this warning.

@yoeo
Copy link
Owner

yoeo commented Aug 5, 2021

github server reject my request

Strange... Guesslangtools main workflow only rely on git commands because, as far as I know, they are not (yet) restricted by Github servers. Github website & API are heavily restricted though.

Can you share the errors that you're getting?

@yjmm10
Copy link
Author

yjmm10 commented Aug 6, 2021

github server reject my request

Strange... Guesslangtools main workflow only rely on git commands because, as far as I know, they are not (yet) restricted by Github servers. Github website & API are heavily restricted though.

Can you share the errors that you're getting?
The above exception was the direct cause of the following exception:

Thank you for your reply. This is Error message,  when download the zip file, it often happen。
Traceback (most recent call last):
  File "I:/Private/guesslangtools/guesslangtools/__main__.py", line 104, in <module>
    main()
  File "I:/Private/guesslangtools/guesslangtools/__main__.py", line 89, in main
    run_workflow()
  File "I:\Private\guesslangtools\guesslangtools\app.py", line 14, in run_workflow
    compressed_repositories.download()
  File "I:\Private\guesslangtools\guesslangtools\common.py", line 112, in wrapped
    result = func(*args, **kw)
  File "I:\Private\guesslangtools\guesslangtools\workflow\compressed_repositories.py", line 100, in download
    for step, row in enumerate(pool_imap(_download_repository, rows), 1):
  File "I:\Private\guesslangtools\guesslangtools\common.py", line 213, in pool_imap
    for result in pool.imap(_apply, iterable):
  File "D:\.conda\envs\base\lib\multiprocessing\pool.py", line 868, in next
    raise value
requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host.', None, 10054, None))

Process finished with exit code 1

or

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "I:/Private/guesslangtools/guesslangtools/__main__.py", line 104, in <module>
    main()
  File "I:/Private/guesslangtools/guesslangtools/__main__.py", line 89, in main
    run_workflow()
  File "I:\Private\guesslangtools\guesslangtools\app.py", line 14, in run_workflow
    compressed_repositories.download()
  File "I:\Private\guesslangtools\guesslangtools\common.py", line 112, in wrapped
    result = func(*args, **kw)
  File "I:\Private\guesslangtools\guesslangtools\workflow\compressed_repositories.py", line 97, in download
    for step, row in enumerate(pool_imap(_download_repository, rows), 1):
  File "I:\Private\guesslangtools\guesslangtools\common.py", line 213, in pool_imap
    for result in pool.imap(_apply, iterable):
  File "D:\.conda\envs\base\lib\multiprocessing\pool.py", line 868, in next
    raise value
ValueError: check_hostname requires server_hostname

Process finished with exit code 1

@yoeo
Copy link
Owner

yoeo commented Aug 7, 2021

Okay @yjmm10, it looks like you are using an older version of guesslangtools (version < 1.0).
Older version of gueslangtools was downloading the repositories directly from Github HTTP servers.
And due to Github HTTP servers restrictions (like the ones that you are experiencing) I switched to using Git command instead.

You can install guesslangtools latest version with the following commands

# Clone the latest version of the code
git clone https://github.com/yoeo/guesslangtools.git
cd guesslangtools

# Edit the language description file to add the new languages information
vi data/languages.yaml

# Install the new Guesslangtools on your system
pip install -Ue .

@yoeo
Copy link
Owner

yoeo commented Aug 7, 2021

After installing guesslangtools you can run it to generate the dataset:

# You can change the --nb-xxx parameters to have more or less examples in your dataset
gltool /path/to/new/dataset

It will take hours, and when it is done, you can train Guesslang:

# Clone Guesslang
git clone https://github.com/yoeo/guesslang.git
cd guesslang

# Install Guesslang in "developper mode"
pip install -Ue .

# Copy the language mapping generated in the dataset (`languages.json`) into Guesslang repository
cp /path/to/new/dataset/languages.json ./data/languages.json

# Run the training
guesslang --train /path/to/new/dataset/files --steps 10000 --model /path/to/new/model

I'm using Linux command line syntax here, and I hope that it won't be hard to convert them into Window shell commands.

@yoeo yoeo added the question label Sep 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants