Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meaning of C and D #15

Open
Maxscha opened this issue Aug 3, 2023 · 1 comment
Open

Meaning of C and D #15

Maxscha opened this issue Aug 3, 2023 · 1 comment

Comments

@Maxscha
Copy link

Maxscha commented Aug 3, 2023

Thanks for this amazing library. Looking forward to actually train and adapt some models for it.

After creating my first vocabulary I noticed that a lot of the tokens contain uppercase C and uppercase D. Do those have a special meaning? I could also see them referenced in the code, but I could not find the meaning.

Thanks in advance

Example:

tokens:
    - token:   "D"
      id:      35
      score:   0.006828829
      encoded: true
    - token:   " und"
      id:      2657
      score:   0.0047021606
      encoded: true
    - token:   " der"
      id:      2099
      score:   0.0032128973
      encoded: true
    - token:   "C"
      id:      34
      score:   0.0031624683
      encoded: true
    - token:   " die"
      id:      2105
      score:   0.002436903
      encoded: true
    - token:   " von"
      id:      2684
      score:   0.0021727835
      encoded: true
    - token:   ".C"
      id:      271
      score:   0.0020115946
      encoded: true
    - token:   " für"
      id:      5997
      score:   0.0017581019
      encoded: true
    - token:   "-DC"
      id:      1163
      score:   0.0017092729
      encoded: true
    - token:   " des"
      id:      2100
      score:   0.0016576286
      encoded: true
    - token:   " mit"
      id:      2407
      score:   0.0014818916
      encoded: true
    - token:   " in"
      id:      993
      score:   0.0014810717
      encoded: true
    - token:   ",C"
      id:      259
      score:   0.0014182056
      encoded: true
    - token:   ","
@alasdairforsythe
Copy link
Owner

D, C & W are 'capcode' markers for capcode level 2. With capcode level 1 it will instead use only ord(127).
D means delete next space.
C means uppercase next character.
W means uppercase next word.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants