Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker image fails to start on mac m3 #257

Open
domdorn opened this issue Jan 2, 2024 · 8 comments
Open

docker image fails to start on mac m3 #257

domdorn opened this issue Jan 2, 2024 · 8 comments

Comments

@domdorn
Copy link

domdorn commented Jan 2, 2024

Hi,

I stumbled upon this project in search for a code generation plugin for IntelliJ that I could run locally on my Macbook Pro M3.
GPT4All and LM Studio work here.
When trying to start the docker container according to the readme, I'm getting the following error:

⇒  docker run -d --rm -p 8008:8008 -v perm-storage:/perm_storage --gpus all smallcloud/refact_self_hosting
Unable to find image 'smallcloud/refact_self_hosting:latest' locally
latest: Pulling from smallcloud/refact_self_hosting
6b851dcae6ca: Pull complete
4586c00479c6: Pull complete
4304fa233a80: Pull complete
afa3f70b397f: Pull complete
d963a42bc712: Pull complete
68cd1e6a2dfe: Pull complete
c4a5e6c74f13: Pull complete
afec03310895: Pull complete
44d8a5c35cf0: Pull complete
e1bab5cae66b: Pull complete
e5f5c15a6664: Pull complete
8171a8ea64b5: Pull complete
81498b0353e9: Pull complete
434102192d11: Pull complete
164ea2687875: Pull complete
330be2f7dde3: Pull complete
ec975c0f6b6b: Pull complete
d0684ebbd31e: Pull complete
3101cbbf5939: Pull complete
3b09cacfe58c: Pull complete
f6985351b850: Pull complete
d4592264fd4d: Pull complete
dbb5483af9ca: Pull complete
894f0a0dd390: Pull complete
Digest: sha256:d77bb965d665cefdcf9e575d3c826fd2b5f54835a002212acb8568c4e4f500ed
Status: Downloaded newer image for smallcloud/refact_self_hosting:latest
WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
5aa05d92b342e408a04d59ea1bbc8050f63c8c077fc0f7154ec6baa85fd93094
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

any help appreciated!

p.s.: happy new year!

@m0ngr31
Copy link

m0ngr31 commented Jan 5, 2024

It requires an Nvidia GPU for the self-hosted version to work

@olegklimov
Copy link
Contributor

Code completion doesn't work very well on apple silicon or CPUs. The context is just too big for completion to be fast enough. Chat without much context is of course possible on CPUs, but that's not a complete product.

@domdorn
Copy link
Author

domdorn commented Jan 10, 2024

The context is just too big for completion to be fast enough.

I have 64gb of shared memory, that is, memory that is also usable by the GPU, not just the CPU.
Most Nvidia RTX 4090 "only" have 24gb available.

Apart from this, I guess the problem is that Apple Silicon support is just not there in the underlying libraries/apps?

@olegklimov
Copy link
Contributor

I have M1 in my MacBook Air, I've tested the smallest reasonable models, for example 1b starcoder running llama.cpp:

"-m", "starcoder-1b-q8_0.gguf"
897.71 ms /   557 tokens (    1.61 ms per token,   620.47 tokens per second)
1334.68 ms /    49 runs   (   27.24 ms per token,    36.71 tokens per second)

The problem is not the 49 generated tokens, it's the 557 prefill tokens (in this example) that take 900ms. For a typical 2k or 4k context that will be 4-8 seconds. Normally using cloud Refact, you'll get 250-400ms typical single line completion depending on where in the world you are. So there's 10x gap.

I looked at M2 specs, it's not that much faster. Maybe M3 is? @domdorn if you have interesting suggestions that would be awesome!

@domdorn
Copy link
Author

domdorn commented Jan 10, 2024

hmm.. not sure I'm doing this right as its the first time I run the llms on the CLI.

  1. build
git clone git@github.com:ggerganov/llama.cpp.git
cd llama.cpp
make -j 
  1. download starcoder-1b
mkdir models; cd models; 
wget https://huggingface.co/TabbyML/StarCoder-1B/resolve/main/ggml/q8_0.v2.gguf
mv q8_0.v2.gguf starcoder-1b-q8_0.gguf
cd ..
./main -m models/starcoder-1b-q8_0.gguf
  1. output is
⇒  ./main -m models/starcoder-1b-q8_0.gguf
Log start
main: build = 1808 (57d016b)
main: built with Apple clang version 14.0.0 (clang-1400.0.29.202) for arm64-apple-darwin23.2.0
main: seed  = 1704900282
llama_model_loader: loaded meta data with 18 key-value pairs and 293 tensors from models/starcoder-1b-q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = starcoder
llama_model_loader: - kv   1:                               general.name str              = StarCoder
llama_model_loader: - kv   2:                   starcoder.context_length u32              = 8192
llama_model_loader: - kv   3:                 starcoder.embedding_length u32              = 2048
llama_model_loader: - kv   4:              starcoder.feed_forward_length u32              = 8192
llama_model_loader: - kv   5:                      starcoder.block_count u32              = 24
llama_model_loader: - kv   6:             starcoder.attention.head_count u32              = 16
llama_model_loader: - kv   7:          starcoder.attention.head_count_kv u32              = 1
llama_model_loader: - kv   8:     starcoder.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv   9:                          general.file_type u32              = 7
llama_model_loader: - kv  10:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  11:                      tokenizer.ggml.tokens arr[str,49152]   = ["<|endoftext|>", "<fim_prefix>", "<f...
llama_model_loader: - kv  12:                  tokenizer.ggml.token_type arr[i32,49152]   = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  13:                      tokenizer.ggml.merges arr[str,48891]   = ["Ġ Ġ", "ĠĠ ĠĠ", "ĠĠĠĠ ĠĠ...
llama_model_loader: - kv  14:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  15:                tokenizer.ggml.eos_token_id u32              = 0
llama_model_loader: - kv  16:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  17:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  194 tensors
llama_model_loader: - type q8_0:   99 tensors
llm_load_vocab: special tokens definition check successful ( 19/49152 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = starcoder
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 49152
llm_load_print_meta: n_merges         = 48891
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 1
llm_load_print_meta: n_layer          = 24
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 16
llm_load_print_meta: n_embd_k_gqa     = 128
llm_load_print_meta: n_embd_v_gqa     = 128
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 1B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 1.24 B
llm_load_print_meta: model size       = 1.23 GiB (8.51 BPW)
llm_load_print_meta: general.name     = StarCoder
llm_load_print_meta: BOS token        = 0 '<|endoftext|>'
llm_load_print_meta: EOS token        = 0 '<|endoftext|>'
llm_load_print_meta: UNK token        = 0 '<|endoftext|>'
llm_load_print_meta: LF token         = 145 'Ä'
llm_load_tensors: ggml ctx size       =    0.11 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size =  1257.52 MiB, ( 1257.59 / 49152.00)
llm_load_tensors: system memory used  = 1255.96 MiB
....................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M3 Max
ggml_metal_init: picking default device: Apple M3 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/domdorn/work/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M3 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple9 (1009)
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 51539.61 MB
ggml_metal_init: maxTransferRate               = built-in GPU
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     6.00 MiB, ( 1269.22 / 49152.00)
llama_new_context_with_model: KV self size  =    6.00 MiB, K (f16):    3.00 MiB, V (f16):    3.00 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     0.02 MiB, ( 1269.23 / 49152.00)
llama_build_graph: non-view tensors processed: 583/583
llama_new_context_with_model: compute buffer total size = 107.19 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   104.00 MiB, ( 1373.22 / 49152.00)

system_info: n_threads = 12 / 16 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
sampling:
	repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


 [end of text]

llama_print_timings:        load time =     131.08 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =      -0.09 ms
ggml_metal_free: deallocating
Log end

is this what you did?

@olegklimov
Copy link
Contributor

Yes, just give it a prompt of 1000 tokens.

Here you can try my script:

code = """import pygame
import numpy as np
import attractgame_particle


W = 640
H = 480


def draw_hello_world(
    screen: pygame.Surface,
    message: str,
    color: tuple = (0, 255, 255),
    font_name: str = "Arial",
) -> None:
    font = pygame.font.SysFont(font_name, 32)
    text = font.render(message, True, color)
    text_rect = text.get_rect()
    text_rect.center = (W / 2, H / 2)


particles = [
    attractgame_particle.Particle(
        np.random.uniform(0, W),
        np.random.uniform(0, H),
        np.random.uniform(-2, 2),
        np.random.uniform(-2, 2)
    ) for i in range(10)]


def main_loop():
    screen = pygame.display.set_mode((W, H))
    quit_flag = False
    while not quit_flag:
        for event in pygame.event.get():
            if event.type == pygame.QUIT:
                quit_flag = True
        screen.fill((0, 0, 0))
        pygame.draw.circle(screen, (255, 255, 255), (W / 2, H / 2), 10)
        for p in particles:
            pygame.draw.circle(screen, (255, 255, 255), (p.x, p.y), 10)
        draw_hello_world(screen, "Hello World!", (0, 255, 0))
        pygame.display.flip()
        pygame.time.Clock().tick(60)
        for p in particles:
            p.calc_forward(W, H)


class Point:
    def __init__(self, x: float, y: float):
        self.x = x
        self.y = y

    def __str__(self):
        return f"({self.x}, {self.y})"

    def __add__(self, other):
        return Point(self.x + other.x, self.y + other.y)


if __name__ == '__main__':
    pygame.init()
    pygame.display.set_caption("Attract Game")
    main_loop()
    pygame.quit()

class Point3d:"""


cmd = [
	"./main",
        #"-m", "./Refact-1_6B-fim/ggml-model-f16.gguf",
        "-m", "Refact-1_6B-fim/refact-1.6B-fim-q4_0.gguf",
        #"-m", "./starcoder-1b-f16.gguf",
        #"-m", "starcoder-1b-q8_0.gguf",
	"-c", "2048",
        "-n", "50",
        "-p", code,
        "--temp", "0.2",
        "--top-p", "1.0",
]

import subprocess
p = subprocess.run(cmd)
print(p)


#"-m", "starcoder-1b-q8_0.gguf",
#  897.71 ms /   557 tokens (    1.61 ms per token,   620.47 tokens per second)
# 1334.68 ms /    49 runs   (   27.24 ms per token,    36.71 tokens per second)

#"-m", "./starcoder-1b-f16.gguf",
#  841.99 ms /   557 tokens (    1.51 ms per token,   661.53 tokens per second)
#  243.18 ms /    49 runs   (   45.78 ms per token,    21.84 tokens per second)

"-m", "Refact-1_6B-fim/refact-1.6B-fim-q4_0.gguf",
# 1246.57 ms /   557 tokens (    2.24 ms per token,   446.82 tokens per second)
# 1176.39 ms /    49 runs   (   24.01 ms per token,    41.65 tokens per second)

#"-m", "./Refact-1_6B-fim/ggml-model-f16.gguf",
#  175.27 ms /   557 tokens (    2.11 ms per token,   473.93 tokens per second)
#  962.51 ms /    49 runs   (   60.46 ms per token,    16.54 tokens per second)


@domdorn
Copy link
Author

domdorn commented Jan 11, 2024

This is the output the main from llama.cpp gives me with the starcoder-1b-q8_0.gguf:

⇒  python3 run.py
Log start
main: build = 1808 (57d016b)
main: built with Apple clang version 14.0.0 (clang-1400.0.29.202) for arm64-apple-darwin23.2.0
main: seed  = 1704994947
llama_model_loader: loaded meta data with 18 key-value pairs and 293 tensors from ./models/starcoder-1b-q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = starcoder
llama_model_loader: - kv   1:                               general.name str              = StarCoder
llama_model_loader: - kv   2:                   starcoder.context_length u32              = 8192
llama_model_loader: - kv   3:                 starcoder.embedding_length u32              = 2048
llama_model_loader: - kv   4:              starcoder.feed_forward_length u32              = 8192
llama_model_loader: - kv   5:                      starcoder.block_count u32              = 24
llama_model_loader: - kv   6:             starcoder.attention.head_count u32              = 16
llama_model_loader: - kv   7:          starcoder.attention.head_count_kv u32              = 1
llama_model_loader: - kv   8:     starcoder.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv   9:                          general.file_type u32              = 7
llama_model_loader: - kv  10:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  11:                      tokenizer.ggml.tokens arr[str,49152]   = ["<|endoftext|>", "<fim_prefix>", "<f...
llama_model_loader: - kv  12:                  tokenizer.ggml.token_type arr[i32,49152]   = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  13:                      tokenizer.ggml.merges arr[str,48891]   = ["Ġ Ġ", "ĠĠ ĠĠ", "ĠĠĠĠ ĠĠ...
llama_model_loader: - kv  14:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  15:                tokenizer.ggml.eos_token_id u32              = 0
llama_model_loader: - kv  16:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  17:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  194 tensors
llama_model_loader: - type q8_0:   99 tensors
llm_load_vocab: special tokens definition check successful ( 19/49152 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = starcoder
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 49152
llm_load_print_meta: n_merges         = 48891
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 1
llm_load_print_meta: n_layer          = 24
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 16
llm_load_print_meta: n_embd_k_gqa     = 128
llm_load_print_meta: n_embd_v_gqa     = 128
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 1B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 1.24 B
llm_load_print_meta: model size       = 1.23 GiB (8.51 BPW)
llm_load_print_meta: general.name     = StarCoder
llm_load_print_meta: BOS token        = 0 '<|endoftext|>'
llm_load_print_meta: EOS token        = 0 '<|endoftext|>'
llm_load_print_meta: UNK token        = 0 '<|endoftext|>'
llm_load_print_meta: LF token         = 145 'Ä'
llm_load_tensors: ggml ctx size       =    0.11 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size =  1257.52 MiB, ( 1257.59 / 49152.00)
llm_load_tensors: system memory used  = 1255.96 MiB
....................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M3 Max
ggml_metal_init: picking default device: Apple M3 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/domdorn/work/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M3 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple9 (1009)
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 51539.61 MB
ggml_metal_init: maxTransferRate               = built-in GPU
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =    24.00 MiB, ( 1287.22 / 49152.00)
llama_new_context_with_model: KV self size  =   24.00 MiB, K (f16):   12.00 MiB, V (f16):   12.00 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     0.02 MiB, ( 1287.23 / 49152.00)
llama_build_graph: non-view tensors processed: 583/583
llama_new_context_with_model: compute buffer total size = 107.19 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   104.00 MiB, ( 1391.22 / 49152.00)

system_info: n_threads = 12 / 16 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
sampling:
	repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 1.000, min_p = 0.050, typical_p = 1.000, temp = 0.200
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 2048, n_batch = 512, n_predict = 50, n_keep = 0


import pygame
import numpy as np
import attractgame_particle


W = 640
H = 480


def draw_hello_world(
    screen: pygame.Surface,
    message: str,
    color: tuple = (0, 255, 255),
    font_name: str = "Arial",
) -> None:
    font = pygame.font.SysFont(font_name, 32)
    text = font.render(message, True, color)
    text_rect = text.get_rect()
    text_rect.center = (W / 2, H / 2)


particles = [
    attractgame_particle.Particle(
        np.random.uniform(0, W),
        np.random.uniform(0, H),
        np.random.uniform(-2, 2),
        np.random.uniform(-2, 2)
    ) for i in range(10)]


def main_loop():
    screen = pygame.display.set_mode((W, H))
    quit_flag = False
    while not quit_flag:
        for event in pygame.event.get():
            if event.type == pygame.QUIT:
                quit_flag = True
        screen.fill((0, 0, 0))
        pygame.draw.circle(screen, (255, 255, 255), (W / 2, H / 2), 10)
        for p in particles:
            pygame.draw.circle(screen, (255, 255, 255), (p.x, p.y), 10)
        draw_hello_world(screen, "Hello World!", (0, 255, 0))
        pygame.display.flip()
        pygame.time.Clock().tick(60)
        for p in particles:
            p.calc_forward(W, H)


class Point:
    def __init__(self, x: float, y: float):
        self.x = x
        self.y = y

    def __str__(self):
        return f"({self.x}, {self.y})"

    def __add__(self, other):
        return Point(self.x + other.x, self.y + other.y)


if __name__ == '__main__':
    pygame.init()
    pygame.display.set_caption("Attract Game")
    main_loop()
    pygame.quit()

class Point3d:
    def __init__(self, x: float, y: float, z: float):
        self.x = x
        self.y = y
        self.z = z

    def __str__(self):
        return f"({self
llama_print_timings:        load time =     303.02 ms
llama_print_timings:      sample time =       6.02 ms /    50 runs   (    0.12 ms per token,  8305.65 tokens per second)
llama_print_timings: prompt eval time =     161.88 ms /   557 tokens (    0.29 ms per token,  3440.71 tokens per second)
llama_print_timings:        eval time =     332.86 ms /    49 runs   (    6.79 ms per token,   147.21 tokens per second)
llama_print_timings:       total time =     510.23 ms
ggml_metal_free: deallocating
Log end
CompletedProcess(args=['./main', '-m', './models/starcoder-1b-q8_0.gguf', '-c', '2048', '-n', '50', '-p', 'import pygame\nimport numpy as np\nimport attractgame_particle\n\n\nW = 640\nH = 480\n\n\ndef draw_hello_world(\n    screen: pygame.Surface,\n    message: str,\n    color: tuple = (0, 255, 255),\n    font_name: str = "Arial",\n) -> None:\n    font = pygame.font.SysFont(font_name, 32)\n    text = font.render(message, True, color)\n    text_rect = text.get_rect()\n    text_rect.center = (W / 2, H / 2)\n\n\nparticles = [\n    attractgame_particle.Particle(\n        np.random.uniform(0, W),\n        np.random.uniform(0, H),\n        np.random.uniform(-2, 2),\n        np.random.uniform(-2, 2)\n    ) for i in range(10)]\n\n\ndef main_loop():\n    screen = pygame.display.set_mode((W, H))\n    quit_flag = False\n    while not quit_flag:\n        for event in pygame.event.get():\n            if event.type == pygame.QUIT:\n                quit_flag = True\n        screen.fill((0, 0, 0))\n        pygame.draw.circle(screen, (255, 255, 255), (W / 2, H / 2), 10)\n        for p in particles:\n            pygame.draw.circle(screen, (255, 255, 255), (p.x, p.y), 10)\n        draw_hello_world(screen, "Hello World!", (0, 255, 0))\n        pygame.display.flip()\n        pygame.time.Clock().tick(60)\n        for p in particles:\n            p.calc_forward(W, H)\n\n\nclass Point:\n    def __init__(self, x: float, y: float):\n        self.x = x\n        self.y = y\n\n    def __str__(self):\n        return f"({self.x}, {self.y})"\n\n    def __add__(self, other):\n        return Point(self.x + other.x, self.y + other.y)\n\n\nif __name__ == \'__main__\':\n    pygame.init()\n    pygame.display.set_caption("Attract Game")\n    main_loop()\n    pygame.quit()\n\nclass Point3d:', '--temp', '0.2', '--top-p', '1.0'], returncode=0)

This is finished in less than a second.
python3 run.py 0.18s user 0.08s system 32% cpu 0.799 total

I'm not sure how to read the results given, but it feels like the M3 would be enough to run this.. am I correct?

I'll try to setup refact locally on my mac without docker.. it seems to be a bit complicated as I seem to have a mess with my python installations :-/

@olegklimov
Copy link
Contributor

Hey @domdorn thanks for your results!

That should be less than a second for a 2k context, maybe we'll think about official support 🤔

refact locally on my mac without docker

It's all about finetune, file filtering, efficient model hosting on GPUs. I think your best bet is to run model inside llama.cpp, add openai-style http server to it, and set up a caps file for refact-lsp, similar to https://inference.smallcloud.ai/coding_assistant_caps.json

Here's how to run it:

cargo build && target/debug/refact-lsp --address-url mycaps.json --api-key ANY-WILL-WORK --http-port 8001 --lsp-port 8002 --logs-stderr

Then test if it works in console, using an example like this:

https://github.com/smallcloudai/refact-lsp/blob/main/examples/http_completion.sh

If it does, the last step is to put a path to your caps.json to address URL in the plugin. Should work, but you will be the first to try, fun!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants