Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

darwin arm64 regression trying "Example: Build on mac", llama-cpp ''../ggml-common.h' file not found' #4274

Closed
mintyleaf opened this issue Nov 27, 2024 · 2 comments · Fixed by #4279
Labels
bug Something isn't working unconfirmed

Comments

@mintyleaf
Copy link
Contributor

mintyleaf commented Nov 27, 2024

LocalAI version:
e8128a3

Environment, CPU architecture, OS, and Version:
m3 air mac

Describe the bug
trying to build and load phi-2.Q2_K model is failing for all backends
doing exactly the same on latest stable tag v2.23.0 successfully loads the model with the first llama-cpp backend and working as intended

To Reproduce
build and run the example for mac on arm64 mac

Logs

...
9:09PM DBG [llama-cpp-fallback] llama-cpp variant available
9:09PM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama-cpp-fallback
9:09PM DBG GRPC Service for Phi2 will be running at: '127.0.0.1:63236'
9:09PM DBG GRPC Service state dir: /var/folders/d7/46zkm5yj39nbb6dp9dtrs_d00000gn/T/go-processmanager3758471652
9:09PM DBG GRPC Service Started
9:09PM DBG Wait for the service to start up
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stdout Server listening on 127.0.0.1:63236
9:09PM DBG GRPC Service Ready
9:09PM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:phi-2.Q2_K ContextSize:512 Seed:1785535671 NBatch:512 F16Memory:false MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:99999999 MainGPU: TensorSplit: Threads:8 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/Users/mintyleaf/Projects/work/LocalAI/models/phi-2.Q2_K Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 LoadFormat: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type: FlashAttention:false NoKVOffload:false ModelPath:/Users/mintyleaf/Projects/work/LocalAI/models LoraAdapters:[] LoraScales:[]}
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llama_load_model_from_file: using device Metal (Apple M3) - 5461 MiB free
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llama_model_loader: loaded meta data with 20 key-value pairs and 325 tensors from /Users/mintyleaf/Projects/work/LocalAI/models/phi-2.Q2_K (version GGUF V3 (latest))
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llama_model_loader: - kv   0:                       general.architecture str              = phi2
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llama_model_loader: - kv   1:                               general.name str              = Phi2
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llama_model_loader: - kv   2:                        phi2.context_length u32              = 2048
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llama_model_loader: - kv   3:                      phi2.embedding_length u32              = 2560
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llama_model_loader: - kv   4:                   phi2.feed_forward_length u32              = 10240
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llama_model_loader: - kv   5:                           phi2.block_count u32              = 32
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llama_model_loader: - kv   6:                  phi2.attention.head_count u32              = 32
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llama_model_loader: - kv   7:               phi2.attention.head_count_kv u32              = 32
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llama_model_loader: - kv   8:          phi2.attention.layer_norm_epsilon f32              = 0.000010
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llama_model_loader: - kv   9:                  phi2.rope.dimension_count u32              = 32
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llama_model_loader: - kv  10:                          general.file_type u32              = 10
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llama_model_loader: - kv  11:               tokenizer.ggml.add_bos_token bool             = false
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = gpt2
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,51200]   = ["!", "\"", "#", "$", "%", "&", "'", ...
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,51200]   = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr[str,50000]   = ["Ġ t", "Ġ a", "h e", "i n", "r e",...
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 50256
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 50256
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 50256
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llama_model_loader: - kv  19:               general.quantization_version u32              = 2
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llama_model_loader: - type  f32:  195 tensors
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llama_model_loader: - type q2_K:   33 tensors
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llama_model_loader: - type q3_K:   96 tensors
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llama_model_loader: - type q6_K:    1 tensors
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_vocab: missing pre-tokenizer type, using: 'default'
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_vocab:                                             
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_vocab: ************************************        
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!        
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_vocab: CONSIDER REGENERATING THE MODEL             
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_vocab: ************************************        
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_vocab:                                             
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_vocab: special tokens cache size = 944
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_vocab: token to piece cache size = 0.3151 MB
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: format           = GGUF V3 (latest)
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: arch             = phi2
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: vocab type       = BPE
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: n_vocab          = 51200
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: n_merges         = 50000
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: vocab_only       = 0
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: n_ctx_train      = 2048
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: n_embd           = 2560
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: n_layer          = 32
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: n_head           = 32
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: n_head_kv        = 32
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: n_rot            = 32
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: n_swa            = 0
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: n_embd_head_k    = 80
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: n_embd_head_v    = 80
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: n_gqa            = 1
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: n_embd_k_gqa     = 2560
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: n_embd_v_gqa     = 2560
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: f_norm_eps       = 1.0e-05
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: f_clamp_kqv      = 0.0e+00
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: f_max_alibi_bias = 0.0e+00
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: f_logit_scale    = 0.0e+00
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: n_ff             = 10240
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: n_expert         = 0
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: n_expert_used    = 0
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: causal attn      = 1
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: pooling type     = 0
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: rope type        = 2
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: rope scaling     = linear
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: freq_base_train  = 10000.0
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: freq_scale_train = 1
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: n_ctx_orig_yarn  = 2048
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: rope_finetuned   = unknown
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: ssm_d_conv       = 0
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: ssm_d_inner      = 0
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: ssm_d_state      = 0
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: ssm_dt_rank      = 0
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: ssm_dt_b_c_rms   = 0
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: model type       = 3B
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: model ftype      = Q2_K - Medium
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: model params     = 2.78 B
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: model size       = 1.09 GiB (3.37 BPW) 
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: general.name     = Phi2
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: BOS token        = 50256 '<|endoftext|>'
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: EOS token        = 50256 '<|endoftext|>'
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: EOT token        = 50256 '<|endoftext|>'
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: UNK token        = 50256 '<|endoftext|>'
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: LF token         = 128 'Ä'
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: EOG token        = 50256 '<|endoftext|>'
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_print_meta: max token length = 256
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_tensors: tensor 'token_embd.weight' (q2_K) (and 0 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr ggml_backend_metal_log_allocated_size: allocated buffer, size =  1076.52 MiB, ( 1076.59 /  5461.34)
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_tensors: offloading 32 repeating layers to GPU
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_tensors: offloading output layer to GPU
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_tensors: offloaded 33/33 layers to GPU
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_tensors: Metal_Mapped model buffer size =  1076.51 MiB
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llm_load_tensors:   CPU_Mapped model buffer size =    41.02 MiB
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr .........................................................................................
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llama_new_context_with_model: n_seq_max     = 1
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llama_new_context_with_model: n_ctx         = 512
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llama_new_context_with_model: n_ctx_per_seq = 512
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llama_new_context_with_model: n_batch       = 512
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llama_new_context_with_model: n_ubatch      = 512
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llama_new_context_with_model: flash_attn    = 0
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llama_new_context_with_model: freq_base     = 10000.0
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llama_new_context_with_model: freq_scale    = 1
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llama_new_context_with_model: n_ctx_per_seq (512) < n_ctx_train (2048) -- the full capacity of the model will not be utilized
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr ggml_metal_init: allocating
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr ggml_metal_init: found device: Apple M3
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr ggml_metal_init: picking default device: Apple M3
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr ggml_metal_init: default.metallib not found, loading from source
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr ggml_metal_init: loading '/private/tmp/localai/backend_data/backend-assets/grpc/ggml-metal.metal'
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr ggml_metal_init: error: Error Domain=MTLLibraryErrorDomain Code=3 "program_source:7:10: fatal error: '../ggml-common.h' file not found
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr #include "../ggml-common.h"
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr          ^~~~~~~~~~~~~~~~~~
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr " UserInfo={NSLocalizedDescription=program_source:7:10: fatal error: '../ggml-common.h' file not found
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr #include "../ggml-common.h"
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr          ^~~~~~~~~~~~~~~~~~
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr }
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr ggml_backend_metal_device_init: error: failed to allocate context
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr llama_new_context_with_model: failed to initialize Metal backend
9:09PM DBG GRPC(Phi2-127.0.0.1:63236): stderr common_init_from_params: failed to create context with model '/Users/mintyleaf/Projects/work/LocalAI/models/phi-2.Q2_K'
...
@mintyleaf mintyleaf added bug Something isn't working unconfirmed labels Nov 27, 2024
@mintyleaf mintyleaf changed the title arm64 regression trying "Example: Build on mac", llama-cpp 'computeFunction must not be nil' darwin arm64 regression trying "Example: Build on mac", llama-cpp 'computeFunction must not be nil' Nov 27, 2024
@mintyleaf mintyleaf changed the title darwin arm64 regression trying "Example: Build on mac", llama-cpp 'computeFunction must not be nil' darwin arm64 regression trying "Example: Build on mac", llama-cpp ''../ggml-common.h' file not found' Nov 27, 2024
@mintyleaf
Copy link
Contributor Author

after further investigation i discovered that llama-cpp binaries copied into backend-assets, and then cached at the /tmp directory, which didn't cleans on make clean

with newly compiled llama-cpp i got another error
exactly the same as in ggerganov/llama.cpp#6608

@mudler
Copy link
Owner

mudler commented Nov 27, 2024

@mintyleaf can you try #4279 ? I don't have a Mac to test this on, but should fix your issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working unconfirmed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants