LLaMA.CPP

LLaMA.CPP is an open-source project that enables inference of Large Language Models (LLMs) like LLaMA on various hardware. Written in C/C++, it boasts minimal dependencies and supports diverse platforms, from Apple Silicon to NVIDIA GPUs. Notably, it excels in quantization techniques, reducing model sizes and accelerating inference speeds. LLaMA.CPP democratizes access to powerful AI capabilities, allowing users to run sophisticated language models on consumer-grade devices.

LLaMA.CPP uses n_predict instead of max_tokens; however you can safely use max_tokens because it will be converted automatically. To use embeddings you will also need to start your webserver with --embedding argument and an appropriate model. The expected port is 8080.

Interface Name

llamacpp

Example Usage

const { LLMInterface } = require('llm-interface');

LLMInterface.setApiKey({'llamacpp': process.env.LLAMACPP_API_KEY});

async function main() {
  try {
    const response = await LLMInterface.sendMessage('llamacpp', 'Explain the importance of low latency LLMs.');
    console.log(response.results);
  } catch (error) {
    console.error(error);
    throw error;
  }
}

main();

Model Aliases

The following model aliases are provided for this provider.

default: gpt-3.5-turbo
large: gpt-3.5-turbo
small: gpt-3.5-turbo
agent: openhermes

Embeddings Model Aliases

default: none
large: none
small: none

Options

The following parameters can be passed through options.

cache_prompt: Details not available, please refer to the LLM provider documentation.
dynatemp_exponent: Details not available, please refer to the LLM provider documentation.
dynatemp_range: Details not available, please refer to the LLM provider documentation.
frequency_penalty: Penalizes new tokens based on their existing frequency in the text so far, reducing the likelihood of repeating the same line. Positive values reduce the frequency of tokens appearing in the generated text.
grammar: Details not available, please refer to the LLM provider documentation.
id_slot: Details not available, please refer to the LLM provider documentation.
ignore_eos: Whether to ignore the end-of-sequence token.
image_data: Details not available, please refer to the LLM provider documentation.
json_schema: Details not available, please refer to the LLM provider documentation.
logit_bias: An optional parameter that modifies the likelihood of specified tokens appearing in the model-generated output.
max_tokens: The maximum number of tokens that can be generated in the chat completion. The total length of input tokens and generated tokens is limited by the model's context length.
min_keep: Details not available, please refer to the LLM provider documentation.
min_p: Minimum probability threshold for token selection.
mirostat: Details not available, please refer to the LLM provider documentation.
mirostat_eta: Details not available, please refer to the LLM provider documentation.
mirostat_tau: Details not available, please refer to the LLM provider documentation.
n_keep: Details not available, please refer to the LLM provider documentation.
n_probs: Details not available, please refer to the LLM provider documentation.
penalize_nl: Details not available, please refer to the LLM provider documentation.
penalty_prompt: Details not available, please refer to the LLM provider documentation.
presence_penalty: Penalizes new tokens based on whether they appear in the text so far, encouraging the model to talk about new topics. Positive values increase the likelihood of new tokens appearing in the generated text.
repeat_last_n: Details not available, please refer to the LLM provider documentation.
repeat_penalty: Details not available, please refer to the LLM provider documentation.
samplers: Details not available, please refer to the LLM provider documentation.
seed: A random seed for reproducibility. If specified, the system will attempt to sample deterministically, ensuring repeated requests with the same seed and parameters return the same result. Determinism is not guaranteed.
stop: Up to 4 sequences where the API will stop generating further tokens.
stream: If set, partial message deltas will be sent, similar to ChatGPT. Tokens will be sent as data-only server-sent events as they become available, with the stream terminated by a data: [DONE] message.
system_prompt: Details not available, please refer to the LLM provider documentation.
temperature: Controls the randomness of the AI's responses. A higher temperature results in more random outputs, while a lower temperature makes the output more focused and deterministic. Generally, it is recommended to alter this or top_p, but not both.
tfs_z: Details not available, please refer to the LLM provider documentation.
top_k: The number of highest probability vocabulary tokens to keep for top-k sampling.
top_p: Controls the cumulative probability of token selections for nucleus sampling. It limits the tokens to the smallest set whose cumulative probability exceeds the threshold. It is recommended to alter this or temperature, but not both.
typical_p: Details not available, please refer to the LLM provider documentation.

Features

Streaming
Embeddings

Getting an API Key

No API Key (Local URL): This is not a traditional API so no API key is required. However, a URL(s) is required to use this service. (Ensure you have the matching models installed locally)

To get an API key, first create a LLaMA.CPP account, then visit the link below.

http://localhost:8080/v1/chat/completions

LLaMA.CPP Documentation

LLaMA.CPP documentation is available here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llamacpp.md

llamacpp.md

LLaMA.CPP

Interface Name

Example Usage

Model Aliases

Embeddings Model Aliases

Options

Features

Getting an API Key

LLaMA.CPP Documentation

Files

llamacpp.md

Latest commit

History

llamacpp.md

File metadata and controls

LLaMA.CPP

Interface Name

Example Usage

Model Aliases

Embeddings Model Aliases

Options

Features

Getting an API Key

LLaMA.CPP Documentation