Skip to content

IPFS_Accelerate_Py

endomorphosis edited this page Dec 20, 2024 · 1 revision

This is an vllm/openvinoo/huggingface model server endpoint multiplexer, a local model server, and a libp2p connector. The purpose of this is to handle the inference requests, that are either requested from a node on the same computer, a node with the same DID key, or if a user is feeling generous the general public. The local model server supports Huggingface transformers on cuda, or CPU, llama_cpp and Intel OpenVino to perform inference, and it checks the config.json to try to determine what the model architecture is, then uses a "skillset" directory to attach an "endpoint handler" method to the object, where the endpoint handler method has a two dimensional array of "model_name" and "endpoint" e.g. "cuda:0" or "http://127.0.0.1/embed", for openvino/optimum-intel to specify what pipeline it should use and it will compile the models for the intel specific hardware e.g. cpu, gpu, npu, for llama there is also a converter that will quantize the models, and in each case with llama or openvino the default bit width is typically 4bits per weight.

Clone this wiki locally