IPFS_Accelerate_Py

This is an vllm/openvinoo/huggingface model server endpoint multiplexer, a local model server, and a libp2p connector. The purpose of this is to handle the inference requests, that are either requested from a node on the same computer, a node with the same DID key, or if a user is feeling generous the general public. The local model server supports Huggingface transformers on cuda, or CPU, llama_cpp and Intel OpenVino to perform inference, and it checks the config.json to try to determine what the model architecture is, then uses a "skillset" directory to attach an "endpoint handler" method to the object, where the endpoint handler method has a two dimensional array of "model_name" and "endpoint" e.g. "cuda:0" or "http://127.0.0.1/embed", for openvino/optimum-intel to specify what pipeline it should use and it will compile the models for the intel specific hardware e.g. cpu, gpu, npu, for llama there is also a converter that will quantize the models, and in each case with llama or openvino the default bit width is typically 4bits per weight.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IPFS_Accelerate_Py

Clone this wiki locally