Releases: madroidmaq/mlx-omni-server
v0.3.1
What's New
- Support more MLX inference parameters, such as
adapter_path
,top_k
,min_tokens_to_keep
,min_p
,presence_penalty
, etc. - close #12
Usage Examples
OpenAI SDK:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:10240/v1", # MLX Omni Server endpoint
api_key="not-needed"
)
# Using extra_body adapter_path
response = client.chat.completions.create(
model="mlx-community/Llama-3.2-1B-Instruct-4bit",
messages=[
{"role": "user", "content": "What's the weather like today?"}
],
extra_body={
"adapter_path": "path/to/your/adapter", # Path to fine-tuned adapter
}
)
Curl :
curl http://localhost:10240/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Llama-3.2-1B-Instruct-4bit",
"messages": [
{
"role": "user",
"content": "What'\''s the weather like today?"
}
],
"adapter_path": "path/to/your/adapter"
}'
Full Changelog: v0.3.0...v0.3.1
v0.3.0
What's Changed
- Support Structured Output by @madroidmaq in #11
Structured Output Examples
code | result |
---|---|
Full Changelog: v0.2.1...v0.3.0
v0.2.1
What's Changed
- Support parallel invoke by worker parameter by @madroidmaq in #10
Full Changelog: v0.2.0...v0.2.1
v0.2.0
Key Features
- Enhanced Function Calling (Tools) parsing accuracy to mitigate LLM output instability issues
- Added model caching support to eliminate reload time when using the same model multiple times
Function Calling test results using madroid/glaive-function-calling-openai dataset:
For Llama3.2 3B 4bit model:
- Accuracy improved from 2.9% to 99.6%
- Average latency reduced from 10.81s to 4.24s
For Qwen2.5 3B 4bit model:
- Accuracy improved from 48.4% to 99.0%
- Average latency reduced from 13.22s to 4.89s
Performance comparison with Ollama:
- MLX achieves higher TPS (77.6) compared to Ollama (57.6)
- 34.7% speed advantage while generating more tokens
Example: Web Search with Function Calling
Thanks to the significant improvement in function calling accuracy, you can now perform web searches using phidata web agentic even with a 4-bit quantized 3B model. Here's how it works:
New Features
- Added prefill response support for pre-populating LLM outputs
- Implemented stream_options for token statistics in stream responses
- Added support for custom stop tokens configuration
Improvements
- Reorganized code structure for better maintainability
- Added more code examples
Full Changelog: v0.1.2...v0.2.0
v0.1.2
What's Changed
- Remove the printing of SSE event Body content by @madroidmaq in #5
Full Changelog: https://github.com/madroidmaq/mlx-omni-server/commits/v0.1.2
v0.1.1
MLX Omni Server v0.1.1
MLX Omni Server is a local inference server powered by Apple's MLX framework, specifically designed for Apple Silicon (M-series) chips. It implements OpenAI-compatible API endpoints, enabling seamless integration with existing OpenAI SDK clients while leveraging the power of local ML inference.
🚀 Key Features
OpenAI Compatible API Endpoints
/v1/chat/completions
- Support for chat, tools/function calling, and LogProbs/v1/audio/speech
- Text-to-Speech capabilities/v1/audio/transcriptions
- Speech-to-Text processing/v1/models
- Model listing and management/v1/images/generations
- Image generation
Core Capabilities
- Optimized for Apple Silicon (M1/M2/M3 series) chips
- Full local inference for privacy
- Multiple AI capabilities:
- Audio Processing (TTS & STT)
- Chat Completion
- Image Generation
- High Performance with hardware-accelerated local inference
- Privacy-First: All processing happens locally on your machine
- SDK Support: Works with official OpenAI SDK and other compatible clients