Skip to content

Releases: madroidmaq/mlx-omni-server

v0.3.1

06 Jan 18:17
Compare
Choose a tag to compare

What's New

  • Support more MLX inference parameters, such as adapter_path, top_k, min_tokens_to_keep, min_p, presence_penalty, etc.
  • close #12

Usage Examples

OpenAI SDK:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:10240/v1",  # MLX Omni Server endpoint
    api_key="not-needed"
)

# Using  extra_body adapter_path
response = client.chat.completions.create(
    model="mlx-community/Llama-3.2-1B-Instruct-4bit",
    messages=[
        {"role": "user", "content": "What's the weather like today?"}
    ],
    extra_body={
        "adapter_path": "path/to/your/adapter",  # Path to fine-tuned adapter
    }
)

Curl :

curl http://localhost:10240/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Llama-3.2-1B-Instruct-4bit",
    "messages": [
      {
        "role": "user",
        "content": "What'\''s the weather like today?"
      }
    ],
    "adapter_path": "path/to/your/adapter"
  }'

Full Changelog: v0.3.0...v0.3.1

v0.3.0

04 Jan 16:14
Compare
Choose a tag to compare

What's Changed

Structured Output Examples

code result
ray-so-export ray-so-export
mlx-omni-server-phidata-structured-output mlx-omni-server-phidata-structured-output

Full Changelog: v0.2.1...v0.3.0

v0.2.1

19 Dec 17:01
Compare
Choose a tag to compare

What's Changed

Full Changelog: v0.2.0...v0.2.1

v0.2.0

16 Dec 15:26
Compare
Choose a tag to compare

Key Features

  • Enhanced Function Calling (Tools) parsing accuracy to mitigate LLM output instability issues
  • Added model caching support to eliminate reload time when using the same model multiple times

Function Calling test results using madroid/glaive-function-calling-openai dataset:

For Llama3.2 3B 4bit model:

  • Accuracy improved from 2.9% to 99.6%
  • Average latency reduced from 10.81s to 4.24s

image

For Qwen2.5 3B 4bit model:

  • Accuracy improved from 48.4% to 99.0%
  • Average latency reduced from 13.22s to 4.89s

image

Performance comparison with Ollama:

  • MLX achieves higher TPS (77.6) compared to Ollama (57.6)
  • 34.7% speed advantage while generating more tokens

Example: Web Search with Function Calling
Thanks to the significant improvement in function calling accuracy, you can now perform web searches using phidata web agentic even with a 4-bit quantized 3B model. Here's how it works:

Implementation:
image

Result:
image

New Features

  • Added prefill response support for pre-populating LLM outputs
  • Implemented stream_options for token statistics in stream responses
  • Added support for custom stop tokens configuration

Improvements

  • Reorganized code structure for better maintainability
  • Added more code examples

Full Changelog: v0.1.2...v0.2.0

v0.1.2

07 Dec 18:10
Compare
Choose a tag to compare

What's Changed

  • Remove the printing of SSE event Body content by @madroidmaq in #5

Full Changelog: https://github.com/madroidmaq/mlx-omni-server/commits/v0.1.2

v0.1.1

05 Dec 13:01
Compare
Choose a tag to compare

MLX Omni Server v0.1.1

MLX Omni Server is a local inference server powered by Apple's MLX framework, specifically designed for Apple Silicon (M-series) chips. It implements OpenAI-compatible API endpoints, enabling seamless integration with existing OpenAI SDK clients while leveraging the power of local ML inference.

🚀 Key Features

OpenAI Compatible API Endpoints

  • /v1/chat/completions - Support for chat, tools/function calling, and LogProbs
  • /v1/audio/speech - Text-to-Speech capabilities
  • /v1/audio/transcriptions - Speech-to-Text processing
  • /v1/models - Model listing and management
  • /v1/images/generations - Image generation

Core Capabilities

  • Optimized for Apple Silicon (M1/M2/M3 series) chips
  • Full local inference for privacy
  • Multiple AI capabilities:
    • Audio Processing (TTS & STT)
    • Chat Completion
    • Image Generation
  • High Performance with hardware-accelerated local inference
  • Privacy-First: All processing happens locally on your machine
  • SDK Support: Works with official OpenAI SDK and other compatible clients