06 Jan 18:17

madroidmaq

99a52a4

v0.3.1 Latest

Latest

What's New

Support more MLX inference parameters, such as adapter_path, top_k, min_tokens_to_keep, min_p, presence_penalty, etc.
close #12

Usage Examples

OpenAI SDK:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:10240/v1",  # MLX Omni Server endpoint
    api_key="not-needed"
)

# Using  extra_body adapter_path
response = client.chat.completions.create(
    model="mlx-community/Llama-3.2-1B-Instruct-4bit",
    messages=[
        {"role": "user", "content": "What's the weather like today?"}
    ],
    extra_body={
        "adapter_path": "path/to/your/adapter",  # Path to fine-tuned adapter
    }
)

Curl :

curl http://localhost:10240/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Llama-3.2-1B-Instruct-4bit",
    "messages": [
      {
        "role": "user",
        "content": "What'\''s the weather like today?"
      }
    ],
    "adapter_path": "path/to/your/adapter"
  }'

Full Changelog: v0.3.0...v0.3.1

Assets 2

04 Jan 16:14

madroidmaq

v0.3.0

e97e7c5

v0.3.0

What's Changed

Support Structured Output by @madroidmaq in #11

Structured Output Examples

code	result

Full Changelog: v0.2.1...v0.3.0

Contributors

madroidmaq

Assets 2

19 Dec 17:01

madroidmaq

v0.2.1

e198131

v0.2.1

What's Changed

Support parallel invoke by worker parameter by @madroidmaq in #10

Full Changelog: v0.2.0...v0.2.1

Contributors

madroidmaq

Assets 2

16 Dec 15:26

madroidmaq

v0.2.0

e1b7315

v0.2.0

Key Features

Enhanced Function Calling (Tools) parsing accuracy to mitigate LLM output instability issues
Added model caching support to eliminate reload time when using the same model multiple times

Function Calling test results using madroid/glaive-function-calling-openai dataset:

For Llama3.2 3B 4bit model:

Accuracy improved from 2.9% to 99.6%
Average latency reduced from 10.81s to 4.24s

For Qwen2.5 3B 4bit model:

Accuracy improved from 48.4% to 99.0%
Average latency reduced from 13.22s to 4.89s

Performance comparison with Ollama:

MLX achieves higher TPS (77.6) compared to Ollama (57.6)
34.7% speed advantage while generating more tokens

Example: Web Search with Function Calling
Thanks to the significant improvement in function calling accuracy, you can now perform web searches using phidata web agentic even with a 4-bit quantized 3B model. Here's how it works:

Implementation:

Result:

New Features

Added prefill response support for pre-populating LLM outputs
Implemented stream_options for token statistics in stream responses
Added support for custom stop tokens configuration

Improvements

Reorganized code structure for better maintainability
Added more code examples

Full Changelog: v0.1.2...v0.2.0

Assets 2

07 Dec 18:10

madroidmaq

v0.1.2

6d6f985

v0.1.2

What's Changed

Remove the printing of SSE event Body content by @madroidmaq in #5

Full Changelog: https://github.com/madroidmaq/mlx-omni-server/commits/v0.1.2

Contributors

madroidmaq

Assets 2

05 Dec 13:01

madroidmaq

v0.1.1

13fd297

v0.1.1

MLX Omni Server v0.1.1

MLX Omni Server is a local inference server powered by Apple's MLX framework, specifically designed for Apple Silicon (M-series) chips. It implements OpenAI-compatible API endpoints, enabling seamless integration with existing OpenAI SDK clients while leveraging the power of local ML inference.

🚀 Key Features

OpenAI Compatible API Endpoints

/v1/chat/completions - Support for chat, tools/function calling, and LogProbs
/v1/audio/speech - Text-to-Speech capabilities
/v1/audio/transcriptions - Speech-to-Text processing
/v1/models - Model listing and management
/v1/images/generations - Image generation

Core Capabilities

Optimized for Apple Silicon (M1/M2/M3 series) chips
Full local inference for privacy
Multiple AI capabilities:
- Audio Processing (TTS & STT)
- Chat Completion
- Image Generation
High Performance with hardware-accelerated local inference
Privacy-First: All processing happens locally on your machine
SDK Support: Works with official OpenAI SDK and other compatible clients

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's New

Usage Examples

What's Changed

Structured Output Examples

Contributors

What's Changed

Contributors

Key Features

New Features

Improvements

What's Changed

Contributors

MLX Omni Server v0.1.1

🚀 Key Features

OpenAI Compatible API Endpoints

Core Capabilities

Releases: madroidmaq/mlx-omni-server

v0.3.1

What's New

Usage Examples

v0.3.0

What's Changed

Structured Output Examples

Contributors

v0.2.1

What's Changed

Contributors

v0.2.0

Key Features

New Features

Improvements

v0.1.2

What's Changed

Contributors

v0.1.1

MLX Omni Server v0.1.1

🚀 Key Features

OpenAI Compatible API Endpoints

Core Capabilities