-
Notifications
You must be signed in to change notification settings - Fork 150
Cortex Design Doc
Cortex is a C++ application designed to run and customize machine learning models on any hardware. It provides a lightweight, high-performance runtime for inference tasks, supporting multiple models and custom backends where you build the logic in your favorite tool and Cortex does the rest.
AI has taken the world by storm since the release of GPT-3, Stable Diffusion, and other models in the space of consumer products and enterprise-level applications, but it hasn't yet made it into the mainstream of robotics. Meaning, that there is no software (open source or proprietary) yet that can serve as the brain of a robot (or the pip install a-robots-brain
), and that's what we will shape Cortex to be:
The open-source brain for robots: vision, speech, language, tabular, and action -- the cloud is optional.
At its core, Cortex offers the following functionalities:
- A server for communication between the host and the instructions provider
- An inference engine
- Model management capabilities
- Storage
- Users can download models from external sources (Hugging Face, custom repositories).
- Users can store models in a structured, efficient format.
- Multiple models can be loaded and switched dynamically.
- Users can execute inference requests via CLI or REST API.
- Support for different quantization strategies (GGUF, FP16, INT8, etc.).
- Performance optimizations using CPU and GPU acceleration.
- Expose REST endpoints similar to OpenAI’s API:
/v1/chat/completions
/v1/embeddings
/v1/fine_tuning
- Support structured outputs and function calling.
- Users can check model load status and resource usage.
- Provide telemetry on memory and GPU utilization.
- Cross-platform installation via standalone executables.
- Prebuilt binaries for Linux (Deb, Arch), macOS, and Windows.
- Must handle 7B models with at least 8GB RAM.
- Response times for inference should be under 500ms for small queries.
- Local execution ensures no data is transmitted externally.
- Secure API with optional authentication.
- Multi-threaded execution to utilize available CPU cores.
- Future support for distributed inference across devices.
- downloaded models get stored on-device
- Interactions with models are stored in a sqlite (or flavor of sqlite) database
- metrics and embeddings are stored in a separate analytical database (e.g., lancedb or libsql)
Cortex consists of three main layers:
- CLI / REST API Interface – Handles user interactions.
- Engine Layer – Loads models, manages execution, and optimizes runtime.
- Inference Backend – Executes computations using different backends (Llama.cpp, ONNXRuntime).
Command-Line Interface (CLI)
- Commands: cortex pull, cortex run, cortex ps Provides simplified management of models.
- Runs as a local server, exposing AI capabilities via HTTP.
- Manages model loading, unloading, and switching. Uses optimized quantized formats for faster inference.
- Supports multiple engines llama.cpp.
- GPU acceleration where applicable.
- Local Installer: Standalone package with all dependencies.
- Network Installer: Downloads dependencies dynamically.
5.2 Supported Platforms
Linux: .deb, .tar.gz for Arch, generic shell script.
macOS: .pkg installer.
Windows: .exe installer.
- Robot manipulation
- On-device AI service
A developer downloads a Llama3 model and runs a chatbot locally using:
cortex pull llama3.2
cortex run llama3.2
A developer integrates Cortex.cpp into a VS Code extension for offline code suggestions.
A researcher runs a fine-tuned AI model to analyze and categorize documents.
- Model fine-tuning support.
- Integration with AMD and Apple Silicon.
- Support for multi-modal AI (text, audio, vision).