Skip to content

Cortex Design Doc

Ramon Perez edited this page Feb 7, 2025 · 3 revisions

Abstract

Cortex is a C++ application designed to run and customize machine learning models on any hardware. It provides a lightweight, high-performance runtime for inference tasks, supporting multiple models and custom backends where you build the logic in your favorite tool and Cortex does the rest.

Motivation

AI has taken the world by storm since the release of GPT-3, Stable Diffusion, and other models in the space of consumer products and enterprise-level applications, but it hasn't yet made it into the mainstream of robotics. Meaning, that there is no software (open source or proprietary) yet that can serve as the brain of a robot (or the pip install a-robots-brain), and that's what we will shape Cortex to be:

The open-source brain for robots: vision, speech, language, tabular, and action -- the cloud is optional.

Functional Requirements

At its core, Cortex offers the following functionalities:

  • A server for communication between the host and the instructions provider
  • An inference engine
  • Model management capabilities
  • Storage

Model Management

  • Users can download models from external sources (Hugging Face, custom repositories).
  • Users can store models in a structured, efficient format.
  • Multiple models can be loaded and switched dynamically.

Inference Execution

  • Users can execute inference requests via CLI or REST API.
  • Support for different quantization strategies (GGUF, FP16, INT8, etc.).
  • Performance optimizations using CPU and GPU acceleration.

API Compatibility

  • Expose REST endpoints similar to OpenAI’s API:
    • /v1/chat/completions
    • /v1/embeddings
    • /v1/fine_tuning
  • Support structured outputs and function calling.

System Monitoring

  • Users can check model load status and resource usage.
  • Provide telemetry on memory and GPU utilization.

Platform Support

  • Cross-platform installation via standalone executables.
  • Prebuilt binaries for Linux (Deb, Arch), macOS, and Windows.

Nonfunctional Requirements

Performance

  • Must handle 7B models with at least 8GB RAM.
  • Response times for inference should be under 500ms for small queries.

Security

  • Local execution ensures no data is transmitted externally.
  • Secure API with optional authentication.

Scalability

  • Multi-threaded execution to utilize available CPU cores.
  • Future support for distributed inference across devices.

Storage

  • downloaded models get stored on-device
  • Interactions with models are stored in a sqlite (or flavor of sqlite) database
  • metrics and embeddings are stored in a separate analytical database (e.g., lancedb or libsql)

System Architecture

High-Level Design

Cortex consists of three main layers:

  • CLI / REST API Interface – Handles user interactions.
  • Engine Layer – Loads models, manages execution, and optimizes runtime.
  • Inference Backend – Executes computations using different backends (Llama.cpp, ONNXRuntime).

Key Components

Command-Line Interface (CLI)

  • Commands: cortex pull, cortex run, cortex ps Provides simplified management of models.

REST API

  • Runs as a local server, exposing AI capabilities via HTTP.

Engine Layer

  • Manages model loading, unloading, and switching. Uses optimized quantized formats for faster inference.

Inference Backend

  • Supports multiple engines llama.cpp.
  • GPU acceleration where applicable.

Deployment & Installation

Installation Methods

  • Local Installer: Standalone package with all dependencies.
  • Network Installer: Downloads dependencies dynamically.

5.2 Supported Platforms

Linux: .deb, .tar.gz for Arch, generic shell script.
macOS: .pkg installer.
Windows: .exe installer.

Use Cases

  • Robot manipulation
  • On-device AI service

Local AI Chatbot

A developer downloads a Llama3 model and runs a chatbot locally using:

cortex pull llama3.2
cortex run llama3.2

AI-Powered Code Completion

A developer integrates Cortex.cpp into a VS Code extension for offline code suggestions.

Private AI Search Engine

A researcher runs a fine-tuned AI model to analyze and categorize documents.

Future Enhancements

  • Model fine-tuning support.
  • Integration with AMD and Apple Silicon.
  • Support for multi-modal AI (text, audio, vision).