Skip to content

Releases: modular/max

MAX 24.6

17 Dec 18:05
Compare
Choose a tag to compare

Release 24.6

We are excited to announce the release of MAX 24.6, featuring a preview of MAX GPU! At the heart of the MAX 24.6 release is MAX GPU – the first vertically integrated Generative AI serving stack that eliminates the dependency on vendor-specific computation libraries like NVIDIA’s CUDA.

MAX GPU is built on two groundbreaking technologies. The first is MAX Engine, a high-performance AI model compiler and runtime built with innovative Mojo GPU kernels for NVIDIA GPUs–free from CUDA or ROCm dependencies. The second is MAX Serve, a sophisticated Python-native serving layer specifically engineered for LLM applications. MAX Serve expertly handles complex request batching and scheduling, delivering consistent and reliable performance, even under heavy workloads.

For additional details, checkout the changelog and the release announcement.

MAX 24.5

26 Sep 21:28
Compare
Choose a tag to compare

Release 24.5

We are excited to announce the release of MAX 24.5! This release includes support for installing MAX as a conda package with magic, a powerful new package and virtual environment manager. We’re also introducing two new Python APIs for MAX Graph and MAX Driver, which will ultimately provide the same low-level programming interface as the Mojo Graph API. MAX Engine has improved performance for Llama3, with 24.5 generating tokens for Llama an average of 15% to 48% faster. Lastly, this release also adds support for Python 3.12, and drops support for Python 3.8 and Ubuntu 20.04.

For additional details, checkout the changelog and the release announcement.

MAX 24.4

07 Jun 21:31
Compare
Choose a tag to compare

Release 24.4

Today, we are thrilled to announce the release of MAX 24.4, which introduces a powerful new quantization API for MAX Graphs and extends MAX’s reach to macOS. Together, these unlock a new industry standard paradigm where developers can leverage a single toolchain to build Generative AI pipelines locally and seamlessly deploy them to the cloud, all with industry-leading performance. Leveraging the Quantization API reduces the latency and memory cost of Generative AI pipelines by up to 8x on desktop architectures like macOS, by up to 7x on cloud CPU architectures like Intel and Graviton, without requiring developers to rewrite models or update any application code.

Checkout the changelog and the full release blog for additional details.