2025-05-29

Unlocking AI Model Performance with Mako on Microsoft Azure

Mako improves the performance of vLLM and SGLang

By Waleed Atallah, David Levy

Unlocking AI Model Performance with Mako on Microsoft Azure

Scaling AI workloads efficiently requires the right combination of hardware and software optimizations. In this technical exploration, we'll examine how strategic hyperparameter optimization and GPU kernel selection can enhance GenAI performance across different GPU architectures.

We recently benchmarked Flux.1-Dev, Llama-3.3-70B, and DeepSeek-R1 on Microsoft Azure VMs ND MI300X v5 and ND H100 v5 virtual machines, using Mako’s automated kernel selection and tuning to improve performance across these models where possible.

“Our mission at Mako is to unlock every GPU’s full potential with zero manual effort,” said Mohamed Abdelfattah, Chief Science Officer, Mako. “By bringing these optimizations to Microsoft Azure customers, we’re letting teams focus on building great AI products instead of low-level tuning.”

In reference to the benchmarks, Tom Davis, Partner, Microsoft for Startups program, noted, “Mako’s GPU kernel optimization capabilities and Microsoft Azure’s AI infrastructure makes it easier to scale AI workloads.”

Details below, but startups and enterprises can get a free benchmark report for the model(s) of your choice across different GPU VMs via the Mako app in the Azure Marketplace! Use the benchmark report to discover the optimal configuration for your AI workloads before you deploy.

Mako’s Dynamic Kernel Optimizations

AI models are only as efficient as the compute they run on. While GPU selection plays a role, software-level optimizations can deliver significant performance gains.

To demonstrate this, we tested Flux.1-Dev, Llama-70B, and DeepSeek-R1 on ND MI300X v5 and ND H100 v5 virtual machines before and after applying optimizations. On top of the available optimizations, Mako dynamically selects and applies the most efficient GPU kernels, leveraging multiple providers to maximize speed and efficiency. Additionally, Mako performs comprehensive hyperparameter optimization on the inference engine to further enhance performance.

The result? Faster inference, lower memory overhead, and more efficient AI workloads — all without requiring changes to the underlying model architecture.

Benchmark Results: Before and After Mako Optimization

AI models like Flux.1-Dev, Llama-3.3-70B, and DeepSeek-R1 have baseline (”off the shelf”) performance metrics that do not represent the true capability of the hardware that serves these models. Implementing simple optimization techniques like torch.compile have a material impact on performance. But Mako’s automated kernel selection and other optimization methods eke out every last bit of performance from these GPU architectures so that models run as effectively as possible given a particular compute platform, dataset, and target metric.

Summary results

Flux.1-Dev: Achieved a 43% latency improvement (measured in ms) on ND MI300X v5 VMs with lower memory overhead after Mako’s optimizations.
Llama-70B: Optimizations improved inference speed by 29% and 11% on ND MI300X v5 and ND H100 v5 virtual machines (as measured in tps), reducing latency and improving efficiency on both VMs.
DeepSeek-R1: Mako improved DeepSeek-R1 performance by 22% and 64% (tps) with torch.compile and optimized kernel selection.

By leveraging Mako’s automated optimization engine, AI teams can achieve these performance gains on top of the existing optimizations provided by Azure.

Details

Our benchmarking ran entirely on generously provided Microsoft Azure hardware. For the AMD trials we used the ND-MI300X-v5 instance type, which packages eight AMD MI300X GPUs with two 4th-generation Intel Xeon Scalable processors - 96 physical CPU cores in total. Inside the VM the GPUs are tied together by 4th-gen AMD Infinity Fabric links that deliver 128 GB/s per GPU, or 896 GB/s aggregate bandwidth across the node.

The NVIDIA experiments ran on two ND-H100-v5 nodes. Each VM likewise contains eight NVIDIA H100 Tensor Core GPUs backed by the same class of dual 4th-gen Intel Xeon Scalable CPUs (96 physical cores). Intra-VM GPU communication relies on NVLINK 4.0, while inter-node traffic rides dedicated 400 Gb/s NVIDIA Quantum-2 CX7 InfiniBand links for every GPU, with GPUDirect RDMA and automatic wiring inside a scale set that can extend to thousands of GPUs. Both platforms therefore give us eight flagship accelerators and identical CPU resources in a single VM, providing a fair, apples-to-apples test bed across vendors.

We test across a variety of inference engines, including vLLM and SGlang, using the latest versions of all software stacks (CUDA, ROCm, PyTorch, etc…).

How it Works:

Improving hardware performance with the Mako Optimization Platform is as easy as one line of code. Simply install the Mako software, then run:

mako tune meta-llama/Llama-3.3-70B-Instruct

From here, you’ll see the optimization dashboard pop up. You can track as our search-based optimization engine begins tuning both kernels and inference engine hyperparameters. As it explores the search space of possible implementations, it will discover faster implementations automatically.

Running a Mako-optimized model is as simple as tuning one, just execute the following:

mako serve meta-llama/Llama-3.3-70B-Instruct

And that’s it! Your model can be served in an Open-AI compatible endpoint using the inference engines you know and love, vLLM and SGLang.

Building the Future of AI with Mako on Azure

Take the guesswork out of deploying AI efficiently. Because GPUs represent the most capital-intensive portion of the stack, any performance left unused is money left on the table. The Mako Optimization Platform removes that waste with a single-line of code: it automatically tunes your workload, supports every major accelerator architecture, plugs straight into today’s open-source inference engines, and consistently delivers best-in-class throughput and latency.

For startups and enterprises building AI applications on Azure, this means:

Seamless performance improvements across different GPU architectures.
Lower infrastructure costs by maximizing hardware utilization.
Faster time-to-market without needing deep GPU tuning expertise.

With Mako’s automated optimization, companies can get more out of their compute investments while reducing overall AI operational costs.

🚀 Get a free benchmark report now for the model(s) of your choice across different hardware via the Mako app in the Azure Marketplace to discover the optimal configuration for your AI workloads before you deploy!

Unlocking AI Model Performance with Mako on Microsoft Azure

Mako’s Dynamic Kernel Optimizations

Benchmark Results: Before and After Mako Optimization

Summary results

Details

Building the Future of AI with Mako on Azure

Recent Blogs

Kernels Together Strong 🦧 Improving Performance using Multiple Kernel Providers

1-Click deploy models on AMD MI300X

GPU go brrrrr, but at what cost?