2025-02-04

Kernels Together Strong šŸ¦§ Improving Performance using Multiple Kernel Providers

Achieve state-of-the-art latency on FLUX.1-schnell by leveraging multiple executor backends
By Kimbo Chen, Waleed Atallah, Łukasz Dudziak
Kernels Together Strong šŸ¦§
Improving Performance using Multiple Kernel Providers

There are a ton of different kernel libraries available to the public, but making use of them all to create an optimal implementation can be a challenge! In this blog post, we show that (1) there is no single ā€œbestā€ kernel for a given layer, like attention, and (2) using a combination of multiple kernel libraries delivers state-of-the-art results on the AMD MI300X when running the Flux.1-schnell model at a given configuration. The Mako Compiler automates the selection of kernel combinations, achieving up to 60% improvement over the PyTorch baseline.

Intro

Today, we are blessed by a blooming open-source GPU kernel ecosystem. Projects and communities such as Thunderkittens, Liger Kernel, and the GPU Mode Discord channel have pulled the curtain back on the art of high performance GPU kernel creation. But leveraging such open-source technologies requires a non-trivial amount of specialized expertise and manual integration. Engineers need to wrangle different programming APIs, understand the advantages and limitations of different hardware platforms, and choose from a wide variety of implementations for the very same functions and operations. Properly integrating and benchmarking every kernel library and then stitching together the best combination is time-consuming and difficult.

The Mako Compiler automates this process by integrating multiple kernel libraries and execution backends and searching through them automatically. We found that the best results are usually a combination of all of these different libraries. State-of-the-art inference latency for FLUX.1-schnell on the AMD MI300X GPU can be achieved with the Mako compiler by systematically evaluating viable kernel combinations across supported backends and determining the highest-performing configuration.

Missing from this blog post is a detailed technical analysis on how the different kernel implementations behave under different circumstances. We save this for future work.

FLUX Benchmarking and Optimization

This blog presents benchmark results from our FLUX.1-schnell implementation on AMD MI300X hardware. We start with a baseline implementation that uses Hugging Faceā€™s Diffusers library, then swap out different kernel implementations from there.

Our benchmarking setup is as follows:

  • 1x AMD MI300X GPU
  • ROCm 6.2
  • PyTorch 2.5
  • Huggingface Diffusers v0.31
  • Flux.1-Schnell from https://huggingface.co/black-forest-labs/FLUX.1-schnell
  • BF16 Data type

The optimal configuration combines four components:

  1. AMD Composable Kernel's Flash Attention implementation
  2. A GEMM tuning via PyTorch TunableOps
  3. Liger Kernelā€™s LayerNorm, GroupNorm, and RMSNorm (at high batch only)
  4. Torch.compile forā€¦ well everything else šŸ™‚

We sneak peek the final results at batch size of 1, 16, and 32 here because no one likes to scroll to the bottom these days:

Batch sizeNum stepsBaseline Latency (ms)Fully optimized Latency (ms)
141249.160816.1 (53.08% speedup)
16416291.30510827.2 (50.47% speedup)
32432066.77519870.2 (61.38% speedup)

Background: Image Generation with Diffusion

Before we dive into the specific optimizations, let us provide a high-level overview of the image generation process. Feel free to skip this section if you are familiar with how diffusion works.

The core component of any diffusion pipeline is the denoising model - it is used to iteratively transform a sample from a Gaussian distribution (basically random noise) into an image, by predicting a delta that should be added to the current value (called epsilon). A single denoising step consists of: 1) invoking the denoising model to predict an epsilon for the current latents, and 2) updating the latents using the predicted epsilon. Formally, for a process consisting of T steps, we can write:

Furthermore, in order for the user to be able to have some control over the generation process, denoising models are usually additionally conditioned with a user-provided prompt, which describes the intended content of the image. This prompt is usually represented using a continuous representation produced by one of many pretrained language models, to help the denoising model understand synonyms etc. In the case of FLUX, two such models (called text encoders) are employed. The prompt encoding stays the same throughout the iterative denoising process explained above, and is supposed to influence it towards generating images that would match the text provided by the user. With that, we can update the previous equations as:

Finally, last but not least, modern diffusion models like FLUX perform what is called a latent-space diffusion. This means that throughout the generation process an image is represented using a model-specific continuous representation, rather than the typical discrete representation that encodes values of individual pixels. This allows us to represent images using less data, making the denoising process more efficient, but also means that in order to obtain the final image we have to decode the latents into the usual pixel-oriented representation suitable for displaying. This decoding is performed by the decoder part of an Auto-Encoder (AE) model specific to the diffusion model in question. Putting everything together, the process of generating an image I over T steps can be described as:

Initial Reactions

The first thing we note is that torch.compile is really good! Like many before us have commented, itā€™s basically free performance if your situation allows for its use.

Secondly, the latency improvements are higher at larger batch sizes. This aligns with expectations as a well-optimized kernel can leverage the increased workload to maximize GPU utilization and hide latency.

Third, a non-intuitive result we observed was that it at different batch sizes, different kernel combinations produced a better result. At the smaller batches, for example, using the standard torch.compile to handle layernorms had better results than adding the dedicated Liger Kernel. At higher batches, the Liger Kernel implementations were better!

Lastly, this blog post was only made possible by the tireless work of the community when it comes to getting high performance results on all sorts of hardware, be it Nvidia, AMD, Apple, or more.

Applying Optimizations

While the Huggingface Diffuser pipeline offers a great place to start, we can do better with Mako.

1. torch.compile - its free money!

In this case, we hit torch.compile first. It's a pretty impressive piece of software, although it isnā€™t perfect for every scenario and it isnā€™t always the best first move. In fact, if we switched the order that we apply these optimizations, youā€™d see different improvements at each step (while landing at the same eventual conclusion). Here, running torch.compile yields a significant improvement.

Batch sizeNum stepsBaseline Latency (ms)+torch.compile
141249.160908.4
16416291.30511801.7
32432066.77520355.7

2. GEMM Tuning

There are thousands of different implementations for a given GEMM library, but only one of these implementations is going to be the fastest. PyTorch TunableOp addresses how to select the fastest implementation. In AMD ROCm in particular, we can search through rocblas and hipblasLT libraries to identify the best ones. TunableOp provides an interface that allows us to do this easily.

With GEMM tuning, we find that different GEMM kernels are selected at different parts of the model. The optimal GEMM kernel is different depending on all sorts of factors, like batch size, sequence length, and more. Interestingly, we see that at batch sizes of 1 and 16, we get improvements of 3.8% and 4.7%. At a batch size of 32, the model actually gets slower! One guess we have is that at the smaller batch sizes, weā€™re able to run torch.compile with max autotune enabled. At the higher batch, there wasnā€™t enough memory to do this, so we had to do max autotune with no CUDA graphs. This is part of the tradeoffs when it comes to figuring out how to run your model optimally.

Applying TunableOp is as simple as activating an environment variable, which yields the following changes to latency:

Batch sizeNum stepsBaseline Latency (ms)+torch.compile+tunableop
141249.160908.4873.5
16416291.30511801.711245.3
32432066.77520355.722466.0

3. Attention Kernels

One of the most interesting choices a developer has these days is which implementation of Attention to choose. Here we see some really incredible innovation, whether its Tri Daoā€™s Flash Attention series, FlashInfer, or vendor provided versions. The PyTorch Scaled Dot Product Attention (SDPA) is a good implementation that has Nvidia and AMD compatible incarnations. AMD also has their own attention kernel as a part of their Composable Kernel (CK) library, which we find offers competitive performance.

There is a lot of nuance when it comes to selecting the fastest kernel, though. They behave differently under different use cases. A future write-up will examine how the different attention kernels behave under the varying circumstances. For now, we show PyTorch SDPA vs AMDā€™s CK Attention.

PyTorch SDPA vs CK Attention, Batch Size = Num Steps = 1
ModePyTorch SDPACK AttentionSpeedup
Denoising Only295 ms266 ms11 %
AE Decoder Only154 ms128 ms20 %
End to End406.2 ms386.6 ms5 %

Flux.1-schnell has multiple head dimension sizes throughout the model architecture. Most of them are n=256, but the very last attention head is n=512. But CK Attention only supports dimensions up to 256! So we use CK Attention for those, and SDPA for the n=512 attention layer. When we use the CK Attention kernel and SDPA in this way, you get these performance numbers. Again, because of memory constraints, the high batch configuration is compiled without CUDA graphs.

Batch sizeNum stepsBaseline Latency (ms)+torch.compile+tunableop+CK Attention
141249.160908.4873.5816.0
16416291.30511801.711245.310827.2
32432066.77520355.722466.021698.6

4. Normalization Layers

The final change we make is to leverage different normalization kernel implementations. Now this isnā€™t a particularly powerful change because normalization is such a small component in the grand scheme of the Flux pipeline, but it's interesting to investigate nonetheless, and can have a much more significant impact in other use cases. We leverage the Liger Kernel library for these. One of the most interesting findings we had regarding these was how inconsistent performance gains were from applying some combinations of them. At lower batch sizes, the torch.compile normalization layers were the best! And at batch size of 32, the Liger Kernel normalization layers had a relatively larger improvement!.

Batch sizeNum stepsBaseline Latency (ms)+torch.compile+tunableop+CK Attention+Liger layernorm
141249.160908.4873.5816.0845.6
16416291.30511801.711245.310827.211516.0
32432066.77520355.722466.021698.619870.2

Conclusion and Next Steps

Given such a wide variety of available kernel libraries, it can be tricky to find the optimal implementation for a given model and use case. GPU kernels are not performance portable between different hardware vendors or even different generations of hardware within the same vendor. To resolve the painstaking process of manual optimization, we have designed the Mako compiler to identify optimal kernel combinations for you while leveraging the bustling open-source GPU kernel ecosystem.

With the Mako Compiler, we can achieve up to a 60% improvement over the PyTorch baseline when generating images. The best kernel combination was not always intuitive, and changes depending on the batch size.

Batch sizeNum stepsBaseline Latency (ms)Fully optimized Latency (ms)
141249.160816.1 (53.08% speedup)
16416291.30510827.2 (50.47% speedup)
32432066.77519870.2 (61.38% speedup)

This blog post certainly leaves much to be desired in terms of the actual architecture of the compiler and a more detailed analysis of how, when, and why different kernels perform the way they do, but fear not because this will be addressed in the future. If these kinds of investigations are interesting to you, apply to join our incredible team of GPU kernel and ML compiler engineers at jobs.mako-dev.com!

Recent Blogs