Source code AlphaBlend
MMX and SSE asm routines for per-pixel alpha-blending

DeskZoom screenshot The basic idea of this still ongoing project is to create an as fast as possible per-pixel alpha blending routine, while still keeping the routines as simple as possible to use: just pass two buffers, one containing the background image and the other one the source image, overlay the second on top of the first one, taking into account the alpha channel.

Possible uses for all this might be the various digital television applications which uses per-pixel alpha-blending heavily, complex graphics libraries, games etc. I tried to make the source code as clear and simple as possible, to ease integration with your own projects.

Until now there are three implementations of the basic functionality:

The functions take the same parameters, and are designed as simple as possible. The parameters are the destination image, source image, and the width and height of images (the source and destination image sizes must be the same). The source and destination are 32 bit images with RGBA pixel format (alpha channel on most important byte, Red channel on least important byte).

Following is a description of the alpha-blending routines, which, again, have exactly the same parameters:

void AlphaBlt(unsigned char *dst, unsigned char *src, int w, int h)

void AlphaBltMMX(unsigned char *dst, unsigned char *src, int w, int h)

void AlphaBltSSE(unsigned char *dst, unsigned char *src, int w, int h)


Destination image buffer.
Source image buffer.
Width of image.
Height of image.


The resulting alpha-blended image.


The MMX and SSE versions requires that the width of the image must be word aligned (divisible by two).
Note that the routines don't perform well when mixing floating point with MMX/SSE instructions. To solve this, add an EMMS instruction after using MMX and/or SSE. This instruction was left out to allow more precise benchmarks.

Final considerations:

The MMX version added an almost 200% improvement is speed comparing to the basic, un-optimized, C implementation, while the SSE version added another speed boost, giving an almost 240% improvement over the basic version.

The downloads, include a demo program which also includes some benchmarks. The application blits a bitmap containing an alpha channel, over another bitmap, measuring the frame rate. The following keys are handled by the program:

The benchmarks measure the time it takes to make 10,000 blits using the selected implementation.

Warning: Tested only with Visual Studio 6 SP5 with Processor Pack

Coming soon, SSE2 implementation.

Further reading:

Discuss this article

    Version: 1.0
    License: GPL
    OS:  Windows
    Development Tools:                             Microsoft Visual Studio 6 SP5
    Last Update: November 25st, 2005

     Save Sources
     Save Executable