VMX2SPU Engine
Offload KB - features
Old Content Alert
Please note that this is a old document archive and the will most likely be out-dated or superseded by various other products and is purely here for historical purposes.
Offload™ features a fully optimizing VMX2SPU engine. What this means, is that any code the Offload™ compiler finds which needs duplicating for SPU usage will be transformed from optimized Altivec intrinsic into optimal SPU intrinsics. This is incredibly useful as it can exploit lots of performance with very little programmer effort.
Like the VMX2SPU header file included in some Cell implementations, Altivec intrinsics are transformed into a single SPU intrinsic, or a series of them depending on operation.
Where the Offload™ really shines however is that with the VMX2SPU implementation being integrated into the compiler, we can take full advantage of it's knowledge of data alignment and movement.
For example:
//PPU code for loading an unaligned vector pointer
inline vector float inline loadUnaligned(float* p)
{
vector float left = vec_lvlx(0, p);
vector float right = vec_lvrx(16, p);
return vec_or(left, right);
}
On SPU, this will be transformed into the following using traditional VMX2SPU headers:
inline __offload vector float inline loadUnaligned(float* p)
{
vector unsigned char *p1 = (vector unsigned char *)((unsigned char *)(p) + 0);
vector float left = (spu_slqwbyte(*(__typeof(p1))(((int)p1) & ~0xF), (unsigned int)p1 & 0xF));
vector unsigned char *p2 = (vector unsigned char *)((unsigned char *)(p) + 16);
vector float right =(spu_slqwbyte(*(__typeof(p2))(((int)p2) & ~0xF), (unsigned int)p2 & 0xF));
return spu_or(left, right);
}
Using <offload> however, would allow that code to be optimized further; if an aligned pointer was passed into loadUnaligned and the function was inlined the two loads and spu_or would be optimized away into a single load operation. This yields fast code, and can have much smaller code size.