Data Locality

Offload KB - performance

Old Content Alert

Please note that this is a old document archive and the will most likely be out-dated or superseded by various other products and is purely here for historical purposes.

Untitled Any outer data accesses within an Offload block go through our software cache, which allows for very speedy offloading of code to the SPU. When iterating over code with performance in mind though, one of the first and most simple optimizations that can be performed is to bring data local instead.

static void func()
{
	int foo1[1000];
	__blockingoffload()
	{
		int bar1 = 0;
		for(unsigned int i = 0; i < 1000; i++)
		{
			bar1 += foo1[i];
		}

		int foo2[1000];
		__builtin_memcpy(foo2, foo1, sizeof(foo1));

		int bar2 = 0;
		for(unsigned int i = 0; i < 1000; i++)
		{
			bar2 += foo2[i];
		}

		// bar1 & bar2 will have same value!
	};
}

In the above example, each of the loops performs identical operations, and the resultant bar values will be the same. The difference for performance between the two loops though is notable. The first loop will go through the software cache, each element of foo1 has to be DMA'ed in individually. On the second loop, since the array foo1 in its entirety is copied locally in one larger DMA, when we access foo2 within the loop all the accesses are through local memory, and are thus substantially faster!Handily, the __builtin_memcpy works on all combinations of outer/inner memory (So do normal memory copies invoked by dereferencing pointers to large data). This means that the one compiler intrinsic can be used for all of your memcpy needs!

See /kb/135.html for further examples on caching PPU data in SPU-local storage.

oneAPI

oneAPI for NVIDIA®/AMD

oneAPI Construction Kit

SYCL™

Research Projects

All Updates

News

Press Updates

Blogs

Videos

About Us

Careers

Management Team

Collaborations

Press-Packs

Contact Us

Data Locality

Offload KB - performance

Old Content Alert

Share