arrow_back_ios Back to List

The Software Cache

Offload KB - features

Old Content Alert

Please note that this is a old document archive and the will most likely be out-dated or superseded by various other products and is purely here for historical purposes.

Offload™ will automatically uses a Software Cache on SPU to manage random reads and writes of data from from/to main memory locations. This cache allows for very quick migration of code to SPU, as usually when manually writing source code for SPU you will need to patch pointers within structures to be resident in local memory before they are used. In addition, it allows for a constant local store overhead which is ideal when you may not know the size of your datasets.

From version 2.0.0 of Offload™, there are two versions of the Software Cache available to developers.

Standard Software Cache

The Standard Software Cache is implemented as 4-way set associative. There is no cache line replacement algorithm, and simply chooses a line ID in a linear fashion across all sets, unless the line has a prefetch pending, at which point it chooses another line. If all lines within a set are prefetching, it will wait on a line being available before immediately replacing it with new data.

The size of the Software Cache can be modified by changing the -cachesets=n command line option. Currently this will set the caches of all offload blocks within the translation unit to be 2^n. If n = 0 the Slim Cache, outlined below, is used. Each cache line is 128 bytes.

Performance

The following numbers do not include the time taken to actually perform the read or write, only the time taken for the cache system to perform it's lookups and return a valid SPU local store address.

Cache hit (for read): 28 Cycles

Cache hit (for write): 62 Cycles

Cache miss: 600 Cycles (consisting of DMA transfer, may also require a cache writeback)

Cache line writeback: Expensive depending on cache modifications. If the whole line is modified a single DMA operation will occur to main memory. If only a subset of the line is modified, a series of atomic DMA operations are required. A DMA read of the latest main memory view of the cache line performed and the modifications are merged into that before being committed back to main memory.

Cache flush: Depends on size of cache.

Slim Software Cache

The Slim Software Cache is enabled when the command line option -cachesets=n is set so n=0.

It is implemented as 4-way set addociative, with a single set. There is no cache line replacement algorithm, and simply chooses a line ID in a linear fasion, unless the line has a prefetch pending, at which point it chooses another line. If all lines within a set are prefetching, it will wait on a line being available before immediately replacing it with new data.

The size of the Software Cache cannot be modified, and is always simply 4 lines of 128 bytes each. Global registers are used to make this version of the cache as fast as possible, ideal for when Offload blocks have most of their data on Local Store and the cache is used infrequently.

Performance

The following numbers do not include the time taken to actually perform the read or write, only the time taken for the cache system to perform it's lookups and return a valid SPU local store address.

Cache hit (for read): 20 Cycles

Cache hit (for write): 50 Cycles

Cache miss: 600 Cycles (consisting of DMA transfer, may also require a cache writeback)

Cache line writeback: Expensive depending on cache modifications. If the whole line is modified a single DMA operation will occur to main memory. If only a subset of the line is modified, a series of atomic DMA operations are required. A DMA read of the latest main memory view of the cache line performed and the modifications are merged into that before being committed back to main memory.

Cache flush: Depends on size of cache.

Issues

PS3 Implementation: