Mobile GPU computing, just like any other technology domain, is subject to buzz and hype words. When marketing teams get a bit overexcited, it is easy to assume that something is a lot more important than it really is. I believe that a couple of areas require some comments, including: coherency, heterogeneous compute and, again, FP64.

Coherency

Coherency is the concept where multiple processing engines, e.g. multiple CPU cores in an SoC, have the same “coherent” view of memory. More specifically, if one CPU core updates some data in memory, the other cores immediately (ideally) know about it without any special efforts.

From a system efficiency point of view, such coherency sounds very valuable, as historically these types of operations have required synchronisation between cores, manual cache invalidates (ensuring that outdated data is no longer used) and manual cache flushes (ensuring that new data is written to memory for use by all other processors).

Handling all of this automatically would offer a lot more efficiency and a lot fewer software concerns. However, in practise, it is of course not quite as trivial as this. Most coherent implementations basically end up having a unified coherent cache, and higher-level caches (closer to the compute units themselves) are still subjected to flushes and invalidates, as linking all caches at all levels would be a connection nightmare (and would also be slow in itself). So typically, we see “coherent” CPU systems where multiple CPU cores share an L2 level cache and all higher levels automatically handle flushing/invalidates/synchronisation. There is no doubt that this helps, but it is also a highly complex mechanism. Any mistakes in the implementation quickly lead to undesired results, forcing a design back to manual efforts. Given the complexity of this, all coherency is typically only implemented in multicore CPU systems today.

A lot of hype these days is made about GPU / CPU coherency. The words would have you believe its full coherency just like described above. Unfortunately, this is not the case, and GPU / CPU coherency as it’s available today on embedded processor architectures is really not coherency but is rather a simplistic derivative with far less (if any) benefits.

GPU / CPU coherency today at the technical level really just boils down to the ability of the GPU to set some flags on memory accesses, indicating that the data it’s fetching may be already within the CPU cache, and hence the infrastructure will snoop (have a look in) the CPU cache (hit) before falling back to looking in system memory (miss). From this description, it’s hopefully immediately clear that this coherency is not coherency at all. It’s a one-way mechanism where the GPU snoops in the CPU cache, and the other direction is not supported. It is just a mechanism to try and use the CPU cache where speculatively we hope to save some bandwidth by not going to system memory (and possibly save a forced flush of this CPU data to system memory for the GPU to read it – note there are alternatives here e.g. uncached memory usage).

PowerVR - mobile GPU computing - GPU CPU coherency today

So is this a big deal? To be able to judge that, we again need to look at common usage scenarios and their link to this snooping functionality:

First of all, like any cache, you need to have data that is in the cache for it to be a benefit. Unfortunately, caches are not predictable elements. Their contents are continuously dynamic based on specific rules. Simplified: the CPU needs to generate (or use) data for it to be present in its cache and this same data must then almost immediately be used by the GPU to ensure that data is actually accessed through the CPU cache. Without immediate re-access by the GPU there is significant risk the data would be flushed to system memory and replaced by other useful for the CPU data.

Now the chance of a GPU doing something immediately is rather small, as GPUs are almost invariably multi-tasking i.e. they are working on their primary task — graphics — which never stops. As you are using your mobile device, compute is really a secondary task which will be scheduled as resources allow (smooth graphics are the most important task to ensure a good user experience).

This means that the chance that the GPU will immediately look at this data is… well… very small. Furthermore, if we think about practical usage scenarios for mobile compute, we think about image processing, video processing, augmented reality, camera vision… What all of these usage cases have in common is that the data that must be operated on is not coming from the CPU. It’s coming from video decoders and from camera interfaces, which all write directly to memory. So again… this coherency is of very little use, as the data never ends up in the CPU cache in the first place (baring ineffective system implementations that require CPU post processing of image/video data or which do not support zero copy access to memory by the GPU).

So to-date, CPU / GPU coherency is really overhyped. It’s not full coherency, it’s just cache snooping. Its practical compute usage scenarios and benefits are unproven and unknown (if you know a working case, let us know). As always, like a good IP provider, we do fully support these coherency mechanisms, and it is up to our partners to decide if they want to take advantage of this functionality as part of the GPU integration into their SoC.

Heterogeneous compute

Heterogeneous compute is a term which describes using different devices at the same time to handle a compute task, e.g. using both a CPU and a GPU to execute a processing task. Very often, this term is linked to the above-described CPU and GPU coherency, and this is proclaimed as the key usage scenario and benefit. But again, is this link valid, or just part of the overhyping of the coherency terminology?

There are different ways to understand heterogeneous compute. First of all, if you remember, I explained that parallel compute is ideally all about independent, parallel, non-divergent processing workloads. As the workload items (ideally thousands, if not millions) are all independent of each other, the most obvious way to implement heterogeneous compute is to distribute the workload across all capable processors. As an example: when doing image processing, assign 75% of the image pixels to be processed by the GPU, and the remaining 25% by the CPU. This usage scenario implements heterogeneous compute, as we are using two different processing resources at the same time to handle the compute task. Now, if you’ve been paying attention, you’ll immediately understand that coherency and this type of processing have no link at all, since the CPU and GPU are processing different parts of the image, which means there is no sharing of information possible in the CPU cache.

PowerVR - mobile GPU computing - heterogeneous_processing_today

However, there is a second way to understand heterogeneous compute: splitting the algorithm itself into multiple parts – some which are serial in nature, and others which are parallel in nature. The serial parts are best handled by the CPU, and the parallel parts by the GPU. This is quite different from the above case, as again, using our example of image processing, this would mean that all pixels would both be processed by the CPU and the GPU (part of the algorithm would run on the CPU and part of the algorithm would run on the GPU).

Thinking about this scenario, coherency does come to mind as a benefit… if only the coherency we had today was real coherency where CPU and GPU can collaborate in a bi-directional mechanism of sharing data and synchronisation and scheduling of workloads. But as we’ve discussed, this is not the case, as we just have simplistic CPU cache snooping.

PowerVR - mobile GPU computing - heterogeneous_processing_future

This last usage scenario is very interesting though, and Imagination recognized this early-on as a big opportunity to move processing to the next level. This is why Imagination is one of the founding members of the HSA Foundation (http://hsafoundation.com/), which is looking at all of the complexities of optimally implementing this type of usage case. Rather than just simplistic cache snooping, this effort is looking at what is really required at the hardware and system levels, and also at the software level. Discussing HSA in this article would do it injustice, so this is left to another article. For now, if you are intrigued by this potential future industry revolution, have a look at the HSA Foundation website and these blog articles (1) (2) that give an overview of Imagination, mobile GPU computing and the HSA Foundation . Before you do, just note that today’s CPU cache snooping mechanisms are far away from offering an HSA-compliant solution.

FP64

64-bit is one of the hot topics of the moment, not so much because of FP64 in the compute context, but more often as 64-bit memory addressing. As 32-bit memory addressing limits a system to about 4GBytes of usable memory space, and as phones and tablets get an ever-increasing amount of memory (2GBytes is no longer just an exception), we do need to worry, at the system level, about how to address larger memory spaces.

Marketing teams however often fail to grasp the difference between 64-bit memory addressing and 64-bit, floating point – which are two entirely different things. With confused marketing messages about 64-bit, it’s easy to confuse people into thinking that all types of 64-bit must be mandatory. I see marketing material confusing these two topics on the same slide way too often – just remember that 64-bit memory addressing and 64-bit floating point have no link at all. Most systems today actually support 40-bit addressing which makes the whole 64-bit story far less interesting since 40-bit already puts us in the TByte range. This amount of addressable memory is likely to be sufficient for mobile for at least a few years.

As discussed previously, FP64 is not a typical usage scenario for mobile compute. This is further backed up by what we see in the market today, which is that no mobile GPU actually exposes 64-bit floating point, no matter the marketing claims being made. This can easily be seen in the lack of the cl_khr_fp64 extension when browsing through online feature databases such as offered by Kishonti’s CLBenchmark.

If you have any questions or feedback about Imagination’s graphics IP, please use the comments box below. To keep up to date with the latest developments on PowerVR, follow us on Twitter (@GPUCompute, @PowerVRInsider and @ImaginationPR) and subscribe to our blog feed.

About the author: Kristof Beets

Profile photo of Kristof

Kristof Beets is Senior Business Development Manager for PowerVR Graphics at Imagination Technologies where he leads the in-house demo development team and works on product messaging. He has a background in electrical engineering and received a master's degree in artificial intelligence. Prior to joining the Business Development Group he worked on SDKs and tools for both PC and mobile products as a member of the PowerVR Developer Relations Team. Previous work has been published in ShaderX2, X5 & X6, ARM IQ Magazine, and online by the Khronos Group, Beyond3D and 3Dfx Interactive. Kristof has spoken at GDC, SIGGRAPH, Embedded Technology, MWC and too many other conferences to remember.

View all posts by Kristof Beets