In my last blog post, I explained how the timing data and high-level counters in our PVRTune GPU performance recording tool enabled us to identify that the Fantasy Warrior 3D demo was CPU limited when running on our PowerVR-based target device.

In this article, we will describe some of the advanced profiling counters provided by PVRTune and explain how they can be used to better understand how to analyze the GPU’s workload.

Advanced performance analysis on Fantasy Warrior 3D using PVRTune (2)

Vertices per triangle

This is an average number of vertices per triangle. This is calculated as the number of input vertices processed divided by the number of input triangles processed. This counter gives an indication of how efficiently transformed vertices are shared between triangles.

Advanced performance analysis on Fantasy Warrior 3D using PVRTune (1)

This value varies between a maximum of 3 (indicating that there is no sharing and every triangle has an individual index per vertex) to a number close to or below 1. The lower this number is, the most optimal the geometry is for processing.

So if we sort the geometry data properly, we can improve the efficiency of vertices processing. The general rule to follow is this: the more shared vertices between adjacent triangles we have in the Index Buffer, the lower our Vertices per triangle value.

For this game, the Vertices per triangle value is 1.5 which means it is a little bit higher than the ideal value (vertices per triangle <= 1). For optimal performance, triangles should be sorted by spatial locality to improve post-transform vertex cache efficiency.

We can use the triangle sorting algorithm provided by Imagination to optimize the meshes; the sorting algorithm has been included in the latest PowerVR Graphics SDK v4.0 release (check out the PVRTGeometry.h file).

Triangles culled

This value represents the percentage of post-transform triangles culled before data is written to the Parameter Buffer. These culled triangles include sub-pixel, back-face and off-screen polygons.

This value for this game (80.2%) is very high. This means the GPU has wasted a lot of time processing polygons (GPU – Tiler load) that will be rejected by the Tiler. We can use PVRTrace to determine if there are a lot of off-screen polygons or if back-face culling is always disabled.

HSR efficiency

This counter represents the effectiveness of the Hidden Surface Removal (HSR) engine (more detail about HSR can be found here), rejecting obscured pixels before they get processed. This tells you the percentage of pixels sent to be shaded out of the total number of pixels submitted.

Any pixel occluded by an opaque polygon and not visible is rejected at this early stage. This avoids the expensive processing and texturing of pixels that are not visible, maximizing processing performance and saving memory bandwidth.

The HSR efficiency in this game (21.0%) is a little bit low. We can use PVRTrace to check if there are too many blend objects in the scene, or if all the opaque draws came before the translucent ones.

ISP overload

The Image Synthesis Processor (ISP) is the part of the PowerVR GPU that fetches the primitive data and performs Hidden Surface Removal (HSR), along with depth and stencil tests. This counter indicates if this unit has become a bottleneck.

The ISP counter for this game indicates a high value of 75.6%. An ISP overload event is quite rare and it occurs when one or several tiles have to process a large amount of overlapped polygons (i.e. high overdraw). We can use PVRTrace to find if there is high overdraw in the scene.

Z load/store

For most renders, depth and stencil buffers only contain temporary data required to complete the associated render pass. In the PowerVR GPU architecture, on-chip depth and stencil buffers are used to store this data. When the appropriate API mechanisms are utilized, an application will never cause data to be uploaded to or written from these on-chip buffers. This enables an application to avoid redundant system memory transfers for these temporary buffers.

A Z load/store event indicates that there has been an upload or resolve of depth/stencil data to/from the GPU. Unless a given application requires depth or stencil information to be preserved, an application should always use the appropriate API mechanisms to avoid these costly data transfer operations.

A value above 0% indicates a Z load/store event has occurred. To avoid these operations, you should ensure depth and stencil buffers that do not need to be preserved are cleared at the start and invalidated at the end of each render pass.

For this game, this value is 0% so there is no upload or resolve of depth/stencil data to/from the GPU.

Texturing load

This counter represents the average load of Texturing Units compared to their peak throughput.

A high value (e.g. beyond 50%) indicates that the Texturing Units are spending a significant amount of time fetching texture data from system memory and/or performing linear interpolation filtering operations.

When the load is high, you should refer to the following counter for more information about the bottleneck:

  • Texture Overload (%): If Texturing Load is high it is likely that texture overload events will occur. These events can reduce Active Slot Occupancy, in turn reducing the Shader Processor ability to hide latency caused by data dependencies.

The Texturing load is 80.6% in this game – a very high value. This might be caused by high overdraw or too many blend objects in the scene. The Texture Overload counter should help us narrow down the bottleneck.

Texture overload

The counter shows when texture overload events have occurred. Each event indicates that a Texture sample request queue is full, i.e. shader processing units are submitting requests faster than the Texturing Unit can process them.

The Texture overload indicates a high value of 11.3% for this game because the pixel shader is too simple to hide the latency of texture fetch request. We can use PVRTrace to investigate for more information and Texture Warm-up to improve texture fetch operation. Also we can use PVRShaderEditor to optimize the pixel shader to reduce Dependent Texture Reads .

Summary

This game is a typical CPU limited case. But we have also identified the following issues:

  • Need to apply triangle sorting to improve the vertices per-triangle ratio
  • Need to check if there are too many off-screen polygons or if back-face culling is always disabled.
  • Need to check if there are too many translucent objects or we draw translucent object before opaque objects.
  • Need to reduce overdraw and optimize texture usage.
  • Need to optimize shader codes.

These changes can improve the performance of the game after we solve the render thread issue. In the next post, I plan to explain how we used PVRTrace to isolate OpenGL ES API call inefficiencies and will also demonstrate the improvements delivered by the changes above to the original code.

Please let us know if you have any feedback on the materials published on the blog and leave a comment on what you’d like to see next. Make sure you also follow us on Twitter (@ImaginationPR, @GPUCompute and @PowerVRInsider) for more news and announcements from Imagination.

About the author: Sun Kevin

Profile photo of kevinsunimg

Kevin Sun is a leading PowerVR developer technology engineer for Imagination Technologies.

View all posts by Sun Kevin