Unity UI Profiling Tools
There are several profiling tools useful for analyzing Unity UI’s performance. The key tools are:
- Unity Profiler
- Unity Frame Debugger
- Xcode’s Instruments or Intel VTune
- Xcode’s Frame Debugger or Intel GPA
The external tools provide method-level CPU profiling with millisecond (or better) resolution, as well as detailed draw-call and shader profiling. Instructions for setting up and using the above tools lie beyond the scope of this guide. Note that the XCode Frame Debugger and Instruments are only usable on IL2CPP builds for Apple platforms, and therefore can currently only be used to profile iOS builds.
The primary use for the Unity Profiler is to perform comparative profiling: enabling and disabling elements of a UI while the Unity Profiler is running can quickly narrow down the portions of a UI hierarchy that are most responsible for performance issues.
To analyze this, watch the Canvas.BuildBatch and Canvas.SendWillRenderCanvases lines in the profiler’s output.
Canvas.BuildBatch is the native-code calculations that perform the Canvas Batch Building process, as described previously.
Canvas.SendWillRenderCanvases contains the invocation of the C# scripts that are subscribed to the Canvas component’s willRenderCanvases Event. Unity UI’s CanvasUpdateRegistry class receives this event and uses it to run the Rebuild process, described previously. It is expected that any dirty UI components will update their Canvas Renderers at this time.
Note: To more easily see differences in UI performance, it is generally advisable to disable all of the trace categories aside from “Rendering”, “Scripts” and “UI”. This can be done by clicking on the colored boxes beside the name of the trace category on the left-hand side of the CPU Usage profiler. The categories can also be re-ordered in the CPU profiler by clicking and dragging the names of the categories upwards or downwards.
The UI category is new in Unity 2017.1 and up. Unfortunately, parts of the UI update process are not categorized correctly, so be careful when looking at the UI curve because it may not contain all UI related calls. For example, Canvas.SendWillRenderCanvases is categorized as "UI", but Canvas.BuildBatch is categorized as “Others” and “Rendering”.
In 2017.1 and up, there’s also a new UI Profiler. By default, this profiler is the last one in the Profiler window. It consists of two timelines and a batch viewer:
The first timeline shows the CPU time spent in two categories, respectively computing layout and rendering. Note that it suffers from the same problem described previously and some UI functions may not be accounted for.
The second timeline shows the total number of batches, vertices and also displays event markers. In the previous screenshot, you can see a couple of button click events. These markers can help you determine what caused a CPU spike.
Finally, the most useful feature of the UI Profiler is the batch viewer at the bottom. On the left, there’s a tree view of all your canvases and underneath each of them, a list of the batches they generated. The columns provide interesting details about each canvas or batch, but there’s one in particular that is crucial to better understand how to optimize your UI and it’s the Batch Breaking Reason.
This column will show why the selected batch couldn’t be merged with the previous one. Reducing the number of batches is one of the most effective way of improving UI performance, so it’s important to understand what breaks batching.
One of the most frequent reason, as shown in the screenshot, is a UI element using a different texture or material. In many cases, this can easily be fixed by using sprite atlases. The last column shows the name of the game objects associated with the batch. You can double click on the name to select the game object in the editor (this is particularly helpful when you have several objects with the same name).
As of Unity 2017.3, the batch viewer only works in the editor. The batching should usually be the same on device, so this is still really helpful. If you have doubt that batches may be different on device, then you can use the Frame Debugger that will be described next.
The Unity Frame Debugger is a useful tool for reducing the number of draw calls generated by a Unity UI. This built-in tool can be accessed via the Window menu within the Unity Editor. When enabled, it will display all draw calls generated by Unity, including those generated by Unity UI.
Notably, the frame debugger will update itself with the draw calls generated to display the Game View in the Unity Editor, and therefore can be used to try out different UI configurations without even entering Play Mode.
The location of the Unity UI draw calls depends on the Render Mode selected on the Canvas component being drawn:
- Screen Space – Overlay will appear within Canvas.RenderOverlays group
- Screen Space – Camera will appear within the Camera.Render group of the selected Render Camera, as a subgroup of Render.TransparentGeometry
- World Space will appear as a subgroup of Render.TransparentGeometry for each World Space camera in which the Canvas is visible
All UIs can be identified by the “Shader: UI/Default” line (assuming that the UI shader has not been replaced with a custom shader). in the group or draw call’s details. See the highlighted red boxes in the below screenshot.
By watching this set of lines while tweaking a UI, it is relatively simple to maximize the Canvas’ ability to combine UI elements into batches. The most common design-related cause of broken batches is unintentional overlap.
All Unity UI components generate their geometry as a series of quads. However, many UI sprites or UI text glyphs occupy only a fraction of the quads used to represent them, with the rest being empty space. As a result, it is quite common to find that the UI’s designer has unintentionally overlapped multiple different quads whose textures come from different materials and therefore cannot be batched.
As Unity UI operates entirely in the transparent queue, any quads that have unbatchable quads overlaid atop them must be drawn before the unbatchable quads, and therefore cannot be batched with other quads placed atop the unbatchable quads.
Consider a case of three quads, A, B, and C. Assume all three quads overlap one another, and also assume quads A and C use the same material while quad B uses a separate material. Quad B therefore cannot be batched with A or C.
If the order in the hierarchy (from top to bottom) is A, B, C then A and C cannot be batched, because B must be drawn atop A and beneath C. However, if B is placed before or after the batchable quads, then the batchable quads can actually be batched – B needs only to be drawn before or after the batched quads and does not interpose them.
Instruments & VTune
Xcode’s Instruments and Intel’s VTune allow for extremely deep profiling of Unity UI rebuilds and Canvas batch calculations on Apple or Intel CPUs, respectively. The method names are nearly identical to the profiler labels discussed above in the Unity Profiler section:
Canvas::SendWillRenderCanvases is the C++ parent that calls the Canvas.SendWillRenderCanvases C# method and governs that line in the Unity Profiler. It will contain the code used to run the Rebuild process, as described in the previous chapter.
Canvas::UpdateBatches is identical to Canvas.BuildBatch, but includes additional boilerplate code not covered by the Unity Profiler label. It runs the actual Canvas Batch Building process, described above.
When used in conjunction with a Unity app built via IL2CPP, these tools can be used to drill down deeper into the transpiled C# code of Canvas::SendWillRenderCanvases. Of primary interest will be the cost of the following methods. (Note: transpiled method names are approximate.)
- IndexedSet_Sort and CanvasUpdateRegistry_SortLayoutList are used to sort the list of dirty Layout components before the layouts are recalculated. As described above, this involves calculating the number of parent transforms above each Layout component.
- ClipperRegistry_Cull calls all registered implementers of the IClipRegion interface. Built-in implementers include RectMask2D, which uses the IClippable interface. During ClipperRegistry.Cull calls, RectMask2D components loop over all clippable elements contained within their hierarchy and asks them to update their culling information.
- Graphic_Rebuild will contain the cost of actually calculating the meshes needed to represent Image, Text or other Graphic-derived components. Beneath this will be several other methods like Graphic_UpdateGeometry and, most notably, Text_OnPopulateMesh.
- Text_OnPopulateMesh is generally a hotspot when Best Fit is enabled. This is discussed in more detail later in this guide.
- Mesh modifiers, such as Shadow_ModifyMesh and Outline_ModifyMesh, will also run here. The cost of calculating component drop shadows, outlines and other special effects can be seen via these methods.
Xcode Frame Debugger & Intel GPA
Low-level frame debugging tools are essential for profiling the cost of individual portions of the batched UI as well as monitoring the cost of UI overdraw. UI overdraw is discussed in more detail later in this guide.
Using the Xcode Frame Debugger
To test whether a given UI is overstressing the GPU, Xcode’s built-in GPU diagnostics tools can be employed. First, configure the project in question to use Metal or OpenGLES3, then make a build and open the resulting Xcode project. Some Xcode version and device combinaisons may support OpenGLES 2 frame captures, but there’s no guarantee it will work.
Note: On some versions of Xcode, it is necessary to select the appropriate Graphics API in the Build Scheme in order to make the graphics profiler work. To do this, go to the Product menu in Xcode, expand the Scheme menu item, and choose Edit Scheme.... Select the Run target and go to the Options tab. Change the GPU Frame Capture option to match the API used by your project. Assuming the Unity project is set up to automatically select a graphics API, then most modern iPads will default to using Metal. If in doubt, start the project and look at the debug logs in Xcode. One of the early lines should indicate which rendering path (Metal, GLES3 or GLES2) is being initialized.
Build and run the project on an iOS device. The GPU profiler can be found by showing the Debug pane in Xcode’s Navigator sidebar, and clicking on the FPS entry.
The first point of interest in the GPU profiler is the set of three bars in the center of the screen, labeled “Tiler”, “Renderer”, and “Device”. Of these two:
- “Tiler” is generally a measure of how stressed the GPU is by processing geometry, which includes time spent in vertex shaders. Generally, a high “Tiler” usage indicates either excessively slow vertex shaders or an excessive number of vertices being drawn.
- “Renderer” is generally a measure of how stressed the GPU’s pixel pipelines are. Generally, high “Renderer” usage indicates that an application is exceeding the maximum fill-rate of the GPU, or has inefficient fragment shaders.
- “Device” is a composite measure of overall GPU usage, which includes both “Tiler” and “Renderer” performance. It can generally be ignored, as it will roughly track the higher of the “Tiler” or “Renderer” measurements.
For more information on Xcode’s GPU profiler, see this documentation article.
Xcode’s Frame Debugger can be triggered by clicking on the small ‘Camera’ icon hidden at the bottom of the GPU profiler. It is highlighted by an arrow and a red box in the following screenshot.
After a brief pause, the Frame Debugger’s summary view should appear, like so:
When using the default UI shader, the cost of rendering geometry generated by the Unity UI system will show up under the “UI/Default” shader pass, assuming the default UI shader has not been replaced with a custom shader. It is possible to see this default UI shader in the above screenshot as Render Pipeline “UI/Default.”
Unity UI only generates quads and so the vertex shader is unlikely to stress the tiler pipeline of the GPU. Any problems that appear in this shader pass are likely due to fill-rate issues.
Analyzing profiler results
After gathering profiling data, several conclusions might be drawn. If Canvas.BuildBatch or Canvas::UpdateBatches seems to be using an excessive amount of CPU time, then the likely problem is an excessive number of Canvas Renderer components on a single Canvas. See the Splitting Canvases section of the Canvas chapter.
If an excessive amount of time is spent drawing the UI on the GPU, and the frame debugger indicates that the fragment shader pipeline is the bottleneck, then the UI is likely exceeding the pixel fill rate which the GPU is capable of. The most likely cause is excessive UI overdraw. See the Remediating fill-rate issues section of the Fill-rate, Canvases and input chapter.
If Graphic Rebuilds are using excessive CPU, as seen by a large portion of CPU time going to Canvas.SendWillRenderCanvases or Canvas::SendWillRenderCanvases, then deeper analysis is needed. Some portion of the Graphic Rebuild process is likely responsible.
In the case that a large portion of WillRenderCanvas is spent inside IndexedSet_Sort or CanvasUpdateRegistry_SortLayoutList, then time is being spent sorting the list of dirty Layout components. Consider reducing the number of Layout components on the Canvas. See Replacing layouts with RectTransforms and Splitting Canvases sections for possible remediations.
If excessive time seems to be spent in Text_OnPopulateMesh, then the culprit is simply the generation of text meshes. See the Best Fit and Disabling Canvases sections for possible remediations, and consider the advice inside Splitting Canvases if much of the text being rebuilt is not actually having its underlying string data changed.
If time is spent inside Shadow_ModifyMesh or Outline_ModifyMesh (or any other implementation of ModifyMesh), then the problem is excessive time spent calculating mesh modifiers. Consider removing these components and achieving their visual effect via static images.
If there is no particular hotspot within Canvas.SendWillRenderCanvases, or it appears to be running every frame, then the problem is likely that dynamic elements have been grouped together with static elements and are forcing the entire Canvas to rebuild too frequently. See the Splitting Canvases section.