On this page
One scene, 10 graphics optimizations for console developersLast updated: January 2019
What you will get from this page: Console graphics optimization tips, courtesy of Unity’s Rob Thompson, a console graphics developer (who presented them at Unite), and our demo team. These optimizations were made to an especially difficult scene to ensure smooth 30fps performance.
AAA quality with Unity’s Scriptable Render Pipelines
The Book of the Dead (in this article we refer to it as BOTD) was produced by Unity's demo team. It’s a real-time rendered animated short, essentially created to test-drive and show off the capabilities of the new (and still experimental) Scriptable Render Pipelines (SRP). The SRPs allow you to code the core of your render loop in C#, thereby giving you much more flexibility for customizing how your scene is drawn to make it specific to your content.
There are two SRPs available: the High-Definition Render Pipeline (HDRP) which offers all of the features that you'd expect from a modern, AAA quality, high fidelity renderer, while the Lightweight Render Pipeline (LWRP) maintains responsive performance when scaling for mobile. BOTD uses the HDRP, and you can find advanced learning resources for the SRPs at the end of this article.
All of the assets and all of the script code for BOTD are available for you in the Asset Store.
A major optimization pass to get to 1080p at 30fps (or better) on consoles
The objective with the demo was to offer an interactive experience where people could wander around inside that environment and get their hands on it and experience it in a way that was familiar from a traditional AAA games perspective. Specifically, we wanted to show BOTD running on Xbox One and PS4. We had the performance requirements of 1080p at 30fps, or better.
As it’s a demo, and not a full game, the main focus for optimizations was on the rendering.
Generally, performance for BOTD is fairly consistent as it doesn’t have any scenes where, for example, there are thousands of particles suddenly spawning into life or loads of animated characters appearing.
To start with, Rob and the demo team found the view that was performing most poorly re: GPU load, here’s a screenshot of it:
What's going on in the scene is pretty much constant; what varies is what’s within the view of the camera. If they could make savings on this scene, they’d ultimately increase performance throughout the entire demo.
The reason why this scene performed poorly is that it's an exterior view of the level looking into the center of it, so the vast majority of assets in the scene are in the camera frustum. This results in a lot of draw calls.
In brief, here is how this scene was rendered:
- With the HDRP
- Most of the artist authored textures are between 1K and 2K sized maps, with a handful at 4K.
- It uses Baked Occlusion and Baked GI for the indirect lighting, and a Single Dynamic Shadow Casting Light source for direct lighting from the sun.
- It issues around a few thousand draw calls at any point (draw calls and compute shader dispatches)
- At start of optimization pass, view was GPU bound on PS4 Pro at around 45 milliseconds.
What the GPU Frame showed before optimization pass
At the start of the optimization pass, Rob and the team looked at the GPU frame step by step, and saw the following performance:
- The Gbuffer was at 11ms (you can find a description of the Gbuffer layouts for HDRP in this post by Unity lead graphics developer Sebastien Lagarde)
- Motion Vectors and Screen Space Ambient Occlusion was pretty fast, at .25 and .6ms respectively
- Shadow maps from the directional shadow casting with dynamic lights came in at a whopping 13.9ms
- Deferred lighting was at 4.9ms
- Atmospheric scattering was at 6.6ms
This is what their GPU frame looked like, from start to finish:
As you can see, they’re at 45 milliseconds and the two vertical orange lines show where they needed to be to hit 30fps and 60fps respectively.
10 ways they optimized rendering for Book of the Dead
Control the batch count
CPU performance was not a big issue for the team because BOTD is a demo, not a game, meaning it didn’t have the complexities of the script code that goes along with all of the systems that are necessary for a full game. Again, the main concern was optimizing the rendering work.
However keeping the batch count low is still a valuable tip for any platform. The team did this by using Occlusion Culling, and, primarily, GPU instancing. Avoid using Dynamic batching on consoles unless you are sure it’s providing a performance win.
In this case, GPU instancing was the most useful method. Without it there would have been 4500 batches for this scene. By using it, they reduced that number to 1832.
Another takeaway: The number of individual assets used to create this scene is actually very small. By using good quality assets, and placing them intelligently, the team created complex scenes that don't look repetitive, and kept the batch count low with GPU instancing.
Here's the scene with no instancing.
And here it is with instances:
Use the multiple cores available on consoles
Both Xbox One and PS4 are multi core devices, and in order to get the best CPU performance, we need to try and keep those cores busy all of the time.
If you follow Unity news, you’ll know that Unity is currently developing a high performance multithreaded system, that will make it possible for your game to fully utilise the multicore processors available today and in the future. (Very) briefly, this system comprises three sub-systems: the Entity Component System, the C# Job System and the Burst Compiler.
The new multithreaded system is still in early experimental mode. In Unity’s graphics system, you can try it out via the Graphics Jobs mode (again, in experimental mode). You can find the Graphics Jobs controls under Player Settings -> Other Settings.
Graphics jobs mode will get you a performance optimization in almost all circumstances on console unless you're only literally drawing a handful of batches. There are two types available:
- Legacy Jobs, available on PS4, and DirectX 11 for Xbox One
- Takes pressure off the main thread by distributing work to other cores. Be aware that in very large scenes it can a bottleneck in the “Render Thread”, a thread that Unity uses to talk to the platform holders graphics API.
- Native Jobs, available on PS4, and DirectX 12 for Xbox One (coming soon)
- Distributes the most work across available cores and is the best option for large scenes.
- Should always be the best option from 2018.2 onwards (in versions 2018.1 or earlier, it can put more work on the main thread, causing performance regressions).
Use the platform holder’s performance analysis tools
Microsoft and Sony provide excellent tools for analyzing your project’s performance both on the CPU and on the GPU. These tools are available to you for free if you're working on console. Learn them early on and keep using them throughout your development cycle. Pix for Xbox One is Microsoft's offering and the Razor Suite is Sony's; as Rob says, they are your main tools in your arsenal when it comes to optimization on these platforms.
Profile your post-process effects
Rob says that he’s profiled Unity games on PS4 where, unbeknownst to the developers, up to ⅔ of the framerate was taken up by post-processing. This is often caused by downloading post-processing assets from the Asset Store that are authored primarily for PC. They appear to run fine on console but in fact the performance characteristics are really bad.
So, when applying such effects, profile how long they take on the GPU, and iterate until you find a happy balance between visual quality and performance. And then, leave them alone, because they comprise a static cost in every scene, meaning you know how much GPU bandwidth is left over to work with.
Avoid using tessellation (unless for a very good reason)
In general, it’s not advisable to use tessellation in console game graphics. In most cases, you’re better off using the equivalent artist authored assets than you are runtime tessellating them on the GPU.
But, in the case of BOTD, they had a good reason for using tessellation, and that was for rendering the bark of the trees.
Tessellated displacement allowed them to add the deep recesses and gnarly details into the geometry that will self-shadow correctly in a way that normal mapping won't.
As the trees are “hero” objects in much of BOTD, it was justified. This was done by having the same mesh used on the trees at LOD 0 and LOD 1. The difference between them is simply that the tessellated displacement was scaled back so that it's no longer in effect by the time they reached LOD one.
Aim for healthy wavefront occupancy at all times on the GPU
Ok, that’s a mouthful, but wave front occupancy is very much worth understanding.
You can think of a wave front as a packet of GPU work. When you submit a draw call to the GPU, or a compute shader dispatch, that work is then split into many wave fronts and those wave fronts are distributed throughout all of the SIMDs within all of the compute units that are available on the GPU.
Each SIMD has a maximum number of wave fronts that can be running at any one time and therefore, we have a maximum total number of wave fronts that can be running in parallel on the GPU at any one point. How many of those wave fronts we are using is referred to as wave front occupancy, and it’s a very useful metric to use for understanding how well you are using the GPU's potential for parallelism.
Pix and Razor can show wave front occupancy in great details. The graphs above are from Pix for Xbox One. On the left we have an example of good wave front occupancy. Along the bottom on the green strip we can see some vertex shader wave fronts running and above that in blue we can see some pixel shader wave fronts running.
On the right though we can see there’s a performance issue. It’s showing a lot of vertex shader work that's not resulting in much pixel shader activity. This is an under utilization of the GPU's potential. This brings us to the next optimization tip.
How does this come about? This scenario is typical when we're doing vertex shader work that doesn't result in pixels.
Utilize Depth Prepass
Some more analysis on Pix and Razor showed that the team were getting a lot of overdraw during the Gbuffer pass. This is particularly bad on console when looking at alpha-tested objects.
On console, if you issue pixel discard instructions or write directly to depth in your pixel shader, you can’t take advantage of early depth rejection. Those pixel shader wave fronts get run anyway even though the work is going to be thrown out at the end.
The solution here was to add a Depth Prepass. A Depth Prepass involves rendering the scene in advance to depth only, using very light shaders, that then can be the basis of more intelligent depth rejection where you’ve got your heavier Gbuffer shaders bound.
The HDRP now includes a Depth Prepass for all alpha tested objects, but you can also switch on a full Depth Prepass if you want. The settings for controlling HDRP, what render passes are used, and features enabled, are all made available via the .
If you search in a HDRP project for the HD Render PipelineAsset you'll find a great big swath of checkboxes that control everything that HDRP is doing.
For BOTD, using Depth Prepass was a great GPU win but keep in mind that it does have the overhead of adding more batches to be drawn on to the CPU.
Reduce the size of your shadow mapping render targets
As mentioned earlier the shadow maps in this scene are generated against a single shadow casting directional light. Four Shadow map splits were used and initially they were rendering to a 4K Shadow map at 32-bit depth, as this is the default for HDRP projects. When rendering to Shadow maps the resolution of the Shadow map is almost always the limiting factor here; this was backed up by analysis in Pix and Razor.
Reducing the resolution of the Shadow map was the obvious solution, even though it could impact on quality.
The shadow map resolution was dropped to 3k, which provided a perfectly acceptable trade-off against performance. The demo team also added an option specifically to allow developers to render to 16-bit depth Shadow maps. If you want to give that a go for yourself download the project assets.
Finally, by changing the resolution of their Shadow map, they also had to change some settings on the light.
At this point, the team had made their shadow map revisions and repositioned their shadow mapping camera to try and get the best utilization out of the new reduced resolution they now have. So, what did they do next?
Only draw the last (most zoomed-out) Shadow map split once on level load
As the shadow mapping camera doesn’t move much, they could get away with this. That most zoomed out split is typically used for rendering the shadows that are furthest from the Player camera.
They did not see a drop in quality. It turned out to be a very clever optimization because it saved them both GPU framerate time, and reduced batch numbers on the CPU.
After this series of optimizations, their shadow map creation phase went from 13ms to just under 8ms; lighting pass went from 4.9ms to 4.4ms, and atmospherics pass went from 6.6ms to 4.2ms.
This is where the team was at the end of the shadow mapping optimization. They’re now within the boundary where they could run at 30fps on PS4 Pro.
Utilize Async Compute
Async Compute is a method available for minimizing periods of underutilization on the GPU with useful compute shader work. It’s currently only supported on PS4, but coming to DX12 for Xbox One very soon. It's accessible through Unity's Command Buffer interface. It's not just exclusively for the SRP but is primarily aimed at it. Code examples are available in the BOTD assets, or look at the HDR PSOS.
The depth only phase, which is what you’re doing with shadow mapping, is traditionally a point where you’re not making full use of the GPU's potential. Async Compute allows you to move your compute shader work to run in parallel with your graphics queue, thereby making use of resources that the graphics queue is underutilizing.
BOTD uses Async Compute for it's tiled light list gather which is part of the deferred lighting, all of which is mostly done with compute shaders on console in HDRP. It also uses it for its SSAO calculations. Both of these overlap with the shadow map rendering to fill in the gaps in the wave front utilization.
For a run-through of some conceptual code where Async Compute is employed, tune into Rob’s Unite session at 35:30.