Unity's evolving best practicesLast updated: December 2018
What you will get from this talk: updated scripting performance and optimization tips from Ian Dundore that reflect the evolution of Unity’s architecture to support data-oriented design.
Unity’s evolving, and the old tricks might no longer be the best ways to squeeze performance out of the engine. In this article, you’ll get a rundown of a few Unity changes (from Unity 5.x to 2018.x) and how you can take advantage of them. All of these practices come from Ian Dundore's talk, so, if you prefer, you can go straight to the source.
One of the most difficult tasks during optimization is choosing how to optimize code once a hot-spot has been discovered. There are many different factors involved: platform-specific details of the operating system, CPU, and GPU; threading; memory access; and several others. It is difficult to know in advance which optimization will produce the biggest real-world benefit.
Generally, it’s best to prototype optimizations in small test projects – you can iterate much faster. However, isolating code into a test project poses its own challenge: simply isolating a piece of code changes the environment in which it runs. Thread timings may differ; the managed heap may be smaller or less fragmented. Therefore, it’s important to be careful in designing your tests.
Start by considering the inputs to your code and how the code reacts when you change its inputs.
- How does it react to highly-coherent data that's located serially in memory?
- How does it handle cache-incoherent data?
- How much code have you removed from the loop in which your code runs? Have you altered the usage of the processor’s instruction cache?
- What hardware is it running on? How well does that hardware implement branch prediction? How well does it execute micro-operations out-of-order? Does it have SIMD support?
- If you have a heavily multi-threaded system and you're running on a system with more cores versus fewer cores, how does the system react?
- What are scaling parameters of your code? Does it scale linearly as its input set grows, or more-than-linearly?
Effectively, you must think about exactly what are you measuring with your test harness.
As an example, consider the following test of a simple operation: comparing two strings.
When the C# APIs are comparing two strings, they're doing locale-specific conversions to ensure that different characters can match different characters when they're from different cultures, and you’ll notice that is pretty slow.
While most string APIs in C# are culture-sensitive, one is not: String.Equals.
If you opened up String.CS from that GitHub and looked at String.Equals, this is what you would see: a very simple function which does a couple of checks before passing control to a function called EqualsHelper, a private function which you can't call directly without reflection.
EqualsHelper is a simple method. It walks through the strings 4 bytes at a time, comparing the raw bytes of the input strings. If it finds a mismatch, it stops and returns "false".
But there are other ways to check for string equality. The most innocuous-looking is an overload of String.Equals which accepts two parameters: the string to compare against, and an Enum called StringComparison.
We’ve already determined that the single-parameter overload of String.Equals does only a little work before passing control to EqualsHelper. What does the two-parameter overload do?
If you look at the two-parameter overload’s code, it performs a few additional checks before entering a large switch statement. This statement tests the value of the input StringComparison enum. Since we’re seeking parity with the single-parameter overload, we want an ordinal comparison – a byte-by-byte comparison. In this case, control will flow past 4 checks before arriving at the StringComparison.Ordinal case, where the code then looks very similar to the single-parameter String.Equals overload. This means that, if you used the two-parameter overload of String.Equals instead of the single-parameter overload, the processor would perform a few extra comparison operations. One could expect this to be slower, but it’s worth testing.
You don’t want to stop at testing just String.Equals when you are interested in all the ways to compare strings for equality. There is an overload of String.Compare which can perform ordinal comparisons, as well as a method called String.CompareOrdinal which itself has two different overloads.
As a reference implementation, here’s a simple hand-coded example. It is just a little function with a length check that iterates over each character in the two input strings and checks them.
After examining the code for all of these, there are four different test cases that seem immediately useful:
- Two identical strings, to test worst-case performance.
- Two strings of random characters but of identical length, to bypass length checks.
- Two strings of random characters and identical length with an identical first character, to bypass an interesting optimization found only in String.CompareOrdinal.
- Two strings of random characters and differing lengths, to test best-case performance.
After running a few tests, String.Equals is the clear winner. This remains true regardless of platform, of scripting runtime version, or whether we’re using Mono or IL2CPP.
It’s worth noting that String.Equals is the method used by the string equality operator, ==, so don’t run out and change a == b to a.Equals(b) all over your code!
Actually, examining the results, it’s odd how much worse the hand-coded reference implementation is. Examining the IL2CPP code, we can see that Unity injects a bunch of array bounds checks and null checks when our code is cross-compiled.
These can be disabled. In your Unity install folder, find the IL2CPP subfolder. Inside that IL2CPP subfolder, you'll find IL2CPPSetOptionAttributes.cs. Drag this into your project, and you'll get access to the Il2CppSetOptionAttribute.
You can decorate types and methods with this attribute. You can configure it to disable automatic null checks, automatic array bounds checks, or both. This can speed up code – sometimes by a substantial amount. In this particular test case, it speeds up the hand-coded string comparison method by about 20%!
You wouldn’t know it by simply looking at the Hierarchy in the Unity Editor, but the Transform component has changed a lot between Unity 5 and Unity 2018. These changes also present some interesting new possibilities for performance improvements.
Back in Unity 4 and Unity 5.0, an object would be allocated somewhere in Unity’s native memory heap whenever you created a Transform. That object could be anywhere in the native memory heap – there was no guarantee that two Transforms allocated sequentially would be allocated near one another, and there was no guarantee that a child Transform would be allocated near its parent.
This meant that, when iterating through a Transform Hierarchy linearly, we were not iterating linearly through a contiguous region of memory. This caused the processor to repeatedly stall as it waited for Transform data to be fetched from the L2 cache or from main memory.
In Unity backend, every time a Transform’s position, rotation or scale changed, that Transform would send an OnTransformChanged message. This message had to be received by all child Transforms so they could update their own data, and so they could notify any other Components interested in Transform changes. For example, a child Transform with a Collider attached must update the Physics system any time the child Transform or the parent Transform change.
This unavoidable message caused a lot of performance problems, especially because there was no built-in way to avoid spurious messages. If you were going to change a Transform, and you knew you were also going to change its children, there was no way to prevent Unity from sending an OnTransformChanged message after each change. This wasted a lot of CPU time.
Because of this detail, one of the most common pieces of advice in older versions of Unity is to batch up changes to Transforms. That is, to capture a Transform’s position and rotation once, at the start of a frame, and to use and update those cached values over the course of the frame. Only apply the changes to position and rotation once, at the end of the frame. This is good advice, all the way up to Unity 2017.2.
Fortunately, as of Unity 2017.4 and 2018.1, OnTransformChanged is dead. A new system, TransformChangeDispatch, has replaced it.
TransformChangeDispatch was first introduced in Unity 5.4. In this version, Transforms were no longer lonely objects that could be located anywhere in Unity’s native memory heap. Instead, each root Transform in a scene would be represented by a contiguous data buffer. This buffer, called a TransformHierarchy structure, contains all of the data for all of the Transforms beneath a root Transform.
In addition, a TransformHierarchy also stores metadata about each Transform inside it. This metadata includes a bitmask which indicates whether a given Transform is "dirty" – if its position, rotation or scale has changed since the last time the Transform was marked as being “clean”. It also includes a bitmask to track which of Unity’s other systems is interested in changes to a specific Transform.
With this data, Unity can now create a list of dirty Transforms for each of its other internal systems. For example, the Physics system can query the TransformChangeDispatch to fetch a list of Transforms whose data have changed since the last time the Physics system ran a FixedUpdate.
However, to assemble this list of changed Transforms, the TransformChangeDispatch system should not iterate over all Transforms in a scene. That would become very slow if a scene contained a lot of Transforms – especially since, in most cases, very few Transforms would have changed.
To fix this, the TransformChangeDispatch tracks a list of dirty TransformHierarchy structures. Whenever a Transform changes, it marks itself as dirty, marks its children as dirty, and then registers the TransformHierarchy in which it is stored with the TransformChangeDispatch system. When another system inside of Unity requests a list of changed Transforms, the TransformChangeDispatch iterates over each Transform stored inside each of the dirty TransformHierarchy structures. Transforms with the appropriate dirty and interest bits set are added to a list and this list is returned to the system making the request.
Because of this architecture, the more you split your hierarchy, the better you make Unity's ability to track changes at a granular level. The more Transforms exist at the root of a scene, the fewer transforms we will have to examine when looking for changes.
There’s another implication, though. TransformChangeDispatch uses Unity’s internal multithreading system to split up the work it needs to do when examining TransformHierarchy structures. This splitting-up, and the merging of results, adds a little bit of overhead each time a system needs to request a list of changes from TransformChangeDispatch.
Most of Unity’s internal systems request updates once per frame, immediately before they run. For example, the Animation system requests updates immediately before it evaluates all the active Animators in your scene. Similarly, the Rendering system requests updates for all active Renderers in your scene before it begins culling the list of visible objects.
One system is a little bit different: Physics.
In Unity 2017.1 (and older), Physics updates were synchronous. When you moved or rotated a Transform with a Collider attached to it, we immediately updated the Physics scene. This ensured that the Collider’s changed position or rotation was reflected in the Physics world, so that Raycasts and other Physics queries would be accurate.
When we migrated Physics to use TransformChangeDispatch in Unity 2017.2, while it’s a necessary change, it could’ve also created problems. Any time you did a Raycast, we had to query TransformChangeDispatch for a list of changed Transforms and apply them to the Physics world. That could be expensive, depending on how big your Transform Hierarchies were and how your code called Physics APIs.
This behavior is governed by a new setting, Physics.autoSyncTransforms. From Unity 2017.2 to Unity 2018.2, this setting defaults to "true", and Unity will automatically synchronize the Physics world to Transform updates each time you call a Physics query API like Raycast or Spherecast.
This setting can be changed, either in your Physics Settings in the Unity Editor or at runtime by setting the Physics.autoSyncTransforms property. If you set it to "false" and disable automatic Physics synchronization, then the Physics system will only query the TransformChangeDispatch system for changes at a specific time: immediately before running FixedUpdate.
If you’re seeing performance problems when calling Physics query APIs, there are two additional ways to deal with them.
First, you can set Physics.autoSyncTransforms to "false". This will eliminate spikes due to TransformChangeDispatch and Physics scene updates from Physics queries.
However, if you do this, changes to Colliders will not be synchronized into the Physics scene until the next FixedUpdate. This means that, if you disable autoSyncTransforms, move a Collider and then call Raycast with a Ray directed at the Collider’s new position, the Raycast might not hit the collider – this is because the Raycast is operating on the last-updated version of the Physics scene, and the Physics scene has not yet been updated with the Collider’s new position.
This can result in strange bugs, and you should carefully test your game to ensure that disabling automatic Transform synchronization does not cause problems. If you need to force Physics to update the Physics scene with Transform changes, you can call Physics.SyncTransforms. This API is slow, so you do not want to call it multiple times per frame!
Note that, from Unity 2018.3 onwards, Physics.autoSyncTransforms will default to "false".
The second way to optimize away the time spent querying TransformChangeDispatch is to rearrange the order in which you are querying and updating the Physics scene to be friendlier to the new system.
Remember: with Physics.autoSyncTransforms set to "true", every Physics query will check for changes with TransformChangeDispatch. However, if TransformChangeDispatch does not have any dirty TransformHierarchy structures to check, and the Physics system has no updated Transforms to apply to the Physics scene, then there will be almost no overhead added to the Physics query.
So, you could perform all of your Physics queries in a batch, and then apply all Transform changes in a batch. Effectively, do not mix changes to Transforms with calls to Physics query APIs.
This example illustrates the difference:
The performance difference between these two examples is striking, and becomes even more dramatic when a scene contains only small Transform hierarchies.
The Audio system
Internally, Unity uses a system called FMOD to play AudioClips. FMOD runs on its own threads, and those are responsible for decoding and mixing audio together. However, audio playback isn’t entirely free. There is some work performed on the main thread for each Audio Source active in a scene. Also, on platforms with fewer numbers of cores (such as older mobile phones), it’s possible for FMOD’s audio threads to compete for processor cores with Unity’s main and rendering threads.
Each frame, Unity loops over all active Audio Sources. For each Audio Source, Unity calculates the distance between the audio source and the active audio listener, and a few other parameters. This data is used to calculate volume attenuation, doppler shift, and other effects that can affect individual Audio Sources.
A common issue comes from the "Mute" checkbox on an Audio Source. You might think that setting Mute to true would eliminate any computation related to the muted Audio Source – but it doesn’t!
Instead, the “Mute” setting simply clamps the Volume parameter to zero, after all other Volume-related calculations are performed, including the distance check. Unity will also submit the muted Audio Source to FMOD, which will then ignore it. The calculation of Audio Source parameters and the submission of Audio Sources to FMOD will show up as AudiosSystem.Update in the Unity Profiler.
If you notice a lot of time allocated to that Profiler marker, check to see if you have a lot of active Audio Sources which are muted. If they are, consider disabling the muted Audio Source components instead of muting them, or disabling their GameObject. You can also call AudioSource.Stop, which will stop playback.
Another thing you can do is clamp the voice count in Unity’s Audio Settings. To do this, you can call AudioSettings.GetConfiguration, which returns a structure containing two values of interest: the virtual voice count, and the real voice count.
Reducing the number of Virtual Voices will reduce the number of Audio Sources which FMOD will examine when determining which Audio Sources to actually play back. Reducing the Real Voice count will reduce the number of Audio Sources which FMOD actually mixes together to produce your game’s audio.
To change the number of Virtual or Real voices that FMOD uses, you should change the appropriate values in the AudioConfiguration structure returned by AudioSettings.GetConfiguration, then reset the Audio system with the new configuration by passing the AudioConfiguration structure as a parameter to AudioSettings.Reset. Note that this interrupts audio playback, so it’s recommended to do this when players won’t notice the change, like during a loading screen or at startup time.
There are two different systems that can be used to play animations in Unity: the Animator system and the Animation system.
By "Animator system" we mean the system involving the Animator component, which is attached to GameObjects in order to animate them, and the AnimatorController asset, which is referenced by one or more Animators. This system was historically called Mecanim, and is very feature-rich.
In an Animator Controller, you define states. These states can be either an Animation Clip or Blend Tree. States can be organized into Layers. Each frame, the active state on each Layer is evaluated, and the results from each Layer are blended together and applied to the animated model. When transitioning between two states, both states are evaluated.
The other system is one we call the “Animation system.” It is represented by the Animation component, and it is very simple. Each frame, each active Animation component linearly iterates through all the curves in its attached Animation Clip, evaluates those curves, and applies the results.
The difference between these two systems is not just features, but also underlying implementation details.
The Animator system is heavily multithreaded. Its performance changes dramatically across different CPUs with different numbers of cores. In general, it scales less-than-linearly as the number of curves in its Animation Clips increase. Therefore, it performs very well when evaluating complex animations with a large number of curves. However, it has a fairly high overhead cost.
The Animation system, being simple, has almost no overhead. Its performance scales linearly with the number of curves in the Animation Clips being played.
This difference is most striking when the two systems are compared when playing back identical Animation Clips.
When playing back Animation Clips, try to choose the system that best suits the complexity of your content and the hardware on which your game will be running.
Another common problem is the overuse of Layers in Animator Controllers. When an Animator is running, it will evaluate all of the Layers in its Animator Controller every frame. This includes Layers whose Layer Weight is set to zero, meaning it makes no visible contribution to the final animation results.
Each additional Layer will add additional computation to each Animator Controller each frame. So, in general, try to use Layers sparingly. If you have debug, demo or cinematic Layers in an Animator Controller, try to refactor them. Merge them into existing Layers, or eliminate them before shipping your game.
Generic vs humanoid rig
By default, Unity imports animated models with the Generic rig, but it’s common for developers to switch to the Humanoid rig when they’re animating a character. However, this isn’t free.
The Humanoid rig adds two additional features to the Animator System: inverse kinematics and animation retargeting. Animation retargeting is great, allowing you to re-use animations across different avatars.
However, even if you’re not using IK or Animation Retargeting, a Humanoid-rigged character’s Animator will still compute IK and Retargeting data each frame. This consumes about 30-50% more CPU time than the Generic rig, which does not do these calculations.
If you're not using the Humanoid rig’s features, you should be using the Generic rig.
Object Pooling is a key strategy for avoiding performance spikes during gameplay. However, Animators have historically been difficult to use with Object Pooling. Whenever an Animator’s GameObject is enabled, it must rebuild buffers of intermediate data for use when evaluating the Animator’s Animator Controller. This is called an Animator Rebind, and shows up in the Unity Profiler as Animator.Rebind.
Prior to Unity 2018, the only workaround was disabling the Animator Component, not the GameObject. This had side effects: if you had any MonoBehaviors, Mesh Colliders, or Mesh Renderers on your character, you wanted to disable those as well. That way, you could save all the CPU time that was being used by your character. But this adds complexity to your code and is easy to break.
With Unity 2018.1, we introduced the Animator.KeepControllerStateOnEnable API. This property defaults to false, meaning the Animator will behave the same way it always has – deallocating its intermediate data buffers when it is disabled and re-allocating them when it is enabled.
However, if you set this property to true, Animators will retain their buffers while they’re disabled. This means there will be no Animator.Rebind when that Animator is re-enabled. Animators can finally be pooled!