Unity3D : Adventures in creating huge outdoor areas (terrains) — Part 4: The real adventure of GPU Instancing begins

10 min readJul 22, 2020

All three previous parts were (sometimes somewhat wordy) descriptions of the easy stuff, basically just looking around all the default options and seeing how they perform and why, to have some kind of baseline.

The real adventure starts now, as well as the real “journaling” of this adventure, since from now on, I’m writing literally along actually doing the stuff.

I have found a good article focusing on creation and optimization of the kind of environments I want, forests. It talks about many useful techniques. I don’t really want to use any of those. I don’t like mesh combining, at least for this project, because if I have 6 meshes repeating all the time, why would I want to recombine them into x thousand meshes that don’t repeat at all? It will save on drawcalls, sure, but it sure as hell won’t help the memory demands, which I expect will become a challenge somewhat soon down the line. Other techniques the article suggests are related to creation of the assets themselves, which is, at least at this time, unusable for me, since I’m not creating the assets.

It is a good article though, good for when your workflow is similar to the author’s, and even if not, it’s a good read about many other optimization techniques, so that you know what’s out there.

I want to go into gpu instancing, I wanted to, for a long time, but until now, I didn’t have a real reason. Now I do, so I’m going.

Oh, and, yes, there’s also already many tools and libraries and assets that provide the functionality I need, as well as many, many other features. For example, GPU Instancer looks pretty much amazing. For now, I will pretend it doesn’t exist. I want to get my hands dirty with understanding how it’s done and what’s going on in the insides, by making my own solution, tailor-made specifically for what I need it for.

Ok, disclaimers out of the way, let’s go.

There are two functions that are core to the whole thing, and which do the same thing but slightly differently: DrawMeshInstanced and DrawMeshInstancedIndirect. As I understand it so far, they both do the same thing, but the non-indirect one is made to be a bit more user-friendly, but also limited by maximum of 1023 instances per one call. I will start with that one, even though my intention is to utilize the Indirect one later. It seems silly to me to draw 10thousand trees in 10 calls if I can do it in one with a bit more work. The only reason I’m starting with the non-indirect one is that it’s (supposedly) easier to use, so it will be easier and faster to establish if I’m on the right track with what I’m doing.

It seems a bit strange to me that I should do this call in Update phase… Won’t it cause visual collisions with other stuff which has been rendered normally? We’ll see.

Let’s start by this idea, then:

I manually add large number of GameObjects to the scene. I tag them by adding an empty script, let’s say, “ToBeGpuInstanced” on each of them.
When the game is run, my instancing script will collect all of those objects, their positions, rotations, scales and meshes, remove them from the scene, and start drawing them by instancing instead.

This is, of course, going to be obtuse, but I’ll better figure out how to define and store all the data sometime later.

First funny (but logical) thing — mesh doesn’t even store link to its materials, renderer does. And there’s nothing explicitly linking submesh of a mesh to its material, it’s just… submesh[0] uses material[0] of the renderer, submesh[1] uses material[1] of the renderer, etc. So I’ll need a structure to store this. There we go.

Oh, why am I using an empty MonoBehavior to tag objects, when Tag exists? Because object can only have one Tag, and I might have better uses for that one later. And when you do, don’t forget to actually delete all the functions from it, even the empty Start and Update. Unity inspects MonoBehaviours by reflection, and adds its callbacks into call list if it finds those functions, even when they’re empty, I suspect.

Now, I’ll have to be careful, too. One gameobject can have a whole hierarchy of other objects (and more importantly, Renderers!) in itself, so I need to traverse the whole thing and collect all of them. Each renderer’s position gets added to the appropriate InstancedMesh object for that renderer’s mesh.

Oh, actually, nonsense (yeah, as I said, I’m writing this as I’m working on it, so stuff like this WILL happen), Renderer only holds materials, it’s MeshFilter that holds the mesh!

…which raises an interesting question that never occured to me: if I have a hierarchy of GOs, where only the parent has renderer, and children have different MeshFilters, but no renderers… what happens? Let’s test it real quick.

Tested, and… nothing happens. Nothing gets rendered. Renderer only renders the mesh in MeshFilter attached to the same GO. I thought that’s the likely case, but I was hoping maybe I’d get a different, more interesting result.

Back to where we were, then. That means it doesn’t matter whether we look for Renderers or MeshFilters in the children of our tagged object, wherever we find one, we’ll find the other as well (until someone else screwed up), and we’ll need to pull out both. Okay, easy. Also, in that case I’m looking for MeshFilters, since my instanced mesh dicionary is keyed on those.

Also, don’t forget to not be stupid and read the MeshFilter.sharedMesh, not MeshFilter.mesh. Why? The second one creates a separate instance/copy for each MeshFilter you read from, even if those meshes are all identical (shared), so you’ll needlessly pollute memory.

Now, first actual interesting question:
When making the Matrix4x4 which the instance drawing function uses, I’m doing Matrix4x4.TRS(transform.position, transform.rotation, transform.lossyScale). Position and rotation are clear in my opinion, they need to be global (but we’ll see in a moment). But what about the scale? I assume also global, since this instance drawing thing erases the whole original hierarchy. Huh. Okay, question probably answered, but we’ll see.

Gameobject collecting code done. Let’s check if it actually works as it should so that there’s no surprises when I write the actual b… oh. Oh no, It will not work correctly, at least not for the trees I’m using. Or for anything else that uses LOD system. Because that all uses separate child gameobjects with their own meshfilters and renderers, as the LOD stages. Meaning that in the next step, I will have to start caring about those too. Welp. For now, I’ll remove all the LODs except 0. One step at a time, right?

Let’s start small. 5 trees of the same type, one of those I’m using in my terrain, each with one mesh with two submeshes and one renderer with two materials, one for the trunk submesh and one for the leaves submesh.

Oh, and possibly with a custom shader that will be grossly incompatible with GPU instancing, because why not. If this won’t work, I’ll revert to plain old cubes for a while. Now I’m just testing the mesh collection. I should get one entry in my mesh dictionary, with two materials, and five matrices.

Yeah, I can still do at least the basic stuff without larger issues, good to know.

Let’s get to the interesting part then.

The interesting part. For now ignoring the 1023instance limit.

Does it work?

OH MY GOD YES IT DOES! Notice the absence of any tree gameobjects in the hierarchy, because they’re all instanced meshes done in one call.

…i think. Wait. Why is there 15 batches and 15setpass calls? Let’s look at that after I celebrate for few more seconds. Oh, actually, of course there’s going to be more drawcalls, the trees aren’t the only thing, right? Let’s look at what they are, anyway.

Yeah, so there’s actually 6drawcalls for all the trees, but it’s for ALL of them (I’ll prove that in a moment). It’s all of the Draw Mesh (instanced). First two up there in RenderDeferred.GBuffer are for trunk and leaves. The other two pairs down there are… well… for shadows. To be honest, i’m not sure why there’s four, I’d expect only two — one for trunk and one for leaves again, but there’s two for trunk and two for leaves.

Unity says …oh, wait, yes, I know, of course. I have two shadow cascades enabled, so of course it does one draw per each shadow cascade! (You can see the shadows in game view, but not in scene view because I have lighting turned off in there).

Wait… but that also means… that in my current fullsize terrain… if I turn the shadow cascades off, I will reduce the rendering weight of the trees by two thirds…? I’ve got to check, we’ll continue with instancing in a moment.

Huh… No, not really. Okay. Obviously there’s lots I still don’t understand. And yeah, those numbers seem to not make much sense, that’s because they’re not from exactly the same place. But I’m also not right, because if I were, the difference would be so visible that the precise spot wouldn’t matter. Nevermind.

Back to the main program, instancing. Let’s test it some more. Let’s actually stress-test it, even. Next up, 640 trees:

Ladies and gentlemen, six hundred and forty trees in glorious tw… wait, what? FOUR drawcalls? Why? Unity is actually splitting that one call to Graphics.DrawInstanced, into two GPU drawcalls?

…why? I mean, it’s still awesome performance, but… WHY?

It is true that by default the maxcount for an instanced shader is 500. It is also true that it is due to the fact that we have objectToWorld and worldToObject matrices in a single constant buffer, with the 64KB limitation in size.
https://forum.unity.com/threads/understanding-instancing-and-drawmeshinstanced.445995/#post-2885177

…oh. Okay. That’s still glorious, though. But… does that also mean that I don’t have to manually split it into several drawcalls at the maxinstance boundary? Let’s try out! 2160 trees!

Incorrect. I still have to maintain this boundary on my own. Ok, let’s implement that next.

Luckily, ArraySegment exists, which means I won’t have to shuffle the values around, but I can just move a 1023 cells long “view” around above it. Hopefully, if it works the way I assume it works, but if it works differently, then I don’t see the point of its existence…

…waaait, what? This dude is bullshit! It doesn’t… it doesn’t do what i thought, and it doesn’t even copy that array segment! it… What the hell, it’s even more pointless than I thought, it just wraps the “start index, end index, for this array” info, and nothing else! That’s… utterly stupid. Okay, whatever, let’s not do this, since we’ll switch to DrawMeshInstancedIndirect anyway…

Correction, we’ll probably need to do this, even with InstancedIndirect… Because this is a thing defined in shaders, and I’d like to avoid mucking around with the shaders for as long as possible, and possibly even forever.

Ok, let’s do it differently. Our instanceTransforms won’t be array of matrices, it’ll be array of 1023long arrays of matrices.

Ok, the collecting part code is getting a bit mindbendy now, and… oh, I broke something. The trees are not being deleted from the scene now, and …uh. Give me a moment.

(To the tune of Jingle Bells: “Stepping through the code, with one hand on F10, hoping we will find the bug and everything will be great! Bughunting, bughunting, favorite hobby of ours…”)

Ah, yeah, as I said, the collection code is getting a bit mindfucky, with me using a Dictionary<Mesh, List<List<Transform>>>… Index troubles, nothing else, but very hard to read in this:

Do you know what’s funny, though? In this testing scene, Unity has no issues with batching all the trees correctly, the same way as my code is doing it. Except it gives the bonus of collision boxes. Well, that makes my “achievement” a bit more underwhelming, doesn’t it? :)

Okay, what’s next? Let’s test with putting some more different tree models in there.

Two is enough, I’m getting tired and irritated. Keep in mind, all of them are LOD0, no distance or occlusion culling. 9.5 Million triangles. Not bad, I would say. Notice especially that rendering is taking slightly below a millisecond.

Also, yes, under these conditions, this is slightly less performant than all of them just being game objects. Under these specific conditions, unity can batch them better than I do.

…so it’s actually placing them using the terrain tools, that tanks it.

This is so strange. That terrain tool is so… horrible!

But! Let’s not let that detract from the fact that I’ve gotten through the hardest, first step in manual GPU instancing, which is great!

Tomorrow we’ll do some more performance tests. We’ll try mass-placing gameobject trees onto the terrain, to see what happens. And then we’ll try mass-placing our manually instanced ones. And then we’ll know where to go from there.

Unity3D : Adventures in creating huge outdoor areas (terrains) — Part 4: The real adventure of GPU Instancing begins

Written by Miroslav Martinovič