How Video Cards Render What You See

rayzoredge · January 30, 2008

Posting because I found it to be a good read.

Since some people like to argue about which features and visual effects have an effect on performance, whether or not feature X is possible on card Y, or whether DirectX version Z can, but might not know all about the actual rendering, and (many?) others seem to be interested in just a general understanding of what determines your overall performance, I thought I'd try to make sure these discussions are at least based in reality. if people know some of the facts involved, they might not resort to namecalling and flamewars so quickly. (A naive hope, but hey, worth a try)

Would be fun if we could one day have these discussions without having the thread locked.

I'll try to describe what goes on as the game renders what you see, and also (to serve the latter group), point out how it affects performance, and what parts of your computer becomes the bottleneck at each step, to give an idea of which parts you should upgrade if you run into performance problems in a game).

WARNING: This is going to be a long read. Don't read it if you don't have a lot of time to kill...

Now, so you know how seriously (not) to take me, a few words on my qualifications.

I am not a professional game developer.

I have not shipped any games, commercial or otherwise. So I can't tell you everything about how "real" games work inside.

I am a Computer Science student. Got my bachelor's degree a year or so ago, aiming for a masters at the moment.

I have implemented many graphical effects in an engine. (Including fragment (pixel) shaders for most common effects, as well as a few more advanced ones, like bloom lighting), and vertex shaders (for animating clouds or grass waving in the wind, as well as skeleton-based animation with blending between multiple animations, and scheduling of transitions between animations)

I have written a software renderer (which means I had to code everything myself, instead of just relying on the GPU to do all the hard work)

So I'm not a graphics guru, but I know much of what goes on inside a game engine. With that out of the way, you should know not to take my word for gospel, but also that I have *some* experience, and am not just making this up. So correct me if you think you know better than me.

I know I'm not the only programmer here, and I'm sure some people will be able to add, clarify or correct some of the following, and I hope they do. At worst, it means more correct information for you, and at best, it means I'll learn something too.

And of course, ask questions if there's anything you're curious about, or that I didn't explain properly.

With that out of the way, let's get started:

Preparing for rendering

Resource loading:

When you load a level in a game, a lot of resources have to be read from the harddrive and stored in RAM. This includes geometry (meshes), textures, shaders, sounds.

Depending on the game, only part of the level might be loaded when you start, and other parts loaded on the fly while the game is running. (Look at Oblivion. The installation is what, 3-4GB? The game is one huge level. If they had to load everything at once, it'd 1) take tens of minutes, and 2) use a huge amount of RAM.

Still, the data that is going to be needed has to be loaded sooner or later.

This is responsible for the often long loading times, and also for the stuttering that may occur when you move between different areas in the game (Oblivion divides the world into rectangular cells, and it's pretty easy to see when you leave one cell. Suddenly everything in front of you becomes more detailed, and the game might stutter for a moment. That is one, rather simple, way to do loading on the fly. You can even run along these cell boundaries (which are just straight lines), and watch the effect as you occasionally cross over it.)

So, this only accounts for the occasional *loading* time, which is usually easy to recognize. Performance implications: The harddrive is what matters here. A faster harddrive will do a lot to decrease loading time. What may be more surprising is that often, a RAID-0 setup will *not* make a significant difference here. Games often load resources that are not located near each others, which means the harddrive's seek time becomes important, and the actual transfer speed less so. Also, sufficient RAM is vital. If you don't have enough RAM, some of the loaded data ends up in the pagefile, which results in more harddrive thrashing, and lousy performance.

Initializing data in memory

A second part of this is to actually structure the data. Sure, we might have loaded the "table" mesh, and the "evil terrist with turban" textures, but we also need to decide how many of these should be rendered, and where and when.

This information is often kept in some kind of scene graph. That is, we build a big tree structure, where the root is the scene itself, and then add child nodes for each object in the world, keeping track of object data like the position, as well as which texture and shader(s) should be used. Some objects might "belong" to other objects (such as the ak-47 the terrist is carrying. We want to indicate that this object should follow the guy when he moves), so we make that a child of the relevant object. This way, we can fit the entire world into one big tree, where each branch indicates "objects that belongs to the the one we branched off from". It also helps us keep track of how many instances of each model must be rendered. (There might be dozens of cars on the level, so there are dozens of nodes in this tree, each storing different world coordinates, but pointing to the same mesh)

So now we know how to render (we have the necessary meshes, shaders, textures and everything else we need), and we know where to render as well (our scene graph contains information about which objects we have on the level, and where they are)

Now, our imaginary game has got off the ground, it's finished loading, and the first frame has to be rendered.

Rendering a frame

If our level is really small and simple (like most tech demos, which contain only a few dozen objects), we can just run through the entire scene graph, and ask the GPU to render each object. The GPU will happily render each object, pixel by pixel, and then when it finds out something wasn't visible after all, it just draws over it when it comes to the object that was in front of it. So this works fine, and gives us the expected image, even though we're actually asking the GPU to render a number of objects that aren't visible (for example, the ones that are behind the camera, or ones that are hidden behind other objects). It's safe to do so, the GPU will make sure they're removed properly, but it still renders them, which takes time.

Space partitioning, visibility graphsOn larger scenes, this doesn't really work. (Think of Oblivion. How many tens of thousands objects are there in the game world? I don't know, but more than a few. Or even something as primitive as a level in Doom (1 or 2). Just try to count the number of barrels + monsters + doors + weapons + ammo + health kits. Now compare that number of objects to how many are actually visible on the screen. (let's say 4 monsters, if it's a crowded area, any maybe a handful of barrels). And this is for a game that had to run on 15 year old computers! Obviously, today's games have vastly bigger levels with far more objects in them. In other words, rendering everything in the scene graph is horribly inefficient for anything bigger than a tech demo.

Doom's big innovation (the one single feature that made it possible to run the game at all back then) was to partition the level into smaller chunks, based on what can be seen at each point in the level. The actual way this is done might get a bit hairy, but the idea is that for any position on the map, you can easily compute which partitions are guaranteed *not* visible. (Think if you have a big wall intersecting the level, with only a small doorway in it, it's fairly obvious that unless you're looking at the door, you can never see anything on the other side of the wall. So lets save that information. Then every frame we just check "are the player looking at the door? If yes, render everything. If no, render only the half of the level that we're in". (And of course, this can be done much more fine-grained. This example is just to show the general idea)

And as the above example also shows, this method is a lot easier if we have lots of walls obscuring the view. Indoors areas are wonderful to render because you can never see much more than a single room, and since walls don't often move around, we can figure out in advance which parts of the map are visible from where. And most importantly, we can do these calculations *in advance*, when the map is generated, since the terrain and walls are static and never move around. So while rendering, we don't have to generate all this data, we just have to read it from the pre-generated data structure.

Outdoors areas are a lot tougher, and can't be partitioned as easily. There are still methods for it, and they work ok, but they're not as good as the ones for indoor areas. (Which is one reason why games that take place indoors tend to have more detailed scenes. They can be made much more efficiently, so a higher detail level is possible) Same goes for destructible terrain. That screws with our space partitioning, making more things visible than we'd like, so that's a pain to do efficiently as well, and hardly any games have even attempted it.

Using space partitioning, we can discard most of the scene *before* involving the GPU. We will still end up rendering some hidden objects, but at least we've gotten rid of a lot of them. This is done on the CPU, which isn't particularly fast (compared to the GPU), so we only do this if we're sure it's worth it. It almost always is, though, so it's almost always done.

CPU culling

Another less universal trick is to perform another pass with the CPU, looking at each indidual object, and testing if *any* part of it intersects with the viewing volume (The part of the world that you can see). this is done very roughly (because it has to be fast, to avoid hogging expensive CPU time), using some kind of bounding box or sphere to represent the model (imagine a huge box around an enemy character. Assuming the character keeps all limbs inside the box, we can simply check whether the box comes near the viewing volume. if it does, we *might* be able to see part of the character, so we decide to keep the entire character for rendering. If no part of the box is anywhere near the viewing volume, we can safely throw away the entire model. This further eliminates objects that aren't visible, but at the cost of more CPU time. That's why this isn't used as often. In some games, it'd only lower performance to perform this extra pass. Performance implications: All this is done on the CPU. We're doing a lot of work on the CPU to avoid having to do even more work on the GPU. (Every object that is sent to the GPU has to be rendered, whether or not it's actually visible). So a slow CPU will give us problems here. On the other hand, the GPU doesn't even come into question. This part would be unaffected if you were running a Voodoo 2 card, because the GPU isn't used.

Now we've gotten rid of a lot (but not all) of the objects that fall outside the currently viewed area of the scene, but the CPU isn't done yet.

Preparing data for the GPU

We still need to perform a lot of other processing to prepare everything for the GPU.

We need to move and rotate every object to their current positions. (Or more specifically, we have to compute how much the object should be moved and rotated. Actually doing this to every single vertex in the mesh is done later, by the GPU)

And we need to do at least part of the animation. We might have a character model, which has a skeleton of, say, 100 bones (this is used to simplify the animation. Instead of the animator having to move every single vertex that makes up the arm, he can just move the arm bone, and the actual mesh will follow). But this means that for each bone in each character, we have to compute how the attached vertices should be moved.

This in itself might be a major operation (think Total War, where you might have 3000 soldiers on screen, and each of them has maybe 15 bones (They have to use such simple skeletons to get any kind of decent performance)). that means close to 50,000 bones transformations that have to be computed every frame. And some of these bones depend on the transformation of others (your hand should follow if the arm moves, *in addition* to the hand's own animation).

And of course, we also want the animations to be smooth, which means we might have to blend multiple animations (the character is swinging a sword while running, so we have to mix the run and attack animations), which means that every bone has *two* positions it wants to go to, which is already twice as much work *plus* the extra work of actually blending them (we may not just want an average. If I'm just starting running, we only want the run animation to have a little influence, while we mostly use the 'stand still' animation).

And because it looks like crap if we instantly switch animations, we want to schedule it too. (Don't switch from run to walk while my feet are in the air. Wait for them to hit the ground before gradually fading into the walk animation).

All this has to be done by the CPU, which *then* passes everything to the GPU's vertex shader, which, for every single vertex, in every single (animated) model, has to apply the computed transformations.

All in all, animations are a ton of work, for both the CPU and GPU.

Performance implications: Mainly the CPU suffers here. The GPU also has to do a ton of work in the vertex shaders, but 1) these are pretty fast to begin with, and 2) the GPU is usually held back by the fragment (pixel) shaders, which means the vertex shaders may be able to afford this extra work (if they're just waiting for the fragment shaders to catch up anyway, they might as well do something useful in the meantime)

Finally, we're moving to the GPU side.

Vertex Shader

The vertex shader moves all the vertices into position, according to the transformation data computed by the CPU above.

GPU culling

Afterwards, the GPU can perform yet another pass of culling. We now know where each polygon has ended up, which means we can discard individual polygons if they're outside the viewing area.

Fragment/pixel Shader

Finally, the fragment (pixel) shader kicks into gear. This has to figure out the color of every pixel on the screen. For each polygon, it runs through every pixel (or fragment, technically speaking), and computes a color based on whatever information we're interested in (for example, the distance and angle to a light source, plus the color of the texture at this location, and any "default" color of the polygon.

When we do this, we might decide to output the result directly to the screen, or we might want to save it to a texture. If we do the latter, we can then run the process again (maybe from a different viewing angle and/or using a different shader), and use the previously generated texture. This trick may be used for all of the following:

- Generating shadows (render the scene from the point of view of the light source, to figure out what the light can "see", and then in the following pass, use this info to generate shadows)

- Post-processing (bloom lighting, motion blur and such. Render the scene to a texture, then render that texture using another fragment shader to generate the bloomy highlights, or blend it with the textures generated from the last 3 frames to make motion blur)

- In-game cameras (HL2 has cameras placed around the world, which render the scene from their point of view, and then put the result onto the ingame monitors)

Performance implications: Pain... This can easily bring any GPU to its knees. At 1600x1200, we have millions of pixels on screen. Each pixel might be rendered a dozen times (because some polygons overlap) and some may not even lie inside the visible area at all but still have to be rendered, and *then* we might decide to start all over on a second pass to generate shadows or motion blur! And we want to do this 60 times per second!

Resolution plays a big role here (lower resolution means each polygon consists of fewer pixels, so we don't have to run the shader as many times (this is also a good way to test if your performance is bottlenecked by the fragment shaders. Lower the resolution and see if it makes a difference. If it does, fragment shaders are the problem. If it makes no difference, you're held back by the CPU and/or fragment shaders in the previous steps instead)

So now we've rendered *one* frame. And we can start all over again.

In between frames

Before we render the next frame, the character has probably moved, which means different objects might be visible, which means we have to run through our space partitioning tree to find out which partitions are visible, then do all the culling on CPU and GPU, and generate all the movement/rotation info again, and the animations and everything.

Usually, most of these are smallish changes (we can still see *mostly* the same objects as before, so we don't have to rebuild our render graph, we can just take out a few objects and add a few new ones, and reuse the rest.

But if something significant changes (you're teleported to the other end of the level, or even apparently small things like you turn around to look at a doorway (which means entire new areas might be *potentially* visible, we have to make a *lot* of changes, and process a ton of new objects. So initially, we have to spend time making large changes to the render graph, and afterwards, we might have a much bigger graph to consider (because more objects are now potentially visible, and have to be at least considered during culling)

So moving around may have a large impact on performance in a game. Not so much because of the movement itself, but because it might change which objects have to be considered during rendering, and which can be ignored completely. This is easily noticeable if you walk out of a building. In most games, the framerate will dip noticeably because suddenly, we can no longer just ignore huge parts of the scene. You might see a short freeze (as the game rebuilds render graph and shuffles around data in general to get everything organized, and maybe it even has to load a few things off the harddrive), and after that, framerates will be lower than before because we now have to process more objects every frame.

So there you have it. A vastly simplified game engine (or at least, the rendering part of it. I haven't even touched on sound, AI, the actual gameplay code and so on)

The point of this was primarily to show that moving around on large scenes *may* have a big impact on performance, and that small scenes are never representative of performance (because as you might have noticed if you read the above, most of the work is centered around trying to throw objects away, which doesn't really matter (and is easier to do) in a small scene)

We also don't have the actual game code, which requires most of the CPU time, which may cause bottlenecks that didn't occur in our test scene with 4 cars and a dozen soldiers on a 5x5m area.

No matter how amazing graphical effects you cram into this scene (HDR, physics-based animation, soft shadows, radiosity lighting and so on), it is still in no way comparable to a real game. It might run at 500 frames per second, and *still* be too slow for use in actual games. Or it might run at 80 frames per second, and actually be acceptable for a game (because it stresses other hardware parts than the rest of the game does)

It's just not a valid benchmark, and can't be used to show that "this effect is possible in games on this card using this version of DirectX/OpenGL".

3DMark suffers from the same thing to a certain extent. They use bigger, more detailed scenes to try to simulate games more accurately, but they're still missing a fairly essential part... The game. When they have no game running in the background, their results become inherently skewed. A lower 3dMark score might translate into better game performance, and vice versa.

But at least they make an effort to simulate games performance. Tech demos like Nvidia, ATI and the DirectX team make don't even try to do that. They're all about showing off eye candy, and *not* about comparing performance. (Nor do they claim to be valid performance indicators)

Hope someone will find this useful and/or interesting...

Cheers,

Updates: (As people comment, point out weak explanations or ask questions, I'll add stuff here, to make it easier to find)

- .pak filesA lot of games use a few huuuuge files to store all their data (usually with the extension .pak or .wad or .gcf (Isn't that what Steam uses?), and some just use regular .zip files even)

The purpose with this is mainly to speed up loading. Instead of having to pick through 300 small files each containing a single sound effect, mesh or texture, it's all bundled into one huge archive which serves as a virtual file system. That is, inside the file, they store a bunch of files, complete with folder paths and everything. This is mainly to avoid fragmentation. If the game has one big 3GB file, any defragger will try to place it continuously, so that anything we read from the file comes from the same region on the harddrive. If we'd had 4000 individual files of size 4K to 20MB, they'd easily get scattered all over the disk, and a defragger would only make sure that each individual file is not fragmented, but wouldn't care much about whether all 4000 files are located near each others. (mujtaba, is that what you meant? Otherwise just gimme a yell with a correction )

- BSP trees are generated off-line, before the game is started. That's the clever thing Doom did (Wolfenstein didn't, as far as I know, so it had to settle for far more limited graphics.) When you generate the level (in a level editor), one of the things it produces is a bsp tree containing all this visibility information, telling the game which areas are visible depending on where the player's camera is.

Games generally still do this. If you look at the output from Source's Hammer level editor, you'll see that one of the things it does when compiling the level is... building bsp trees. And that means games are still stuck with the limitation that the actual map geometry has to be static. You can't destroy walls or make craters in the ground. It's the most efficient way there is to render a big 3d map, but you have to live with your geometry being static. If you want everything to be destructible, this optimization becomes useless, and you have to deal with lower performance.. But bsp trees are the only reason Doom was possible so long ago. They're also the reason why FPS games tend to have better graphics than most other genres. They tend to take place indoors, which means we can partition the scene very efficiently and only have to render a tiny area around the player.

It's quite a bit to chew on but brings to reality what happens when you're throwing your hands in the air when you have choppy, stuttering gameplay.

To summarize, the CPU handles objects, movement, sprite locations, and anything that has to deal with what and where things are. The CPU passes down this data to the GPU, which draws out these lines, vertices, and whatnot where they should be, then "fleshes" them out with preloaded textures, THEN applies effects like shadow, blooming, lighting, etc. This happens at LEAST 50 times PER second... so in reality, it's quite a toll on the hardware to render a scene in a computer game. Makes you wonder about how amazing and quickly technology has come forth. Pretty crazy, really.

The information about packages really just means that the files (textures, sound bytes, etc.) are broken up into accessible chunks that make it easier for the main program to extract and execute when needed. That's why we have those lovely archive files that are +1GB large, chock full of data for one or more stages or types of data for the game.

I hope this opens up the eyes of any non-technical person that wonders why their games run slowly... and why we pay big bucks to update our video components.

Source

Sign In

How Video Cards Render What You See

Recommended Posts

rayzoredge 2

Share this post

Link to post

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity

Important Information