Real-Time-3D Game-Engine Taxonomy

Introduction

Many computer games create virtual worlds, and the more recent ones can create and display remarkably detailed ones. Most interestingly, many of them can display virtual worlds in real time, that is, that display these worlds with essentially instant response to inputs and with automatic updating. This distinguishes them from typical 3D-modeling software, which generally only does preview modes in real time. And this distinguishes them from games such as Myst, which use entirely prerendered scenes. So I will be focusing on such game engines.

In particular, I will be focusing on game engines that do real-time 3D rendering; this is rendering that uses three-dimensional geometrical information. However, I will be discussing 2D-geometry game engines as appropriate, and as I will note, many 3D engines retain various 2D features.

I'd originally considered this question when working out which game engines it would be worth porting maps to from Bungie's Marathon series; since then, I've decided to consider this question more carefully, and the result is this taxonomy.

Overall Environment

The highest-level classification is overall environment; this is the overall type of geometry, and it determines much of the rest of the world geometry.

These names describe what sort of scenes the engines are best adapted to. Indoor engines have floors and walls and ceilings, outdoor engines have essentially one big floor, and outer-space engines have no boundary surfaces. However, these are not absolutely fixed distinctions, since one type of engine can have features of another. In particular, indoor and outdoor engines can have entities that fly or swim (or both!), making their physics much like that of entities in outer-space games. Also, indoor engines can do outdoor scenes by making some of their surfaces look like distant landscapes, while outdoor engines can do indoor scenes with appropriately-placed cliffs. And some recent games appear to have hybrid indoor/outdoor engines with indoor-engine segments added to outdoor-engine ones; "Drakan" and "Outcast", for example.

I will say little about outer-space engines here, since they have very little by way of world geometry, and because the rendering of their inhabitants parallels that of other kinds of game engines.

Outdoor engines, with the exception of flight simulators and the like, generally feature a top-down or a slanted view; these engines are essentially 2D, with only recent ones having some 3D features. Bungie's Myth series was the first to have such real 3D features such as terrain elevations and 3D projectile physics.

Indoor engines are the ones with the most advanced 3D rendering, and it is these that I will discuss in the most detail. Since this is a fairly big field, I will subdivide it into generations, each with characteristic rendering features.

Generation Zero

I'm including this generation for completeness; it is the sidescroller, a kind of 2D engine where the view direction is horizontal. All the inhabitants are sprites (2D pictures), and landscapes may be scrolled at some fractional speed to create the appearance of perspective. Their outdoor-engine counterparts are 2D engines with top-down or slanted views.

First Generation

These were the first of the indoor-engine real-time-3D games to appear; they include such notable examples as id's Wolfenstein 3D and Bungie's Pathways into Darkness in the early 1990's. They share these features:

Second Generation

This generation started to appear in the mid-1990's; it includes id's Doom series, Bungie's Marathon series, LucasArts's Dark Forces, 3D Realms's Duke Nukem, etc. They generally share these features:

The first one happens because several of the examples (Doom, Dark Forces, Duke Nukem) have their horizontal geometry specified by Binary Space Partitions (BSP's), which do not allow stacked floors. However, stacked floors can be produced by dividing the level into submaps, some of whose surfaces act as portals to other submaps. Marathon is an exceptional case, because it does not use BSP's, thus allowing stacked floors to be created in a very natural manner. However, each Marathon map sector/polygon could be interpreted as a submap with portals comparable to those in the other engines mentioned.

The last one happens because the view direction stays horizontal, while the view window (what one sees on screen) gets moved up or down; this is only a fake sort of view-direction change. And this is what limits the vertical range, since too high a vertical shift would cause serious distortion. A true vertical-direction change, as in what happens with third-generation engines, would make vertical lines appear to converge at some point, and not stay parallel, as they do here.

One trend that became noticeable in this generation is the licensing of various game engines to other game companies; Doom's engine has been licensed to create Heretic and Hexen, Duke Nukem's Build engine has been licensed to create Shadow Warrior, Redneck Rampage, etc., and Marathon's engine has been licensed to create ZPC, Damage Inc., and Prime Target. Licensing means that licensees can save on much of the software development needed to create a new game engine, and focus on whatever specific features they may want to offer.

Third Generation

This generation started to appear in 1996, with the release of id's Quake. It has since been followed by Tomb Raider, Quake 2, Unreal, and several others; it is now the dominant sort of indoor game engine. They generally share these features:

As noted, sprites are generally not used for very much; mostly explosion effects, flames, and lens flare. This is partly because 3D models automatically have the correct appearance in all directions; with sprites, one has to create sets of them representing some entity viewed from different directions, and not surprisingly, some of the more recent sprite makers have been known to use 3D-modeling software for that task.

3D-model character animation is more complicated than doing animation of sprites, which is to make a simple series of them. The most common way of doing that is the Quake approach, which features using a sequence of vertex sets in a single continuous model. An alternative is to use the Tomb Raider approach, which is to break the models up into segments, and animate by moving those segments relative to each other; this is a form of skeletal animation. A hybrid method, now used in games like Half-Life, which uses Quake's continuous-skin approach, but with the vertices set in Tomb-Raider fashion or a generalization of it.

Licensing has only intensified in this generation, with the Quake family and Unreal currently having several licensees (Hexen 2, Heretic 2, SiN, Half-Life, Klingon Honor Guard, etc.), with more to come. The saving of development effort is clearly substantial here; John Carmack, id's master programmer, has compared writing a new game engine to writing an operating system.

One interesting oddity here is the Tomb Raider engine. Though its view and inhabitants are third-generation, its world geometry is really second-generation, though with sloped floors and ceilings and extensive use of both horizontal and vertical submap portals. As a result, the world geometry approximates the third-generation appearance.

Programming Issues

The progression of generations has been made possible by the availability of ever-increasing quantities of CPU cycles available for rendering each frame; as the quantity increases, the shortcuts of previous generations become less necessary. And in recent years, video cards have become increasingly important as sources of processor cycles for 3D rendering. This is because much of the more common real-time-3D rendering can be done by rendering a lot of texture-mapped triangles, a task that the cards can then be specialized for doing. The result is a big jump in processing power, which makes it possible to do a variety of nice rendering effects in real time, such as smoothed close-up textures, semitransparent surfaces, etc. Though 3D acceleration was at first a novelty, it is now almost universal in recent real-time-3D game releases, and some games are now hardware-acceleration-only.

This has led to some interesting "API wars", over what Application Programming Interface to use in giving rendering instructions to 3D cards. There has been an abundance of these, though the field has been winnowing down. The first type of these is 3D-card-specific API's; out of the several that have been created, only one has survived, and that is Glide, for 3dfx cards. The second type of these are those associated with specific operating systems, such as Microsoft's Direct3D and Apple's QuickDraw 3D RAVE. These have been doing well, with Microsoft and (until recently) Apple continuing to support them. The third type of these are those not tied to any specific card or OS; only one is widely used, and that is OpenGL.

3D-card-specific API's had been created because of a lack of more widespread alternatives, but the emergence of such alternatives has caused most of them to be dropped, due to the unwillingness of game programmers to write several versions of their rendering code. This suggests that Glide will eventually have the same fate. However, the OS-specific API's have had a better fate because their being supported by the OS makers guarantees their wide support. But one of the two main such examples, RAVE, is being abandoned by Apple in favor of OpenGL, because of its very limited support. The other one, Direct3D, is most likely being continued because of political reasons, more specifically, to create captive markets and to squash competition with them.

Future Directions

One interesting question is whether there are any further generations beyond the third generation of indoor engines, or whether we will essentially see elaborations of the third-generation paradigms. I suspect the latter, since 3D-modeling software for doing high-quality renderings generally uses 3rd-generation-style modeling with elaborations such as fog, curved surfaces, reflections, and bump mapping. And these are elaborations that are starting to appear in RT3D game engines.

One interesting question is possible improvements on rendering techniques. All those currently used for RT3D games have some things in common: they do not do explicit raytracing, but instead use an implicit form of it, of projecting objects onto view space and rendering those projections. This is much computationally cheaper than "real" ray tracing, which means that it is likely to be continue being used in the near future, even as computer performance increases. Instead, those extra CPU cycles will likely be used for fancier and more detailed rendering, such as for doing some of the sorts of rendering described earlier.

This suggests that the next big jumps are likely to take place in other fields, such as sound, game physics, and Artificial Intelligence (AI). I will now examine these.

Sound

Sound is generally handled by replaying sound samples; this is done for sound effects and voices, and sometimes for background music. Continuous sound is created by looping the sample, continuously repeating it. Variations can be created by selecting among several samples for the sound to use. For music, however, there are higher-order techniques that are often used, something like doing 3D models instead of sprites.

An important such technique is sequencing, which is reading off of a list of notes to be played and playing a sample of each note. This makes for much more efficient storage of instrumental-music information, since each pitch of note has to be stored only once -- and different pitches can be created by replaying samples at different speeds. This is how the Amga MOD music-file format works. The MIDI format ("Musical Instrument Digital Interface") only specifies sequences, though "General MIDI" is a standard set of note-sound assignments.

This principle can be extended further with fancier algorithms for generating note sounds, such as doing filtering and varying the overall amplitude, pitch (frequency modulation), and other such quantities. This can be used to make an exponentially decaying sound with higher frequencies decaying faster, for example. In addition, more than one sample can be used, for example, one sample for the initial part and one sample for the sustaining part. Furthermore, the samples themselves can be made very short, such as one cycle long, and can be such idealized mathematical shapes as sine waves, square waves, and sawtooth waves.

There are alternative techniques, such as physical modeling. For example, many pitched instruments have linear oscillators (strings, air columns, etc.), and these can be modeled with the digital equivalent of delay lines (some long first-in-first-out buffers), while analog electronics can be modeled as an appropriate set of differential equations. Though some of these techniques tend to be computationally expensive, I doubt that they would be too expensive for some present-day CPU's.

What can be used for music can also be used for sound effects; such algorithms can be use to make explosions (for example) sound different every time.

The "ultimate" sound is, of course, the human voice, and there have been various efforts at doing speech synthesis, with varying degrees of success. This is more difficult than might first appear, because one has to get intonation (timing, pitch, intensity) right. As a result, current computer-generated speech often seems stereotypically robotic. Getting the intonation right may require serious AI (see this appendix on spelling), and such AI may also be necessary for clumsy spelling systems such as that for English, which has an ugly mixture of quasiphonetic and logographic spellings. For these reasons, having game characters speak from ASCII text in game-data files has generally not been done, although there has been significant progress on the text-to-speech front, such as what is bundled with Apple's MacOS. However, even that does not sound quite right, and for that reason, the MacOS text-to-speech parser can be given hints in the text.

Game Physics

There are prospects for improvement here also. For example, most game characters have a limited set of states and degrees of freedom, and their animations are usually composed in advance. More specifically, they typically have only a position, a direction, and a state (stationary, moving, etc.). However, there have been a few efforts to move beyond that stage, such as the behavior of the Tomb Raider series' player character's animated ponytail (present in all but the first of the series), which is hardcoded in the engine. And future productions may well go beyond that, to direct simulations of walking, running, and so forth. Not to mention accurate physics for a variety of other objects.

Artificial Intelligence (AI)

Finally, the ultimate in computer-game improvements. One attraction of multiplayer games is that human opponents and collaborators can still be smarter than much of the Artificial Intelligence that is available for computer-game characters. However, AI has proven to be a much more difficult subject than had been thought in the first few decades of computing; one reason is that the world we live in has numerous features that have to be learned -- features that we often take for granted. Some examples of this phenomenon can be found in this appendix on spelling). Another source of examples is natural-language translation; this was one AI application that had gotten a lot of hype in the early years of computers, but the reality is that natural languages have much more complexity than at first sight, meaning that one often does not get neat mappings from one language to another. For examples of this problem, try out the translator server Babelfish; try some round-trip translations, especially translations of idiomatic phrases (those with a different meaning than what the combined words might suggest). So it is difficult for me to come to conclusions about the ultimate prospects of game AI.

The most successful game AI to date has been in rather stylized sorts of games with simple game worlds, such as chess. The difficulty there results from the abundant combinatorics of the different kinds of pieces and their possible positions. The usual method of deciding on moves in such games is to do a tree search of the various possible moves, while being careful to avoid following up on obviously bad ones. The search depth can be extended as far as one has the computational ability to do so, and some chess software can provide impressive competition for champion (human) players.

However, this sort of AI is unsuited for the sorts of games discussed here, which have a much more real-world-ish quality than chess. To give one example, Bungie's Myth 2: Soulblighter has a multiplayer-game mode called "Assassin", in which one tries to kill one's opponents' assassin targets, thus making a close analogy with chess. However, there are numerous differences between Myth and chess. Myth's is real-time rather than turn-based, and its time is essentially continuous. Characters' positions are also essentially continuous, they can move to any territory accessible to them, their attacks can be long-distance, and their attacks are by sending out projectiles which generally do not kill at the first hit. AI suitable for Myth is thus very different from AI suitable for chess. And Myth does use some AI for its troops, because otherwise, chess-like control of one's troops would mean a painful amount of micromanagement. One designates destinations and targets by clicking on them, and one's troops will then find their way and attack. They can go around obstacles, and they will even be careful to avoid attacking non-hostile characters.

But there has still been some progress in interesting AI. One example is the "bot", a kind of non-player character that imitates a (human) player in first-person shooters like Quake and Unreal. Some of these have been fairly successful in imitating some human-player strategies. Some of the best bot designs use such AI techniques as Fuzzy Logic, the logic of partial category membership, which has gotten an abundance of use in embedded-system controllers.

Given the disappointment that AI has been, at least to me, I'm not willing to claim much by way of expected progress. But I'm not going to complain much if I am proved wrong.


Appendix: Spelling Systems

Here I will discuss features of various spelling systems and indicate why some require more sophisticated Artificial Intelligence to interpret than others. I will start with the simplest sort of spelling, phonetic spelling, and move to other sorts of spelling. On the way, I will examine English spelling and show that it is hard to fit into simple categories -- and show why it is so difficult.

The ideal of phonetic spelling would be a translation of every alphabet letter into a speech sound; this has the potential to become very clumsy. One problem is allophonic variation, of sets of similar sounds, each of which occurs in a different context. In English, voiceless consonants at the beginnings of words or after unstressed syllables have a puff of air after them, which other voiceless consonants do not ("tick" vs. "stick", "poke" vs. "spoke", "cope" vs. "scope", etc.). This can be taken care of by using the same letter for each set of such sounds, as is done in the above examples. One ingenious example of this is in Korean's Hangul alphabet, in which h- and -ng are represented by the same letter. A related problem is the lack of direct intonation information in most writing; intonation is generally hinted by supplying punctuation. For example, questions are often delivered with rising-pitch intonation the question mark is an intonation hint here.

Even with such shortcuts, an alphabet to represent high-level speech sounds (phonemes) can still be large; however, such "alphabets" have been constructed with various devices such as letter combinations (repeated letters for long sounds, for example) and marks on letters. English can be called quasiphonetic, because it is usually possible to unambiguously pronounce nonsense words, even though the necessary parsing is heavily context-dependent. For example, e's at the ends of words are often silent, but when present, they usually lengthen previous vowels. Consider "bit" and "bite", for example. However, there are numerous exceptions; an example cited by George Bernard Shaw is using "ghoti" for "fish", using the "gh" in "rough", the "o" in "women", and the "ti" in "nation". The interesting feature of this example is that English speakers have no trouble assigning a pronunciation to "ghoti" -- one very different from that of "fish".

Phonetic spelling need not have a symbol for each sound; it can have a symbol for each syllable, a variation that has sometimes been used. Japanese, for, example, has two such syllabaries (hiragana and katakana). One problem with syllabaries, however, is that they require more symbols than (phonetic) alphabets.

Having tackled the question of phonetic spelling, we now turn to logographic spelling. This is using a single symbol to represent each word. This quickly becomes very clumsy, which is why phonetic spelling systems have been introduced, usually by way of specializing some logographic symbols to represent certain sounds or syllables. Such spellings have historically started out as pictures, but many words express concepts that are not easily picturable. The Chinese approach has been to combine a semantic hint and a phonetic hint, thus the sign for "mother" is a combination for the signs for "woman" (semantic) and "horse" (phonetic; in Chinese, they are both ma, but with different tones). The Egyptian and Mesopotamian approach has been to create an alphabet or syllabary, but in addition to use determinative symbols. Thus, in ancient Egyptian, people's names are written phonetically, but are followed by a sign for "man" or "woman". Japanese also has a mixture of phonetic and logographic spelling, the latter being borrowed Chinese symbols.

Alphabets may seem to make logographic spelling unnecessary, but creeping logography nevertheless happens. This can happen with words whose spellings are borrowed intact, as happens with English (some languages respell words to conform to their spelling conventions), and this can happen when spelling goes out of sync with pronunciation, as has happened with English. Consider the vowel sounds in "beet", "beat", and "receive" -- most of them are the same, but spelled differently. The first two were once pronounced differently, while the second one is a borrowed word, complete with spelling. The first two suggest a interesting use of logographic spelling -- to distinguish same- or similar-sounding words that otherwise would be spelled the same. This is sometimes done on purpose; Italian spelling features an initial silent h on some words, despite a tendency to spell words according to its phonetic-spelling rules. Logographic spelling can be taken to the extreme of constructing words from sets of alphabet symbols without regard to pronunciation rules; Chinese uses something like that system in its writing, in which each symbol is constructed from members of a set of 206 basic symbols.

Although English does not go nearly that far, there are cases where one needs a word's context to determine its pronunciation. For example, "read" is pronounced differently in present tense and past tense, and "record" is pronounced differently, depending on whether it is currently a noun or a verb ("I will record this record"). Making this selection thus requires some sort of sentence-level parsing, if not natural-language understanding.

So while some spelling systems can easily be translated into pronunciation, some others require more advanced Artificial Intelligence to do so, and one example of that is English spelling.