In particular, I will be focusing on game engines that do real-time 3D rendering; this is rendering that uses three-dimensional geometrical information. However, I will be discussing 2D-geometry game engines as appropriate, and as I will note, many 3D engines retain various 2D features.
I'd originally considered this question when working out which game engines it would be worth porting maps to from Bungie's Marathon series; since then, I've decided to consider this question more carefully, and the result is this taxonomy.
I will say little about outer-space engines here, since they have very little by way of world geometry, and because the rendering of their inhabitants parallels that of other kinds of game engines.
Outdoor engines, with the exception of flight simulators and the like, generally feature a top-down or a slanted view; these engines are essentially 2D, with only recent ones having some 3D features. Bungie's Myth series was the first to have such real 3D features such as terrain elevations and 3D projectile physics.
Indoor engines are the ones with the most advanced 3D rendering, and it is these that I will discuss in the most detail. Since this is a fairly big field, I will subdivide it into generations, each with characteristic rendering features.
The last one happens because the view direction stays horizontal, while the view window (what one sees on screen) gets moved up or down; this is only a fake sort of view-direction change. And this is what limits the vertical range, since too high a vertical shift would cause serious distortion. A true vertical-direction change, as in what happens with third-generation engines, would make vertical lines appear to converge at some point, and not stay parallel, as they do here.
One trend that became noticeable in this generation is the licensing of various game engines to other game companies; Doom's engine has been licensed to create Heretic and Hexen, Duke Nukem's Build engine has been licensed to create Shadow Warrior, Redneck Rampage, etc., and Marathon's engine has been licensed to create ZPC, Damage Inc., and Prime Target. Licensing means that licensees can save on much of the software development needed to create a new game engine, and focus on whatever specific features they may want to offer.
3D-model character animation is more complicated than doing animation of sprites, which is to make a simple series of them. The most common way of doing that is the Quake approach, which features using a sequence of vertex sets in a single continuous model. An alternative is to use the Tomb Raider approach, which is to break the models up into segments, and animate by moving those segments relative to each other; this is a form of skeletal animation. A hybrid method, now used in games like Half-Life, which uses Quake's continuous-skin approach, but with the vertices set in Tomb-Raider fashion or a generalization of it.
Licensing has only intensified in this generation, with the Quake family and Unreal currently having several licensees (Hexen 2, Heretic 2, SiN, Half-Life, Klingon Honor Guard, etc.), with more to come. The saving of development effort is clearly substantial here; John Carmack, id's master programmer, has compared writing a new game engine to writing an operating system.
One interesting oddity here is the Tomb Raider engine. Though its view and inhabitants are third-generation, its world geometry is really second-generation, though with sloped floors and ceilings and extensive use of both horizontal and vertical submap portals. As a result, the world geometry approximates the third-generation appearance.
This has led to some interesting "API wars", over what Application Programming Interface to use in giving rendering instructions to 3D cards. There has been an abundance of these, though the field has been winnowing down. The first type of these is 3D-card-specific API's; out of the several that have been created, only one has survived, and that is Glide, for 3dfx cards. The second type of these are those associated with specific operating systems, such as Microsoft's Direct3D and Apple's QuickDraw 3D RAVE. These have been doing well, with Microsoft and (until recently) Apple continuing to support them. The third type of these are those not tied to any specific card or OS; only one is widely used, and that is OpenGL.
3D-card-specific API's had been created because of a lack of more widespread alternatives, but the emergence of such alternatives has caused most of them to be dropped, due to the unwillingness of game programmers to write several versions of their rendering code. This suggests that Glide will eventually have the same fate. However, the OS-specific API's have had a better fate because their being supported by the OS makers guarantees their wide support. But one of the two main such examples, RAVE, is being abandoned by Apple in favor of OpenGL, because of its very limited support. The other one, Direct3D, is most likely being continued because of political reasons, more specifically, to create captive markets and to squash competition with them.
One interesting question is possible improvements on rendering techniques. All those currently used for RT3D games have some things in common: they do not do explicit raytracing, but instead use an implicit form of it, of projecting objects onto view space and rendering those projections. This is much computationally cheaper than "real" ray tracing, which means that it is likely to be continue being used in the near future, even as computer performance increases. Instead, those extra CPU cycles will likely be used for fancier and more detailed rendering, such as for doing some of the sorts of rendering described earlier.
This suggests that the next big jumps are likely to take place in other fields, such as sound, game physics, and Artificial Intelligence (AI). I will now examine these.
An important such technique is sequencing, which is reading off of a list of notes to be played and playing a sample of each note. This makes for much more efficient storage of instrumental-music information, since each pitch of note has to be stored only once -- and different pitches can be created by replaying samples at different speeds. This is how the Amga MOD music-file format works. The MIDI format ("Musical Instrument Digital Interface") only specifies sequences, though "General MIDI" is a standard set of note-sound assignments.
This principle can be extended further with fancier algorithms for generating note sounds, such as doing filtering and varying the overall amplitude, pitch (frequency modulation), and other such quantities. This can be used to make an exponentially decaying sound with higher frequencies decaying faster, for example. In addition, more than one sample can be used, for example, one sample for the initial part and one sample for the sustaining part. Furthermore, the samples themselves can be made very short, such as one cycle long, and can be such idealized mathematical shapes as sine waves, square waves, and sawtooth waves.
There are alternative techniques, such as physical modeling. For example, many pitched instruments have linear oscillators (strings, air columns, etc.), and these can be modeled with the digital equivalent of delay lines (some long first-in-first-out buffers), while analog electronics can be modeled as an appropriate set of differential equations. Though some of these techniques tend to be computationally expensive, I doubt that they would be too expensive for some present-day CPU's.
What can be used for music can also be used for sound effects; such algorithms can be use to make explosions (for example) sound different every time.
The "ultimate" sound is, of course, the human voice, and there have been various efforts at doing speech synthesis, with varying degrees of success. This is more difficult than might first appear, because one has to get intonation (timing, pitch, intensity) right. As a result, current computer-generated speech often seems stereotypically robotic. Getting the intonation right may require serious AI (see this appendix on spelling), and such AI may also be necessary for clumsy spelling systems such as that for English, which has an ugly mixture of quasiphonetic and logographic spellings. For these reasons, having game characters speak from ASCII text in game-data files has generally not been done, although there has been significant progress on the text-to-speech front, such as what is bundled with Apple's MacOS. However, even that does not sound quite right, and for that reason, the MacOS text-to-speech parser can be given hints in the text.
The most successful game AI to date has been in rather stylized sorts of games with simple game worlds, such as chess. The difficulty there results from the abundant combinatorics of the different kinds of pieces and their possible positions. The usual method of deciding on moves in such games is to do a tree search of the various possible moves, while being careful to avoid following up on obviously bad ones. The search depth can be extended as far as one has the computational ability to do so, and some chess software can provide impressive competition for champion (human) players.
However, this sort of AI is unsuited for the sorts of games discussed here, which have a much more real-world-ish quality than chess. To give one example, Bungie's Myth 2: Soulblighter has a multiplayer-game mode called "Assassin", in which one tries to kill one's opponents' assassin targets, thus making a close analogy with chess. However, there are numerous differences between Myth and chess. Myth's is real-time rather than turn-based, and its time is essentially continuous. Characters' positions are also essentially continuous, they can move to any territory accessible to them, their attacks can be long-distance, and their attacks are by sending out projectiles which generally do not kill at the first hit. AI suitable for Myth is thus very different from AI suitable for chess. And Myth does use some AI for its troops, because otherwise, chess-like control of one's troops would mean a painful amount of micromanagement. One designates destinations and targets by clicking on them, and one's troops will then find their way and attack. They can go around obstacles, and they will even be careful to avoid attacking non-hostile characters.
But there has still been some progress in interesting AI. One example is the "bot", a kind of non-player character that imitates a (human) player in first-person shooters like Quake and Unreal. Some of these have been fairly successful in imitating some human-player strategies. Some of the best bot designs use such AI techniques as Fuzzy Logic, the logic of partial category membership, which has gotten an abundance of use in embedded-system controllers.
Given the disappointment that AI has been, at least to me, I'm not willing to claim much by way of expected progress. But I'm not going to complain much if I am proved wrong.
The ideal of phonetic spelling would be a translation of every alphabet letter into a speech sound; this has the potential to become very clumsy. One problem is allophonic variation, of sets of similar sounds, each of which occurs in a different context. In English, voiceless consonants at the beginnings of words or after unstressed syllables have a puff of air after them, which other voiceless consonants do not ("tick" vs. "stick", "poke" vs. "spoke", "cope" vs. "scope", etc.). This can be taken care of by using the same letter for each set of such sounds, as is done in the above examples. One ingenious example of this is in Korean's Hangul alphabet, in which h- and -ng are represented by the same letter. A related problem is the lack of direct intonation information in most writing; intonation is generally hinted by supplying punctuation. For example, questions are often delivered with rising-pitch intonation the question mark is an intonation hint here.
Even with such shortcuts, an alphabet to represent high-level speech sounds (phonemes) can still be large; however, such "alphabets" have been constructed with various devices such as letter combinations (repeated letters for long sounds, for example) and marks on letters. English can be called quasiphonetic, because it is usually possible to unambiguously pronounce nonsense words, even though the necessary parsing is heavily context-dependent. For example, e's at the ends of words are often silent, but when present, they usually lengthen previous vowels. Consider "bit" and "bite", for example. However, there are numerous exceptions; an example cited by George Bernard Shaw is using "ghoti" for "fish", using the "gh" in "rough", the "o" in "women", and the "ti" in "nation". The interesting feature of this example is that English speakers have no trouble assigning a pronunciation to "ghoti" -- one very different from that of "fish".
Phonetic spelling need not have a symbol for each sound; it can have a symbol for each syllable, a variation that has sometimes been used. Japanese, for, example, has two such syllabaries (hiragana and katakana). One problem with syllabaries, however, is that they require more symbols than (phonetic) alphabets.
Having tackled the question of phonetic spelling, we now turn to logographic spelling. This is using a single symbol to represent each word. This quickly becomes very clumsy, which is why phonetic spelling systems have been introduced, usually by way of specializing some logographic symbols to represent certain sounds or syllables. Such spellings have historically started out as pictures, but many words express concepts that are not easily picturable. The Chinese approach has been to combine a semantic hint and a phonetic hint, thus the sign for "mother" is a combination for the signs for "woman" (semantic) and "horse" (phonetic; in Chinese, they are both ma, but with different tones). The Egyptian and Mesopotamian approach has been to create an alphabet or syllabary, but in addition to use determinative symbols. Thus, in ancient Egyptian, people's names are written phonetically, but are followed by a sign for "man" or "woman". Japanese also has a mixture of phonetic and logographic spelling, the latter being borrowed Chinese symbols.
Alphabets may seem to make logographic spelling unnecessary, but creeping logography nevertheless happens. This can happen with words whose spellings are borrowed intact, as happens with English (some languages respell words to conform to their spelling conventions), and this can happen when spelling goes out of sync with pronunciation, as has happened with English. Consider the vowel sounds in "beet", "beat", and "receive" -- most of them are the same, but spelled differently. The first two were once pronounced differently, while the second one is a borrowed word, complete with spelling. The first two suggest a interesting use of logographic spelling -- to distinguish same- or similar-sounding words that otherwise would be spelled the same. This is sometimes done on purpose; Italian spelling features an initial silent h on some words, despite a tendency to spell words according to its phonetic-spelling rules. Logographic spelling can be taken to the extreme of constructing words from sets of alphabet symbols without regard to pronunciation rules; Chinese uses something like that system in its writing, in which each symbol is constructed from members of a set of 206 basic symbols.
Although English does not go nearly that far, there are cases where one needs a word's context to determine its pronunciation. For example, "read" is pronounced differently in present tense and past tense, and "record" is pronounced differently, depending on whether it is currently a noun or a verb ("I will record this record"). Making this selection thus requires some sort of sentence-level parsing, if not natural-language understanding.
So while some spelling systems can easily be translated into pronunciation, some others require more advanced Artificial Intelligence to do so, and one example of that is English spelling.