Examples from
Exploring Baseball with Retrosheet and Mathematica

©2007 Ken Levasseur
klevasseur@mac.com
http://homepage.mac.com/klevasseur/erm

The data files used here were obtained free of charge from and is copyrighted by Retrosheet.  Interested parties may contact Retrosheet at "www.retrosheet.org".

Contents

Palette

A note on timing

Ballpark Testing

Game Log Testing

Event File Testing

Team Rosters

FromPlayerCode

TeamsPlayedFor

SearchPlayer

SeasonPlot

Play Results

Innings and MatchUp

Select from Games

What's Next?

Palette

If you select the next cell and then the "Generate Palette from Selection" under the Palette menu, you can save this first version the "Baseball Palette."

"BaseballTesting_4.gif"
"BaseballTesting_5.gif"
"BaseballTesting_6.gif"

Load the package

Note:  To replicate the results you see here, you need Mathematica, and the game logs and event files for the years in the examples.  These files can be downloaded at http://www.retrosheet.org/

"BaseballTesting_7.gif"

Evaluating an expression consisting of a question mark followed by any name above give information on the name.  For example:

"BaseballTesting_8.gif"

"BaseballTesting_9.gif"

"BaseballTesting_10.gif"

"BaseballTesting_11.gif"

Currently this TeamCodeList only works for years for which Event Files are available.   A message hinting to that fact comes up when you try an earlier year.

"BaseballTesting_12.gif"

"BaseballTesting_13.gif"

"BaseballTesting_14.gif"

A note on timing

Functions that load files, particularly GameLogs, Events, and AllEvents. Take a noticeable amount of time to evaluate, but their values are held in memory, so the subsequent calls for any year/team will take negligible time.   Here is the timing for reading the event file for Red Sox home game for 1960.  Note:  The list of events for each game are formatted to show just the teams and dates, but the full information is accessible from the output.

"BaseballTesting_15.gif"

"BaseballTesting_16.gif"

"BaseballTesting_17.gif"

To retrieve the same list of games a second time the timing is greatly reduced.

"BaseballTesting_18.gif"

"BaseballTesting_19.gif"

"BaseballTesting_20.gif"

Of course timing will also vary by computer.  I use a PowerBook G4 at home and a Mac Pro at work.  The Mac Pro uses roughly 15-20% of the time that the Powerbook uses in a typical call involving file input.  Timings above are with the PowerBook.

Ballpark Testing

"BaseballTesting_21.gif"

"BaseballTesting_22.gif"

Major League parks located in Massachusetts.

"BaseballTesting_23.gif"

"BaseballTesting_24.gif"

"New" ballparks of the 1950's.  None are still active.

"BaseballTesting_25.gif"

"BaseballTesting_26.gif"

Ballparks that were retired in the 1910's

"BaseballTesting_27.gif"

"BaseballTesting_28.gif"

"BaseballTesting_29.gif"

"BaseballTesting_30.gif"

"BaseballTesting_31.gif"

"BaseballTesting_32.gif"

Game Log Testing

The function GameLogs is fundamental to the use of game logs.  It read the specified game log and keeps the value in memory for the current session

"BaseballTesting_33.gif"

"BaseballTesting_34.gif"

Who won the first game played in the 20th century?

"BaseballTesting_35.gif"

"BaseballTesting_36.gif"

Of course the first game in the game log might not really have been the first game since time-of-day isn't part of the data for 1900.  Here are all the games played on Opening Day 1900.

"BaseballTesting_37.gif"

"BaseballTesting_38.gif"

And here are the winning teams.

"BaseballTesting_39.gif"

"BaseballTesting_40.gif"

Generating standings for selected games is easy to do with the Standings function.  Here are the National League standings in 1908, a very good year for the Cubs!

"BaseballTesting_41.gif"

"BaseballTesting_42.gif"

What National League team dominated the years of WWII?

"BaseballTesting_43.gif"

"BaseballTesting_44.gif"

What are the standings in games played on the 4th of July since 2000?

"BaseballTesting_45.gif"

"BaseballTesting_46.gif"

What was the average playing time of each team's home games in 1998?

"BaseballTesting_47.gif"

"BaseballTesting_48.gif"

Some information on individual games can be displayed.  Here is opening day at Fenway Park in 1967.

"BaseballTesting_49.gif"

"BaseballTesting_50.gif"

The line score can eventually be added to the scoreboard.

"BaseballTesting_51.gif"

"BaseballTesting_52.gif"

Who were the top ten winning pitchers in 1950?  Right now there isn't an easy way to convert from player codes with this package for years prior to Event Files.

"BaseballTesting_53.gif"

"BaseballTesting_54.gif"

Who were the losing pitchers for the Red Sox in 1966?

"BaseballTesting_55.gif"

"BaseballTesting_56.gif"

Some information on Bronson Arroyo's wins in 2004.

"BaseballTesting_57.gif"

"BaseballTesting_58.gif"

"BaseballTesting_59.gif"

"BaseballTesting_60.gif"

Here is the final  record of the 2004 Red Sox

"BaseballTesting_61.gif"

"BaseballTesting_62.gif"

Here is how the Red Sox got to the final record

"BaseballTesting_63.gif"

There are several ways to look a this history.  One is but game-by-game winning percentage.

"BaseballTesting_64.gif"

"BaseballTesting_65.gif"

Another view is to plot the actual successive records.  I like this better since streaks are easier to see.

"BaseballTesting_66.gif"

"BaseballTesting_67.gif"

Event File Testing

The Events function loads all games for a team in a specific year while AllEvents loads all games for a team.  The latter takes significantly longer since all events files for teams that the selected team played.  As AllEvents  is evaluated for other teams in the same year, the process is much faster.

"BaseballTesting_68.gif"

"BaseballTesting_69.gif"

The Red Sox opened at home against the Blue Jays in 2004.

"BaseballTesting_70.gif"

"BaseballTesting_71.gif"

The actual value of the output above is a Mathematica expression with head retrogame that contains arguments that correspond to the lines in the event file for that game.  The reason you see "TOR @ BOS on 2004/04/09" as an output is that the expression is formatted because of its length.   Here is a listing of the actual contents:

"BaseballTesting_72.gif"

"BaseballTesting_73.gif"

When AllEvents for a team/year is called for the first time, a list of regular season opponents for that year is printed.  This likely to be removed in a future version - it is currently in place due to some testing.

"BaseballTesting_74.gif"

"BaseballTesting_75.gif"

"BaseballTesting_76.gif"

Game log data from a retrogame is easily available.  Here is the data for the Red Sox opening game in 2004.

"BaseballTesting_77.gif"

"BaseballTesting_78.gif"

Rosters are also available for Event File years.

"BaseballTesting_79.gif"

"BaseballTesting_80.gif"

Outside of the Baseball` Context you can define other functions and some can be added to the package in later versions.   Here is one that scans the games in a list to identify whether a player appeared in that game.   The Boolean vector of 0's an 1's (1= "appeared") is displayed next to the player code.

"BaseballTesting_81.gif"

This reminds us that Trot Nixon missed a significant part of the 2004 season.

"BaseballTesting_82.gif"

"BaseballTesting_83.gif"

Counting the 1's produces a result that agrees with the stats on Trot's Retrosheet record.

"BaseballTesting_84.gif"

"BaseballTesting_85.gif"

Here are the "appearance vectors" for the pitchers on the 2004 Red Sox roster.   

"BaseballTesting_86.gif"

"BaseballTesting_87.gif"

A density plot of the vectors shows some interesting patterns. It should be clear what rows correspond to starters.   Of course this "work in progress" could be improved by displaying the player code (or even better, name) to the side of each row in the plot.

This is a Mathematica 6.0 issue:  The density plot that follows is blurred unlike the previous version's evaluation of the same expression - this needs to be addressed.

"BaseballTesting_88.gif"

"BaseballTesting_89.gif"

Team code lists such as this one for 2006 can be generated for years in which event files are present.

"BaseballTesting_90.gif"

"BaseballTesting_91.gif"

The team code lists are used for the function FromTeamCode.

"BaseballTesting_92.gif"

"BaseballTesting_93.gif"

Here are the Indians' regular season opponents in 2006

"BaseballTesting_94.gif"

"BaseballTesting_95.gif"

Conversion from player code to player name is currently limited.  You need to enter the team and year with the code and it works only for event file years.

"BaseballTesting_96.gif"

"BaseballTesting_97.gif"

"BaseballTesting_98.gif"

"BaseballTesting_99.gif"

The functions OffensiveStarter and StartingBattingOrder extract starter information.

"BaseballTesting_100.gif"

"BaseballTesting_101.gif"

"BaseballTesting_102.gif"

"BaseballTesting_103.gif"

"BaseballTesting_104.gif"

"BaseballTesting_105.gif"

Here is a sampling of batting orders from the 1960 Red Sox home games.

"BaseballTesting_106.gif"

"BaseballTesting_107.gif"

Not much as been done developing tools to work with play data, but some things can be easily programmed.  Here are the 11 unintentional four pitch walks to David Ortiz in 2006.

"BaseballTesting_108.gif"

"BaseballTesting_109.gif"

"BaseballTesting_110.gif"

"BaseballTesting_111.gif"

"BaseballTesting_112.gif"

A function to count pitches  in an event has been written.   Here are the number of pitches to David Ortiz in his 2006 at-bats.

"BaseballTesting_113.gif"

"BaseballTesting_114.gif"

Here is a sorted list for Kevin Youkilis.

"BaseballTesting_115.gif"

"BaseballTesting_116.gif"

Weather data can be extracted for some years.  The Red Sox opening game in 2006 was in Texas, on the 93th day of the year.

"BaseballTesting_117.gif"

"BaseballTesting_118.gif"

The temperature was, as expected, warm.

"BaseballTesting_119.gif"

"BaseballTesting_120.gif"

Plots for the whole year show that in both 2000 and 2006 the Red Sox played some cool early season games.

"BaseballTesting_121.gif"

The spike downward prior to day 225  the 2006 plot is an error in the White Sox event file for 2006.

"BaseballTesting_122.gif"

"BaseballTesting_123.gif"

"BaseballTesting_124.gif"

"BaseballTesting_125.gif"

Roster Files

Do I have access to the Roster file for a specific team and year?

"BaseballTesting_126.gif"

"BaseballTesting_127.gif"

"BaseballTesting_128.gif"

"BaseballTesting_129.gif"

"BaseballTesting_130.gif"

"BaseballTesting_131.gif"

"BaseballTesting_132.gif"

"BaseballTesting_133.gif"

What is the roster for the St. Louis Cardinals in 1911?

"BaseballTesting_134.gif"

"BaseballTesting_135.gif"

Who played for both the 1975 and 1986 Boston Red Sox?

"BaseballTesting_136.gif"

"BaseballTesting_137.gif"

Who played on all three Oakland A's championship teams in the 1970's?

"BaseballTesting_138.gif"

"BaseballTesting_139.gif"

FromPlayerCode

"BaseballTesting_140.gif"

"BaseballTesting_141.gif"

"BaseballTesting_142.gif"

"BaseballTesting_143.gif"

"BaseballTesting_144.gif"

"BaseballTesting_145.gif"

"BaseballTesting_146.gif"

"BaseballTesting_147.gif"

Did anyone play in all the Baltimore Oriole's home games in 1988?

"BaseballTesting_148.gif"

"BaseballTesting_149.gif"

"BaseballTesting_150.gif"

I think almost everyone knew about one of them!.

I always liked Forbes Field.  Lets look at what happened in the last game there.  First, what was the last year the Pirates played there?

"BaseballTesting_151.gif"

"BaseballTesting_152.gif"

The last home game in 1970, I guess.  Who played in that game.

"BaseballTesting_153.gif"

"BaseballTesting_154.gif"

It was a game against the Cubs.

"BaseballTesting_155.gif"

"BaseballTesting_156.gif"

"BaseballTesting_157.gif"

"BaseballTesting_158.gif"

Who won?

"BaseballTesting_159.gif"

"BaseballTesting_160.gif"

A bit more detail.

"BaseballTesting_161.gif"

"BaseballTesting_162.gif"

A happy ending!

"BaseballTesting_163.gif"

"BaseballTesting_164.gif"

Any home runs?

"BaseballTesting_165.gif"

"BaseballTesting_166.gif"

Al Oliver, right?

"BaseballTesting_167.gif"

"BaseballTesting_168.gif"

TeamsPlayedFor

Don Mossi was one of my favorite players when I was growing up.  His career started before 1957, so this result is partial.

"BaseballTesting_169.gif"

"BaseballTesting_170.gif"

What path did members of the 2004 Boston Red Sox take in getting to Boston in 2004?  Wanted:  A nice way to format and/or visualize this result.

"BaseballTesting_171.gif"

"BaseballTesting_172.gif"

SearchPlayer

"BaseballTesting_173.gif"

"BaseballTesting_174.gif"

"BaseballTesting_175.gif"

What is Joe Torre's player code.  I'm guessing that it is torrj101, but just to be sure, this will generate all player codes that start with torr.

"BaseballTesting_176.gif"

"BaseballTesting_177.gif"

"BaseballTesting_178.gif"

"BaseballTesting_179.gif"

SearchPlayer is one of the most  time-consuming functions.  It loads all rosters for all Event File years.  The default setting searches 26×2×3=156 keys in each roster.  A bit of time can be saved if you know that the name you are working with is not common.  The depth can be set to 1 or 2 to reduce the keys to 52 or 104.  After the first search the results come bit faster, but not blazing fast!

"BaseballTesting_180.gif"

"BaseballTesting_181.gif"

"BaseballTesting_182.gif"

"BaseballTesting_183.gif"

ClearRosters will clear the rosters loaded by SearchPlayer, recovering a significant part or the memory that is used by that function.  The output or ClearRosters is the amount of memory recovered.

"BaseballTesting_184.gif"

"BaseballTesting_185.gif"

Season Plotting

This speaks for itself.   The array that is printed before the graphic may not stay, it is a side effect.

"BaseballTesting_186.gif"

"BaseballTesting_187.gif"

"BaseballTesting_188.gif"

In this version, only Event File season plots can be generated.

Here is the plot of another great season, won by the Giants in a 3 game playoff.  And then there were the Mets.

"BaseballTesting_189.gif"

"BaseballTesting_190.gif"

"BaseballTesting_191.gif"

If you're from Philadelphia, you don't want to see this, so I'll leave it unevaluated.

"BaseballTesting_192.gif"

Play Results

A temporary play parsing function:

"BaseballTesting_193.gif"

"BaseballTesting_194.gif"

Did Ted Williams really hit a home run in his last game at Fenway Park?  Of course he did, but let's verify that the data says so.

"BaseballTesting_195.gif"

"BaseballTesting_196.gif"

"BaseballTesting_197.gif"

"BaseballTesting_198.gif"

"BaseballTesting_199.gif"

"BaseballTesting_200.gif"

Innings, InningsWithPitcher and Matchup

On September 9, 1965, Sandy Koufax pitched a perfect game.  Let's take a look at that game. You would need to know that the game was played in LA.

"BaseballTesting_201.gif"

"BaseballTesting_202.gif"

"BaseballTesting_203.gif"

"BaseballTesting_204.gif"

Lets look at the play-by-play of the game.

"BaseballTesting_205.gif"

"BaseballTesting_206.gif"

"BaseballTesting_207.gif"

inning(play(1,0,yound103,10,BX,4/P),play(1,0,beckg101,2,CFC,K/C),play(1,0,willb104,22,CCBBC,K/C))
inning(play(1,1,willm102,10,BX,13),play(1,1,gillj102,1,KX,8#),play(1,1,daviw102,32,BBBCCX,5/P#))
inning(play(2,0,santr102,21,BSBX,2/FL#),play(2,0,banke101,12,BFSS,K),play(2,0,browb101,22,CSBBX,8/L89D))
inning(play(2,1,johnl104,1,KX,3/FL#),play(2,1,fairr101,12,KBCX,8#),play(2,1,lefej101,0,X,8/F8XD))
inning(play(3,0,krugc101,22,SCBBFX,8/F89),play(3,0,kessd101,2,KKX,9/F9#),play(3,0,hendb101,2,CSC,K/C))
inning(play(3,1,parkw101,0,X,63),play(3,1,torbj101,20,BBX,53/G5L),play(3,1,koufs101,1,CX,13/G25))
inning(play(4,0,yound103,2,CFX,3/FL),play(4,0,beckg101,1,CX,9/F9L),play(4,0,willb104,22,BFBFC,K/C))
inning(play(4,1,willm102,21,BFBX,3/G3),play(4,1,gillj102,1,CX,8/F78),play(4,1,daviw102,0,X,4/FL))
inning(play(5,0,santr102,12,CCBX,7/F7),play(5,0,banke101,32,BCFBBFS,K),play(5,0,browb101,32,BBSSBX,63))
inning(play(5,1,johnl104,32,FBFBBB,W),play(5,1,fairr101,0,X,14/SH.1-2),play(5,1,lefej101,0,C,SB3.2-H(E2/TH)(UR)),play(5,1,lefej101,22,C.FBFFBS,K),play(5,1,parkw101,1,FX,13/G1))
inning(play(6,0,krugc101,22,BCSFBX,63),play(6,0,kessd101,1,FX,53/G5S),play(6,0,hendb101,22,FCBBS,K))
inning(play(6,1,torbj101,0,X,8/F8),play(6,1,koufs101,2,SFC,K/C),play(6,1,willm102,0,X,43))
inning(play(7,0,yound103,22,BFBSS,K),play(7,0,beckg101,0,X,9/F9S),play(7,0,willb104,32,BBBCFX,7/F7))
inning(play(7,1,gillj102,11,CBX,53/G56),play(7,1,daviw102,22,BSBFX,3/G3),play(7,1,johnl104,10,BX,D3/F3D),play(7,1,fairr101,31,CBBBX,63/G6))
inning(play(8,0,santr102,0,,NP),sub(kennj105,John Kennedy,1,2,5),play(8,0,santr102,12,BFSC,K/C),play(8,0,banke101,2,FFS,K),play(8,0,browb101,12,SBSS,K))
inning(play(8,1,lefej101,32,BSBSBS,K),play(8,1,parkw101,10,BX,13/G1),play(8,1,torbj101,1,FX,7/F7DW))
inning(play(9,0,krugc101,0,,NP),sub(tracd101,Dick Tracewski,1,6,4),play(9,0,krugc101,22,CSBFFBS,K),play(9,0,kessd101,0,,NP),sub(amalj101,Joey Amalfitano,0,8,11),play(9,0,amalj101,2,CFS,K),play(9,0,hendb101,0,,NP),sub(kuenh101,Harvey Kuenn,0,9,11),play(9,0,kuenh101,22,CBBSS,K))

Both starting pitchers pitched complete games, so the InningsWithPitchers function doesn't provide much additional information, but reading the plays is a bit easier with this output.

"BaseballTesting_208.gif"

"BaseballTesting_209.gif"

Viewing a more recent game, with multiple pitchers shows the usefulness of InningsWithPitchers.

"BaseballTesting_210.gif"

"BaseballTesting_211.gif"

"BaseballTesting_212.gif"

"BaseballTesting_213.gif"

Here are the innings with pitchers in the second column.

"BaseballTesting_214.gif"

"BaseballTesting_215.gif"

Bill Mueller's walk-off home run ended the game.  How did Mueller do against Mariano Rivera that year?

"BaseballTesting_216.gif"

"BaseballTesting_217.gif"

This is a bit easier to read.  The NP play is removed, but DI is a stolen base/defensive indifference notation.  So Mueller was 1 for 4 against Rivera in 2004, but the hit was a big one.

"BaseballTesting_218.gif"

"BaseballTesting_219.gif"

How did Tim Wakefield do against the Yankees in 2004, the year after he allowed the walk-off home run to Aaron Boone?

To get an idea of how long this takes, let's load all Red Sox and Yankee home games for 2004.

"BaseballTesting_220.gif"

Now let's look at the match-ups and record how long it takes to generate them.

"BaseballTesting_221.gif"

"BaseballTesting_222.gif"

Aaron Boone?  He didn't play at all in 2004.

"BaseballTesting_223.gif"

"BaseballTesting_224.gif"

Don Drysdale and Juan Marichal we both good hitters.  How did they do against one another in 1963, one their prime years?

"BaseballTesting_225.gif"

"BaseballTesting_226.gif"

Marichal was 4 for 9 against Drysdale.

"BaseballTesting_227.gif"

"BaseballTesting_228.gif"

Drysdale was 2 for 8 against Marichal.

Select from games

"BaseballTesting_229.gif"

"BaseballTesting_230.gif"

How many walk-off home runs did the Red Sox have in 2004.  

"BaseballTesting_231.gif"

"BaseballTesting_232.gif"

Finding the same result for the 1967 Red Sox brings up an interesting observation on timing

The order of logical expressions makes a difference in timing.  Normally, I would suggest that the most restrictive condition in a conjunction be first, but the following is a counterexample.  My take on the difference here is that  Winner consults the GameLogs, it takes longer to evaluate than EndsInResultQ.

"BaseballTesting_233.gif"

"BaseballTesting_234.gif"

"BaseballTesting_235.gif"

"BaseballTesting_236.gif"

Loading a list of games outside of SelectGames - in this case, only loading "BOS" games - speeds up the selection process if you are going to then use the games for another reason.  The order on the conditions is still significant.

"BaseballTesting_237.gif"

"BaseballTesting_238.gif"

"BaseballTesting_239.gif"

"BaseballTesting_240.gif"

"BaseballTesting_241.gif"

A third, optional, argument for SelectGames is a label for the search.

"BaseballTesting_242.gif"

"BaseballTesting_243.gif"

"BaseballTesting_244.gif"

How many games in 2005 were played in 95°F or hotter weather?

"BaseballTesting_245.gif"

"BaseballTesting_246.gif"

Three pitch innings

The following function doesn't take into account at-bats that are split by events such as stolen bases or substitutions.

"BaseballTesting_247.gif"

"BaseballTesting_248.gif"

One application of the function is to identify three pitch innings.  I suspect that these were extremely common in the "old days" but they are quite rare now.  Here is a function to identify them.  The last logical condition filters out walk-off endings.

"BaseballTesting_249.gif"

1995 isn't the "old days" but lets compare the number of three pitch innings with 2006.

"BaseballTesting_250.gif"

"BaseballTesting_251.gif"

That's 16 - lets assume that only one such inning occurs in each game.  Lets' look at one of them.

"BaseballTesting_252.gif"

"BaseballTesting_253.gif"

Who was the pitcher in the top of the 9th in that game?

"BaseballTesting_254.gif"

"BaseballTesting_255.gif"

"BaseballTesting_256.gif"

"BaseballTesting_257.gif"

Let's look at 2006.

"BaseballTesting_258.gif"

"BaseballTesting_259.gif"

The number is down to 6!    Who was the pitcher in the first game?

"BaseballTesting_260.gif"

"BaseballTesting_261.gif"

"BaseballTesting_262.gif"

"BaseballTesting_263.gif"

"BaseballTesting_264.gif"

"BaseballTesting_265.gif"

A bit of extra coding could generate lists of these innings and pitchers.  

What's Next?

More needs to be done with play data in event files.  A refined  system for parsing the play data is a definite need.

It should be possible to generate teamcode and roster files for pre-Event File years.  That would extend some of the functionality that relies on event file directories.

A good franchise list for the purposes of converting between franchise codes and team names would be nice.

Transactions, postseason, and schedules have not been touched.


Created by Wolfram Mathematica 6.0  (02 July 2007) Valid XHTML 1.1!