ESTIMATING PARK FACTORS FOR THE NEGRO LEAGUES
February 28, 2019 by Kevin Johnson · Leave a Comment
Most serious baseball fans understand that ballparks can have a large impact on statistical performance. However, trying to measure that exact impact often proves difficult. In work we’ve done on Major League ballparks at Seamheads.com, it takes about three full years of data, regressed by about one additional season worth of games, to get a decent prediction of how a park will impact runs scored in year four. Some analysts have argued that about TEN (!) years of data would be the proper sampling to understand park effects IF parks did not change during that time. Unfortunately for those of us who would like to calculate an effect, parks do often get changed, old parks are abandoned, and new parks are built, all making it more difficult to pinpoint exactly how a park might impact offense at any point in time.
A ballpark factor is a measure of how a ballpark influences batting events, with run scoring being the primary event measured. It’s usually calculated as an index with 100 representing a neutral park for a league, above 100 meaning a park makes run scoring easier than an average park, and below 100 meaning a park makes scoring harder than an average park. For example, a factor of around 82 would mean run scoring in that park is reduced by around 18% compared to a neutral park. On the other hand, a ballpark factor close to 109 would mean run scoring is increased by around 9% for games played in that particular park. Park factors can also be calculated as plus or minus runs scores, with a -1.00 indicating the park reduces runs scored per game by 1 run for each team.
Typically, for the major leagues, park factors are calculated by looking at home versus away statistics for a given team, perhaps making some adjustments for a team not facing its own pitchers and batters. For the Negro Leagues, however, we of course have complicating factors. For one, teams did not play an even number of home versus away games against each team. For another, teams often played at alternative or neutral sites. Then there’s the problem of having a limited number of games against other blackball teams in a given year. Finally, there are the unbalanced schedules where a team may have played an ‘easier’ mix of opponents at home as opposed to on the road, or vice versa.
Of course at Seamheads we don’t let these little issues stop us from trying to make sense of the Negro Leagues, so we’ll use what we have and see where it takes us.
First, let’s look at all known major blackball versus blackball team scores from 1902 through 1948 (note: some seasons are not complete), and see which parks have hosted the most games:
Schorling Park in Chicago, formerly home of the Chicago White Sox before Comiskey Park was built, easily tops the list, hosting over 8% of all the games we have in our database.
We said earlier that for major league parks you need about three years worth of data, which is around 240 games, to do a ‘good’, reliable park factor calculation. Looking at the table, we only have thirteen parks that have even hosted more than 240 games, and those are spread out over 30 years in some cases. At this point we can forget about even considering doing a ‘one year’ park factor given the small number of games, but we’ll still hope we can do some type of factor that will encompass multiple years.
We do have one factor working in our favor. Schorling Park, the park used more than any other, was formerly a major league park, known as South Side Park (III). We have major league park factors for this park, and those can give us a hint of what type of park this might be, and even a point of reference to compare the Negro Leagues with the American League.
Here are the calculated one year Runs factors from the seamheads.com ballparks database for Schorling Park aka South Side Park, with ranking indicating how much of a pitcher’s park:
Out of the 16 primary MLB parks, four times South Side Park was the most pitcher friendly park in baseball, using one-year park factors. This will be a good reference point for us going forward.
Let’s start our analysis by just looking at the historical run scoring in each park:
For our time period, the Negro Leagues averaged just over 10 runs per game (5 per team per game) so that’s a nice number to be able to calculate against. Just looking at total runs we see Schorling Park averaged 8.4 runs per game, which would calculate to a park factor of 82, with a run adjustment per team of almost 1 run less expected to be scored per game than average. Highlighting a few other extreme parks, Stars Park in St. Louis And Catholic Protectory Oval in New York appear to be ‘Coors like’ in their impact on offense, while Crosley Field, Yankee Stadium and Comiskey Park substantially reduce offense.
This is a nice start, but there are some biases here we need to get rid of. For one, some parks had more games in the higher offense eras, while others were primarily used in low scoring years. Here’s a chart on run scoring in the Negro Leagues by year:
We see there was a dead ball era in the Negro Leagues pre-1920, except for a bump up in the 1911-1914 seasons, and we see 1921 was the beginning of the ‘live’ ball for the Negro Leagues. So this is one bias we’ll need to correct for.
The other bias we need to adjust for is team bias. Maybe Stars Park had so many runs scored because the St. Louis Stars were such a good hitting club?
The Math-y Details
Feel free to skip this section on methodology, although we’re going to try to avoid high level mathematics and statistics as much as possible. To take season run scoring and team quality out of play, but still have a good statistical sample size, the method to calculate park factors will be as follows:
Summarize for each season by team the number of runs that team scored in every park. One positive factor for us here about the Negro Leagues is that teams played almost the same lineup every game, as there simply weren’t many bench players to sub in. This means the quality of the offense should be relatively stable for each game. Here’s an example of summing by season by team by park:
1923, St. Louis Stars, Stars Park, 394 Runs, 58 Games, 6.8 R/G1923, St. Louis Stars, Rickwood Field, 28 Runs, 5 Games, 5.6 R/G
And we do that for every park the Stars played in during 1923. Then we compare the runs scored by each team in each park. In this example we would compare the 6.8 runs per game in Stars Park to the 5.6 in Rickwood Field, and we’d give Stars Park a +1.2 vs. Rickwood, and give Rickwood a -1.2 vs. Stars Park. (We’re going to have a similar pair calculation for the opponent teams in both parks). We also must figure out how to weight this difference, as we’re looking at 58 games in one park vs. only 5 in the other park. For statistical reasons that we won’t get into because they’re above my head, we can take the Harmonic Mean of 58 and 5 as our weight of this 1.2 R/G difference (which is 9.2). We do this for every pair of parks each team played in, then we sum them all together, and that gives us the total weighted +/- of any park against all the other parks.
Some general comments on the methodology and its simplifying assumptions:1. Comparing as a ‘plus/minus’ exercise versus other parks in the same season should result in the era (high or low scoring) being adjusted for, adjust for the mix of OTHER parks in the same season (we are now measuring each park against all other parks in the same season instead of against the historical runs/game), AND this calculation should also adjust for good pitch/no hit or good hit/no pitch teams impact on run scoring in a park.
2. We’re not considering the opposing team pitchers/defense directly. What we could have done is further restrict our sample to pairs of teams, such as St. Louis-Birmingham in Stars Park vs. St. Louis-Birmingham in Rickwood Field, etc. The argument for doing this would be that the quality of the opposition could impact the number of runs scored by a team in a park. While this is certainly true, it’s probably also true that the INDIVIDUAL pitcher has an even bigger impact, so just because the Stars are playing Birmingham in both parks doesn’t mean the quality of pitcher that they faced would be the same. Ideally, we’d adjust for individual pitcher, but this gets very complicated just from a computational angle, and it’s not clear exactly how to separate the pitcher quality from HIS park. For example, if you use pitcher ERA to adjust the quality and Willie Foster is your opposing pitcher, and his ERA is 2.50, it’s not clear how much of the 2.50 is due to pitching in Schorling Park (which is the unknown we’re trying to calculate in the first place), and how much is due to Foster’s ability. Not restricting by opposition has the advantage of allowing for a larger sample size. St. Louis in 1923 played a neutral site game at Lebanon, IN against the ABCs. We would only have two parks to compare to that St. Louis and the ABCs played in – Stars Park and Washington Park. But by not restricting by opposition, we can use the data point that St. Louis scored 6 runs in that game to compare the Lebanon Park to all of the other parks the Stars played in that season, and all other parks the ABCs played in.
3. We ARE considering the fact the home team scores more and the visitor scores less. Home field advantage was historically around 0.5 runs per game in the Negro Leagues. We add expected runs for each home team, and we subtract expected runs for the visitors.
Back to the Results
We see some changes here. Schorling is now the most extreme pitcher’s park. We know it was a pitcher’s park even in the American League, and in the Negro National League it’s being compared to a few more ‘band box’ parks like Stars Park in St. Louis, so we would somewhat expect it to show as an extreme pitcher’s park.
Parks that were used more in the lower scoring 1940s, like Yankee Stadium, Comiskey Park, and Crosley Field, still show as pitcher friendly, but much less so than before adjusting for run environment.
Northwestern Park in Indianapolis, a pre-1920 ‘dead ball era’ park, now shows to be one of the better hitter parks.
Stars Parks, with most of the games there played in the high scoring 1920’s, now shows as much less extreme, but is still the best hitting park in the western Negro Leagues.
It’s the same story for Catholic Protectory Oval – 1920’s offense adjustments show it to be less extreme than adding 3 runs per game of offense, but still the most hitter friendly park in the Negro Leagues.
Lewis Park and Rickwood Field go from pitcher friendly to neutral. Apparently, both the Memphis and Birmingham teams tended to have good pitching but poor offenses, which were biasing the original results.
One final step we need to do. The 132 games we have for Forbes Field give us data that certainly is not as reliable in the statistical sense as the 954 games we have for Schorling Park. We need to adjust for that uncertainty by regressing these park factors towards the mean. What the ‘right’ regression to apply should be is not an easy number to determine. We mentioned earlier that three years of MLB park data regressed by 80 games is usually a good sample. For the Negro Leagues, given that we’re using data over a parks’ entire life, and perhaps introducing more ‘noise’, we’ll use 160 games as our regression point – roughly 2 seasons of MLB home games. If we have 160 games of data for a park calculation, we’ll regress that 50%, so that a +0.50 runs factor would become a regressed park factor estimate of +0.25.
Schorling of course, barely regresses, but as we go down the list, our uncertainty about the observed park factors increases, so the extreme’s we saw for Northwestern Park and Catholic Protectory Oval get regressed down quite significantly. Stars Park is still a hitter haven, even with the regressed numbers.
One final note – these latest park factors, along with the calculations for home field advantage, and for ‘strength of schedule’, will very soon be used to update the OPS+ and ERA+ calculations for players on seamheads.com. When that happens, a few players may see some ‘significant’ changes in those calculations.
(NOTE: This post is an updated version of an earlier article that appeared in the October 31, 2012 issue of Outsider Baseball Bulletin).