In practice, it is quite common that PCA is used to project high-dimensional data onto a lower-dimensional space, then have the K-means clustering algorithm be applied in a PCA subspace. Below is the result of automatic segmentation of baseball hitters into groups by applying K-means clustering on the first few principal components that account for the lion’s share of the sample variance.
Caterwalling
Thursday, May 1, 2014
K-Means Clustering in a PCA Subspace
In a previous post, we saw a list of the top 250 offensive players in MLB history constructed by Principal Components Analysis (PCA), which transforms high-dimensional data to its lower-dimensional principal components while retaining a high percentage of the sample variation. The PCA projection of high-dimensional data onto a convenient lower-dimensional subspace allows accessible data visualization and provides an opportunity for data segmentation. Segmentation can add detail and structure to a PCA biplot like the one seen below for the reduced MLB players' career batting data.
In practice, it is quite common that PCA is used to project high-dimensional data onto a lower-dimensional space, then have the K-means clustering algorithm be applied in a PCA subspace. Below is the result of automatic segmentation of baseball hitters into groups by applying K-means clustering on the first few principal components that account for the lion’s share of the sample variance.
In practice, it is quite common that PCA is used to project high-dimensional data onto a lower-dimensional space, then have the K-means clustering algorithm be applied in a PCA subspace. Below is the result of automatic segmentation of baseball hitters into groups by applying K-means clustering on the first few principal components that account for the lion’s share of the sample variance.
Wednesday, April 30, 2014
Confidence Intervals for the Pythagorean Formula in Baseball
The so-called "Pythagorean Formula" was invented by sabermetrics pioneer Bill James in the early 1980's and is used to predict a baseball team's winning percentage on the basis of its runs scored (RS) and runs allowed (RA).
The exponent in James' original formula was lambda = 2, which reminded him of the Pythagorean theorem from Euclidean geometry, thus the name stuck. From a statistical perspective, the Pythagorean Formula is a logistic regression model where the response variable is a team's log-odds and the predictor variable is the logarithm of (RS/RA). Fitting a logistic regression model to a historical data set spanning the MLB seasons 1901-2013 gives a best fitting exponent of about 1.86.
Confidence intervals for a baseball team's winning percentage can be obtained by using a Scheffe-type simultaneous prediction band based on a fitted linear regression model that approximates the above logistic regression model. The formula for the confidence interval
is given below.
Confidence intervals for a baseball team's winning percentage can be obtained by using a Scheffe-type simultaneous prediction band based on a fitted linear regression model that approximates the above logistic regression model. The formula for the confidence interval
is given below.
Saturday, January 18, 2014
The Best Offensive Players in Major League Baseball History
Here is my attempt at compiling a list of the best offensive players in Major League Baseball (MLB) history. I wanted this list to be determined by letting the data speak for itself. I started by constructing a dataset of career batting records from the Lahman Baseball Database using the SQL code found here. Data is based on the most recent version (2013 MLB Season) of this database.
My approach to building this list is based on Principal Components Analysis (PCA), a multivariate statistical method for reducing a large number of possibly correlated variables to a few key underlying factors, called "principal components", that explain the variance-covariance structure of these variables. PCA can also be seen as a linear dimensionality reduction technique that projects high-dimensional data onto a lower-dimensional space without losing too much of the original data's sample variation.
I used PCA to derive a metric that I call Offensive Player Grade (OPG). Conceptually, the OPG is a statistic that grades players on a numerical scale and efficiently summarizes an individual player’s career offensive performance into a single number. My OPG statistic is a weighted average of both popular and esoteric (sabermetric) offensive statistics: Runs, Hits, Doubles, Triples, Home Runs, Runs-Batted-In, Stolen Bases, Caught Stealing, Bases-on-Balls, Strikeouts, Intentional Bases-On-Balls, Hit-By-Pitch, Sacrifice Hits, Sacrifice Flies, Grounded-Into-Double-Play, Total Bases, Runs Created, Batting Average, On-Base Percentage, Slugging Average, On-Base Percentage Plus Slugging Average, Total Average, Isolated Power, Secondary Average, and Runs Created Per Game.
Below is my list of the top 250 players in MLB history with respect to Offensive Player Grade (OPG).
My approach to building this list is based on Principal Components Analysis (PCA), a multivariate statistical method for reducing a large number of possibly correlated variables to a few key underlying factors, called "principal components", that explain the variance-covariance structure of these variables. PCA can also be seen as a linear dimensionality reduction technique that projects high-dimensional data onto a lower-dimensional space without losing too much of the original data's sample variation.
I used PCA to derive a metric that I call Offensive Player Grade (OPG). Conceptually, the OPG is a statistic that grades players on a numerical scale and efficiently summarizes an individual player’s career offensive performance into a single number. My OPG statistic is a weighted average of both popular and esoteric (sabermetric) offensive statistics: Runs, Hits, Doubles, Triples, Home Runs, Runs-Batted-In, Stolen Bases, Caught Stealing, Bases-on-Balls, Strikeouts, Intentional Bases-On-Balls, Hit-By-Pitch, Sacrifice Hits, Sacrifice Flies, Grounded-Into-Double-Play, Total Bases, Runs Created, Batting Average, On-Base Percentage, Slugging Average, On-Base Percentage Plus Slugging Average, Total Average, Isolated Power, Secondary Average, and Runs Created Per Game.
Below is my list of the top 250 players in MLB history with respect to Offensive Player Grade (OPG).
Sunday, December 1, 2013
SQL Code for MLB Players' Career Batting Data
The MS Access SQL code below can be used to query the Lahman Baseball Database for the MLB players' career batting dataset:
SELECT trips.*,
lookup_fielding.Primary_Position,
lookup_fielding.Games_at_Primary_Position
FROM (
SELECT deuce.*,
IIF(lookup_allstar.count_distinct_seasons IS NULL, 0, lookup_allstar.count_distinct_seasons) as AllStar_Seasons
FROM (
SELECT ace.*,
IIF(ace.finalGame IS NULL AND cdate(ace.debut) >= #01/01/1982#, 'Yes', 'No') as Active,
IIF(lookup_hof.inducted = 'Y', 'Yes', 'No') as Hall_Of_Fame,
IIF( IIF(lookup_hof.inducted = 'Y', 'Yes', 'No') = 'Yes', 'HOF',
IIF( IIF(ace.finalGame IS NULL AND cdate(ace.debut) >= #01/01/1982#, 'Yes', 'No') = 'Yes', 'Active', 'Not HOF')) as Status
FROM (
SELECT
Master.playerID, Master.hofID, Master.nameFirst, Master.nameLast, Master.debut, Master.finalGame,
sub.G, sub.AB, sub.R, sub.H, sub.Doubles, sub.Triples, sub.HR, sub.RBI, sub.SB, sub.CS, sub.BB, sub.SO, sub.IBB, sub.HBP, sub.SH, sub.SF, sub.GIDP,
(sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR) as TB,
round( (sub.H + sub.BB - sub.CS + sub.HBP - sub.GIDP)*( (sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR) + 0.26*(sub.BB - sub.IBB + sub.HBP) + 0.52*(sub.SH + sub.SF + sub.SB) )/(sub.AB + sub.BB + sub.HBP + sub.SH + sub.SF), 1) as RC,
round( sub.H/sub.AB, 3) as BA,
round( (sub.H + sub.BB + sub.HBP)/(sub.AB + sub.BB + sub.HBP + sub.SF), 3 ) as OBP,
round( (sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR)/sub.AB, 3) as SLG,
round( (sub.H + sub.BB + sub.HBP)/(sub.AB + sub.BB + sub.HBP + sub.SF) + (sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR)/sub.AB, 3) as OPS,
round( ((sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR) + sub.BB + sub.HBP + sub.SB - sub.CS)/(sub.AB - sub.H + sub.CS + sub.GIDP), 3) as TA,
round( (sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR)/sub.AB - sub.H/sub.AB, 3) as ISO,
round( ((sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR) - sub.H + sub.BB + sub.SB - sub.CS)/sub.AB, 3) as SECA,
round( 27*( (sub.H + sub.BB - sub.CS + sub.HBP - sub.GIDP)*( (sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR) + 0.26*(sub.BB - sub.IBB + sub.HBP) + 0.52*(sub.SH + sub.SF + sub.SB) )/(sub.AB + sub.BB + sub.HBP + sub.SH + sub.SF) )/(sub.AB - sub.H + sub.SH + sub.SF + sub.CS + sub.GIDP), 3) as RC27
FROM (
SELECT
Batting.playerID,
Sum(Batting.G) as G,
Sum(Batting.AB) as AB,
Sum(Batting.R) as R,
Sum(Batting.H) as H,
Sum(Batting.[2B]) as Doubles,
Sum(Batting.[3B]) as Triples,
Sum(Batting.HR) as HR,
Sum(Batting.RBI) as RBI,
Sum(Batting.SB) as SB,
IIF(Sum(Batting.CS) IS NULL, 0, Sum(Batting.CS)) as CS,
IIF(Sum(Batting.BB) IS NULL, 0, Sum(Batting.BB)) as BB,
IIF(Sum(Batting.SO) IS NULL, 0, Sum(Batting.SO)) as SO,
IIF(Sum(Batting.IBB) IS NULL, 0, Sum(Batting.IBB)) as IBB,
IIF(Sum(Batting.HBP) IS NULL, 0, Sum(Batting.HBP)) as HBP,
IIF(Sum(Batting.SH) IS NULL, 0, Sum(Batting.SH)) as SH,
IIF(Sum(Batting.SF) IS NULL, 0, Sum(Batting.SF)) as SF,
IIF(Sum(Batting.GIDP) IS NULL, 0, Sum(Batting.GIDP)) as GIDP
FROM Batting
GROUP BY Batting.playerID
) as sub INNER JOIN Master
ON sub.playerID = Master.playerID
WHERE sub.AB >= 100
GROUP BY Master.playerID, Master.hofID, Master.nameFirst, Master.nameLast, Master.debut, Master.finalGame,
sub.G, sub.AB, sub.R, sub.H, sub.Doubles, sub.Triples, sub.HR, sub.RBI, sub.SB, sub.CS, sub.BB, sub.SO, sub.IBB, sub.HBP, sub.SH, sub.SF, sub.GIDP,
(sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR),
round( (sub.H + sub.BB - sub.CS + sub.HBP - sub.GIDP)*( (sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR) + 0.26*(sub.BB - sub.IBB + sub.HBP) + 0.52*(sub.SH + sub.SF + sub.SB) )/(sub.AB + sub.BB + sub.HBP + sub.SH + sub.SF), 1),
round( sub.H/sub.AB, 3),
round( (sub.H + sub.BB + sub.HBP)/(sub.AB + sub.BB + sub.HBP + sub.SF), 3 ),
round( (sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR)/sub.AB, 3),
round( (sub.H + sub.BB + sub.HBP)/(sub.AB + sub.BB + sub.HBP + sub.SF) + (sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR)/sub.AB, 3),
round( ((sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR) + sub.BB + sub.HBP + sub.SB - sub.CS)/(sub.AB - sub.H + sub.CS + sub.GIDP), 3),
round( (sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR)/sub.AB - sub.H/sub.AB, 3),
round( ((sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR) - sub.H + sub.BB + sub.SB - sub.CS)/sub.AB, 3),
round( 27*( (sub.H + sub.BB - sub.CS + sub.HBP - sub.GIDP)*( (sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR) + 0.26*(sub.BB - sub.IBB + sub.HBP) + 0.52*(sub.SH + sub.SF + sub.SB) )/(sub.AB + sub.BB + sub.HBP + sub.SH + sub.SF) )/(sub.AB - sub.H + sub.SH + sub.SF + sub.CS + sub.GIDP), 3)
ORDER BY Master.playerID asc
) as ace LEFT OUTER JOIN (
SELECT hofID, yearid, inducted, category
FROM HallOfFame
WHERE inducted = 'Y'
AND category = 'Player'
) as lookup_hof
ON ace.hofID = lookup_hof.hofID
) as deuce LEFT OUTER JOIN (
SELECT playerID,
COUNT(*) as count_distinct_seasons
FROM (
SELECT DISTINCT playerID,
yearID
FROM AllstarFull
) as temp
GROUP BY playerID
) as lookup_allstar
ON deuce.playerID = lookup_allstar.playerID
) as trips LEFT OUTER JOIN (
SELECT DISTINCT leo.playerID,
rex.Primary_Position,
leo.Games_at_Primary_Position
FROM (
SELECT DISTINCT playerID, max(Games_at_POS) as Games_at_Primary_Position
FROM (
SELECT DISTINCT playerID,
POS,
sum(G) as Games_at_POS
FROM Fielding
WHERE NOT(POS = 'OF' AND yearID >= 1954)
GROUP BY playerID, POS
ORDER BY playerID, sum(G) desc
) as sub_field
GROUP BY playerID
) as leo INNER JOIN (
SELECT DISTINCT playerID,
POS as Primary_Position,
sum(G) as Games_at_Primary_Position
FROM Fielding
WHERE NOT(POS = 'OF' AND yearID >= 1954)
GROUP BY playerID, POS
ORDER BY playerID, sum(G) desc
) as rex
ON leo.playerID = rex.playerID AND leo.Games_at_Primary_Position = rex.Games_at_Primary_Position
GROUP BY leo.playerID, rex.Primary_Position, leo.Games_at_Primary_Position
) as lookup_fielding
ON trips.playerID = lookup_fielding.playerID
ORDER BY trips.playerID asc
;
SELECT trips.*,
lookup_fielding.Primary_Position,
lookup_fielding.Games_at_Primary_Position
FROM (
SELECT deuce.*,
IIF(lookup_allstar.count_distinct_seasons IS NULL, 0, lookup_allstar.count_distinct_seasons) as AllStar_Seasons
FROM (
SELECT ace.*,
IIF(ace.finalGame IS NULL AND cdate(ace.debut) >= #01/01/1982#, 'Yes', 'No') as Active,
IIF(lookup_hof.inducted = 'Y', 'Yes', 'No') as Hall_Of_Fame,
IIF( IIF(lookup_hof.inducted = 'Y', 'Yes', 'No') = 'Yes', 'HOF',
IIF( IIF(ace.finalGame IS NULL AND cdate(ace.debut) >= #01/01/1982#, 'Yes', 'No') = 'Yes', 'Active', 'Not HOF')) as Status
FROM (
SELECT
Master.playerID, Master.hofID, Master.nameFirst, Master.nameLast, Master.debut, Master.finalGame,
sub.G, sub.AB, sub.R, sub.H, sub.Doubles, sub.Triples, sub.HR, sub.RBI, sub.SB, sub.CS, sub.BB, sub.SO, sub.IBB, sub.HBP, sub.SH, sub.SF, sub.GIDP,
(sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR) as TB,
round( (sub.H + sub.BB - sub.CS + sub.HBP - sub.GIDP)*( (sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR) + 0.26*(sub.BB - sub.IBB + sub.HBP) + 0.52*(sub.SH + sub.SF + sub.SB) )/(sub.AB + sub.BB + sub.HBP + sub.SH + sub.SF), 1) as RC,
round( sub.H/sub.AB, 3) as BA,
round( (sub.H + sub.BB + sub.HBP)/(sub.AB + sub.BB + sub.HBP + sub.SF), 3 ) as OBP,
round( (sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR)/sub.AB, 3) as SLG,
round( (sub.H + sub.BB + sub.HBP)/(sub.AB + sub.BB + sub.HBP + sub.SF) + (sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR)/sub.AB, 3) as OPS,
round( ((sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR) + sub.BB + sub.HBP + sub.SB - sub.CS)/(sub.AB - sub.H + sub.CS + sub.GIDP), 3) as TA,
round( (sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR)/sub.AB - sub.H/sub.AB, 3) as ISO,
round( ((sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR) - sub.H + sub.BB + sub.SB - sub.CS)/sub.AB, 3) as SECA,
round( 27*( (sub.H + sub.BB - sub.CS + sub.HBP - sub.GIDP)*( (sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR) + 0.26*(sub.BB - sub.IBB + sub.HBP) + 0.52*(sub.SH + sub.SF + sub.SB) )/(sub.AB + sub.BB + sub.HBP + sub.SH + sub.SF) )/(sub.AB - sub.H + sub.SH + sub.SF + sub.CS + sub.GIDP), 3) as RC27
FROM (
SELECT
Batting.playerID,
Sum(Batting.G) as G,
Sum(Batting.AB) as AB,
Sum(Batting.R) as R,
Sum(Batting.H) as H,
Sum(Batting.[2B]) as Doubles,
Sum(Batting.[3B]) as Triples,
Sum(Batting.HR) as HR,
Sum(Batting.RBI) as RBI,
Sum(Batting.SB) as SB,
IIF(Sum(Batting.CS) IS NULL, 0, Sum(Batting.CS)) as CS,
IIF(Sum(Batting.BB) IS NULL, 0, Sum(Batting.BB)) as BB,
IIF(Sum(Batting.SO) IS NULL, 0, Sum(Batting.SO)) as SO,
IIF(Sum(Batting.IBB) IS NULL, 0, Sum(Batting.IBB)) as IBB,
IIF(Sum(Batting.HBP) IS NULL, 0, Sum(Batting.HBP)) as HBP,
IIF(Sum(Batting.SH) IS NULL, 0, Sum(Batting.SH)) as SH,
IIF(Sum(Batting.SF) IS NULL, 0, Sum(Batting.SF)) as SF,
IIF(Sum(Batting.GIDP) IS NULL, 0, Sum(Batting.GIDP)) as GIDP
FROM Batting
GROUP BY Batting.playerID
) as sub INNER JOIN Master
ON sub.playerID = Master.playerID
WHERE sub.AB >= 100
GROUP BY Master.playerID, Master.hofID, Master.nameFirst, Master.nameLast, Master.debut, Master.finalGame,
sub.G, sub.AB, sub.R, sub.H, sub.Doubles, sub.Triples, sub.HR, sub.RBI, sub.SB, sub.CS, sub.BB, sub.SO, sub.IBB, sub.HBP, sub.SH, sub.SF, sub.GIDP,
(sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR),
round( (sub.H + sub.BB - sub.CS + sub.HBP - sub.GIDP)*( (sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR) + 0.26*(sub.BB - sub.IBB + sub.HBP) + 0.52*(sub.SH + sub.SF + sub.SB) )/(sub.AB + sub.BB + sub.HBP + sub.SH + sub.SF), 1),
round( sub.H/sub.AB, 3),
round( (sub.H + sub.BB + sub.HBP)/(sub.AB + sub.BB + sub.HBP + sub.SF), 3 ),
round( (sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR)/sub.AB, 3),
round( (sub.H + sub.BB + sub.HBP)/(sub.AB + sub.BB + sub.HBP + sub.SF) + (sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR)/sub.AB, 3),
round( ((sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR) + sub.BB + sub.HBP + sub.SB - sub.CS)/(sub.AB - sub.H + sub.CS + sub.GIDP), 3),
round( (sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR)/sub.AB - sub.H/sub.AB, 3),
round( ((sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR) - sub.H + sub.BB + sub.SB - sub.CS)/sub.AB, 3),
round( 27*( (sub.H + sub.BB - sub.CS + sub.HBP - sub.GIDP)*( (sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR) + 0.26*(sub.BB - sub.IBB + sub.HBP) + 0.52*(sub.SH + sub.SF + sub.SB) )/(sub.AB + sub.BB + sub.HBP + sub.SH + sub.SF) )/(sub.AB - sub.H + sub.SH + sub.SF + sub.CS + sub.GIDP), 3)
ORDER BY Master.playerID asc
) as ace LEFT OUTER JOIN (
SELECT hofID, yearid, inducted, category
FROM HallOfFame
WHERE inducted = 'Y'
AND category = 'Player'
) as lookup_hof
ON ace.hofID = lookup_hof.hofID
) as deuce LEFT OUTER JOIN (
SELECT playerID,
COUNT(*) as count_distinct_seasons
FROM (
SELECT DISTINCT playerID,
yearID
FROM AllstarFull
) as temp
GROUP BY playerID
) as lookup_allstar
ON deuce.playerID = lookup_allstar.playerID
) as trips LEFT OUTER JOIN (
SELECT DISTINCT leo.playerID,
rex.Primary_Position,
leo.Games_at_Primary_Position
FROM (
SELECT DISTINCT playerID, max(Games_at_POS) as Games_at_Primary_Position
FROM (
SELECT DISTINCT playerID,
POS,
sum(G) as Games_at_POS
FROM Fielding
WHERE NOT(POS = 'OF' AND yearID >= 1954)
GROUP BY playerID, POS
ORDER BY playerID, sum(G) desc
) as sub_field
GROUP BY playerID
) as leo INNER JOIN (
SELECT DISTINCT playerID,
POS as Primary_Position,
sum(G) as Games_at_Primary_Position
FROM Fielding
WHERE NOT(POS = 'OF' AND yearID >= 1954)
GROUP BY playerID, POS
ORDER BY playerID, sum(G) desc
) as rex
ON leo.playerID = rex.playerID AND leo.Games_at_Primary_Position = rex.Games_at_Primary_Position
GROUP BY leo.playerID, rex.Primary_Position, leo.Games_at_Primary_Position
) as lookup_fielding
ON trips.playerID = lookup_fielding.playerID
ORDER BY trips.playerID asc
;
Sunday, July 28, 2013
Mandelbrot Set Animation
Just wanted to show off my 'Mandelbrot Set' (http://en.wikipedia.org/wiki/Mandelbrot_set) animation, which was produced entirely in the R statistical programming environment.
Sunday, May 26, 2013
The Ice Cream Containers Problem
Here's a fun little math problem (for folks of any age) to play with.
There are three different containers of ice cream. Each container has 100 gallons of ice cream of either all Chocolate (in container X), all Vanilla (in container Y), or all Strawberry (in container Z).
There are two robots with ice cream scoops controlled by an activation button. Each robot scoops a certain amount of ice cream from a container and deposits this into another container. Robot 1 simultaneously transfers 2/3 of the ice cream in container X to container Y, and 1/2 of the ice cream in container Y to container X. Robot 2 simultaneously transfers 1/6 of the ice cream in container Y to container Z, and transfers 1/3 of the ice cream in container Z to container X.
The activation button can be pushed again to repeat this process. Note that after the first button push, container X will no longer contain all Chocolate ice cream (it will actually contain 1/3 of the original amount of Chocolate ice cream plus 1/2 of the Vanilla ice cream originally in container Y plus 1/3 of the Strawberry ice cream originally in container Z).
Question 1: As the number of button pushes gets very large, how many gallons of ice cream are in containers X, Y and Z?
Question 2: As the number of button pushes gets very large, how much Chocolate, Vanilla, and Strawberry ice cream can be found in containers X, Y and Z?
Subscribe to:
Posts (Atom)