Thursday, May 1, 2014

K-Means Clustering in a PCA Subspace

In a previous post, we saw a list of the top 250 offensive players in MLB history constructed by Principal Components Analysis (PCA), which transforms high-dimensional data to its lower-dimensional principal components while retaining a high percentage of the sample variation. The PCA projection of high-dimensional data onto a convenient lower-dimensional subspace allows accessible data visualization and provides an opportunity for data segmentation. Segmentation can add detail and structure to a PCA biplot like the one seen below for the reduced MLB players' career batting data.




In practice, it is quite common that PCA is used to project high-dimensional data onto a lower-dimensional space, then have the K-means clustering algorithm be applied in a PCA subspace. Below is the result of automatic segmentation of baseball hitters into groups by applying K-means clustering on the first few principal components that account for the lion’s share of the sample variance.



Wednesday, April 30, 2014

Confidence Intervals for the Pythagorean Formula in Baseball

The so-called "Pythagorean Formula" was invented by sabermetrics pioneer Bill James in the early 1980's and is used to predict a baseball team's winning percentage on the basis of its runs scored (RS) and runs allowed (RA).


The exponent in James' original formula was lambda = 2, which reminded him of the Pythagorean theorem from Euclidean geometry, thus the name stuck. From a statistical perspective, the Pythagorean Formula is a logistic regression model where the response variable is a team's log-odds and the predictor variable is the logarithm of (RS/RA). Fitting a logistic regression model to a historical data set spanning the MLB seasons 1901-2013 gives a best fitting exponent of about 1.86.




Confidence intervals for a baseball team's winning percentage can be obtained by using a Scheffe-type simultaneous prediction band based on a fitted linear regression model that approximates the above logistic regression model. The formula for the confidence interval 
is given below.



Saturday, January 18, 2014

The Best Offensive Players in Major League Baseball History

Here is my attempt at compiling a list of the best offensive players in Major League Baseball (MLB) history. I wanted this list to be determined by letting the data speak for itself. I started by constructing a dataset of career batting records from the Lahman Baseball Database using the SQL code found here. Data is based on the most recent version (2013 MLB Season) of this database.

My approach to building this list is based on Principal Components Analysis (PCA), a multivariate statistical method for reducing a large number of possibly correlated variables to a few key underlying factors, called "principal components", that explain the variance-covariance structure of these variables. PCA can also be seen as a linear dimensionality reduction technique that projects high-dimensional data onto a lower-dimensional space without losing too much of the original data's sample variation. 

I used PCA to derive a metric that I call Offensive Player Grade (OPG). Conceptually, the OPG is a statistic that grades players on a numerical scale and efficiently summarizes an individual player’s career offensive performance into a single number. My OPG statistic is a weighted average of both popular and esoteric (sabermetric) offensive statistics: Runs, Hits, Doubles, Triples, Home Runs, Runs-Batted-In, Stolen Bases, Caught Stealing, Bases-on-Balls, Strikeouts, Intentional Bases-On-Balls, Hit-By-Pitch, Sacrifice Hits, Sacrifice Flies, Grounded-Into-Double-Play, Total Bases, Runs Created, Batting Average, On-Base Percentage, Slugging Average, On-Base Percentage Plus Slugging Average, Total Average, Isolated Power, Secondary Average, and Runs Created Per Game. 

Below is my list of the top 250 players in MLB history with respect to Offensive Player Grade (OPG).











Sunday, December 1, 2013

SQL Code for MLB Players' Career Batting Data

The MS Access SQL code below can be used to query the Lahman Baseball Database for the MLB players' career batting dataset:

SELECT trips.*,
    lookup_fielding.Primary_Position,
    lookup_fielding.Games_at_Primary_Position
FROM    (

    SELECT  deuce.*,
        IIF(lookup_allstar.count_distinct_seasons IS NULL, 0, lookup_allstar.count_distinct_seasons) as AllStar_Seasons
    FROM     (
        SELECT  ace.*,
            IIF(ace.finalGame IS NULL AND cdate(ace.debut) >= #01/01/1982#, 'Yes', 'No') as Active,
            IIF(lookup_hof.inducted = 'Y', 'Yes', 'No') as Hall_Of_Fame,
            IIF(  IIF(lookup_hof.inducted = 'Y', 'Yes', 'No') = 'Yes', 'HOF',
                      IIF( IIF(ace.finalGame IS NULL AND cdate(ace.debut) >= #01/01/1982#, 'Yes', 'No') = 'Yes', 'Active', 'Not HOF')) as Status
        FROM     (
            SELECT
                Master.playerID, Master.hofID, Master.nameFirst, Master.nameLast, Master.debut, Master.finalGame,
                 sub.G, sub.AB, sub.R, sub.H, sub.Doubles, sub.Triples, sub.HR, sub.RBI, sub.SB, sub.CS, sub.BB, sub.SO, sub.IBB, sub.HBP, sub.SH, sub.SF, sub.GIDP,
                (sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR) as TB,
                round( (sub.H + sub.BB - sub.CS + sub.HBP - sub.GIDP)*( (sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR) + 0.26*(sub.BB - sub.IBB + sub.HBP) + 0.52*(sub.SH + sub.SF + sub.SB) )/(sub.AB + sub.BB + sub.HBP + sub.SH + sub.SF), 1) as RC,
                round( sub.H/sub.AB, 3) as BA,
                round( (sub.H + sub.BB + sub.HBP)/(sub.AB + sub.BB + sub.HBP + sub.SF), 3 ) as OBP,
                round( (sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR)/sub.AB, 3) as SLG,
                round( (sub.H + sub.BB + sub.HBP)/(sub.AB + sub.BB + sub.HBP + sub.SF) + (sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR)/sub.AB, 3) as OPS,
                round( ((sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR) + sub.BB + sub.HBP + sub.SB - sub.CS)/(sub.AB - sub.H + sub.CS + sub.GIDP), 3) as TA,
                round( (sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR)/sub.AB - sub.H/sub.AB, 3) as ISO,
                round( ((sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR) - sub.H + sub.BB + sub.SB - sub.CS)/sub.AB, 3) as SECA,
                round( 27*( (sub.H + sub.BB - sub.CS + sub.HBP - sub.GIDP)*( (sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR) + 0.26*(sub.BB - sub.IBB + sub.HBP) + 0.52*(sub.SH + sub.SF + sub.SB) )/(sub.AB + sub.BB + sub.HBP + sub.SH + sub.SF) )/(sub.AB - sub.H + sub.SH + sub.SF + sub.CS + sub.GIDP), 3) as RC27
            FROM     (
                SELECT
                    Batting.playerID,
                    Sum(Batting.G) as G,
                    Sum(Batting.AB) as AB,
                    Sum(Batting.R) as R,
                    Sum(Batting.H) as H,
                    Sum(Batting.[2B]) as Doubles,
                    Sum(Batting.[3B]) as Triples,
                    Sum(Batting.HR) as HR,
                    Sum(Batting.RBI) as RBI,
                    Sum(Batting.SB) as SB,
                    IIF(Sum(Batting.CS) IS NULL, 0, Sum(Batting.CS)) as CS,
                    IIF(Sum(Batting.BB) IS NULL, 0, Sum(Batting.BB)) as BB,
                    IIF(Sum(Batting.SO) IS NULL, 0, Sum(Batting.SO)) as SO,
                    IIF(Sum(Batting.IBB) IS NULL, 0, Sum(Batting.IBB)) as IBB,
                    IIF(Sum(Batting.HBP) IS NULL, 0, Sum(Batting.HBP)) as HBP,
                    IIF(Sum(Batting.SH) IS NULL, 0, Sum(Batting.SH)) as SH,
                    IIF(Sum(Batting.SF) IS NULL, 0, Sum(Batting.SF)) as SF,
                    IIF(Sum(Batting.GIDP) IS NULL, 0, Sum(Batting.GIDP)) as GIDP
                FROM         Batting
                GROUP BY     Batting.playerID
                ) as sub INNER JOIN Master
                      ON sub.playerID = Master.playerID
                WHERE   sub.AB >= 100
            GROUP BY     Master.playerID, Master.hofID, Master.nameFirst, Master.nameLast, Master.debut, Master.finalGame,
                     sub.G, sub.AB, sub.R, sub.H, sub.Doubles, sub.Triples, sub.HR, sub.RBI, sub.SB, sub.CS, sub.BB, sub.SO, sub.IBB, sub.HBP, sub.SH, sub.SF, sub.GIDP,
                     (sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR),
                    round( (sub.H + sub.BB - sub.CS + sub.HBP - sub.GIDP)*( (sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR) + 0.26*(sub.BB - sub.IBB + sub.HBP) + 0.52*(sub.SH + sub.SF + sub.SB) )/(sub.AB + sub.BB + sub.HBP + sub.SH + sub.SF), 1),
                    round( sub.H/sub.AB, 3),
                    round( (sub.H + sub.BB + sub.HBP)/(sub.AB + sub.BB + sub.HBP + sub.SF), 3 ),
                    round( (sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR)/sub.AB, 3),
                    round( (sub.H + sub.BB + sub.HBP)/(sub.AB + sub.BB + sub.HBP + sub.SF) + (sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR)/sub.AB, 3),
                    round( ((sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR) + sub.BB + sub.HBP + sub.SB - sub.CS)/(sub.AB - sub.H + sub.CS + sub.GIDP), 3),
                    round( (sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR)/sub.AB - sub.H/sub.AB, 3),
                    round( ((sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR) - sub.H + sub.BB + sub.SB - sub.CS)/sub.AB, 3),
                    round( 27*( (sub.H + sub.BB - sub.CS + sub.HBP - sub.GIDP)*( (sub.H + sub.Doubles + 2*sub.Triples + 3*sub.HR) + 0.26*(sub.BB - sub.IBB + sub.HBP) + 0.52*(sub.SH + sub.SF + sub.SB) )/(sub.AB + sub.BB + sub.HBP + sub.SH + sub.SF) )/(sub.AB - sub.H + sub.SH + sub.SF + sub.CS + sub.GIDP), 3)
            ORDER BY Master.playerID asc
            ) as ace LEFT OUTER JOIN (
                         SELECT hofID, yearid, inducted, category
                          FROM     HallOfFame
                          WHERE     inducted = 'Y'
                               AND category = 'Player'
                            ) as lookup_hof
                                  ON ace.hofID = lookup_hof.hofID
        ) as deuce LEFT OUTER JOIN (
                       SELECT playerID,
                          COUNT(*) as count_distinct_seasons
                          FROM (
                        SELECT DISTINCT   playerID,
                                  yearID
                        FROM AllstarFull
                            ) as temp
                          GROUP BY playerID
                              ) as lookup_allstar
                                 ON deuce.playerID = lookup_allstar.playerID
    ) as trips LEFT OUTER JOIN (         
                   SELECT DISTINCT     leo.playerID,
                            rex.Primary_Position,
                            leo.Games_at_Primary_Position
                   FROM    (
                    SELECT DISTINCT playerID, max(Games_at_POS) as Games_at_Primary_Position
                    FROM     (
                        SELECT DISTINCT playerID,
                                POS,
                                sum(G) as Games_at_POS
                        FROM     Fielding
                        WHERE NOT(POS = 'OF' AND yearID >= 1954)
                        GROUP BY playerID, POS
                        ORDER BY playerID, sum(G) desc
                        ) as sub_field
                    GROUP BY playerID
                    ) as leo INNER JOIN (
                                    SELECT DISTINCT     playerID,
                                            POS as Primary_Position,
                                            sum(G) as Games_at_Primary_Position
                                    FROM     Fielding
                                WHERE NOT(POS = 'OF' AND yearID >= 1954)              
                                    GROUP BY playerID, POS
                                    ORDER BY playerID, sum(G) desc
                                    ) as rex
                                   ON leo.playerID = rex.playerID AND leo.Games_at_Primary_Position = rex.Games_at_Primary_Position
                   GROUP BY leo.playerID, rex.Primary_Position, leo.Games_at_Primary_Position
                   ) as lookup_fielding
                         ON trips.playerID = lookup_fielding.playerID
ORDER BY trips.playerID asc
;

Sunday, July 28, 2013

Mandelbrot Set Animation

Just wanted to show off my 'Mandelbrot Set' (http://en.wikipedia.org/wiki/Mandelbrot_set) animation, which was produced entirely in the R statistical programming environment.


Sunday, May 26, 2013

The Ice Cream Containers Problem

Here's a fun little math problem (for folks of any age) to play with.

There are three different containers of ice cream. Each container has 100 gallons of ice cream of either all Chocolate (in container X), all Vanilla (in container Y), or all Strawberry (in container Z). 

There are two robots with ice cream scoops controlled by an activation button. Each robot scoops a certain amount of ice cream from a container and deposits this into another container. Robot 1 simultaneously transfers 2/3 of the ice cream in container X to container Y, and 1/2 of the ice cream in container Y to container X. Robot 2 simultaneously transfers 1/6 of the ice cream in container Y to container Z, and transfers 1/3 of the ice cream in container Z to container X.


The activation button can be pushed again to repeat this process. Note that after the first button push, container X will no longer contain all Chocolate ice cream (it will actually contain 1/3 of the original amount of Chocolate ice cream plus 1/2 of the Vanilla ice cream originally in container Y plus 1/3 of the Strawberry ice cream originally in container Z).

Question 1: As the number of button pushes gets very large, how many gallons of ice cream are in containers X, Y and Z?

Question 2: As the number of button pushes gets very large, how much Chocolate, Vanilla, and Strawberry ice cream can be found in containers X, Y and Z?