Logo

Generalized Linear ModelsΒΆ

In [1]: import numpy as np

In [2]: import statsmodels.api as sm

In [3]: from scipy import stats

In [4]: from matplotlib import pyplot as plt

Example for using GLM on binomial response data the input response vector in this case is N by 2 (success, failure) This data is taken with permission from Jeff Gill (2000) Generalized linear models: A unified approach

The response variable is (of students above the math national median, #of students below)

The explanatory variables are (in column order)
The proportion of low income families “LOWINC”
The proportions of minority students,”PERASIAN”,”PERBLACK”,”PERHISP”
The percentage of minority teachers “PERMINTE”,
The median teacher salary including benefits in 1000s “AVSALK”
The mean teacher experience in years “AVYRSEXP”,
The per-pupil expenditures in thousands “PERSPENK”
The pupil-teacher ratio “PTRATIO”
The percent of students taking college credit courses “PCTAF”,
The percentage of charter schools in the districut “PCTCHRT”
The percent of schools in the district operating year round “PCTYRRND”
The following are interaction terms “PERMINTE_AVYRSEXP”,”PERMINTE_AVSAL”,
“AVYRSEXP_AVSAL”,”PERSPEN_PTRATIO”,”PERSPEN_PCTAF”,”PTRATIO_PCTAF”,
“PERMINTE_AVYRSEXP_AVSAL”,”PERSPEN_PTRATIO_PCTAF”
In [5]: data = sm.datasets.star98.load()

In [6]: data.exog = sm.add_constant(data.exog)

The response variable is (success, failure). Eg., the first observation is

In [7]: print data.endog[0]
[ 452.  355.]

Giving a total number of trials for this observation of print data.endog[0].sum()

In [8]: glm_binom = sm.GLM(data.endog, data.exog, family=sm.families.Binomial())

In [9]: binom_results = glm_binom.fit()

The fitted values are

In [10]: print binom_results.params
[-0.01681504  0.00992548 -0.01872421 -0.01423856  0.25448717  0.24069366
  0.08040867 -1.9521605  -0.33408647 -0.16902217  0.0049167  -0.00357996
 -0.01407656 -0.00400499 -0.0039064   0.0917143   0.04898984  0.00804074
  0.00022201 -0.00224925  2.95887793]

The corresponding t-values are

In [11]: print binom_results.tvalues
[-38.74908321  16.50473627 -25.1821894  -32.81791308   8.49827113
   4.21247925   5.7749976   -6.16191078  -5.45321673  -5.16865445
   3.92119964 -15.87825999  -7.39093058  -8.44963886  -4.05916246
   6.3210987    6.57434662   5.36229044   7.42806363  -6.44513698
   1.91301155]

It is common in GLMs with interactions to compare first differences. We are interested in the difference of the impact of the explanatory variable on the response variable. This example uses interquartile differences for the percentage of low income households while holding the other values constant at their mean.

In [12]: means = data.exog.mean(axis=0)

In [13]: means25 = means.copy()

In [14]: means25[0] = stats.scoreatpercentile(data.exog[:,0], 25)

In [15]: means75 = means.copy()

In [16]: means75[0] = lowinc_75per = stats.scoreatpercentile(data.exog[:,0], 75)

In [17]: resp_25 = binom_results.predict(means25)

In [18]: resp_75 = binom_results.predict(means75)

In [19]: diff = resp_75 - resp_25

The interquartile first difference for the percentage of low income households in a school district is

In [20]: print diff*100
-11.8752509918

In [21]: means0 = means.copy()

In [22]: means100 = means.copy()

In [23]: means0[0] = data.exog[:,0].min()

In [24]: means100[0] = data.exog[:,0].max()

In [25]: resp_0 = binom_results.predict(means0)

In [26]: resp_100 = binom_results.predict(means100)

In [27]: diff_full = resp_100 - resp_0

In [28]: print """The full range difference is %2.4f %%""" % (diff_full*100)
The full range difference is -36.1636 %

In [29]: nobs = binom_results.nobs

In [30]: y = data.endog[:,0]/data.endog.sum(1)

In [31]: yhat = binom_results.mu

Plot of yhat vs y

In [32]: plt.figure()
Out[32]: <matplotlib.figure.Figure at 0x6d8ff90>

In [33]: plt.scatter(yhat, y)
Out[33]: <matplotlib.collections.PathCollection at 0x6194590>

In [34]: line_fit = sm.OLS(y, sm.add_constant(yhat)).fit().params

In [35]: fit = lambda x: line_fit[1]+line_fit[0]*x # better way in scipy?

In [36]: plt.plot(np.linspace(0,1,nobs), fit(np.linspace(0,1,nobs)))
Out[36]: [<matplotlib.lines.Line2D at 0x6e3f110>]

In [37]: plt.title('Model Fit Plot')
Out[37]: <matplotlib.text.Text at 0x6cd9150>

In [38]: plt.ylabel('Observed values')
Out[38]: <matplotlib.text.Text at 0x6adad50>

In [39]: plt.xlabel('Fitted values')
Out[39]: <matplotlib.text.Text at 0x6af1e50>
../../_images/glm_fitted.png

Plot of yhat vs. Pearson residuals

In [40]: plt.figure()
Out[40]: <matplotlib.figure.Figure at 0x69713d0>

In [41]: plt.scatter(yhat, binom_results.resid_pearson)
Out[41]: <matplotlib.collections.PathCollection at 0x66cd410>

In [42]: plt.plot([0.0, 1.0],[0.0, 0.0], 'k-')
Out[42]: [<matplotlib.lines.Line2D at 0x6430d10>]

In [43]: plt.title('Residual Dependence Plot')
Out[43]: <matplotlib.text.Text at 0x6924fd0>

In [44]: plt.ylabel('Pearson Residuals')
Out[44]: <matplotlib.text.Text at 0x66e4450>

In [45]: plt.xlabel('Fitted values')
Out[45]: <matplotlib.text.Text at 0x66b8990>
../../_images/glm_resids.png

Histogram of standardized deviance residuals

In [46]: plt.figure()
Out[46]: <matplotlib.figure.Figure at 0x7392310>

In [47]: res = binom_results.resid_deviance.copy()

In [48]: stdres = (res - res.mean())/res.std()

In [49]: plt.hist(stdres, bins=25)
Out[49]: 
(array([ 3,  2,  4, 11, 11, 20, 27, 31, 33, 41, 28, 26, 23, 17, 11,  4,  2,
        3,  1,  2,  1,  0,  1,  0,  1]),
 array([-2.54284768, -2.27049293, -1.99813819, -1.72578344, -1.45342869,
       -1.18107394, -0.9087192 , -0.63636445, -0.3640097 , -0.09165496,
        0.18069979,  0.45305454,  0.72540929,  0.99776403,  1.27011878,
        1.54247353,  1.81482827,  2.08718302,  2.35953777,  2.63189252,
        2.90424726,  3.17660201,  3.44895676,  3.7213115 ,  3.99366625,
        4.266021  ]),
 <a list of 25 Patch objects>)

In [50]: plt.title('Histogram of standardized deviance residuals')
Out[50]: <matplotlib.text.Text at 0x7350f90>
../../_images/glm_hist_res.png

QQ Plot of Deviance Residuals

In [51]: plt.figure()
Out[51]: <matplotlib.figure.Figure at 0x741f250>

In [52]: res.sort()

In [53]: p = np.linspace(0 + 1./(nobs-1), 1-1./(nobs-1), nobs)

In [54]: quants = np.zeros_like(res)

In [55]: for i in range(nobs):
   ....:      quants[i] = stats.scoreatpercentile(res, p[i]*100)
   ....: 

In [56]: mu = res.mean()

In [57]: sigma = res.std()

In [58]: y = stats.norm.ppf(p, loc=mu, scale=sigma)

In [59]: plt.scatter(y, quants)
Out[59]: <matplotlib.collections.PathCollection at 0x655c4d0>

In [60]: plt.plot([y.min(),y.max()],[y.min(),y.max()],'r--')
Out[60]: [<matplotlib.lines.Line2D at 0x655ce90>]

In [61]: plt.title('Normal - Quantile Plot')
Out[61]: <matplotlib.text.Text at 0x74061d0>

In [62]: plt.ylabel('Deviance Residuals Quantiles')
Out[62]: <matplotlib.text.Text at 0x7431750>

In [63]: plt.xlabel('Quantiles of N(0,1)')
Out[63]: <matplotlib.text.Text at 0x7424650>

In [64]: from statsmodels import graphics

In [65]: img = graphics.gofplots.qqplot(res, line='r')
../../_images/glm_qqplot.png

Example for using GLM Gamma for a proportional count response Brief description of the data and design

In [66]: print sm.datasets.scotland.DESCRLONG

This data is based on the example in Gill and describes the proportion of
voters who voted Yes to grant the Scottish Parliament taxation powers.
The data are divided into 32 council districts.  This example's explanatory
variables include the amount of council tax collected in pounds sterling as
of April 1997 per two adults before adjustments, the female percentage of
total claims for unemployment benefits as of January, 1998, the standardized
mortality rate (UK is 100), the percentage of labor force participation,
regional GDP, the percentage of children aged 5 to 15, and an interaction term
between female unemployment and the council tax.

The original source files and variable information are included in
/scotland/src/


In [67]: data2 = sm.datasets.scotland.load()

In [68]: data2.exog = sm.add_constant(data2.exog)

In [69]: glm_gamma = sm.GLM(data2.endog, data2.exog, family=sm.families.Gamma())

In [70]: glm_results = glm_gamma.fit()

Example for Gaussian distribution with a noncanonical link

In [71]: nobs2 = 100

In [72]: x = np.arange(nobs2)

In [73]: np.random.seed(54321)

In [74]: X = np.column_stack((x,x**2))

In [75]: X = sm.add_constant(X)

In [76]: lny = np.exp(-(.03*x + .0001*x**2 - 1.0)) + .001 * np.random.rand(nobs2)

In [77]: gauss_log = sm.GLM(lny, X, family=sm.families.Gaussian(sm.families.links.log))

In [78]: gauss_log_results = gauss_log.fit()

check summary

In [79]: print binom_results.summary()
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:           ['y1', 'y2']   No. Observations:                  303
Model:                            GLM   Df Residuals:                      282
Model Family:                Binomial   Df Model:                           20
Link Function:                  logit   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -2998.6
Date:                Fri, 29 Jun 2012   Deviance:                       4078.8
Time:                        21:38:44   Pearson chi2:                 4.05e+03
No. Iterations:                     6                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1            -0.0168      0.000    -38.749      0.000        -0.018    -0.016
x2             0.0099      0.001     16.505      0.000         0.009     0.011
x3            -0.0187      0.001    -25.182      0.000        -0.020    -0.017
x4            -0.0142      0.000    -32.818      0.000        -0.015    -0.013
x5             0.2545      0.030      8.498      0.000         0.196     0.313
x6             0.2407      0.057      4.212      0.000         0.129     0.353
x7             0.0804      0.014      5.775      0.000         0.053     0.108
x8            -1.9522      0.317     -6.162      0.000        -2.573    -1.331
x9            -0.3341      0.061     -5.453      0.000        -0.454    -0.214
x10           -0.1690      0.033     -5.169      0.000        -0.233    -0.105
x11            0.0049      0.001      3.921      0.000         0.002     0.007
x12           -0.0036      0.000    -15.878      0.000        -0.004    -0.003
x13           -0.0141      0.002     -7.391      0.000        -0.018    -0.010
x14           -0.0040      0.000     -8.450      0.000        -0.005    -0.003
x15           -0.0039      0.001     -4.059      0.000        -0.006    -0.002
x16            0.0917      0.015      6.321      0.000         0.063     0.120
x17            0.0490      0.007      6.574      0.000         0.034     0.064
x18            0.0080      0.001      5.362      0.000         0.005     0.011
x19            0.0002   2.99e-05      7.428      0.000         0.000     0.000
x20           -0.0022      0.000     -6.445      0.000        -0.003    -0.002
const          2.9589      1.547      1.913      0.057        -0.073     5.990
==============================================================================

This Page