960.390 - Introduction to Computers for Statistics

960.390-01, Fall 1999, M 7,8 (6:10-9:00pm)

Meeting dates: 10/25, 11/1, 11/8, 11/15


| Syllabus | Class 1 | Class 2 | Class 3 | Class 4 | Class 5 | Home | Email

 

REGRESSIONS:

Regression is a statistical methodology that studies the relationship between two or more variables so that one variable (called DEPENDENT variable) can be predicted from other, or others (called INDEPENDENT variable(s)). The basic regression procedures in SAS are PROC REG and PROC GLM. Generally speaking, a PROC GLM procedure uses more computing time and memory space and have less options than a PROC REG procedure, but it can handle a broader class of models. Particularly, a PROC REG can only be used to analysis numerical variables but the PROC GLM can in addition used to fit models with categorical type dependent variable(s).

 

PROC REG Procedure:

PROC REG is the basic SAS procedure that performs regression analysis for numerical variables. With various options and statements, it can be used to

 

The general form of the PROC REG procedure is

 
PROC REG DATA=dataset; 
   MODEL dependent_variable  = independent_variable(s) / <options>;
   PLOT variable1 * variable2  <options>;
   OUTPUT OUT =  newdata  <options>;
RUN;

 

In the general form, OUTPUT OUT = … statement is used to produce a new data set which contains the original data of used in the regression model, plus the predicted values and standardized residuals if requested by the appropriate keywords. This new data set can be in turn used to

See page 108 and page 120 for the choice of options.

 

Example:

 
options pagesize = 76 linesize = 70;
data tree;
    input diameter height volume @@;
       x1 = log(diameter);
       x2 = log(height);
       y = log(volume);
       x1x2 = x1*x2;
    cards;
8.3 70 10.3
8.6 65 10.3
8.8 63 10.2
10.5 72 16.4
10.7 81 18.8
10.8 83 19.7
11.0 66 15.6
11.0 75 18.2
11.1 80 22.6
11.2 75 19.9
11.3 79 24.2
11.4 76 21.0
11.4 76 21.4
11.7 69 21.3
12.0 75 19.1
12.9 74 22.2
12.9 85 33.8
13.3 86 27.4
13.7 71 25.7
13.8 64 24.9
14.0 78 34.5
14.2 80 31.7
14.5 74 36.3
16.0 72 38.3
16.3 77 42.6
17.3 81 55.4
17.5 82 55.7
17.9 80 58.3
18.0 80 51.5
18.0 80 51.0
20.6 87 77.0
;
proc reg;
   model y = x1 x2;  
 
/* Remark: The model statement is used to specify the fitted model */
 
     output out = outdata r = resid p = yhat; 
 
/* Remark: This statement outputs the residuals and predicted values 
  into data set `outdata'*/
 
   plot  y*y = ' '  y*(x1 x2) = '*'
     x1*y = '*' x1*x1 = ' ' x1*x2 = '*'
     x2*(y x1) = '*' x2*x2=' ' /
       collect vplots = 5 hplots = 3; 
 
/* Remark: This plot statement produces 9 scatter plots. 
     (Actually there are 5 X 3 = 15 plots, but the last four plots are 
      empty and not plotted here) */
run;
 
proc plot data = outdata hpercent = 50 vpercent = 33; 
   plot resid*(x1 x2 yhat x1x2) = '*';  
run; 
 
/* Remark: The PROC PLOT step uses the data set `outdata' to make residual plots */ 
 
proc univariate plot data = outdata;
   var resid;
run; 
 
/* Remark: This PROC UNIVARIATE precedure is used to summarize the residuals */ 
 
                            The SAS System                           1
 
 
Model: MODEL1  
Dependent Variable: Y                                                 
 
 
                         Analysis of Variance
 
                         Sum of         Mean
Source          DF      Squares       Square      F Value       Prob>F
 
Model            2      8.12323      4.06161      613.195       0.0001
Error           28      0.18546      0.00662
C Total         30      8.30869
 
    Root MSE       0.08139     R-square       0.9777
    Dep Mean       3.27273     Adj R-sq       0.9761
    C.V.           2.48679
 
                         Parameter Estimates
 
                  Parameter      Standard    T for H0:               
 Variable  DF      Estimate         Error   Parameter=0    Prob > |T|
 
 INTERCEP   1     -6.631617    0.79978973        -8.292        0.0001
 X1         1      1.982650    0.07501061        26.432        0.0001
 X2         1      1.117123    0.20443706         5.464        0.0001
 
^L                            The SAS System                           2
 
 
    -+----+----+----+-     -+------+------+--     -+----+----+----+-
Y   |                | Y   |                | Y   |                |
  5 +                +   5 +                +   5 +                +
    |                |     |              * |     |            *   |
  4 +                +   4 +           **   +   4 +         **     +
    |                |     |        ** *    |     |       *****    |
  3 +                +   3 +     *****      +   3 +    * *****     +
    |                |     |  *   *         |     |    ***         |
  2 +                +   2 +                +   2 +                +
    |                |     |                |     |                |
    -+----+----+----+-     -+------+------+--     -+----+----+----+-
     2    3    4    5      2.0    2.5    3.0      4.0  4.2  4.4 4.6
         Y                      X1                     X2
 
      -+---+---+---+--       -+-----+-----+--       -+---+---+---+--
X 3.0 +         *    + X 3.0 +              + X 3.0 +         *    +
1     |        *     | 1     |              | 1     |        *     |
      |       *      |       |              |       |      **      |
      |     **       |       |              |       |   * *****    |
  2.5 +    * *       +   2.5 +              +   2.5 +     **  *    +
      |   ***        |       |              |       |    * ***     |
      |              |       |              |       |              |
      | *            |       |              |       |   * *        |
  2.0 +              +   2.0 +              +   2.0 +              +
      -+---+---+---+--       -+-----+-----+--       -+---+---+---+--
       2   3   4   5         2.0   2.5   3.0        4.0 4.2 4.4 4.6
          Y                      X1                     X2
 
      -+---+---+---+--       -+-----+-----+--       -+---+---+---+--
X     |              | X     |              | X     |              |
2 4.6 +              + 2 4.6 +              + 2 4.6 +              +
      |     *   *    |       |       *    * |       |              |
  4.4 +    *** *     +   4.4 +    ** ** **  +   4.4 +              +
      |   *****      |       |    ******    |       |              |
  4.2 + * ***        +   4.2 + **  ***      +   4.2 +              +
      | *            |       |  *           |       |              |
  4.0 +              +   4.0 +              +   4.0 +              +
      |              |       |              |       |              |
      -+---+---+---+--       -+-----+-----+--       -+---+---+---+--
       2   3   4   5         2.0   2.5   3.0        4.0 4.2 4.4 4.6
          Y                      X1                     X2
 
^L                            The SAS System                           3
 
 
      Plot of RESID*X1='*'.                Plot of RESID*X2='*'.
 
       |                                    |
   0.2 +                                0.2 +
       |                                    |
R      |       * *                   R      |               *  *
e      |       *   *  *              e      |         *  *   *
s      |       *  *   *              s      |              ***
i      |  *    *                     i      |       *  *  *
d  0.0 +   *  **  * **  *            d  0.0 +      *    * **    *
u      |      *                      u      |             *  **
a      |      *   *   *              a      |        *  *   *
l      |                             l      |
       |         *                          |            *
       |        **                          |             *    *
  -0.2 +                               -0.2 +
       -+-------+-------+-------+           -+-------+-------+-------+
       2.0     2.5     3.0    3.5           4.0     4.2     4.4    4.6
 
                   X1                                   X2
 
NOTE: 7 obs hidden.                  NOTE: 6 obs hidden.
 
 
     Plot of RESID*YHAT='*'.              Plot of RESID*X1X2='*'.
 
       |                                    |
   0.2 +                                0.2 +
       |                                    |
R      |        *  *                 R      |          * *
e      |        *   *  *             e      |         *   *  *
s      |        *   *   *            s      |          *  *  *
i      |  *     *                    i      |     *    *
d  0.0 +  *   * * *  **    *         d  0.0 +     *  ****  **   *
u      |        *                    u      |         **
a      |      *    **   *            a      |        *   **   *
l      |                             l      |
       |          *                         |           *
       |         *  *                       |          *  *
  -0.2 +                               -0.2 +
       -+-------+-------+-------+           -+-------+-------+-------+
        2       3       4       5           7.5    10.0    12.5   15.0
 
          Predicted Value of Y                         X1X2
 
NOTE: 6 obs hidden.                  NOTE: 4 obs hidden.
 
 
 
^L                            The SAS System                           4
 
 
                         Univariate Procedure
 
Variable=RESID         Residual
 
                               Moments
 
               N                31  Sum Wgts         31
               Mean              0  Sum               0
               Std Dev    0.078626  Variance   0.006182
               Skewness   -0.41341  Kurtosis   -0.15837
               USS        0.185463  CSS        0.185463
               CV                .  Std Mean   0.014122
               T:Mean=0          0  Pr>|T|       1.0000
               Num ^= 0         31  Num > 0          16
               M(Sign)         0.5  Pr>=|M|      1.0000
               Sgn Rank         14  Pr>=|S|      0.7888
 
 
                           Quantiles(Def=5)
 
                100% Max  0.129223       99%  0.129223
                 75% Q3   0.073269       95%  0.119002
                 50% Med  0.002431       90%  0.085102
                 25% Q1   -0.05266       10%  -0.07321
                  0% Min  -0.16856        5%  -0.16453
                                          1%  -0.16856
                Range     0.297784                    
                Q3-Q1     0.125929                    
                Mode      -0.16856                    
 
 
                               Extremes
 
                  Lowest    Obs     Highest    Obs
                -0.16856(      15) 0.083801(      14)
                -0.16453(      18) 0.085102(      26)
                -0.14655(      16) 0.113363(      23)
                -0.07321(      19) 0.119002(      17)
                -0.06778(      22) 0.129223(      11)
 
 
           Stem Leaf                     #             Boxplot
              1 123                      3                |   
              0 578889                   6             +-----+
              0 0111233                  7             *--+--*
             -0 4441100                  7             |     |
             -0 77665                    5             +-----+
             -1                                           |   
             -1 765                      3                |   
                ----+----+----+----+              
            Multiply Stem.Leaf by 10**-1          
 
 
                            Normal Probability Plot              
        0.125+                                      +*+*+++ *    
             |                               *****+*             
             |                         ******+                   
       -0.025+                   ******                          
             |             *+****                                
             |      +++++*+                                      
       -0.175+++++*+   *                                         
              +----+----+----+----+----+----+----+----+----+----+
                  -2        -1         0        +1        +2     
 
 
 

 

 

PROC GLM Procedure:

 

PROC GLM is a more general procedure than PROC REG. It can in addition handle categorical type of independent variables, where a categorical variable can be, for examples, gender (Male and Female), food taste (bad, ok, good, and excellent) and blood types (O, A, B, AB). Although PROC GLM fits more types of models than PROC REG, it in many cases demands more computing resource/space and provide less output.

 

The call to PROC GLM can be very complicated, depending upon the complexity of the model and what you are trying to prove. The usage of PROC GLM is:

 

 
PROC GLM DATA=dataset; 
  CLASS variable(s); /* specify the categorical varibale(s) */
  MODEL dependent variable = independent variable(s);
  MEANS effects  / <options>; 
RUN;

 

Remarks:

  1. The CLASS statement is new. If you have a categorical variable in your model, you have to declare it in this line. Otherwise, SAS will treat it as if it is continuous, and error messages will occur if your categorical variable is character strings.
  2. The MEANS statement determines which of the categorical variables associated paramenters are statisticially the same. Options "tukey" or "scheffe" can be used to indicate whether Tukey's or Scheffe's method is used to calulate the minimum significant difference
  3. The call to PROC GLM can be very complicated, depending upon the complexity of the model and what you are trying to prove. You are advised to check the SAS menu for the details
  4. There is a PROC ANOVA procedure that can also handle categorical type dependent variable(s). However, if there are both continuous (numerical) and categorical types of dependent variables in the model, only PROC GLM is applicable.

Example:

 
options pagesize = 76 linesize = 70;
data tree;
    input diameter height volume @@;
       x1 = log(diameter);
       if (log(height)> 4.4) then x2 = "3";
       else if (log(height)> 4.2) then x2 = "2";
       else x2 = "1";
       y = log(volume);
       label x2 = "CATEGORIZED LOG-HEIGHT: 1 = SHORT 2 = MEDIAN 3 = TALL";
    cards;
8.3 70 10.3
8.6 65 10.3
8.8 63 10.2
10.5 72 16.4
10.7 81 18.8
10.8 83 19.7
11.0 66 15.6
11.0 75 18.2
11.1 80 22.6
11.2 75 19.9
11.3 79 24.2
11.4 76 21.0
11.4 76 21.4
11.7 69 21.3
12.0 75 19.1
12.9 74 22.2
12.9 85 33.8
13.3 86 27.4
13.7 71 25.7
13.8 64 24.9
14.0 78 34.5
14.2 80 31.7
14.5 74 36.3
16.0 72 38.3
16.3 77 42.6
17.3 81 55.4
17.5 82 55.7
17.9 80 58.3
18.0 80 51.5
18.0 80 51.0
20.6 87 77.0
;
proc plot hpercent = 50 vpercent = 33;
   plot  y*(x1 x2) = '*';
run;
proc glm; 
   class x2;
   model y = x1 x2;
     output out = outdata r = resid p = yhat;
   means x2/scheffe;
run;
proc plot data = outdata hpercent = 50 vpercent = 33; 
   plot resid*(x1 x2 yhat) = '*'; 
run;
proc univariate plot data = outdata;
   var resid;
run;
 
                           The SAS System                           1
                                        12:33 Monday, October 12, 1998
 
        Plot of Y*X1='*'.                    Plot of Y*X2='*'.
 
Y |                                    Y |
5 +                                    5 +
  |                                      |
  |                                      |
  |                   *                  |                      *
4 +                **                  4 +            *         *
  |               *                      |            *
  |           * *                        |            *         *
  |         *  *                         |  *         *         *
3 +        ****                        3 +            *         *
  |       **                             |  *         *
  |                                      |
  |   **                                 |  *         *
2 +                                    2 +
  --+--------+--------+--------+-        ---+---------+---------+--
   2.0      2.5      3.0      3.5           1         2         3
 
                 X1                                  X2
 
NOTE: 15 obs hidden.                 NOTE: 16 obs hidden.
 
 
NOTE: 4 obs hidden.
^L                            The SAS System                           2
 
 
                   General Linear Models Procedure
                       Class Level Information
 
                      Class    Levels    Values
 
                      X2            3    1 2 3
 
 
               Number of observations in data set = 31
 
 
^L                            The SAS System                           3
 
 
                   General Linear Models Procedure
 
Dependent Variable: Y   
 
Source                  DF    Sum of Squares     F Value      Pr > F
 
Model                    3        8.05844483      289.82      0.0001
 
Error                   27        0.25024470
 
Corrected Total         30        8.30868953
 
                  R-Square              C.V.                  Y Mean
 
                  0.969882          2.941644              3.27273172
 
 
Source                  DF         Type I SS     F Value      Pr > F
 
X1                       1        7.92544582      855.11      0.0001
X2                       2        0.13299901        7.17      0.0032
 
Source                  DF       Type III SS     F Value      Pr > F
 
X1                       1        5.85152055      631.35      0.0001
X2                       2        0.13299901        7.17      0.0032
 
 
 
^L                            The SAS System                           4
 
 
                   General Linear Models Procedure
 
                    Scheffe's test for variable: Y
 
     NOTE: This test controls the type I experimentwise error rate 
           but generally has a higher type II error rate than Tukey's 
           for all pairwise comparisons.
 
         Alpha= 0.05  Confidence= 0.95  df= 27  MSE= 0.009268
                     Critical Value of F= 3.35413
 
  Comparisons significant at the 0.05 level are indicated by '***'.
 
                       Simultaneous            Simultaneous
                           Lower    Difference     Upper
             X2         Confidence    Between   Confidence
         Comparison        Limit       Means       Limit
 
        3    - 2         0.20870     0.33224     0.45577   ***
        3    - 1         0.81365     0.98091     1.14818   ***
 
        2    - 3        -0.45577    -0.33224    -0.20870   ***
        2    - 1         0.51314     0.64868     0.78421   ***
 
        1    - 3        -1.14818    -0.98091    -0.81365   ***
        1    - 2        -0.78421    -0.64868    -0.51314   ***
 
 
^L                            The SAS System                           5
 
 
                   General Linear Models Procedure
 
                    Scheffe's test for variable: Y
 
     NOTE: This test controls the type I experimentwise error rate 
           but generally has a higher type II error rate than REGWF 
           for all pairwise comparisons
 
                  Alpha= 0.05  df= 27  MSE= 0.009268
                     Critical Value of F= 3.35413
                Minimum Significant Difference= 0.1433
                  WARNING: Cell sizes are not equal.
                Harmonic Mean of cell sizes= 6.055046
 
     Means with the same letter are not significantly different.
 
            Scheffe Grouping              Mean      N  X2
                                   
                           A           3.63508      5  3
                                   
                           B           3.30285     22  2
                                   
                           C           2.65417      4  1
 
 
^L                            The SAS System                           6
 
 
      Plot of RESID*X1='*'.                Plot of RESID*X2='*'.
 
RESID |                               RESID |
  0.2 +                                 0.2 +
      |       *                             |           *
      |         *    *                      |           *        *
      |          *   *                      |           *
      |  *        *                         |  *        *
      |      **                             |           *
  0.0 +   *   *     ** *                0.0 +  *        *        *
      |  *   **  *   *                      |  *        *        *
      |      *     *                        |           *
      |                                     |
      |         *                           |                    *
      |        ***                          |           *
 -0.2 +                                -0.2 +
      -+-------+-------+-------+-           ---+--------+--------+--
      2.0     2.5     3.0     3.5              1        2        3
 
                   X1                                  X2
 
NOTE: 6 obs hidden.                  NOTE: 15 obs hidden.
 
 
     Plot of RESID*YHAT='*'.
 
RESID |
  0.2 +
      |        *
      |           *   *
      |            *   *
      |  *         *
      |       **
  0.0 +   *    *     * *  *
      |   *  * *** *   *
      |       *      *
      |
      |            *
      |         ***
 -0.2 +
      -+-------+-------+-------+-
       2       3       4       5
 
                  YHAT
 
NOTE: 4 obs hidden.
 
 
 
 
^L                            The SAS System                           7
 
 
                         Univariate Procedure
 
Variable=RESID
 
                               Moments
 
               N                31  Sum Wgts         31
               Mean              0  Sum               0
               Std Dev    0.091332  Variance   0.008341
               Skewness   -0.04576  Kurtosis   -0.06961
               USS        0.250245  CSS        0.250245
               CV                .  Std Mean   0.016404
               T:Mean=0          0  Pr>|T|       1.0000
               Num ^= 0         31  Num > 0          14
               M(Sign)        -1.5  Pr>=|M|      0.7201
               Sgn Rank         -7  Pr>=|S|      0.8935
 
 
                           Quantiles(Def=5)
 
                100% Max  0.182153       99%  0.182153
                 75% Q3   0.062393       95%  0.150755
                 50% Med  -0.01154       90%  0.127801
                 25% Q1   -0.03715       10%  -0.12842
                  0% Min  -0.17906        5%  -0.17852
                                          1%  -0.17906
                Range     0.361215                    
                Q3-Q1     0.099542                    
                Mode      -0.17906                    
 
 
                               Extremes
 
                  Lowest    Obs     Highest    Obs
                -0.17906(      15) 0.108172(      28)
                -0.17852(      16) 0.127801(      26)
                -0.15681(      19)  0.14478(      17)
                -0.12842(      18) 0.150755(       9)
                -0.07945(      24) 0.182153(      11)
 
 
           Stem Leaf                     #             Boxplot
              1 58                       2                |   
              1 134                      3                |   
              0 679                      3             +-----+
              0 001244                   6             |  +  |
             -0 4333332210              10             *-----*
             -0 855                      3                |   
             -1 3                        1                |   
             -1 886                      3                |   
                ----+----+----+----+              
            Multiply Stem.Leaf by 10**-1          
 
 
                            Normal Probability Plot              
        0.175+                                         *++++*    
             |                                   *+*+*++         
             |                               +***+               
             |                         ++*****                   
             |                 **********                        
             |               **+++                               
             |         ++++*+                                    
       -0.175+    *++++* *                                       
              +----+----+----+----+----+----+----+----+----+----+
                  -2        -1         0        +1        +2