REGRESSIONS:
Regression is a statistical methodology that studies the relationship
between two or more variables so that one variable (called DEPENDENT variable)
can be predicted from other, or others (called INDEPENDENT variable(s)).
The basic regression procedures in SAS are PROC REG and PROC GLM. Generally
speaking, a PROC GLM procedure uses more computing time and memory space
and have less options than a PROC REG procedure, but it can handle a broader
class of models. Particularly, a PROC REG can only be used to analysis
numerical variables but the PROC GLM can in addition used to fit models
with categorical type dependent variable(s).
PROC REG Procedure:
PROC REG is the basic SAS procedure that performs regression analysis for numerical variables. With various options and statements, it can be used to
The general form of the PROC REG procedure is
PROC REG DATA=dataset; MODEL dependent_variable = independent_variable(s) / <options>; PLOT variable1 * variable2 <options>; OUTPUT OUT = newdata <options>; RUN;
In the general form, OUTPUT OUT = ? statement is used to produce a new data set which contains the original data of used in the regression model, plus the predicted values and standardized residuals if requested by the appropriate keywords. This new data set can be in turn used to
Example:
options pagesize = 76 linesize = 70; data tree; input diameter height volume @@; x1 = log(diameter); x2 = log(height); y = log(volume); x1x2 = x1*x2; cards; 8.3 70 10.3 8.6 65 10.3 8.8 63 10.2 10.5 72 16.4 10.7 81 18.8 10.8 83 19.7 11.0 66 15.6 11.0 75 18.2 11.1 80 22.6 11.2 75 19.9 11.3 79 24.2 11.4 76 21.0 11.4 76 21.4 11.7 69 21.3 12.0 75 19.1 12.9 74 22.2 12.9 85 33.8 13.3 86 27.4 13.7 71 25.7 13.8 64 24.9 14.0 78 34.5 14.2 80 31.7 14.5 74 36.3 16.0 72 38.3 16.3 77 42.6 17.3 81 55.4 17.5 82 55.7 17.9 80 58.3 18.0 80 51.5 18.0 80 51.0 20.6 87 77.0 ; proc reg; model y = x1 x2; /* Remark: The model statement is used to specify the fitted model */ output out = outdata r = resid p = yhat; /* Remark: This statement outputs the residuals and predicted values into data set `outdata'*/ plot y*y = ' ' y*(x1 x2) = '*' x1*y = '*' x1*x1 = ' ' x1*x2 = '*' x2*(y x1) = '*' x2*x2=' ' / collect vplots = 5 hplots = 3; /* Remark: This plot statement produces 9 scatter plots. (Actually there are 5 X 3 = 15 plots, but the last four plots are empty and not plotted here) */ run; proc plot data = outdata hpercent = 50 vpercent = 33; plot resid*(x1 x2 yhat x1x2) = '*'; run; /* Remark: The PROC PLOT step uses the data set `outdata' to make residual plots */ proc univariate plot data = outdata; var resid; run; /* Remark: This PROC UNIVARIATE precedure is used to summarize the residuals */
The SAS System 1 Model: MODEL1 Dependent Variable: Y Analysis of Variance Sum of Mean Source DF Squares Square F Value Prob>F Model 2 8.12323 4.06161 613.195 0.0001 Error 28 0.18546 0.00662 C Total 30 8.30869 Root MSE 0.08139 R-square 0.9777 Dep Mean 3.27273 Adj R-sq 0.9761 C.V. 2.48679 Parameter Estimates Parameter Standard T for H0: Variable DF Estimate Error Parameter=0 Prob > |T| INTERCEP 1 -6.631617 0.79978973 -8.292 0.0001 X1 1 1.982650 0.07501061 26.432 0.0001 X2 1 1.117123 0.20443706 5.464 0.0001 ^L The SAS System 2 -+----+----+----+- -+------+------+-- -+----+----+----+- Y | | Y | | Y | | 5 + + 5 + + 5 + + | | | * | | * | 4 + + 4 + ** + 4 + ** + | | | ** * | | ***** | 3 + + 3 + ***** + 3 + * ***** + | | | * * | | *** | 2 + + 2 + + 2 + + | | | | | | -+----+----+----+- -+------+------+-- -+----+----+----+- 2 3 4 5 2.0 2.5 3.0 4.0 4.2 4.4 4.6 Y X1 X2 -+---+---+---+-- -+-----+-----+-- -+---+---+---+-- X 3.0 + * + X 3.0 + + X 3.0 + * + 1 | * | 1 | | 1 | * | | * | | | | ** | | ** | | | | * ***** | 2.5 + * * + 2.5 + + 2.5 + ** * + | *** | | | | * *** | | | | | | | | * | | | | * * | 2.0 + + 2.0 + + 2.0 + + -+---+---+---+-- -+-----+-----+-- -+---+---+---+-- 2 3 4 5 2.0 2.5 3.0 4.0 4.2 4.4 4.6 Y X1 X2 -+---+---+---+-- -+-----+-----+-- -+---+---+---+-- X | | X | | X | | 2 4.6 + + 2 4.6 + + 2 4.6 + + | * * | | * * | | | 4.4 + *** * + 4.4 + ** ** ** + 4.4 + + | ***** | | ****** | | | 4.2 + * *** + 4.2 + ** *** + 4.2 + + | * | | * | | | 4.0 + + 4.0 + + 4.0 + + | | | | | | -+---+---+---+-- -+-----+-----+-- -+---+---+---+-- 2 3 4 5 2.0 2.5 3.0 4.0 4.2 4.4 4.6 Y X1 X2 ^L The SAS System 3 Plot of RESID*X1='*'. Plot of RESID*X2='*'. | | 0.2 + 0.2 + | | R | * * R | * * e | * * * e | * * * s | * * * s | *** i | * * i | * * * d 0.0 + * ** * ** * d 0.0 + * * ** * u | * u | * ** a | * * * a | * * * l | l | | * | * | ** | * * -0.2 + -0.2 + -+-------+-------+-------+ -+-------+-------+-------+ 2.0 2.5 3.0 3.5 4.0 4.2 4.4 4.6 X1 X2 NOTE: 7 obs hidden. NOTE: 6 obs hidden. Plot of RESID*YHAT='*'. Plot of RESID*X1X2='*'. | | 0.2 + 0.2 + | | R | * * R | * * e | * * * e | * * * s | * * * s | * * * i | * * i | * * d 0.0 + * * * * ** * d 0.0 + * **** ** * u | * u | ** a | * ** * a | * ** * l | l | | * | * | * * | * * -0.2 + -0.2 + -+-------+-------+-------+ -+-------+-------+-------+ 2 3 4 5 7.5 10.0 12.5 15.0 Predicted Value of Y X1X2 NOTE: 6 obs hidden. NOTE: 4 obs hidden. ^L The SAS System 4 Univariate Procedure Variable=RESID Residual Moments N 31 Sum Wgts 31 Mean 0 Sum 0 Std Dev 0.078626 Variance 0.006182 Skewness -0.41341 Kurtosis -0.15837 USS 0.185463 CSS 0.185463 CV . Std Mean 0.014122 T:Mean=0 0 Pr>|T| 1.0000 Num ^= 0 31 Num > 0 16 M(Sign) 0.5 Pr>=|M| 1.0000 Sgn Rank 14 Pr>=|S| 0.7888 Quantiles(Def=5) 100% Max 0.129223 99% 0.129223 75% Q3 0.073269 95% 0.119002 50% Med 0.002431 90% 0.085102 25% Q1 -0.05266 10% -0.07321 0% Min -0.16856 5% -0.16453 1% -0.16856 Range 0.297784 Q3-Q1 0.125929 Mode -0.16856 Extremes Lowest Obs Highest Obs -0.16856( 15) 0.083801( 14) -0.16453( 18) 0.085102( 26) -0.14655( 16) 0.113363( 23) -0.07321( 19) 0.119002( 17) -0.06778( 22) 0.129223( 11) Stem Leaf # Boxplot 1 123 3 | 0 578889 6 +-----+ 0 0111233 7 *--+--* -0 4441100 7 | | -0 77665 5 +-----+ -1 | -1 765 3 | ----+----+----+----+ Multiply Stem.Leaf by 10**-1 Normal Probability Plot 0.125+ +*+*+++ * | *****+* | ******+ -0.025+ ****** | *+**** | +++++*+ -0.175+++++*+ * +----+----+----+----+----+----+----+----+----+----+ -2 -1 0 +1 +2
PROC GLM Procedure:
PROC GLM is a more general procedure than PROC REG. It can in addition
handle categorical type of independent variables, where a categorical variable
can be, for examples, gender (Male and Female), food taste (bad, ok, good,
and excellent) and blood types (O, A, B, AB). Although PROC GLM fits more
types of models than PROC REG, it in many cases demands more computing
resource/space and provide less output.
The call to PROC GLM can be very complicated, depending upon the complexity
of the model and what you are trying to prove. The usage of PROC GLM is:
PROC GLM DATA=dataset; CLASS variable(s); /* specify the categorical varibale(s) */ MODEL dependent variable = independent variable(s); MEANS effects / <options>; RUN;
Remarks:
options pagesize = 76 linesize = 70; data tree; input diameter height volume @@; x1 = log(diameter); if (log(height)> 4.4) then x2 = "3"; else if (log(height)> 4.2) then x2 = "2"; else x2 = "1"; y = log(volume); label x2 = "CATEGORIZED LOG-HEIGHT: 1 = SHORT 2 = MEDIAN 3 = TALL"; cards; 8.3 70 10.3 8.6 65 10.3 8.8 63 10.2 10.5 72 16.4 10.7 81 18.8 10.8 83 19.7 11.0 66 15.6 11.0 75 18.2 11.1 80 22.6 11.2 75 19.9 11.3 79 24.2 11.4 76 21.0 11.4 76 21.4 11.7 69 21.3 12.0 75 19.1 12.9 74 22.2 12.9 85 33.8 13.3 86 27.4 13.7 71 25.7 13.8 64 24.9 14.0 78 34.5 14.2 80 31.7 14.5 74 36.3 16.0 72 38.3 16.3 77 42.6 17.3 81 55.4 17.5 82 55.7 17.9 80 58.3 18.0 80 51.5 18.0 80 51.0 20.6 87 77.0 ; proc plot hpercent = 50 vpercent = 33; plot y*(x1 x2) = '*'; run; proc glm; class x2; model y = x1 x2; output out = outdata r = resid p = yhat; means x2/scheffe; run; proc plot data = outdata hpercent = 50 vpercent = 33; plot resid*(x1 x2 yhat) = '*'; run; proc univariate plot data = outdata; var resid; run;
The SAS System 1 12:33 Monday, October 12, 1998 Plot of Y*X1='*'. Plot of Y*X2='*'. Y | Y | 5 + 5 + | | | | | * | * 4 + ** 4 + * * | * | * | * * | * * | * * | * * * 3 + **** 3 + * * | ** | * * | | | ** | * * 2 + 2 + --+--------+--------+--------+- ---+---------+---------+-- 2.0 2.5 3.0 3.5 1 2 3 X1 X2 NOTE: 15 obs hidden. NOTE: 16 obs hidden. NOTE: 4 obs hidden. ^L The SAS System 2 General Linear Models Procedure Class Level Information Class Levels Values X2 3 1 2 3 Number of observations in data set = 31 ^L The SAS System 3 General Linear Models Procedure Dependent Variable: Y Source DF Sum of Squares F Value Pr > F Model 3 8.05844483 289.82 0.0001 Error 27 0.25024470 Corrected Total 30 8.30868953 R-Square C.V. Y Mean 0.969882 2.941644 3.27273172 Source DF Type I SS F Value Pr > F X1 1 7.92544582 855.11 0.0001 X2 2 0.13299901 7.17 0.0032 Source DF Type III SS F Value Pr > F X1 1 5.85152055 631.35 0.0001 X2 2 0.13299901 7.17 0.0032 ^L The SAS System 4 General Linear Models Procedure Scheffe's test for variable: Y NOTE: This test controls the type I experimentwise error rate but generally has a higher type II error rate than Tukey's for all pairwise comparisons. Alpha= 0.05 Confidence= 0.95 df= 27 MSE= 0.009268 Critical Value of F= 3.35413 Comparisons significant at the 0.05 level are indicated by '***'. Simultaneous Simultaneous Lower Difference Upper X2 Confidence Between Confidence Comparison Limit Means Limit 3 - 2 0.20870 0.33224 0.45577 *** 3 - 1 0.81365 0.98091 1.14818 *** 2 - 3 -0.45577 -0.33224 -0.20870 *** 2 - 1 0.51314 0.64868 0.78421 *** 1 - 3 -1.14818 -0.98091 -0.81365 *** 1 - 2 -0.78421 -0.64868 -0.51314 *** ^L The SAS System 5 General Linear Models Procedure Scheffe's test for variable: Y NOTE: This test controls the type I experimentwise error rate but generally has a higher type II error rate than REGWF for all pairwise comparisons Alpha= 0.05 df= 27 MSE= 0.009268 Critical Value of F= 3.35413 Minimum Significant Difference= 0.1433 WARNING: Cell sizes are not equal. Harmonic Mean of cell sizes= 6.055046 Means with the same letter are not significantly different. Scheffe Grouping Mean N X2 A 3.63508 5 3 B 3.30285 22 2 C 2.65417 4 1 ^L The SAS System 6 Plot of RESID*X1='*'. Plot of RESID*X2='*'. RESID | RESID | 0.2 + 0.2 + | * | * | * * | * * | * * | * | * * | * * | ** | * 0.0 + * * ** * 0.0 + * * * | * ** * * | * * * | * * | * | | | * | * | *** | * -0.2 + -0.2 + -+-------+-------+-------+- ---+--------+--------+-- 2.0 2.5 3.0 3.5 1 2 3 X1 X2 NOTE: 6 obs hidden. NOTE: 15 obs hidden. Plot of RESID*YHAT='*'. RESID | 0.2 + | * | * * | * * | * * | ** 0.0 + * * * * * | * * *** * * | * * | | * | *** -0.2 + -+-------+-------+-------+- 2 3 4 5 YHAT NOTE: 4 obs hidden. ^L The SAS System 7 Univariate Procedure Variable=RESID Moments N 31 Sum Wgts 31 Mean 0 Sum 0 Std Dev 0.091332 Variance 0.008341 Skewness -0.04576 Kurtosis -0.06961 USS 0.250245 CSS 0.250245 CV . Std Mean 0.016404 T:Mean=0 0 Pr>|T| 1.0000 Num ^= 0 31 Num > 0 14 M(Sign) -1.5 Pr>=|M| 0.7201 Sgn Rank -7 Pr>=|S| 0.8935 Quantiles(Def=5) 100% Max 0.182153 99% 0.182153 75% Q3 0.062393 95% 0.150755 50% Med -0.01154 90% 0.127801 25% Q1 -0.03715 10% -0.12842 0% Min -0.17906 5% -0.17852 1% -0.17906 Range 0.361215 Q3-Q1 0.099542 Mode -0.17906 Extremes Lowest Obs Highest Obs -0.17906( 15) 0.108172( 28) -0.17852( 16) 0.127801( 26) -0.15681( 19) 0.14478( 17) -0.12842( 18) 0.150755( 9) -0.07945( 24) 0.182153( 11) Stem Leaf # Boxplot 1 58 2 | 1 134 3 | 0 679 3 +-----+ 0 001244 6 | + | -0 4333332210 10 *-----* -0 855 3 | -1 3 1 | -1 886 3 | ----+----+----+----+ Multiply Stem.Leaf by 10**-1 Normal Probability Plot 0.175+ *++++* | *+*+*++ | +***+ | ++***** | ********** | **+++ | ++++*+ -0.175+ *++++* * +----+----+----+----+----+----+----+----+----+----+ -2 -1 0 +1 +2