960.390 - Introduction to Computers for Statistics

960.390-01, Fall 1999, M 7,8 (6:10-9:00pm)

Meeting dates: 10/25, 11/1, 11/8, 11/15


| Syllabus | Class 1 | Class 2 | Class 3 | Class 4 | Class 5 | Home | Email  


 

DATA MANAGEMENT II

 

When there is a lot of data and under more complicated cases, it is often necessary to manipulate data sets within your SAS program, by adding variables, dropping variables, dropping a subset of the observations, or merging data sets. We have covered some of the topics last time. We are going to learn more in this class.

Create a new variable using a LAG function:

Sometimes we need to use a value from the previous observation. The LAG function can create a new variable which contains the value of the previous observation. The form is

new variable = LAG(variable); 
or,
new variable = variable - LAG(variable); /* new variable 
                values are the difference between two consecutive observations */

The following is an example SAS program:

OPTIONS NODATE LINESIZE=80;
DATA JUNE;

INPUT DATE $ WEIGHT;

LAGWT = LAG(WEIGHT);

LOSSWT = WEIGHT - LAG(WEIGHT);

CARDS;

JUNE_10 230

JUNE_20 225

June_30 223

;

RUN;

PROC PRINT;

RUN;

 
 

The SAS lst file is

 
 
                                              The SAS System     
                      OBS DATE    WEIGHT LAGWT LOSSWT
                        1 JUNE_10 230       .    .
                        2 JUNE_20 225     230   -5
                        3 JUNE_30 223     225   -2

DROP/KEEP Statements:

DROP variable list;

or,

KEEP variable list;

---DROP statement excludes the listed variables from the data (keeps the rest)

---KEEP statement keeps only the listed variables (drops the rest)

The following is an example SAS program and its lst file:

OPTIONS NODATE LINESIZE=80;
DATA JUNE;
   INPUT DATE $ HEIGHT WEIGHT;
   KEEP DATE WEIGHT; /*or DROP HEIGHT; */
CARDS;
JUNE_10 171 230
JUNE_20 171 225
JUNE_30 171 223
;
RUN;
PROC PRINT;
RUN;
 
                       The SAS System                                1
                       
                      OBS DATE    WEIGHT
                       1   JUNE_10 230
                       2   JUNE_20 225
                       3   JUNE_30 223                       

Use SET Statement to Combine Data Sets:

Example:

 
INPUT BANK $ ACCTNUM MONEY;
DATA CASH;
   INPUT BANK $ ACCTNUM MONEY;
CARDS;
CHASE 1536253 50.32 
CORESTATES 189273462 1563.82 
FLEET 287363 20000.00
;
RUN;
DATA CASH2;
SET CASH;
RUN;

Now the data set "cash2" is the same as that of "cash". The thing is, you can now add some extra lines to the second DATA step and change "cash" into something that you like better.

---- If BY statement is used, you need to sort the data sets first.

This program is an example of combining two sorted data sets (with BY statement).

The SAS program:

OPTIONS NODATE LINESIZE=80;
DATA JUNE;
   INPUT NAME $ HEIGHT WEIGHT;
CARDS;
ANNE 71 130
EDITH 69 160
CHARLIE 50 110
BERT 70 180
;
RUN;
DATA JULY;
INPUT NAME $ EYES $ HAIR $;
CARDS;
DAVID BROWN BROWN
BERT BLUE BLOND
;
RUN;
PROC SORT DATA=JUNE;
   BY NAME;
RUN;
PROC SORT DATA=JULY;
   BY NAME;
RUN;
DATA BOTH;
   SET JUNE JULY;
   BY NAME;
RUN;
PROC PRINT DATA=BOTH;
RUN;

and the lst file:

 
                       The SAS System 
               OBS   NAME   HEIGHT WEIGHT EYES  HAIR
               1     Anne     71    130 
               2     Bert     70    180 
               3     Bert      .      .   blue  blond
               4     Charlie  50    110 
               5     David     .      .  brown  brown
               6     Edith    69    160

Use MERGE Statement to Merge Data Sets:

MERGE data1 data2;

BY a common variable;

Example 1 (without a BY statement):

OPTIONS NODATE LINESIZE=80;
DATA ONE;
  INPUT DATE $ WEIGHT;
CARDS;
JUNE_10 230
JUNE_10 231
JUNE_20 225
JUNE_30 223
;
RUN;
DATA TWO;
   INPUT DATE $ HEIGHT;
CARDS;
JUNE_10 171
;
RUN;
DATA THREE;
  MERGE ONE TWO;
RUN;
PROC PRINT;
RUN;
 
               The SAS System                                1
 
   OBS DATE    WEIGHT HEIGHT
     1 JUNE_10 230    171
     2 JUNE_10 231     .
     3 JUNE_20 225     .
     4 JUNE_30 223     .

Example 2 (with a BY statement):

 
OPTIONS NODATE LINESIZE=80;
DATA ONE;
   INPUT DATE $ WEIGHT;
CARDS;
JUNE_10 230
JUNE_10 231
JUNE_20 225
JUNE_30 223
;
RUN;
DATA TWO;
   INPUT DATE $ HEIGHT;
CARDS;
JUNE_10 171
;
RUN;
DATA THREE;
   MERGE ONE TWO;
   BY DATE;
RUN;
PROC PRINT;
RUN;
 
               The SAS System                            2
 
        OBS DATE    WEIGHT  HEIGHT
          1 JUNE_10  230    171
          2 JUNE_10  231    171
          3 JUNE_20  225      .
          4 JUNE_30  223      .

Remark: Compare the list files of these two examples and find out their difference.

 

 

Read Limited Number of lines from External Files:

INFILE DATAFILE OBS= 5; /* read total 5 observations (lines) from the external data file */

INFILE DATAFILE FIRSTOBS=5; /* skip the first 4 observations (lines) in the external data file */

INFILE DATAFILE FIRSTOBS=6 OBS=12; /* skip the first 5 observations (lines) and read

the next 7 observations from the external data file */

 

Use PUT statement to Create New External Date Files:

---- The FILE statement is the complement of the INFILE. If a FILE statement is not used, SAS writes to the LOG file

Example:

DATA ONE;
INPUT DATE $ WEIGHT;
   LOSSWT = WEIGHT - LAG(WEIGHT);
CARDS;
JUNE_10 230
JUNE_20 225
JUNE_30 223
;
RUN;
FILENAME TEST 'c:\tmp.dat';
DATA TWO;
   SET ONE;
   FILE TEST;
   PUT DATE LOSSWT WEIGHT;
RUN;

Remark: create an external file 'tmp.dat':

 
JUNE_10 . 230
JUNE_20 -5 225
JUNE_30 -2 223 

SOME USEFUL COMMANDS:

Setting the Size of the Output:

---- Acceptable range from 64 through 256

---- Acceptable range from 15 to 32767

Example (for an 81/2 by 11 inch size paper):  

            OPTIONS LINESIZE=64 PAGESIZE=55;

LABEL Statements:

---- Can appear in DATA steps and PROC steps

Examples:

 
LABEL time = 'time needed to run the program';
LABEL student = 'studen''s names'  
        score = `midterm scores';  /* if label contains a single 
                   quote/apostrophe, use two single quotes */

TITTLE Statements/FOOTNOTE Statements

Example:

 
OPTIONS LINESIZE=64 PAGESIZE=40;
DATA ONE;
   INPUT DATE $ WEIGHT;
   LOSSWT = WEIGHT - LAG(WEIGHT);
CARDS;
JUNE_10 230
JUNE_20 225
JUNE_30 223
;
RUN;
PROC PRINT DATA=ONE;
TITLE "WE";
TITLE2 "LIKE";
TITLE3 "THIS";
FOOTNOTE "BUT";
FOOTNOTE2 "WE";
FOOTNOTE3 "DO NOT";
FOOTNOTE4 "LIKE";
FOOTNOTE5 "THAT";
RUN;

The SAS lst file look like this:

 
 
 
------------------------------------------------------------------------------
 
                                      WE                                      1
                                     LIKE
                                     THIS
 
                        OBS        DATE     WEIGHT   
 
                          1        JUNE_10  230   
                   2        JUNE_20  225                       
                   3        JUNE_30   2231  
 
 
 
 
 
                                     BUT
                                      WE
                                    DO NOT
                                     LIKE
                                     THAT
 
 

 

 

DISCRIPTIVE STATISTICS I:

 
 
 
PROC MEANS 

 

 
 
·                      PROC MEANS is a statistical procedure that everybody uses to describe data. 
 

The usage of the procedure is:

 
 
PROC MEANS options;
RUN;
 

See a web page prepared by Dr. Jesse Parelius here for more details.

 

PROC UNIVARIATE: 

 
 
 
·                      PROC UNIVARIATE generates many statistics to help determine of a sample is from a univariate normal distribution (hence the name). There are two main options: 
 
 
 

---- NORMAL computes a statistics to test whether or not the distribution is likely from a normal distribution

 

---- PLOT generates a stem-and-leaf plot of the data (or a pie chart), a box plot, and a normal probability plot

 
 

The usage of the procedure is:

 
 
PROC UNIVARIATE DATA = dataset <options>;
   BY variables;
   VAR variables;
   ID variables;
RUN;
 

Example:

 
OPTIONS LINESIZE=78 NODATE;
FILENAME DATAIN 'A:\NORMAL.DAT';
DATA MYDATA;
   INFILE DATAIN;
   INPUT VAL;
RUN;
PROC UNIVARIATE NORMAL PLOT DATA=MYDATA;
   VAR VAL;
RUN;
 
A tidy program, generating tons of output:
 
------------------------------------------------------------------------------
 
                                The SAS System                               1
 
                             Univariate Procedure
 
Variable=VAL
 
                                   Moments
 
                   N             10000  Sum Wgts      10000
                   Mean        500.133  Sum         5001330
                   Std Dev    15.80274  Variance   249.7265
                   Skewness   -0.05145  Kurtosis   -0.01344
                   USS        2.5038E9  CSS         2497015
                   CV         3.159707  Std Mean   0.158027
                   T:Mean=0   3164.851  Pr>|T|       0.0001
                   Num ^= 0      10000  Num > 0       10000
                   M(Sign)        5000  Pr>=|M|      0.0001
                   Sgn Rank   25002500  Pr>=|S|      0.0001
                   D:Normal    0.02028  Pr>D           <.01                                  Quantiles(Def=5)                      100% Max       559       99%       536                      75% Q3        511       95%       526                      50% Med       500       90%       520                      25% Q1        489       10%       480                       0% Min       437        5%       474                                               1%       463                     Range          122                                         Q3-Q1           22                                         Mode           506                                                          Extremes                        Lowest    Obs     Highest    Obs                          437(    4183)      551(     635)                          442(    6617)      551(    6023)                          444(    6876)      552(    9521)                          447(    4919)      557(    5035)                          447(    3350)      559(    3199)     ------------------------------------------------------------------------------                                   The SAS System                               2                               Univariate Procedure  Variable=VAL                            Histogram                          #         Boxplot   557.5+*                                                    2            0           .*                                                    3            0           .*                                                   11            0           .**                                                  32            |           .****                                                94            |           .*******                                            157            |           .************                                       311            |           .*******************                                474            |           .******************************                     769            |           .************************************               913         +-----+        .************************************************  1225         |     |        .***********************************************   1212         *--+--*   497.5+*********************************************     1164         |     |        .*******************************************       1105         |     |        .************************************               926         +-----+        .************************                           617            |           .******************                                 467            |           .**********                                         256            |           .******                                             140            |           .***                                                 61            |           .**                                                  33            0           .*                                                   19            0           .*                                                    6            0           .*                                                    2            0      437.5+*                                                    1            0            ----+----+----+----+----+----+----+----+----+---                       * may represent up to 26 counts                                    ------------------------------------------------------------------------------                                   The SAS System                               3                               Univariate Procedure  Variable=VAL                                  Normal Probability Plot                           557.5+                                                  *                  |                                                  *                  |                                                  *                  |                                                  *                  |                                               ****                  |                                            ****                     |                                        *****                        |                                     ****                            |                                  ****                               |                               ****                                  |                            ****                                     |                        *****                                   497.5+                      ***                                            |                  ****                                               |               ****                                                  |            ****                                                     |         ****                                                        |      ****                                                           |   ****                                                              |***                                                                  |*                                                                    |*                                                                    |*                                                                    |*                                                               437.5+*                                                                     +----+----+----+----+----+----+----+----+----+----+                       -2        -1         0        +1        +2      >