960.390 - Introduction to Computers for Statistics

960.390-01, Fall 1999, M 7,8 (6:10-9:00pm)

Meeting dates: 10/25, 11/1, 11/8, 11/15


| Syllabus | Class 1 | Class 2 | Class 3 | Class 4 | Class 5 | Home | Email


 

BACKGROUND:

Statistics is considered as science of data.

---- Statistics provide quantitative information about people, processes, events, and ideas. They are used to help make decisions and as guides for future actions. Everyone uses statistics!

Statistical analysis requires the use of computers and statistical computing software packages:

---- Since we often may have a large number of observations in a data set and the calculations can be very tedious and complicated.

SAS (Statistical Analysis Software) is one of the most popular statistical software:

---- SAS software package is used to read in, process, and output statistical information from data sets. SAS software is very powerful and it has the capability to analyze almost all types of statistical problems.

In this course, we will learn SAS for Windows.

---- The other types of SAS, for examples, SAS for Unix, SAS for Mac etc., are very similar and we will not cover them here.

 

SAS BASIC I:

General Form of a SAS program:

 
DATA mydata;
INPUT variables; 
CARDS;
the lines of data
;
RUN;
PROC procedure options;
options;
RUN;

·  SAS is organized into steps, which are like paragraphs. There are two types of steps:

·  SAS steps consist of statements, which like sentence in a paragraph.

·  Use consistent layout pattern to make the SAS programs easier to read/debug

 

The DATA Step:

·  The DATA statement names the data set

DATA data set;

A data set name can be anything you like, but should not begin with a number, shouldn't contain any obviously dangerous characters like "=" or "^D", and must be 8 characters or less in length. For safety, stick to {0-9, A-Z, a-z} and "_".

Good Examples:

 
data one;
data hyper123;
data this_one;

Bad Examples:

 
data 1; 
data hyperspacemodulator123; 
data this=one;

·  INPUT is the keyword that defines the names of the variables in the data set

INPUT names types(if needed) column_designation (if needed);

The DATA statement is followed by an INPUT statement, which lists the variables in the data set and tells where on an input line they may be found. It is also possible (and in some cases necessary) to specify variable types (strings, dates, etc) in this line.

Assuming that the data is:

 
6 0 Michael
5 11 Fred
4 8.5 Isabel
1 11.5 Roxanne

Good Examples:

INPUT FEET INCH NAME $; /* If no locations given, separation by white space assumed */
INPUT FEET 1-1 INCH 3-6 NAME $ 8-20;
INPUT NAME $ 8-20 FEET 1-1 INCH 3-6; /* Ordering does not matter */
INPUT FEET 1-1 INCH 3-6 NAME $ 8-20 WHOLE $ 1-20; /* Sections can be reread */
INPUT FEET $ INCH NAME $; /* Just because a section is a number, does not mean it *must* be read in as such */

Bad Examples:

INPUT HEIGHT HEIGHT NAME $; 
INPUT FEET 1-1 INCH 3-6 NAME 8-20 $; 
INPUT NAME 20-8 $ FEET 1-1 INCH 3-6; 
INPUT 8FEET 1-1 IN;CH 3-6 NAME $ 8-20;  
 

·  The CARDS statement signals the beginning of the lines of data

But if the INPUT statement is constructed differently, more than one data record per line (or a multiple line data element) can be parsed.

No semicolon is at the end of each data line!

Good Example:

DATA CASH;
INPUT BANK $ ACCTNUM MONEY;
CARDS;
CHASE 1536253 50.32
CORESTATES 189273462 1563.82
FLEET 287363 20000.00
;
RUN;

Bad Example:

DATA CASH;
INPUT BANK $ ACCTNUM MONEY;
CARDS;
BANKOFAMERICA 23423423 10000.03; /* Semicolon at end of line */
CHASE 1536253 50.32 CORESTATES 189273462 1563.82 /* Multiple elements per line */

·  RUN is an optional last statement in the DATA steps and the PROC steps (recommanded in this class)

 

PROC PRINT Step

PROC PRINT statement tells SAS to print out certain variables in the data set.

PROC PRINT DATA = mydata;

            VAR variable_name_1 variable_name_2 etc;

            RUN;

·  DATA = mydata tells PROC PRINT to use SAS data set named "mydata" (optional)

·  List the variables you want to print after VAR in the order you want them printed

Example:

DATA MYDATA;
INPUT NAME $ RANK $ SERIAL;
CARDS;
PILE PRIVATE 4323254
KLINK COLONEL 7574734
HOOK CAPTAIN 8573463
;
RUN;
PROC PRINT DATA=MYDATA;
VAR NAME SERIAL;
RUN;

SAS Program, Log, and List files:

In order to write a computer program, you must create a file using an editor. After you have written your SAS program file, you will need to save it and then submit it for batch processing.

When a SAS program is run, it generates two files, often two files will be generated: the "log" file and the "list" file:

·  Log file: contains information about the run, warnings and errors

1. An ERROR line means that some part of the processing failed and you'll want to run the analysis again.

2. WARNING lines are generally worthwhile looking at, too.

·  List file: contains the output of the various procedures.

Comment Lines: 

---- A statement which begins with an star (*), or to

---- A statement begins with /* and ends with */

 We have seen some examples above. Here are some more examples

Example:

DATA CASH;
INPUT BANK $ ACCTNUM MONEY @@;
* this is a good example
CARDS;
CHASE 1536253 50.32 CORESTATES 189273462 1563.82 FLEET 287363 20000.00
;
RUN;
PROC PRINT DATA = CASH; /* to print the data */ 
VAR BANK ACCTNUM MONEY;
RUN;

SAS BASIC II and DATA MANAGEMENT I 

Comparison of Character Formats:

There are three character formats ($, $w., $charw.) which differ in how they handle blanks and missing values.

·  With list input:

·  With $w. format:

·  With $charw. format:

Good example:

SPORT $7. OCCUPATION $char8.;

Read More than One Data Record per Line

 If you have more than one data record per line, it is possible to read it in without reformatting.

·  Simply add an @@ at the end of the INPUT line.

·  Do not use this with column markers, as it relies on white space.

Good Example:

DATA CASH;
INPUT BANK $ ACCTNUM MONEY @@;
CARDS;
CHASE 1536253 50.32 CORESTATES 189273462 1563.82 FLEET 287363 20000.00
;
RUN;

Bad Example:

DATA CASH;
INPUT @1 BANK $ ACCTNUM 7-13 MONEY 15-20 @@; /* column designations with @@ */
CARDS;
CHASE 1536253 50.32; CORESTATES 189273462 1563.82 /* semicolon in line */
;
RUN;

Also, Study example 1.3 on page 8.

Reading Data from A File:

Many data sets exist in a computer file somewhere. You can access these data by using FILENAME statement *before* the DATA step:

FILENAME data_in 'a:\mydata.txt';

where data_in is any valid file name (has the same conventions as the variable names) and 'mydata.txt' is the name of the file where you are reading the data (in the above sentence the file is in drive A). It may be necessary to put in a more full path name to your data file.

Once this is done, in the DATA step before the INPUT line, add INFILE statement

INFILE datain;

Remember do NOT add any "CARDS" statement or raw data!

 

Good Example: If the file 'bigfile.txt' on drive C reads:

Chase 1536253 50.32 Corestates 189273462 1563.82 Fleet 287363 20000.00

FILENAME FOOBLITZ 'C:\BIGFILE.TXT';
DATA CASH;
INFILE FOOBLITZ;
INPUT BANK $ ACCTNUM MONEY @@;
RUN; 

Bad Examples:

FILENAME 123FOO=BLITZ 'C:\BIGFILE.DAT' /* bad filename */
FILENAME FOOBLITZ C:\BIGFILE.DAT /* file not in quotes */
FILENAME FOOBLITZ 'C:\BIGFILE.DAT' /* no ending semicolon */
DATA CASH; /* "infile fooblitz;" line missing */
INPUT BANK $ ACCTNUM MONEY @@;
RUN;

Creating New Variables, Subsetting the Data, and Creating New Data:

·  It's possible to make new variables from old ones within a data step:

After an INPUT statement, just set the new variable to be some mathematical function of the old ones. All the usual mathematical operators work: {+, - ,* , / and ** for exponentiation}.

·  It is possible to subset the data set:

Use IF or IF … THEN … ELSE statements to subset the data set

·  It is possible to use a SAS data set as the basis for creating a new SAS data set

We can have many SAS data sets in one program. The SET statement can be used to create a new SAS data set from an old one.

Good Example:

FILENAME XYZ "A:AAA.DAT";
DATA ONE;
INFILE XYZ;
INPUT NAME $ VT $ X Y;
RUN;
PROC PRINT;
BY Y;
RUN;
DATA TWO;
SET ONE;
IF Y > 350;
RUN;
PROC PRINT DATA = TWO;
RUN;