Data Analysis using the SAS Language/Data Step

From Wikiversity
Jump to: navigation, search

In order to understand the data step it is helpful to understand the SAS data set. A SAS data set consists of observations and variables, these are respectively the rows and columns of data. Using a database concept, a SAS dataset represents a table with records and fields represented as observations and variables respectively.

SAS datasets can be temporary, existing during the life of the program or they can be permanent and persist between programs. The SAS dataset is a proprietary format that can only be accessed by the SAS System. However, SAS datasets can be written to data base tables, text files, or to PC files such as Excel or CSV format. And, as can be expected, SAS data sets can be created from any of these sources as well. The benefit of SAS data sets is the speed in which SAS can load data and begin processing.

Progress-1000.svg Completion status: this resource is considered to be complete.
U+269B.svg Subject classification: this is a science resource.
Bellcurve.svg Subject classification: this is a statistics resource.
Gnome-fs-client.svg Subject classification: this is an information technology resource.
Sciences humaines.svg Educational level: this is a tertiary (university) resource.

Data Step Language[edit]

In the SAS Language, statements are written in a very free form with few rules. For example, SAS statements can span several lines or several statements can be placed on a single line. All SAS statements must end with a semicolon ";". However, it is useful to indent code in the appropriate places in order to make the program more readable. Comments can also be used to explain the purpose of each section of code.

SAS is not case sensitive; however, variable names retain the casing used from when they were first defined within the program. This means variable names will appear in reports using the case established when they were created. Within the data step, the SAS language provides the input, output and logic for manipulating data.

Each data step begins with the data statement which defines the name of the SAS data set created by this step, and ends with the run statement. The statements within the data and run statements are executed for each observation in the input data set. Looping through observations is automatic within a SAS data step. The following example is a program that reads in three variables from the data file, apples, and makes a calculation to create a new variable. The result is a SAS data set, also called apples, which contains every observation from the original file and four variables, three input and one calculated.

    Filename apples "c:\fruits\apples.txt";
    data apples;
         infile apples;
         input Type $15. Quantity 6. Price_per_unit 6.2;
         purchase_cost = Quantity * Price_per_unit ;
    run;

The file, apples.txt, contains three variables:

  • Type, width 15 characters, name of apple
  • Quantity, width 6 digits, amount of apples purchsed
  • Price_per_unit width 6 with 2 decimals

The first seven observations of the file, apples.txt, look like this.

   McIntosh          100  2.00
   Red Delicious      75  2.25
   Granny Smith      125  2.05
   Jonathon          120  1.95
   Rome              130  2.00
   Gala              150  1.95
   Fuji              200  2.25

The resulting sas data set, apples, will have four variables for each observation. Three were read from apple.txt and one was created in the data step. The new variable, purchase_cost, is the product of the variables quantity and purchase_price.

Statements[edit]

SAS has several types of statements used in the SAS Data Step. These statements provide the building blocks for designing powerful programming modules within the data step. There are several SAS procedures, or procs, that are closely tied to processing data in the data step. These procedures include proc format, proc print, proc sort, proc sql, and proc summary. They provide routines that work across several observations at once. Each proc has its own set of statements that provide parameters, options, variables, and output data sets. By interweaving data steps with the appropriate procedures, powerful SAS programs can be built. First we focus on the data step language statements.

Data Input and Output[edit]

Input and output statements are used to identify both the source and destination of data and how to read and write data to and from files. SAS has seperate statements for using non SAS data sets and for SAS data sets. However, SAS can treat data in a non SAS database as if it is a SAS data set. There are a couple of steps that must always be followed. First, a logical link is established to the location of the physical data file. This is done with a filename statement (for non SAS data sets) or a libname statement (for SAS data sets). The physical file could be a SAS data set in a SAS library, a text file, or a file from another vendor such as SAP, Oracle, or Microsoft. SAS data sets use a proprietary format optimized for the SAS system; SAS temporary files are also stored as SAS data sets. SAS can also access data from different database vendors such as Oracle, IBM DB2, and Microsoft Excel and Access as if they were SAS data sets. Text files are accessed using the filename statement with the infile or file statement.

Filename Statement[edit]

The filename statement specifies the name of a physical that will either contain data to be read, or a file that will be created and written to. Normally this is a text file. Variations of this statement allow access to files using FTP, HTTP, pipes, email and other protocols. Within the Data Step, the file and infile statements reference the filename statement. This is done by the name on the statement. In the example below, the infile statement, shipment, identifies the filename statement, shipment, associated with a physical file, fruit.txt.

    filename shipment "c:\foods\fruit.txt";
    data fruit_shipment;
         infile shipment;
         input shipment_number 5. +1 shipment_date mmddyy6. +1 
         type_item $15. @30 price 6.2 @40 quantity 5.;
    run;

File Statement[edit]

The file statement, apples, associates a filename statement, apples , associated with a physical file, apple_list.txt. This file will be used to store output. If the file, apple_list.txt exists it will be written over, otherwise it will be created and written to.

     filename apples "c:\foods\apple_list.txt";
     data _null_;
          set fruit_shipment;  
          file apples;
          if type_item="APPLE" then 
               put  shipment_number 5. price 6.2 @40 quantity 5.;
     run;

Input Statement[edit]

The input statement lists the names of the variables and the formats needed to read them. Formatting for input gives SAS the rules it needs to extract data from the input file. This involves positioning the input pointer to the correct position, giving the name for the new variable, the type (number of character) and the width or number of characters that the input variable uses in the input file. The statement below lists five variables to input from a file.

    input shipment_number 5. +1 shipment_date mmddyy6. +1 
         type_item $15. @30 price 6.2 @40 quantity 5.;

Formats are used for input and output with text files. The $ is used for character strings. A format descriptor always has a decimal either at the end of, or before the number of decimals. There are other format instructions which tell SAS where to move its input pointer. Below is a break down to explain the different format options used in the previous statement.

This tables explains the input statement above while demonstrating much of its functionality.

  • shipment_number 5. a five character numeric with no decimals
  • +1 tells the input pointer to skip ahead one character (one position)
  • shipment_date mmddyy6. a Six character date field 2 digit for month, day and year
  • +1 skip a character
  • type_item $15. a character string of length 15, the "$" indicates characters
  • @30 move the input pointer to position 30
  • price 6.2 a six character numeric field with 2 decimals
  • @40 move to column 40
  • quantity 5. a five character numeric with no decimals

Informat Statement[edit]

The informat statement specifies the default format for inputting data from a textdata file. This includes specifying the fields with embedded commas, currency and date fields. Informat information is also saved in a SAS data step.

Format Statement[edit]

The format statement specifies the default format for writing variables to reports, on SAS procedure output, and to text files. This includes writing fields with embedded commas, currency and date fields. A format statement in the data step attaches the format to the field for all subsequent output. A format statement can also be used in a proc step to override a default format. Format information is also saved in a SAS data step.

Put Statement[edit]

The put statement allows data to be output as text. It is similar to the input statement. Each variable listed is followed by its format.

    put type_fruit $15. @25 price 8.2 +5 special 8.2 +1 quantity 9.;

Libname Statement[edit]

The libname statement references to the location of the folder that will contain permanent SAS data sets. These are SAS data sets that will persist beyond the current program. The name myfruit will be used along with the dataset names, shipment and apple, to fully qualify the dataset name and where it will be stored.

    libname myfruit "c:\sasdata\fruit";
    data apples; 
         set myfruit.shipment;
         if type_fruit="APPLES";
    run;
    data myfruit.apples;
         set apples;
         pct=.10;
         *** change price by pct;
         price=price*(1+pct);
    run;

Data Statement[edit]

The data statement names one or more sata data sets. These data sets can be temporary (exist until the program finishes) or permanent (stored after program is finished). Permanent SAS data sets can used by other SAS programs. The benefit of a SAS data set is that all the data is stored for optimal use by your SA program. You do not need to know the names of the variables or their formats. That information is part of the data set.

     data apples;
     data myfruit.apples;

Set Statement[edit]

The set statement is the counterpart of the data statement. Set inputs data from a SAS data set. It can be from a permanent SAS data set saved by another SAS program or from a temporary SAS data set created in the same program.

     set myfruit.shipment;
     set apples;

Merge Statement[edit]

The merge statement, which usually has a companion by statement, merges SAS data sets row to row by a common value of a common variable. The by statement tells which variable the data sets have in common so they can merge on its value.

     proc sort data=fruit_shipment;
          by type_fruit;
     run;
     proc sort data=fruit_prices;
          by type_fruit;
     run;
     data fruit_costs oranges;
          merge fruit_shipment fruit_prices; 
          by type_fruit;
          if type_fruit="ORANGES" then 
               output oranges;
          output fruit_costs;
     run;

Output Statement[edit]

The output statement forces the current observation into the data set. It is implied at the end of a data step. However there may be reasons to place and output statement inside a data step to conditionally force an observation. A data step may have more than one output statement.

     output oranges;
     output fruit_costs;

By Statement[edit]

The by statement, while it is not strictly part of the input/output process, tells the SAS data step that the observations in both the input data sets fruit_costs and oranges are sorted. Sorted data can be merged on the common values of the sorted variable. When merging data it is important to avoid a many-to-many relationship.

     by type_fruit;

Assignment Statements[edit]

Assignment statements are the heart of most data steps. This is where expressions are evaluated and results assigned to new variables. Assignment statements are broken into 2 sections. On the left side is a variable either a new variable or an existing variable. On the right side is an expression. The two are connected by an equal sign. Variables, in the expression on the right side, are substituted by their values, next the expression is evaluated and the result is placed in the variable on the left. Expressions can consist of numeric, character, and logical expressions.

Caution should be given when mixing numbers and characters. SAS will attempt to perform an automatic conversion, but that is usually a sign that an incorrect variable is being used. There is a special value for numeric variables called a missing value. An expressions with a variable which has a missing value, in most cases, results in a missing value. Missing values are represented by a dot, ".".

Expressions[edit]

Expressions can use any combination of mathematical operators, string operators, logical operators and functions. Following are some example expressions for assignment statements.

     fruit = apples + bananas;
     revenue = price * Sales_Volume;
     fullName = "Billy"||" "||"Smith";
     hypotenuse = sqrt(side_a^2 + side_b^2);
     areaCode = substr(compress(PhoneNumber,' ()-.'),1,3);
     ShippingCode=.;
     items+1;
     total_visits+visits;

The last two statements deserve an explanation. The statement, items+1, which increments items by 1, is equivalent to items=items+1. The statement, total_visits+visits, which increases total_visits by visits is equivalent to total_visits=total_visits+visits. In each case, SAS automatically retains the result for use with the next observation. In the example below;

     Data carrot_totals;
          set shipment (end=done);
          total+carrots;
          truck+1;
          if done then output carrot_totals;
     run;

Mathematical operations have the following precedence.

    ^, **    exponentiate (raise to power)
    *        multiply 
    /        divide
    |        mod (remainder after integer division)
    +        add
    -        subtract  

Parantheses can be used to force the order of operations.

Strings have a special operator for concatenation, ||. The left and right strings are connected toether to make a single string.

Retain Statement[edit]

The retain statement keeps the last value of variable for use when the next observation is read.

     data total_apples;
          set shipment;
          retain total 0;
          input quantity;
          if type_item="APPLES" then 
               total=total+quantity;
     run;

Arrays[edit]

Arrays provide a useful way to operate on several variables in a SAS data set, using a loop statement.

     data apple_shelf_count;
          set apple_shelf;
          array apl (*) apple1-apple5;
          do i=1 to len(apl);
               if apl{i}<0 then apl=.;
          end;
     run;

Logic[edit]

The purpose of logic statements is to execute separate branches of code depending on value of a condition. Conditions are evaluated as either true or false. Logic statements consist of combinations of if...then...else and select..case statements. Logical expressions are built using comparison operators. The operators consist of the following mneumonics and their cooresponding symbols.

    LT     <      less than
    LE     <=     less than or equal to
    GT     >      greater than
    GE     >=     greater than or equal to
    EQ     =      equal to
    NE     ^=     not equal to
    AND    &      and conjunction, both sides must be true for expression to be true
    OR     |      or conjunction, either side must be true for expression to be true
    NOT    ^      negates the expression, not true is false
    IN     set membership, there is no symbol for this operator

The last operator, IN, is used to test membership in a set. A set is a list of value constants, either numeric or character strings. if the item is a member of the set the result is true otherwise it is false.

Like arithmetic operations, logical operators also have an order of precedence that can be controlled by parentheses. The default order is as follows; AND operators are done first followed by OR operators. Parentheses should be used to help with readability. Following are the rules for evaluating AND and OR.

  • AND expression: Both sides of the expression must be true for the expression to be true.
  • OR expression: Either side of the expression must be true for the expression to be true.

Subsetting If Statement[edit]

Subsetting. The following is a special form of the if statement that has no then statement, called a subsetting if statement. Only observations that meet this condition are included in the resulting data set.

     data gs_apples;
          set apple_shipment;
          if (apple_type="Granny Smith");
     run;

If ... Then[edit]

if...then. The if...then statement will do the then portion when the condition is true.

     data apples;
          set shipment;
          if (quantity < 100) then 
              put "Small volume purchase for " apple_type 3.;
      run;

If ... Then ... Else[edit]

if...then...else. The if...then...else statement will do the then portion when the condition is true, the else portion if the statement is false. There can be any number of nested if...then...else statements, however this increases the complexity of the code. Care must be taken to ensure the code is correct. Test each portion of the nested if statements to ensure the program executes properly. one way to do this is by creating a test data set that contains each combination of variables for each comparison of the nest.

     data regional_analysis;
          set shipment;
          if state in ('CA' 'OR' 'WA' 'HI' 'AK' 'ID' 'MT' 'AZ' 'NM' 
             'CO' 'WY' 'UT' 'TX' 'NV') then
             region='West';
          else if state in ('VA' 'FL' 'GA' 'NC' 'SC' 'AL' 'MS' 'LA' 'AR' 
                  'WV' 'KT' 'TN') then
                  region='South';
          else if state in ('MD' 'DC' 'DE' 'PA' 'NJ' 'NY') then 
                  region='Northeast';
          else if state in ('CT' 'RI' 'MA' 'NH' 'ME' 'VT') then 
                  region='New England';
          else if state in ('ND' 'SD' 'MO' 'IA' 'MN' 'WI' 'IL' 'IN' 'OH' 
                  'NE' 'KS' 'MI' 'OK') then 
                  region='Midwest';
     run;

Compound Statements[edit]

Statements can be single statements or a compound statement. A compound statement consists of a series of statements surrounded by do and end. The compound statements can be used after a then or an else statement.

     if (today="SATURDAY") then 
     do;
        discount=0.10
        special=(1-discount)*price;
     end;
     else 
        special=price;

Loops[edit]

Loops allow for the repetetively execution of code. Loops, or do blocks, are generally enclosed within do and end statements. Do blocks keep a group of statements together. A special do block is the compound statements used in if...then...else statements. These do blocks only execute once.

    if (today = "SATURDAY") then 
    do;
         new_price=price*1.10;
         banner="Today the price for apples is "||put(new_price,dollar6.2);
    end;
    else 
         banner="Today the price for apples is "||put(price,dollar6.2);

Thre are other do blocks that loop while a condition is true or until a condition is false. There is also an incremental do loop that automatically increaes or decreases a counter until a condition becomes false.

Do While Loop[edit]

The do while loop repeats the statements until the condition, count <= record_count, becomes false. When the statement becomes true, control goes to the statement following the loop.

data shipment_allocation;
    set package end=done; 
    max_load=20;
    load=0;
    do while( (load <= max_load) and not done);
       load+package_weight;
       set package end=done;
    end;
run;

Do Until Loop[edit]

The Do until loop repeats the statements while the condition, count = record_count, is false. When the statement becomes true, control goes to the statement following the loop.

data shipment_allocation;
    set package end=done; 
    max_load=20;
    load=0;
    do until ( (load >= max_load) or done);
       load+package_weight;
       set package end=done;
    end;
run;

Incremental Do Loop[edit]

The incremental do loop repeats the statements and automatically increments count by 1 until count is greater than record_count. When this occurs, control goes to the statment following the loop. The by 1 clause is the default and could have been left off of this example.

data shipment_allocation;
    set package end=done; 
    load=0;
    do i=1 to 20;
       load+package_weight;
       set package end=done;
       if done then goto nomore;
    end;
nomore:
run;