Data Analysis using the SAS Language/Procedures
Procedures in SAS provide the real horsepower. Procedures can perform sophisticated reporting, charting and statistical operations with a minimum of coding. Using a procedure involves supplying the procedure name, the data set, the variables to be used for the task and any parameters, options, or output data set instructions. Most procedures already have an output report design. The display of these reports can be manipulated using ODS and proc template.
Many procedures can also output results to a data set for later use. The options and parameters allow the user to select from a variety of techniques, statstics and tests for use in the analysis or for reporting.
Some of the procedures are language unto themselves. For example, proc SQL, implements data access using the SQL database language. Proc IML and Proc GA also contain virtually a mini-programming language.
Proc Import is used to create a SAS data set by importing data from various PC File formats including Microsoft Excel, Microsoft Access and delimited such as comma separated value (CSV) files. Import can take the variable names from the first input row of the import file. It examines the data to determine if it is character or numeric. Here is an import from an Excel spreadsheet.
proc import datafile="c:\weather\wthr_history.xls" dbms=excel replace out=weather; sheet="LOCAL"; getnames=YES; run;
The SAS data set will be called weather. It will contain the variables identified in the Sheet called LOCAL. The variables names will come from the Excel sheet.
Proc Export is used to create a file from a SAS data set. Data sets can be exported to various PC File formats including Microsoft Excel, Microsoft Access and delimited such as comma separated value (CSV) files. Proc export puts the SAS variable names on the first row of the export file. Here is an export to an Excel spreadsheet.
proc export outfile="c:\production\orchard_stats.xls" dbms=excel replace data=orchards; sheet="orchard growth"; run;
The SAS data comes from the SAS data set, orchards. Proc export creates a sheet named, orchard growth in an Excel file named orchard_stats.xls. The variable names are written to the first row of the file.
Some formats are preserved in the file transfer. It is important for dates to by formatted as a data field such as mmddyy8. so it is translated correctly in Excel. If dates are not formatted they will be transfered as a SAS date serial number using the SAS internal integer representation of a date. SAS date serial numbers are the number of days offset from 1 Jan 1960 (date serial number = 0). Another way to transfer dates is to parse them into month, day, year components and then rebuild the complete date field in Excel after the transfer.
Proc SQL provides nearly the complete ANSI SQL language for use on SAS data sets and through the libname engine to other non SAS databases. The SQL language needs a complete manual to describe it, but here is a useful example.
proc sql; create table apple_info as select a.shipment_id,a.shipment_date,a.bushels,a.type_item, a.destination,b.temperature,s.spoilage from shipment a, weather_info b where a.type_item="APPLE" and a.shipment_date>='1may2005'd and a.destination=b.station_location and a.shipment_date=b.weather_date; quit;
The prefixes a. and b. are shorthand for the source tables shipment and weather_info. The prefixes are required when the same field name is in more than one of the tables. Two files are merged together by the common values of shipment_date and weather_date and common values of destination and station_location. Also, only observations which match the criteria type_item equals APPLE and shipment_date greater than or equal to 1 May 2005. The result of this query will be a SAS Data Set named apple_info which will contain seven variables.
The same result can be done with a combination of SAS data steps and Proc Sort. But you will be able to see how SQL can streamline some operations.
proc sort data=shipment; by destination shipment_date; run; proc sort data=weather_info; by station_location weather_date; run; data apple_info_2; merge shipment(in=a) weather_info(in=b rename=(station_location=destination weather_date=shipment_date)); by destination shipment_date; if a and b; keep shipment_id shipment_date shipment type_item destination temperature spoilage; run;
This statements produce a data set, apple_info_2, similar to the table, apple_info created with the the SQL based procedure. You should also see some of the limitations of the data step method. First the two files must be sorted, then they must be matched on variables with the same names. This requires a rename option on the data set.
There are several procedures that produce descriptive statistics. Descriptive statistics provide information about the distribution of data such as mean, standard deviation, variance, range, number of observations and extreme observations. Some are univariate in that each variable is treated independently. Other procedures are bivariate in that they work with more than two variables at a time.
Proc means presents descriptive statistics for each variable listed in the var statement or for each numeric variable in the data set if there is no var statement. Here are some of the keywords that can be used to tell SAS which statistics you wish to see.
- n count of non missing variables
- sum summation of the variable
- range largest value minus smallest value
- mean average
- var variance
- stddev standard deviation
Below shows proc means using the data set apples to produce the statistics noted with the above keywords.
proc means data=apples n sum range mean var stddev; var bushels price; run;
This next example shows how to use proc summary to generate and produce store statistics in a SAS data set.. The SAS data set can be stored permanently. use later in the program, or printed in a report. Proc summary does not print any output on its own.
proc summary data=apples; var bushels price; output out=stats n(bushels) sum(bushels)=total_bushels range=rng_bushels rng_price mean(bushels price)=avg_bushels avg_price var(bushels price)=var_bushels var_price stddev stddev(bushels price)=std_bushels std_price; run;
Notice the keywords in the output statement. The statistics themselves are stored in the variables on the right of each equals sign.
Proc univariate provides a comprehensive set of descriptive statistics for each numeric in the data set or on the var statement. The statistics includes quantiles, five largest and smallest values, tests for normality, the moments (mean, variance, skewness and kurtosis), and plots such as normality and box and whisker.
proc univariate data=apples; var bushels price; run;
Proc corr provides a Pearson correlation matrix and an optional variance-covariance (VCV) matrix along with univariate descriptive statistics. The correlations are done an each pair of numeric variables in the SAS data set. The summary statstics, the correlation matrix, the VCV matrix, and the sum-of-squares crossproducts (SSCP) matrix can be stored as SAS data sets for use in other data steps. Many procedures use these matrices as input. Proc corr can also produce Spearman correlations and Kendall's tau-b.
proc corr data=apple_weather outp=spoilage_corr; var spoilage quantity weight temperature distance; run;
In the above eample, besides displaying the simple statistics and the correlation matrix, a SAS data set is created with these statistics.
Proc freq provides frequency counts of observation for each unique value of a variable. Proc freq can also perform chi-squared, , and other analysis of two-way tables. Chi-squared is a nonparametric test for the distribution of the data. The tables statement lists the frequency tables. Any options are placed after the forward slash which follows the last table request. Two variables separated by an "*" indicate a two-way table, the first variable is on the vertical side, the second is horizontal side of the table. Variable with no "*" indicate one-way tables or frequency tables. There are several different goodness-of-fit-tests which can be performed; however the chi-squared test is one of the most common goodness-of-fit tests. Use the option, chisq, to get this test.
proc freq data=shipment; tables region region*month/chisq; run;
Multivariate analysis involves procedures which can analyze multiple variables simultaneously. In many of these procedures there is no model statement like in regression analysis. Instead, transformations are made on the variance-covariance (VCV) matrix or sum-of-squares crossproduct (SSCP) matrix to produce clusters, grouping of similar observations, or factors, grouping of variables which appear to be driven by the same latent factor. The general method for most of these procedures is to find the eigenvalues for the matrix. The eigenvalues represent the transformation necessary to transform the input matrix to the identity matrix (or some other form depending on the assumptions). The eigenvalues help identify the number of factors or groups and provide an estimate of the portion of explanation obtained for each one.
Using a multivariate model usually involves the hold out technique. Hold out uses a sample of the available data to develop or train the model. From this initial model come the coefficients or scores. Next, the model is tested using the scores with another sample from the same data set. One test is to use predicted group membership compared to actual group membership to develop a hit-miss ratio. Due to the need for training and testing models as well as the number of pareters that are usually etimated, samples sizes need be large.
The CLUSTER procedure hierarchically clusters the observations in a SAS data set by using one of 11 methods. The data can be coordinates or distances. If the data are coordinates, PROC CLUSTER computes (possibly squared) Euclidean distances. If you want non-Euclidean distances, use the DISTANCE procedure to compute an appropriate distance data set that can then be used as input to PROC CLUSTER.
Proc discrim is a procedure for performing discriminant analysis. Discrimint analysis models predict group membership by fitting an affine function between two or more groups. A model statement provides the variables to use for fitting the affine. Least squares and maximum likelihood techniques are used to estimate the parameters of the function.
When developing a discriminant analysis, it is useful to use a hold out technique as outlined above. Proc Discrim is tuned for training and testing a model in one step.
proc discrim data=giftcardsurvey; class bought_gift_card; var age gender purchReason purchfor; run;
Proc factor is a procedure for variable reduction through factor analysis. Proc factor is commonly used for analyzing surveys, questionnaires and psychological tests, where several variables are captured for each observation. Factor analysis assumes that there are true latent factors which drive the variables measured by the instrument. Factor analysis discovers the number of latent factors and reports how they are correlated to the measurement variables in the data set. In general, factor analysis is an exploratory method as opposed to model building method. However there are extensions to factor analysis (see proc calis), called confirmatory factor analysis, that let the analyst provide a theoretical factor structure and then test it.
There are several methods available to factor analysis including principal components analysis, principal factor analysis, and maximum likelihood. There are also several rotation technique for specifying how the factors are related. These rotations include orthogonal, orthomax, varimax, and oblique among others. The output of factor analysis includes correlation between the variables and the factors and the eigenvalues. The variable-factor correlation, also called the factor scores can be used in the score procedure to produce scores for a hold sample of the data set used for the factor analysis.
proc factor data=apple_cust_survey; var satis1-satis3 freshness1-freshness3 crispness1-crispness3 sweetness1-sweetness3 service1-service3 price1-price3; run;
This is a principal components analysis of a customer survey that measures qualities of the product, service and price.
PROC FASTCLUS produces relatively little output. In most cases you should create an output data set and use another procedure such as PRINT, PLOT, CHART, MEANS, DISCRIM, or CANDISC to study the clusters. It is usually desirable to try several values of the MAXCLUSTERS= option. Macros are useful for running PROC FASTCLUS repeatedly with other procedures.
A simple application of PROC FASTCLUS with two variables to examine the 2- and 3-cluster solutions can proceed as follows:
proc standard mean=0 std=1 out=stan; var v1 v2; run;
proc fastclus data=stan out=clust maxclusters=2; var v1 v2; run;
proc plot; plot v2*v1=cluster; run;
Proc score combines two data sets, one with coefficients produced in an earlier procedure and another with raw. A new data set is created with the product of the raw data and the coorespnding coefficient.
Many hold-out methods use proc score to generate estimated or predicted values from coefficients determined from a training set. Proc score is often used with proc factor as an efficient method for generating forecasts or predicted values. Proc score can also be used with output from proc reg and other procedures.
proc factor data=appleTrees score outstat=appleTree_scores noprint; var age diameter bushels height canopy acidity; run; proc score data=appleTrees score=appleTree_scores out=scored_trees; var age diameter bushels height canopy acidity; id variety; run;
Graphics, Charts and Plots
These are high quality charts that can be exported to a web site. Charts include pie, radar, star, and bar. Some of these have 3D versions as well. These charts are used to present business information when it is nexessary to provide comparisons or proportions of a whole. There are several Plot procedures that create scientific plots that are useful for identifying underlying relationships of unlock general patterns in the data. There is also a procedure for presenting geograpic data including state and national boundaries, and U.S. counties.
Universal Graphic Statements
These statements are used to specify global parameters than can be used by several graphic procedures. The i following each keyword is replaced by a number from 1 to 99 to represent the statement number. The appropriate statement is referenced in the procedure to apply the parameters where they are required.
Proc GChart creates business charts such as bar and pie charts.
Proc GPlot creates scatterplots and other two variable plots.
data apples; input month date9. @12 sales 10.; infile datalines; format month monyy5.; datalines; 15jan2008 200 15feb2008 150 15mar2008 125 15apr2008 175 15may2008 210 15jun2008 230 15jul2008 320 15aug2008 350 15sep2008 425 15oct2008 380 15nov2008 300 15dec2008 250 ; run; symbol1 value=star interpol=join; axis1 order=("1jan2008"d to "1jan2009"d by month) offset=(3,3) label=none major=(height=2 ) minor=(number=1 height=1); title "Monthly Apple Sales"; proc gplot data=apples; plot sales*month=1/haxis=axis1; run; quit;
This plot is a timeseries of the monthly apple sales for a single year.
Proc GMap uses the SAS Map Data Sets along with other data to show data geographically. Using Proc Map requies two data sets, one is the map data set for the geography to be mapped, the other is the analysis data set. The analysis data set must have a variable that matches to a variable in the map data set such as county, province, state or country.
Proc GPlot creates three way scatterplots, parametric plots and surface plots.
proc gplot data=apples; plot grow_time*altitude=weight; run;
Simple reports can be created with proc print. These reports be made more complex through the use of by group processing and subtotals.
proc print data=apples_shipdate; by transport_vendor; id date_shipped; var quantity purchase_price total_cost spoilage spoilage_cost net_cost; sum quantity purchase_price total_cost spoilage spoilage_cost net_cost; sumby transport_vendor; pageby transport_vendor; run;
Proc Report creates complex, customizable reports by displaying existing fields or computing new summary or aggregate statistics for data. Using templates allows users to specify font, formatting, alignment, cell width, and display attributes. Proc Report can be used to generate standard statistical report elements such as cross tabulations, frequency tables, or listings. When coupled with the ODS system, Proc Report can generate deliverables of many different filetypes, most notably PDF or RTF.
* display the number and average height and weight of men and women in each age group for the SAS class dataset; proc report data=sashelp.class nowd; columns age sex, (height=countVar height weight); define age / group; define sex / across; define countVar / n "N"; define height / analysis mean format=6.0 ; define weight / analysis mean format=6.0; run;
Proc Tabulate creates complex one-way, two-way and three-way reports. This is the SAS version of a pivot table like you may have seen in Excel. Variables are divided into two types, class variables and metrics. Class variables are identified in a class statement, they specify the categorical variables used on the margins of the report. The metric variables are identified in the var statment. They specify the variables that are being analyzed and reported on. The Table statement specifies how the table is orgainized. The major parts of the table are the page, row, column and the body. The Table statement specifies the class variables and their relation to the page, row and column. The Table statement also specifies the metrics and what statistics to display. These can be counts, means, sums or any other univariate statistic.
data orchard_production; infile datalines; input orchardName $20. @22 harvestDate date9. @32 harvest 3.; datalines; Village Fields 10sep2007 25 Village Fields 15sep2007 30 Village Fields 27sep2007 35 Village Fields 12sep2008 27 Village Fields 17sep2008 28 Village Fields 27sep2008 28 Village Fields 10oct2008 30 Red Barn Orchard 15sep2007 45 Red Barn Orchard 21sep2007 32 Red Barn Orchard 28sep2007 35 Red Barn Orchard 08oct2007 45 Red Barn Orchard 17oct2007 40 Red Barn Orchard 12sep2008 27 Red Barn Orchard 17sep2008 28 Red Barn Orchard 27sep2008 28 Red Barn Orchard 10oct2008 19 ; run; Title "Apple Orchard Production"; proc tabulate data=orchard_production; class orchardName harvestDate; var harvest; format harvestDate year4.; tables (orchardName="Orchard Name" all="Total")*(n="Harvests"*format=8. sum="Total Bushels"*format=8. mean="Average Bushels"*format=8.1), harvestDate="Season"*harvest="" /rts=38 box="Apple orchard harvest statistics for last two years"; run;
Regression and Analysis of Variance
Proc ANOVA is used for analysis of variance, this is particularly useful for analyzing and testing experimental data. Class variable define the treatment that each observation is in. The response variable is the dependent variable in the model. The class variables are the independent variables in the model. Several tests can be performed including means tests and tests of interactions. Each class variable must have the same number of observations or a warning is printed on the output. These are known as equal cell size.
proc anova; class shipper; model spoilage=shipper; means shipper/scheffe duncan; run;
Proc Autoreg is for building regression models for timeseries data when the error terms (residuals) are autocorrelated or heteroskadastic over time. This is a common problem in econometric models. Autoreg lets you build a model that may give more forecasting power than regression alone.
Proc GLM, generalized linear models, is a combination of regression, and analysis of variance. It allows the use of continuous variables and categorical variables (treatments). GLM also relaxes the requirement in ANOVA that all cells be of equal size. GLM and ANOVA have many of the same tests available to them.
proc glm data=apple; class shipment_vendor; model spoilage=shipment_vendor quantity shipping_time; run;
Proc Logistic models the logit, or log of the probability of an event. In logistic regression, the dependent variable is descrete. It may be binary, ordinal or nominal. Since the dependent variable is discrete, this breaks the homoskedasticity assumption for regression. This assumption states that the dependent variable should have constant variance over the range of the independent variables. The logit transformation allows regression to be used to build the model.
proc logistic data=apple_delivery; model gift_card_Purchased=age gender purchReason; run;
Proc Mixed is an even more generalized procedure than GLM. It allows for multiple dependent varaibles and the modeling and testing of structures in the error term. Proc Mixed also allows for the regression coefficents to be random effects as opposed to fixed parameters.
Proc NLin is used for non-linear regression. Non linear regression fits a polynomial built from the exogenous variables to model the endogenous variable.
Proc Reg performs simple and multiple regression. Parameter estimates are generated along with their significance level. Additional tests can be done on the residuals for normality. Other tests can be done to idetnify outliers.
proc reg data=apple_spoilage; model spoilage=temperature humidity travel_distance; run; quit;
Proc Stepwise uses several techniques to produce the best linear regression parameter estimates. These methods, of which stepwise is one of them, include backword, forward, maxR and minR. These techinques generally look at the incremental explanatory power of each variable in the regression data set then either bring in the highest variable (of those not in the model) or drop the lowest variable (of those already in the model). Once a certain threshold is met, the procedure stops producing new models.
proc stepwise data=apple_spoilage; model spoilage=temperature humidity travel_distance/stepwise forward backward; run; quit;
Proc PHReg is the SAS procedure for performing proportional hazards regression, also known as Cox regression due to Sir David Cox. Proportional hazards regression is a regression technique for the analysis of time-to-event data, such as the failure of a lightbulb or development of cancer. This technique allows for observations to be censored at various times during the analysis period. Proportional hazards regression is semi-parametric, estimating parameters called hazard ratios while allowing for an arbitrary baseline hazard as any non-negative function of time. Note that the asterisk in this example is not actually a comment line, but denotes the product of the time-to-event variable and the event indicator.
proc phreg data=cancerData; model timeToDiagnosis * (positiveDiagnosis=0) = age smokingStatus familyHistory; run;
Proc Genmod is used to calculate parameter estimates from semiparametric generalized estimating equations (GEEs). GEEs are not strictly semiparametric but allow for the specification of variance structures for the analysis of repeated measures or autocorrelated data. GEEs are a generalization of the generalized linear model (GLM) where the link function and variance components both are specified in an appropriate fashion for the analysis. Using an independence variance structure (by omitting the repeated option) gives parameter estimates from the GLM using the corresponding link, but variance estimates are of the "robust sandwich" type which are consistent in the case of model misspecification or heteroscedasticity. GEEs can be contrasted with the general linear mixed model (GLMM) which is likelihood based. The GLMMs are computed using the EM algorithm and are computationally intensive while the GEE is a relatively simpler algorithm which tends to converge faster and more often. Estimates from the GLMM are interpretted as applying to the individual level since they integrate out the effects due to individual specific random variation in the case of a repeated measures analysis. Estimates from the GEE are interpretted as population averaged effects because they do not explicitly estimate the distribution of the random effects.
Utilities manipulate datasets as a whole, by making copies or deleting data sets from a library or catalog. Utilities also prepare data sets to be used for by group processing where the sme analysis iid performed for each group of observations with the same value of variables specified in the by statement.
Proc datasets is a series of utilities for managing SAS datasets and performing other housekeeping chores. This includes copying, moving and deleting datasets, creating indexes and other tasks.
* delete all the datasets in work library; proc datasets library=work kill; quit;
SAS allows the creation of custom formats displaying data. This is done using Proc Format. Formats can be coded within the proc format procedure, or proc format can load a file with the format specification directly into the format library. These user formats have more uses than just displaying data. The internal hash representation allows you to build a quick way to search for observations based on their formatted value. For example, a set of customer id numbers can be given the format "FOUND". This format can be applied by searching for the formatted value, "FOUND" rather than the original id numbers.
Proc Options lets users see the current value of the options in SAS. This is useful for debugging and for manipuating option values in order change parameters.
Proc Sort sort observations by the variables in the by statement. The sorted data set can replace the original input data set or be output to a new data set. Other parameters describe whether the variables are sorted in ascending or descending order. Proc Sort can remove observations with duplicate values of the sorted variables producing a data set with only unique values.
proc sort data=apples out=apples_shipdate; by transport_vendor date_shipped; run; proc print data=apples_shipdate; by transport_vendor; id date_shipped; var quantity purchase_price total_cost spoilage spoilage_cost net_cost; sum quantity purchase_price total_cost spoilage spoilage_cost net_cost; sumby transport_vendor; pageby transport_vendor; run;
The resulting data set, apples_shipdate, has the apples sorted by the transport_vendor and the date_shipped. The input data set has not been changed. In addition, you can see that sorting the data provides information to proc print that is seful for setting up a complex report. Proc Print will be covered in a another section.
Transpose is used to convert long datasets to wide datasets. Traditionally, most analysis datasets are rectangular in shape, meaning that row and column variables and indicies can be stated explicitly. For example, a dataset may have one row (or one record) per subject per visit in a health care utilization dataset for an HMO. Occassionally, some analyses require datasets to be restructured in terms of the row and column variables. This could be so that aggregate or summary measures can be calculated, so that the dimension of the number of records is reduced, or to improve the speed of sorting, indexing, or subsetting the data.
proc transpose data=HMOvisits; by subject_id; id visit; var copay_cost; run;