Data Analysis using the SAS Language
Since the birth of the World Wide Web and the growth of the Internet, a tremendous amount of data has been collected by organizations all around the world. Analysis of this data is important for decision makers, researchers and policy makers throughout industry, academia and government. The SAS Institute provides one of the leading tools for statistical analysis.
The SAS System is a suite of software products designed for accessing, analyzing and reporting on data for a wide variety of applications. The SAS language includes a programming language designed to manipulate data and prepare it for analysis with the SAS procedures. SAS was originally developed in the 1970s for academic researchers by Dr. James Goodnight and colleagues at North Carolina State University. SAS includes a variety of components for accessing databases and flat, un-formatted files, manipulating data, and producing graphical output for publication on web pages and other destinations. Statistical routines in SAS support everything from sales forecasting to pharmaceutical analysis and from educational/psychological testing to financial risk analysis. SAS is one of the main analytics platforms for academic research and data analysis in institutions, companies and organizations around the world.
This course has several goals; but above all, it attempts to bridge the gap between data collection and analysis.
- introduce concepts for extracting data from a variety of sources
- manipulating data so it is properly prepared for analysis
- apply SAS procedures and interpret the output
- produce text and graphical reports for different types of media
SAS System Modules
Base SAS provides the SAS Data Step Language ability to access several data from several sources in a multitude of formats. Base SAS also includes several procedures for data manipulation, data set management, SQL, and descriptive statistics.
SAS/STAT (Statistics) provides the statistical horsepower for univariate and multivariate statistical modeling. These routines include linear and non-linear regression, analysis of variance, factor analysis, logistic regression and cluster analysis.
SAS/ETS (Econometric and Time Series) provides the modeling power for building econometric models and time series analysis and forecasting. These routines include ARIMA for Box Jenkins models, auto-regression and transfer functions.
SAS/OR (Operations Research) builds constrained mathematical optimization models using linear programming, nonlinear programming, integer programming and goal programming. Problems such as network analysis and transportation problems can be setup and solved. In addition, there are new procedures which use genetic algorithms and local search optimization.
SAS/QC (Quality Control) provides tools for conducting quality control analysis. These tools include experimental design, sample selection, and process control charts.
SAS/Graph (Graphics) gives SAS the power to create and display graphs and business charts, these inlude scatter plots, 3-dimensional plots, surface and contour plots, bar charts and pie charts.
SAS/AF (Applications Framework) allows the construction of applications with an object oriented graphical user interface. A seperate language, SCL, works with the SAS language to build interactive applications.
SAS/Access is a series of interface components built for different vendor databases.
- Access to PC Files allows to popular pc file formats, such as Microsoft Excel and Microsoft Access.
- Access to ODBC allows access to various databases which support Open Database Connectivity. A third party database client from the vendor must be installed on your computer and an ODBC data source name (DSN) is placed in the datasources system panel.
- Access to Oracle allows access to Oracle database servers. The Oracle client must be installed on your computer. SAS interfaces through the Oracle client entry for the database you are connecting to.
- Access to Sybase allows access to Sybase database servers. The Sybase client must be installed on your computer. SAS interfaces through the Sybase client entry for the database you are connecting to.
- There are other SAS/Access modules for other database vendors including mySQL and SAP.
SAS/Connect allows SAS to connect and submit programs across platforms to other SAS installations. For example, SAS on the PC can submit code to SAS on Unix or SAS on MVS (IBM mainframe). Connect also allows for grid computing where a problem is broken down into several independent threads which are physically run on SAS installation on different machines simultaneously.
SAS/IML (Interactive Matrix Language) allows the construction of programs using matrix operations. Matrix programs are written efficiently in IML. This permits the construction of new statistical programs.
A SAS program consists of data steps and proc steps. Data steps manipulate data one observation at a time while proc steps perform complex operations on a complete data set. Proc steps provide the analytical horsepower of SAS by performing statistical analysis, graphics, summarizing and reporting. In order to analyze data, it must be input and prepared for the appropriate analysis. This preparation process is the responsibility of the data step.
The SAS data step Language provides many of the procedural capabilities of a standard programming language such as PL/I, C, C++, or Java. Mulitple data steps can link between the various proc steps. These are the building blocks for a complete SAS program.
All SAS statements must end with a semicolon ";". Statements can span several lines or several statements can be placed on a single line. In general, SAS is not case sensitive, this includes keywords, variable names and data set names. The SAS language provides the input, output, loops and logic for manipulating data. SAS data sets can be saved for use by another SAS data step or procedure within the same program, or saved permanently for future use.
SAS has a robust set of procedures. Base SAS includes procedures for sorting, printing, summarizing and reporting. SAS/STAT includes procedures for univariate, multivariate, regression, analysis of variance and nonparametric analysis. Options provide access to statistical heuristics and algorithms.
Program design should start with the type of analysis that needs to be performed or the type of reports that must be generated. Next it should examine the type of data available and what transformations need to be done in order to have it ready for the analysis or reporting procedure. Filling the gaps between these is what drives the programming requirements.
Programming problems should be broken down into small steps focusing on a few transformations at a time. Some programmers like to find the solution that does the most with the least code, but that method can result in confusing code, making debugging and future maintenance difficult. Useful comments also help the programmer describe the intent of each step in the program.
SAS Data Step
The data step provides a programming environment for input, output and data manipulation. The SAS Language in the data step is the fundamental way to manipulate data. The data step can access SAS data files for input and permanent storage. The data step also allows SAS to intereract with non-SAS data storage for both input and output. The data step identifies the name of the SAS data set to be created, how to format the data for input or output, as well as the logic for manipulating the data.
Manipulating the data may involve creating or calculating new fields, selecting a subset of the data, or merging multiple datasets together. The statements used in SAS are similar to the programming statements in other languages. Besides input and output, there are assignment, logic and loop statements.
SAS provides numerous procedures for processing large amounts of data for statistical analysis, reporting and other purposes. SAS Procedures begin with the keyword proc and end with run or quit depending on the procedure. SAS procedures are also used for managing SAS data sets such as displaying metadata, copying or deleting data sets, and sorting observations by one or more variables. SAS also includes a procedure, proc SQL, that allows the use of SQL (structured query language) to be used on SAS data sets and on other vendors' database platforms.
Macro programming in SAS provides ways to simplify programming tasks through replication. Macro variables are a convenient way to set up constants at the start of the program that can be used throughout the program. One feature of macros is building code that can alter itself depending on the value of parameters. These parameters may be coded by the programmer at design, or they may come from a SAS data set.
Output Delivery System (ODS)
The SAS Output Delivery System (ODS) allows SAS to create output in a multitude of forms. These include Hypertext Markup Languange (html), PostScript (pps), Adobe Portable Document Format (pdf), Microsoft Word (doc), and Rich Text Format (rtf). There is very little to change in the existing code. Instead wrapper statements are use before and after the procedures and data steps from where you want to capture output. You can even specify only certain parts of the output that you want to capture. These are called output objects. Output objects can also be captured as a SAS data set and used in subsequent procedures and data steps.
Look here for examples and code snippets. Users are also invited to submit their own examples to the SAS examples page.
- SAS Institute Home Page
- SAS Support including knowledge base and documentation
- sasCommunity.org worldwide SAS users collaborative wiki
- The SAS language does not provide the object oriented capabilities found in most modern languages; although it does come with an optional add-in, SCL, which provides an object oriented environment for the construction of graphical user interface for front ends to complex SAS applications and dashboards.