Business Analytics With SAS

INTRODUCTION

options formdlim = '*'; * Specifies a character to delimit page breaks in SAS output;

data FIRST; * creating a dataset by name FIRST;

input ID NAME HEIGHT WEIGHT; * to read the variable names;

datalines;

01 VIJAY 5 26

02 AJAY 6 28

03 RAVI 4 25

;

*The DATALINES statement is used to indicate internal data;

*The DATALINES statement must be the last statement in the DATA step;

*All lines in the SAS program following the DATALINES statement are considered data until SAS encounters a semicolon;

*The CARDS statement and the DATALINES statement are synonymous;

proc print data=FIRST;

* prints the values for all variables and all observation in a SAS dataset;

run; * End of the program;

data SECOND;

input ID NAME $ HEIGHT WEIGHT; * The $ after Name indicates that it is a character variable;

datalines;

01 VIJAY 5 26

02 AJAY 6 28

03 RAVI 4 25

;

proc print data=SECOND;

run;

libname SMRITHA 'C:\SAS\SMRITHA';

* specify the libref and then the location of permanent dataset to be stored;

data SMRITHA.THREE;

* creating a permanent dataset in the specified location with two level naming;

input ID NAME $ HEIGHT WEIGHT;

datalines;

01 VIJAY 5 26

02 AJAY 6 28

03 RAVI 4 25

;

proc print data=SMRITHA.THREE;

run;

proc contents data=SMRITHA.THREE; * prints the description of a SAS dataset;

RUN;

IMPORTING DATA

options formdlim='.';

* INFILE;

* Use the INFILE statement to tell SAS the filename and path, if appropriate, of the external file containing the data.

* The INFILE statement follows the DATA statement and must precede the INPUT statement. After the INFILE keyword, the file path

and name are enclosed in quotation marks;

data EMP_DESIG;

infile 'C:\SAS\EXTERNAL FILES\SECOND.xls';

input ID NAME $ HEIGHT WEIGHT;

RUN;

/* LIST INPUT METHOD(FREE FORMATTED INPUT)*/

*The values in the raw data file are separated by atleast one space;

data EMP_DATA;

input EID NAME $;

datalines;

100 ram

101 krish

102 ranjith

;

run;

data EMP_DATA4;

input EID NAME $ AGE SAL; *Since the default character length is 8, therefore it truncates the character observation value;

datalines;

101 Ram 34 10000

102 Ramakrishanan 22 15000

103 Ranjith 23 12000

;

run;

Informats are useful anytime you have non-standard data. (Standard numeric data contain only

numerals, decimal points, minus signs, and E for scientific notation.) Numbers with embedded

commas or dollar signs are examples of non-standard data. Other examples include data in

hexadecimal or packed decimal formats. SAS has informats for reading these types of data as well.

data EMP_DATA5;

input EID NAME $13. AGE SAL;

*By using an informat in the input statement such as $13. to tell SAS that the variable field is 13 characters, then SAS will

read for 13 columns whether or not there is a space in those columns. This will inculde unwanted characters which appear after

the "NAME" on the dataline;

datalines;

101 Ram 34 10000

102 Ramakrishanan 22 15000

103 Ranjith 23 12000

;

run;

*COLON MODIFIER;

data EMP_DATA6;

input EID NAME :$13. AGE SAL;

*To read until it encounters a space, use a colon modifier in the informat;

datalines;

101 Ram 34 10000

102 Ramakrishanan 22 15000

103 Ranjith 23 12000

;

proc print data=EMP_DATA6;

run;

data EMP_DATA7;

length name $13;

*length statement is equivalent to using informat collan modifier;

*The length statement should always be before the input statement;

input EID NAME $ AGE SAL;

datalines;

101 Ram 34 10000

102 Ramakrishanan 22 15000

103 Ranjith 23 12000

;

proc print data=EMP_DATA7;

run;

*COLUMN INPUT METHOD;

* To use the column input method, only if:

1. Each of the variable values is always found in the same place in the dataline.

2. As long as the values are character or standard numeric;

data EMP_DATA9;

input EID 1-4 NAME $ 6-12 AGE 14-15 RACE $ 17-21;

datalines;

1234 suresh 32 white

1256 kumar 32 black

1456 eeshwar 52 black

;

proc print data=EMP_DATA9;

run;

* Column input has the following advantages over list input:

1. Space are not required between values

2. Missing values can be left blank

3. Character data can have embedded spaces

4. Can skip unwanted variables;

* FORMATTED INPUT METHOD;

* 1. RELATIVE INPUT METHOD:

Use informats by placing the informat after the variable name in the INPUT statement;

data EMP_DATA12;

input EID 4. +1 NAME $7. AGE 3. +1 RACE $5. ; * "+1" specifies to skip one character;

datalines;

1234 suresh 32 white

1256 kumar 32 black

1456 eeshwar 52 black

;

proc print data=EMP_DATA12;

run;

*2. ABSOLUTE INPUT METHOD;

* The "@n" is the column pointers, where n is the number of columns SAS should move to.

For example @7 says to SAS,go to Column 7 and read the data;

*The Column pointer has other uses too and can be used anytime you want SAS to skip backwards or forwards within a dataline.

For example to skip over unneeded data, or to read a variable twice using different informats;

data EMP_DATA14;

input @1 EID 4. @7 NAME $7. @16 AGE 3. @21 RACE $5. @27 SALARY dollar7.;

format SALARY dollar7.;

datalines;

1234 suresh 32 white $15,600

1256 kumar 32 black $18,560

1456 eeshwar 52 black $20,890

;

proc print data=EMP_DATA14;

run;

*LINE POINTER;

data ADDRESS4;

input #1 EID NAME $ AGE #2 PIN STREET $15.;

datalines;

101 suman 21

560016 church street

102 pushpita 22

560027 residency road

103 sunil 23

560018 MG road

;

proc print data=ADDRESS4;

run;

* The LINE POINTER # (a pound sign) in the INPUT statement tells SAS which line to read when you have multiple lines of data per

observation;

*NEW LINE OPERATOR (/) or RELATIVE LINE POINTER;

*An alternative way to read the multiple data is to use a relative line pointer, a forward slash(/) to tell SAS to skip to the

next line of input;

data INCOME3;

input NAME $ / AGE / SALARY ;

datalines;

Ram 32

10000

Krish 45

12000

Ajay 55

15000

;

proc print data=INCOME3;

run;

* Delimited files are raw data files that have a special character sperating data values. Many programs can save data as delimited

files, often with commas or tab characters for delimiters. SAS gives two options for the infile statement that make it easy to

read delimited data files: the DLM=option and the DSD option;

* DLM=option;

* To read the data using list input, the DATA step expects your file to have spaces between your data values.

The DELIMITER=, or DLM=, option in the INFILE statement allows you to read data files with other delimiters.

The comma and tab characters are common delimiters found in data files, but can read data files with any delimiter character

by just enclosing the delimiter character in quotation marks after the DLM=option;

data EMPLOYEE_DETAIL;

infile datalines dlm=',';

input EID NAME $ AGE;

datalines;

100,Ram,73

101,Krish,23

102,suresh,62

;

proc print data= EMPLOYEE_DETAIL;

run;

* DSD Option;

* The DSD (Delimited-sensitive Data)option for the INFILE statement does 3 things.

1. It ignores delimiters in data values enclosed in quotation marks.

2. It does not read quotation marks as part of the data value.

3. It reads two delimiters as a missing value.

The DSD option assumes that the delimiter is a comma.

If the delimiter is not a comma then use the DLM=option with the DSD option to specify the delimiter;

data EMPLOYEE_DETAIL5;

infile datalines dlm='%' dsd;

input EID NAME $ AGE;

datalines;

100%Ram%73

101%Krish%23

102%suresh%62

;

proc print data= EMPLOYEE_DETAIL5;

run;

* If there are comma in any of the data values, then those values are enclosed in quotation marks;

data EMPLOYEE_DETAIL6;

infile datalines dlm=',' dsd;

input EID NAME $ AGE;

datalines;

100,"Mr.,Ram",73

101,Krish,23

102,suresh,62

;

proc print data= EMPLOYEE_DETAIL6;

run;

* MISSOVER;

* By default, SAS will go to the next data line to read more data if SAS has reached the end of the dataline and there are still

more variables in the INPUT statement that have not been assigned values.

The MISSOVER option tells SAS that if it runs out of data, don't go to the next data line. Instead, assign missing values

to any remaining variables;

data EMPLOYEE_DETAIL11;

infile datalines missover;

input EID NAME $ AGE SAL;

datalines;

100 Ram 23 2000

101 Krish

102 Sunil 32 25000

102

102 Suresh 42 45000

102 Kumar 28 35000

;

proc print data= EMPLOYEE_DETAIL11;

run;

* FIRSTOBS;

* The firstobs=option tells SAS at what line to begin reading data.

This is useful if you have a datafile that contains descriptive text or header information at the beginning,

and is also useful to skip few data at the beginning;

* OBS;

* The OBS=option can be used anytime you want to read only a part of your datafile.

It tells SAS to stop reading when it gets to that line in the raw data file;

data EMPLOYEES5;

infile datalines firstobs=3 obs=6;*will start reading this file on the third dataline and stops reading after the sixth dataline;

input DEPT $ EID SAL;

datalines;

DEPT EID SAL

CE 101 15000

TE 105 20000

RE 110 10000

TE 115 12000

TE 215 32000

RE 120 11000

CE 150 19000

RE 140 14000

CE 160 20000

RE 130 12000

TE 315 22000

;

proc print data=EMPLOYEES5;

run;

/* IMPORT Procedure; */

* TO read the delimited data files using the IMPORT procedure;

* PROC IMPORT will scan your data file and automatically determine the variable types(numeric or character), will asign proper

lengths to the character variables, and can recognize some date formats.

* PROC IMPORT will treat two consecutive delimiters in your files as a missing value, will read values enclosed by quotation marks,

and assign missing values to variables when it runs out of data on a line.

* The IMPORT procedure actually writes a DATA step, after submitting the program in the log window to see the DATA step it produced.

* To read the delimited data files, then must use the DBMS=option in the PROC IMPORT statement.

* If the specific SAS dataset name already exists then use the REPLACE option to overwrite it.

Type of File Extension DBMS Identifier

Comma-delimited . csv CSV

Tab-delimited .txt TAB

Delimiters other than commas or tabs DLM

The IMPORT procedure, will by default get variable names from the first line in the data file. In case, if the variable names not

be taken then add the GETNAMES=NO statement after the proc import statement. PROC IMPORT will assign the variables the names

VAR1, VAR2, VAR3 and so on;

* If the datafile is type DLM, PROC IMPORT assumes that the delimiter is a space. if it is a different delimiter then specify it in

the DELIMITER=statement;

/*PROC IMPORT DATAFILE='filename' OUT='data-set'

DBMS= DLM REPLACE;

GETNAMES=no;

DELIMITER='delimiter-character';*/

proc import datafile='C:\SAS\EXTERNAL FILES\emp.txt' out = EMPLOYEE_DETAILS

dbms=DLM replace;

getnames=YES;

delimiter=' ';

proc print data=EMPLOYEE_DETAILS;

run;

* Reading a excel file with more than one sheet, then specify the name of which sheet to be read.

SHEET=name-of sheet;

proc import datafile='C:\SAS\EXTERNAL FILES\list.xls' out= FMLIST

dbms=excel replace;

getnames=yes;

sheet=flist;

proc print data=FMLIST;

run;

proc import datatable= "emp"

out=EMPLOYEE

dbms=access replace;

database="C:\SAS\EXTERNAL FILES\db1.mdb";

proc print data=EMPLOYEE;

run;

IF/THEN

options formdlim='~';

* CONDITIONAL EXECUTION USING IF/THEN STATEMENTS;

IF-THEN/ELSE logic takes this basic form:

IF condition THEN action;

ELSE IF condition THEN action;

ELSE action;

data ACTORS;

input NAME $ JOBCODE $ 7-15 SALARY;

datalines;

JOJO Actor I 10000

JIJO Actor II 20000

Sunil Actor III 15000

ram Actor 10000

;

proc print data=ACTORS;

run;

data JOBLEVELS;

set ACTORS;

if JOBCODE='Actor I' OR JOBCODE='Actor II' then JOBLEVEL='Beginner';

else if JOBCODE='Actor III' then JOBLEVEL='Advanced';

else JOBLEVEL='Unknown';

proc print data=JOBLEVELS;

run;

IF condition THEN DO;

action;

END;

data SALARY;

input EID SALARY;

datalines;

101 20000

102 25000

103 30000

104 35000

105 15000

106 10000

;

proc print data=SALARY;

run;

data BONUS;

set SALARY;

if SALARY>=30000 then

BONUS=SALARY*(10/100);

INCREMENT=SALARY*(8/100);

end;

else if SALARY<30000 & SALARY>=20000 then

BONUS=SALARY*(15/100);

INCREMENT=SALARY*(10/100);

end;

else

BONUS=SALARY*(20/100);

INCREMENT=SALARY*(100/100);

end;

proc print data=BONUS;

run;

* SUBSETTING USING IF CONDITION;

data DEPT;

infile 'C:\SAS\EXTERNAL FILES\emp2.txt';

input NO EMPNO ENAME $ JOB $ MGR SAL HIREDATE DEPTNO;

format HIREDATE date9.;

if DEPTNO=20;

proc print data=EMPLOYEE_DETAIL;

run;

PROCEDURES

* BY Statement;

* The BY statement is required for only one procedure, PROC SORT. In PROC SORT the BY statement tells SAS how to arrange the

observations;

* PROC SORT;

* PROC SORT, sorts the data based on the values of a variable;

/*PROC SORT is used to organize data for a report, before combining datasets, or before using a BY statement in another PROC or

data step.

PROC SORT;

BY variable-1....variable-n;

The variables named in the BY statement are called BY variables. With a BY variable, SAS sorts the data based on the values of

that variable.

The DATA=and OUT=options specify the input and output datasets.

If the DATA=option, is not specified, then SAS will use the most recently created SAS dataset. I

If the OUT-option is not specified, then SAS will replace the original dataset with the newly sorted version. */

data EXPREV;

input REGION $ STATE $ MONTH monyy5. EXPENSES REVENUE;

format month monyy5.;

datalines;

Southern GA JAN95 2000 8000

Southern GA FEB95 1200 6000

Southern FL FEB95 8500 11000

Northern NY FEB95 3000 4000

Northern NY MAR95 6000 5000

Southern FL MAR95 9800 13500

Southern NY MAR95 4000 5000

Northern MA MAR95 1500 1000

Eastern GA FEB95 4000 6000

eastern GA FEB95 3000 6000

;

proc print data= EXPREV;

run;

proc sort data=EXPREV out=EXPREV2;

by EXPENSES;

format month monyy5.;

proc print data= EXPREV2;

run;

* By default SAS sorts data in ascending order, from lowest to highestor from A to Z. To have data sorted from highest to lowest,

add the keyword DESCENDING to the BY statement before each variable that should be sorted from highest to lowest;

proc sort data=EXPREV out=EXPREV4;

by STATE descending EXPENSES;

format month monyy5.;

proc print data= EXPREV4;

run;

* NODUPKEY;

* The NODUPKEY option tells SAS to eliminate any duplicate observations that have the same value for the BY variables;

proc sort data=MARINE out=SEASORT2 NODUPKEY;

by family;

proc print data=SEASORT2;

run;

FORMATTING THE DATA AS REQUIRED USING FORMAT STATEMENT;

* FORMAT PROCEDURE;

* Sometimes variable values are stored according to a code. The predefined SAS formats cannot help in such cases. However the

FORMAT procedure can be used to create user-defined formats for displaying values of variables;

* The FORMAT procedure creates formats that will later be associated with variables in a FORMAT statement. The procedure starts

with the statement PROC FORMAT and continues with one or more VALUE statements (other optional statements are available);

/* PROC FORMAT;

VALUE name range-1 = 'formatted-text-1'

range-2 = 'formatted-text-2'

range-n = 'formatted-text-n';

The name in the VALUE statement is the name of the format for creating. If the format is for character data, the name must start

with a $. The name can't be the name of an existing format;

Character values must be enclosed in quotation marks. If there is more than one variable in the range, then separate the values

with a comma or use the hyphen for a continuous range. The keyword LOW and HIGH can be used in ranges to indicate the lowest

and the highest non-missing value for the variable.*/

DATA carsurvey;

INPUT Age Sex Income Color $;

DATALINES;

19 1 14000 Y

45 1 65000 G

72 2 35000 B

31 1 44000 Y

58 2 83000 W

;

RUN;

PROC FORMAT;

VALUE gender 1 = 'Male'

2 = 'Female';

VALUE agegroup 13 -< 20 = 'Teen'

20 -< 65 = 'Adult'

65 - HIGH = 'Senior';

VALUE $col 'W' = 'Moon White'

'B' = 'Sky Blue'

'Y' = 'Sunburst Yellow'

'G' = 'Rain Cloud Gray';

RUN;

PROC PRINT DATA = carsurvey;

FORMAT Sex gender. Age agegroup. Color $col. Income DOLLAR8.;

TITLE 'Survey Results Printed with User-Defined Formats';

RUN;

* MEANS PROCEDURE;

* The MEANS procedure provides simple statistics on numeric variables;

/*The MEANS procedure starts with the keywords PROC MEANS, followed by options listing the statistics to be printed:

PROC MEANS options;

If the options is not specified, MEANS will print the number of non-missing values, the mean, the standard deviation, and the

minimum and maximum for each variable. There are over 30 different statistics which can be requested with the MEANS procedure*/

data GRADE;

input NAME $ GENDER $ STATUS YEAR $ SECTION $ SCORE FINALGRADE;

datalines;

Abbott F 2 97 A 90 87

Branford M 1 98 A 92 97

Crandell M 2 98 B 81 71

Dennison M 1 97 A 85 72

Edgar F 1 98 B 89 80

Faust M 1 97 B 78 73

Greeley F 2 97 A 82 91

;

proc print data=GRADE;

run;

* VAR variable-list;

* The VAR statement specifies which numeric variables to use in the analysis;

* BY variable-list;

* The BY statement performs separate analyses for each level of the variables in the list. The data must first be sorted in the

same order as the variable list;

proc means data=GRADE maxdec=2;

by YEAR STATUS;

var SCORE FINALGRADE;

run;

* CLASS variable-list;

* The CLASS statement also performs separate analyses for each level of the variables in the list, but its output is more compact

than with the BY statement, and the data do not have to be sorted first;

proc means data=GRADE maxdec=2;

class STATUS;

var SCORE;

run;

* FREQ PROCEDURE;

* A freuqency table is a simple list of counts;

* The most obvious reason for using PROC FREQ is to create tables showing the distribution of categorical data values, but PROC

FREQ can also reveal irregularities in data;

* The FREQ PROCEDURE is a descriptive procedure as well as statistical procedure. It produces one-way and n-way frequency tables,

and it concisely describes the data by reporting the distribution of variable values. The FREQ PROCEDURE is used to create

cross-tabulation tables that summarize data for two or more categorical variables by showing the number of observations for

each combination of variable values;

data COLOR;

input REGION EYES $ HAIR $ COUNT @@;

datalines;

1 blue fair 23 1 blue red 7 1 blue medium 24

1 green dark 11 1 black fair 19 1 green red 7

1 green medium 18 1 green dark 14 1 brown fair 34

1 brown red 5 1 brown medium 41 1 brown dark 40

1 brown black 3 2 blue fair 46 2 blue red 21

2 blue medium 44 2 black dark 40 2 blue black 6

2 green fair 50 2 black red 31 2 black medium 37

2 green dark 23 2 green fair 56 2 brown red 42

2 blue medium 53 2 brown dark 54 2 brown black 13

;

proc print data=COLOR;

run;

* By default, PRO FREQ creates a one-way table with the FREQUENCY,PERCENT, CUMULATIVE FREQUENCY and CUMULATIVE PERCENT for each

value of the variables in a SAS data set;

proc freq data=COLOR;

run;

* To specify the variables to be processed by the FREQ PROCEDURE, includes a TABLES statement;

proc freq data=COLOR;

tables EYES HAIR;

run;

* Options can be provided after a slash in the TABLES statement;

proc freq data=COLOR;

tables EYES HAIR / out=FREQCNT; * Only for the HAIR variable the dataset is created;

proc print data=FREQCNT;

run;

* To Produce a cross tabulation, list the variables separated by an asterisk;

proc freq data=COLOR;

tables EYES * HAIR ;

run;

proc freq data=COLOR;

tables EYES * HAIR / nopercent nocum ;

run;

* SET STATEMENT;

* The SET statement in the data step allows you to read a SAS dataset so you can add new variables, create a subset, or otherwise modify the dataset;

data EMPLOYEE;

input EID NAME $;

datalines;

100 sunil

101 ram

102 anand

103 jijo

104 vasanth

104 komal

104 praveen

105 jojo

;

proc print data=EMPLOYEE;

run;

data EMPLOYEE2;

set EMPLOYEE;

where eid < 104;

proc print data=EMPLOYEE2;

run;

* ONE-TO-ONE MERGE;

* When you want to match observations from one data set with observations from another, use the MERGE statement in the DATA step.

Before merging the data sets, the data sets must be sorted by the commmon variable;

/* DATA new-data-set;

MERGE data-set-1 data-set-2;

BY variable list; */

data C;

input ID VAR_C $;

datalines;

1 Mary

3 Sue

5 Hilary

;

proc print data=C;

run;

data D;

input ID VAR_D $;

datalines;

1 Smith

3 Roos

4 Clinton

;

proc print data=D;

run;

proc sort data=C;

by ID;

run;

proc sort data=D;

by ID;

run;

data CD;

merge C D;

by ID;

proc print data=CD;

run;

* MANY TO ONE;

data P;

input ID VAR_P $;

datalines;

1 Mary

3 Sue

3 row

5 Hilary

;

proc print data=P;

run;

data Q;

input ID VAR_Q $;

datalines;

2 Smith

3 Roos

proc print data=Q;

run;

proc sort data=P;

by ID;

run;

proc sort data=Q;

by ID;

run;

data PQ;

merge P Q;

by ID;

proc print data=PQ;

run;

* ONE TO MANY;

data M;

input ID VAR_M $;

datalines;

1 Mary

3 Sue

5 Hilary

;

proc print data=M;

run;

data N;

input ID VAR_N $;

datalines;

2 Smith

3 Roos

3 row

proc print data=N;

run;

proc sort data=M;

by ID;

run;

proc sort data=N;

by ID;

run;

data MN;

merge M N;

by ID;

proc print data=MN;

run;

* IN=option;

* The IN=option is used to track which of the original data sets contributed to each observation in the new data set;

* Unlike most variables, IN=variables are temporary, existing only during the current DATA step. SAS gives the IN=variables a

value of 0 or 1. A value of 1 means that data set did contribute to the current observation, and a value of 0 means the

data set did not contribute;

data ONE;

input ID NAME1 $;

datalines;

1 Mary

2 jack

3 Sue

5 Hilary

;

proc print data=ONE;

run;

data TWO;

input ID NAME2 $;

datalines;

2 Smith

3 Roos

6 jimmy

;

proc print data=TWO;

run;

proc sort data=ONE;

by ID;

run;

proc sort data=TWO;

by ID;

run;

data COMBINED;

set ONE TWO;

by ID;

proc print data=COMBINED;

run;

data COMBINED2;

set ONE(in=O) TWO(in=T);

by ID;

if O;

proc print data=COMBINED2;

run;

data COMBINED3;

set ONE(in=O) TWO(in=T);

by ID;

if T;

proc print data=COMBINED3;

run;

* IN=option in MERGE;

data COMBINED4;

merge ONE TWO;

by ID;

proc print data=COMBINED4;

run;

* INNER JOIN;

data COMBINED5;

merge ONE(in=O) TWO(in=T);

by ID;

if O & T;

proc print data=COMBINED5;

run;

*RIGHT JOIN;

data COMBINED6;

merge ONE(in=O) TWO(in=T);

by ID;

if T;

proc print data=COMBINED6;

run;

*LEFT JOIN;

data COMBINED6;

merge ONE(in=O) TWO(in=T);

by ID;

if O;

proc print data=COMBINED6;

run;

Business Analytics With SAS

Friday, April 8, 2016

some questions

Wednesday, March 2, 2016

Business Analytics Question

Base SAS Training with initial related Codes