For a Restaurant Pricing Policy, create a statistical model and whether to keep the cancellation charges or not.
Wednesday, March 2, 2016
Business Analytics Question
For a Restaurant Pricing Policy, create a statistical model and whether to keep the cancellation charges or not.
Base SAS Training with initial related Codes
INTRODUCTION
options formdlim = '*'; * Specifies a character to delimit page breaks in SAS output;
data FIRST; * creating a dataset by
name FIRST;
input ID NAME HEIGHT WEIGHT;
* to read the variable names;
datalines;
01 VIJAY 5 26
02 AJAY 6 28
03 RAVI 4 25
;
*The DATALINES statement is used to indicate internal data;
*The DATALINES statement must be the last statement in the
DATA step;
*All lines in the SAS program following the DATALINES
statement are considered data until SAS encounters a semicolon;
*The CARDS statement and the DATALINES statement are
synonymous;
proc print data=FIRST;
* prints the values for all variables and all observation in
a SAS dataset;
run; * End of the program;
data SECOND;
input ID NAME $ HEIGHT
WEIGHT; * The $ after Name indicates that it is a
character variable;
datalines;
01 VIJAY 5 26
02 AJAY 6 28
03 RAVI 4 25
;
proc print data=SECOND;
run;
libname SMRITHA 'C:\SAS\SMRITHA';
* specify the libref and then the location of permanent
dataset to be stored;
data SMRITHA.THREE;
* creating a permanent dataset in the specified location with
two level naming;
input ID NAME $ HEIGHT
WEIGHT;
datalines;
01 VIJAY 5 26
02 AJAY 6 28
03 RAVI 4 25
;
proc print data=SMRITHA.THREE;
run;
proc contents data=SMRITHA.THREE; * prints the description of a SAS dataset;
RUN;
IMPORTING DATA
options formdlim='.';
* INFILE;
* Use the INFILE statement to tell SAS the filename and path,
if appropriate, of the external file containing the data.
* The INFILE statement follows the DATA statement and must
precede the INPUT statement. After the INFILE keyword, the file path
and name
are enclosed in quotation marks;
data EMP_DESIG;
infile 'C:\SAS\EXTERNAL FILES\SECOND.xls';
input ID NAME $
HEIGHT WEIGHT;
RUN;
/* LIST INPUT METHOD(FREE FORMATTED INPUT)*/
*The values in the raw data file are separated by atleast one
space;
data EMP_DATA;
input EID NAME $;
datalines;
100 ram
101 krish
102 ranjith
;
run;
data EMP_DATA4;
input EID NAME $ AGE SAL; *Since the default character length is 8, therefore it
truncates the character observation value;
datalines;
101 Ram 34 10000
102 Ramakrishanan 22 15000
103 Ranjith 23 12000
;
run;
Informats are useful anytime you have non-standard data.
(Standard numeric data contain only
numerals, decimal points, minus signs, and E for scientific
notation.) Numbers with embedded
commas or dollar signs are examples of non-standard data. Other
examples include data in
|
|
data EMP_DATA5;
input EID NAME $13. AGE SAL;
*By using an informat in the input statement such as $13. to
tell SAS that the variable field is 13 characters, then SAS will
read for 13 columns whether or not there is a space in those
columns. This will inculde unwanted characters which appear after
the "NAME" on the dataline;
datalines;
101 Ram 34 10000
102 Ramakrishanan 22 15000
103 Ranjith 23 12000
;
run;
*COLON MODIFIER;
data EMP_DATA6;
input EID NAME :$13. AGE SAL;
*To read until it encounters a space, use a colon modifier in
the informat;
datalines;
101 Ram 34 10000
102 Ramakrishanan 22 15000
103 Ranjith 23 12000
;
proc print data=EMP_DATA6;
run;
data EMP_DATA7;
length name $13;
*length statement is equivalent to using informat collan
modifier;
*The length statement should always be before the input
statement;
input EID NAME $ AGE SAL;
datalines;
101 Ram 34 10000
102 Ramakrishanan 22 15000
103 Ranjith 23 12000
;
proc print data=EMP_DATA7;
run;
*COLUMN INPUT METHOD;
* To use the column input method, only if:
1. Each of the
variable values is always found in the same place in the dataline.
2. As long as the
values are character or standard numeric;
data EMP_DATA9;
input EID 1-4 NAME $ 6-12 AGE 14-15 RACE $ 17-21;
datalines;
1234 suresh 32 white
1256 kumar 32 black
1456 eeshwar 52 black
;
proc print data=EMP_DATA9;
run;
* Column input has the following advantages over list input:
1. Space are not
required between values
2. Missing values can
be left blank
3. Character data can
have embedded spaces
4. Can
skip unwanted variables;
* FORMATTED INPUT METHOD;
* 1. RELATIVE INPUT METHOD:
Use informats by
placing the informat after the variable name in the INPUT statement;
data EMP_DATA12;
input EID 4. +1 NAME $7. AGE 3. +1 RACE $5. ; * "+1" specifies to skip one character;
datalines;
1234 suresh 32 white
1256 kumar 32 black
1456 eeshwar 52 black
;
proc print data=EMP_DATA12;
run;
*2. ABSOLUTE INPUT METHOD;
* The "@n"
is the column pointers, where n is the number of columns SAS should move to.
For example @7 says
to SAS,go to Column 7 and read the data;
*The Column pointer has other uses too and can be used anytime
you want SAS to skip backwards or forwards within a dataline.
For example to skip
over unneeded data, or to read a variable twice using different informats;
data EMP_DATA14;
input @1 EID 4. @7 NAME $7. @16 AGE 3. @21 RACE $5. @27 SALARY dollar7.;
format SALARY dollar7.;
datalines;
1234 suresh 32
white $15,600
1256 kumar 32
black $18,560
1456 eeshwar 52
black $20,890
;
proc print data=EMP_DATA14;
run;
*LINE POINTER;
data ADDRESS4;
input #1 EID NAME $ AGE #2 PIN STREET $15.;
datalines;
101 suman 21
560016 church street
102 pushpita 22
560027 residency road
103 sunil 23
560018 MG road
;
proc print data=ADDRESS4;
run;
* The LINE POINTER # (a pound sign) in the INPUT statement
tells SAS which line to read when you have multiple lines of data per
observation;
*NEW LINE OPERATOR (/) or RELATIVE LINE POINTER;
*An alternative way to read the multiple data is to use a
relative line pointer, a forward slash(/) to tell SAS to skip to the
next line
of input;
data INCOME3;
input NAME $ / AGE / SALARY
;
datalines;
Ram 32
25
10000
Krish 45
26
12000
Ajay 55
27
15000
;
proc print data=INCOME3;
run;
* Delimited files are raw data files that have a special
character sperating data values. Many programs can save data as delimited
files, often with
commas or tab characters for delimiters. SAS gives two options for the infile
statement that make it easy to
read delimited data
files: the DLM=option and the DSD option;
* DLM=option;
* To read the data using list input, the DATA step expects
your file to have spaces between your data values.
The DELIMITER=, or
DLM=, option in the INFILE statement allows you to read data files with other
delimiters.
The comma and tab
characters are common delimiters found in data files, but can read data files
with any delimiter character
by just
enclosing the delimiter character in quotation marks after the DLM=option;
data EMPLOYEE_DETAIL;
infile datalines dlm=',';
input EID NAME $ AGE;
datalines;
100,Ram,73
101,Krish,23
102,suresh,62
;
proc print data= EMPLOYEE_DETAIL;
run;
* DSD Option;
* The DSD (Delimited-sensitive Data)option for the INFILE
statement does 3 things.
1. It ignores
delimiters in data values enclosed in quotation marks.
2. It does not read
quotation marks as part of the data value.
3. It reads two
delimiters as a missing value.
The DSD option
assumes that the delimiter is a comma.
If the
delimiter is not a comma then use the DLM=option with the DSD option to specify
the delimiter;
data EMPLOYEE_DETAIL5;
infile datalines dlm='%' dsd;
input EID NAME $ AGE;
datalines;
100%Ram%73
101%Krish%23
102%suresh%62
;
proc print data= EMPLOYEE_DETAIL5;
run;
* If there are comma in any of the data values, then those
values are enclosed in quotation marks;
data EMPLOYEE_DETAIL6;
infile datalines dlm=',' dsd;
input EID NAME $ AGE;
datalines;
100,"Mr.,Ram",73
101,Krish,23
102,suresh,62
;
proc print data= EMPLOYEE_DETAIL6;
run;
* MISSOVER;
* By default, SAS will go to the next data line to read more
data if SAS has reached the end of the dataline and there are still
more variables in the
INPUT statement that have not been assigned values.
The MISSOVER option
tells SAS that if it runs out of data, don't go to the next data line. Instead,
assign missing values
to any remaining
variables;
data EMPLOYEE_DETAIL11;
infile datalines missover;
input EID NAME $ AGE SAL;
datalines;
100 Ram 23 2000
101 Krish
102 Sunil 32 25000
102
102 Suresh 42 45000
102 Kumar 28 35000
;
proc print data= EMPLOYEE_DETAIL11;
run;
* FIRSTOBS;
* The firstobs=option tells SAS at what line to begin reading
data.
This is useful if you
have a datafile that contains descriptive text or header information at the
beginning,
and is also useful to
skip few data at the beginning;
* OBS;
* The OBS=option can be used anytime you want to read only a
part of your datafile.
It tells SAS to stop
reading when it gets to that line in the raw data file;
data EMPLOYEES5;
infile datalines firstobs=3 obs=6;*will start reading this file on the third dataline and stops
reading after the sixth dataline;
input DEPT $ EID SAL;
datalines;
DEPT EID SAL
CE 101 15000
TE 105 20000
RE 110 10000
TE 115 12000
TE 215 32000
RE 120 11000
CE 150 19000
RE 140 14000
CE 160 20000
RE 130 12000
TE 315 22000
;
proc print data=EMPLOYEES5;
run;
/* IMPORT Procedure; */
* TO read the delimited data files using the IMPORT
procedure;
* PROC IMPORT will scan your data file and automatically
determine the variable types(numeric or character), will asign proper
lengths to the
character variables, and can recognize some date formats.
* PROC IMPORT will treat two consecutive delimiters in your
files as a missing value, will read values enclosed by quotation marks,
and assign missing
values to variables when it runs out of data on a line.
* The IMPORT procedure actually writes a DATA
step, after submitting the program in the log window to see the DATA step it
produced.
* To read the delimited data files, then must use the
DBMS=option in the PROC IMPORT statement.
* If the specific SAS dataset name already exists then use
the REPLACE option to overwrite it.
Type of File Extension
DBMS
Identifier
Comma-delimited . csv
CSV
Tab-delimited .txt
TAB
Delimiters other than commas or tabs DLM
The IMPORT procedure, will by default get variable names from
the first line in the data file. In case, if the variable names not
be taken then add the
GETNAMES=NO statement after the proc import statement. PROC IMPORT will assign
the variables the names
VAR1, VAR2, VAR3 and
so on;
* If the datafile is type DLM, PROC IMPORT assumes that the
delimiter is a space. if it is a different delimiter then specify it in
the
DELIMITER=statement;
/*PROC IMPORT DATAFILE='filename' OUT='data-set'
DBMS= DLM REPLACE;
GETNAMES=no;
DELIMITER='delimiter-character';*/
proc import datafile='C:\SAS\EXTERNAL FILES\emp.txt'
out = EMPLOYEE_DETAILS
dbms=DLM replace;
getnames=YES;
delimiter=' ';
proc print data=EMPLOYEE_DETAILS;
run;
* Reading a excel file with more than one sheet, then specify
the name of which sheet to be read.
SHEET=name-of sheet;
proc import datafile='C:\SAS\EXTERNAL FILES\list.xls'
out= FMLIST
dbms=excel replace;
getnames=yes;
sheet=flist;
proc print data=FMLIST;
run;
proc import datatable= "emp"
out=EMPLOYEE
dbms=access replace;
database="C:\SAS\EXTERNAL
FILES\db1.mdb";
proc print data=EMPLOYEE;
run;
IF/THEN
options formdlim='~';
* CONDITIONAL EXECUTION USING IF/THEN STATEMENTS;
IF-THEN/ELSE logic takes this basic form:
IF condition THEN action;
ELSE IF condition THEN action;
ELSE action;
data ACTORS;
input NAME $ JOBCODE $ 7-15 SALARY;
datalines;
JOJO Actor I 10000
JIJO Actor II 20000
Sunil Actor III 15000
ram Actor 10000
;
proc print data=ACTORS;
run;
data JOBLEVELS;
set ACTORS;
if JOBCODE='Actor
I' OR JOBCODE='Actor
II' then JOBLEVEL='Beginner';
else if JOBCODE='Actor
III' then JOBLEVEL='Advanced';
else JOBLEVEL='Unknown';
proc print data=JOBLEVELS;
run;
IF condition
THEN DO;
action;
END;
data SALARY;
input EID SALARY;
datalines;
101 20000
102 25000
103 30000
104 35000
105 15000
106 10000
;
proc print data=SALARY;
run;
data BONUS;
set SALARY;
if SALARY>=30000 then
do
BONUS=SALARY*(10/100);
INCREMENT=SALARY*(8/100);
end;
else if SALARY<30000 & SALARY>=20000 then
do
BONUS=SALARY*(15/100);
INCREMENT=SALARY*(10/100);
end;
else
do
BONUS=SALARY*(20/100);
INCREMENT=SALARY*(100/100);
end;
proc print data=BONUS;
run;
* SUBSETTING USING IF CONDITION;
data DEPT;
infile 'C:\SAS\EXTERNAL FILES\emp2.txt';
input NO EMPNO ENAME $ JOB $
MGR SAL HIREDATE DEPTNO;
format HIREDATE date9.;
if DEPTNO=20;
proc print data=EMPLOYEE_DETAIL;
run;
PROCEDURES
* BY Statement;
* The BY statement is required for only one procedure, PROC
SORT. In PROC SORT the BY statement tells SAS how to arrange the
observations;
* PROC SORT;
* PROC SORT, sorts the data based on the values of a
variable;
/*PROC SORT is used to organize data for a report, before
combining datasets, or before using a BY statement in another PROC or
data step.
PROC SORT;
BY
variable-1....variable-n;
The variables named
in the BY statement are called BY variables. With a BY variable, SAS sorts the
data based on the values of
that variable.
The DATA=and OUT=options specify the input and output
datasets.
If the DATA=option, is not specified, then SAS will use the
most recently created SAS dataset. I
If the OUT-option is not specified, then SAS will replace the
original dataset with the newly sorted version.
*/
data EXPREV;
input REGION $ STATE $ MONTH
monyy5. EXPENSES
REVENUE;
format month monyy5.;
datalines;
Southern GA JAN95 2000
8000
Southern GA FEB95 1200
6000
Southern FL FEB95 8500 11000
Northern NY FEB95 3000
4000
Northern NY MAR95 6000
5000
Southern FL MAR95 9800 13500
Southern NY MAR95 4000
5000
Northern MA MAR95 1500
1000
Eastern GA FEB95 4000
6000
eastern GA FEB95 3000
6000
;
proc print data= EXPREV;
run;
proc sort data=EXPREV out=EXPREV2;
by EXPENSES;
format month monyy5.;
proc print data= EXPREV2;
run;
* By default SAS sorts data in ascending order, from lowest
to highestor from A to Z. To have data sorted from highest to lowest,
add the keyword
DESCENDING to the BY statement before each variable that should be sorted from
highest to lowest;
proc sort data=EXPREV out=EXPREV4;
by STATE descending EXPENSES;
format month monyy5.;
proc print data= EXPREV4;
run;
* NODUPKEY;
* The NODUPKEY option tells SAS to eliminate any duplicate
observations that have the same value for the BY variables;
proc sort data=MARINE out=SEASORT2 NODUPKEY;
by family;
proc print data=SEASORT2;
run;
|
|
* FORMAT PROCEDURE;
* Sometimes variable values are stored according to a code. The predefined SAS formats cannot help in such
cases. However the
FORMAT procedure can
be used to create user-defined formats for displaying values of variables;
* The FORMAT procedure creates formats that will later be
associated with variables in a FORMAT statement. The procedure starts
with the statement
PROC FORMAT and continues with one or more VALUE statements (other optional
statements are available);
/* PROC FORMAT;
VALUE
name range-1 = 'formatted-text-1'
range-2 = 'formatted-text-2'
.
.
.
range-n = 'formatted-text-n';
The name in the VALUE statement is the name of the format for
creating. If the format is for character data, the name must start
with a $. The name can't be the name of an existing format;
Character values must be enclosed in quotation marks. If
there is more than one variable in the range, then separate the values
with a comma or use the hyphen for a continuous range. The
keyword LOW and HIGH can be used in ranges to indicate the lowest
and the highest non-missing value for the
variable.*/
DATA carsurvey;
INPUT Age Sex Income Color
$;
DATALINES;
19 1 14000 Y
45 1 65000 G
72 2 35000 B
31 1 44000 Y
58 2 83000 W
;
RUN;
PROC FORMAT;
VALUE gender 1 = 'Male'
2 = 'Female';
VALUE agegroup 13 -< 20 = 'Teen'
20 -< 65 = 'Adult'
65 - HIGH = 'Senior';
VALUE $col 'W' = 'Moon White'
'B' = 'Sky Blue'
'Y' = 'Sunburst Yellow'
'G' = 'Rain Cloud Gray';
RUN;
PROC PRINT DATA = carsurvey;
FORMAT Sex gender. Age agegroup. Color $col. Income DOLLAR8.;
TITLE 'Survey Results Printed with User-Defined Formats';
RUN;
* MEANS PROCEDURE;
* The MEANS procedure provides simple statistics on numeric
variables;
/*The MEANS procedure starts with the keywords PROC MEANS,
followed by options listing the statistics to be printed:
PROC
MEANS options;
If the options is not specified, MEANS will print the number
of non-missing values, the mean, the standard deviation, and the
minimum and maximum for each variable. There are
over 30 different statistics which can be requested with the MEANS procedure*/
data GRADE;
input NAME $ GENDER $ STATUS
YEAR $ SECTION $ SCORE FINALGRADE;
datalines;
Abbott F 2 97 A 90
87
Branford M 1 98 A 92
97
Crandell M 2 98 B 81
71
Dennison M 1 97 A 85
72
Edgar F 1 98 B 89
80
Faust M 1 97 B 78 73
Greeley F 2 97 A 82
91
;
proc print data=GRADE;
run;
* VAR variable-list;
* The VAR statement specifies which numeric
variables to use in the analysis;
* BY variable-list;
* The BY statement performs separate analyses for each level
of the variables in the list. The data must first be sorted in the
same
order as the variable list;
proc means data=GRADE maxdec=2;
by YEAR STATUS;
var SCORE FINALGRADE;
run;
* CLASS variable-list;
* The CLASS statement also performs separate analyses for
each level of the variables in the list, but its output is more compact
than with
the BY statement, and the data do not have to be sorted first;
proc means data=GRADE maxdec=2;
class STATUS;
var SCORE;
run;
* FREQ PROCEDURE;
* A freuqency table is a simple list of counts;
* The most obvious reason for using PROC FREQ is to create
tables showing the distribution of categorical data values, but PROC
FREQ can also reveal
irregularities in data;
* The FREQ PROCEDURE is a descriptive procedure as well as
statistical procedure. It produces one-way and n-way frequency tables,
and it concisely
describes the data by reporting the distribution of variable values. The FREQ
PROCEDURE is used to create
cross-tabulation
tables that summarize data for two or more categorical variables by showing the
number of observations for
each combination of
variable values;
data COLOR;
input REGION EYES $ HAIR $
COUNT @@;
datalines;
1 blue fair 23 1
blue red 7
1 blue medium 24
1 green dark 11 1
black fair 19 1 green red 7
1 green medium
18 1 green dark 14 1
brown fair 34
1 brown red
5 1 brown medium 41 1
brown dark 40
1 brown black
3 2 blue fair
46 2 blue red
21
2 blue medium 44 2
black dark 40 2
blue black 6
2 green fair
50 2 black red 31
2 black medium 37
2 green dark
23 2 green fair 56 2
brown red 42
2 blue medium
53 2 brown dark 54 2
brown black 13
;
proc print data=COLOR;
run;
* By default, PRO FREQ creates a one-way table with the
FREQUENCY,PERCENT, CUMULATIVE FREQUENCY and CUMULATIVE PERCENT for each
value of the
variables in a SAS data set;
proc freq data=COLOR;
run;
* To specify the variables to be processed by the FREQ
PROCEDURE, includes a TABLES statement;
proc freq data=COLOR;
tables EYES HAIR;
run;
* Options can be provided after a slash in the
TABLES statement;
proc freq data=COLOR;
tables EYES HAIR / out=FREQCNT; * Only for the
HAIR variable the dataset is created;
proc print data=FREQCNT;
run;
* To Produce a cross tabulation, list the variables separated
by an asterisk;
proc freq data=COLOR;
tables EYES * HAIR ;
run;
proc freq data=COLOR;
tables EYES * HAIR / nopercent nocum ;
run;
* SET STATEMENT;
* The SET statement in the data step allows you to read a SAS
dataset so you can add new variables, create a subset, or otherwise modify the dataset;
data EMPLOYEE;
input EID NAME $;
datalines;
100 sunil
101 ram
102 anand
103 jijo
104 vasanth
104 komal
104 praveen
105 jojo
;
proc print data=EMPLOYEE;
run;
data EMPLOYEE2;
set EMPLOYEE;
where eid < 104;
proc print data=EMPLOYEE2;
run;
* ONE-TO-ONE MERGE;
* When you want to match observations from one data set with
observations from another, use the MERGE statement in the DATA step.
Before merging the
data sets, the data sets must be sorted by the commmon variable;
/* DATA
new-data-set;
MERGE
data-set-1 data-set-2;
BY
variable list; */
data C;
input ID VAR_C $;
datalines;
1 Mary
3 Sue
5 Hilary
;
proc print data=C;
run;
data D;
input ID VAR_D $;
datalines;
1 Smith
3 Roos
4 Clinton
;
proc print data=D;
run;
proc sort data=C;
by ID;
run;
proc sort data=D;
by ID;
run;
data CD;
merge C D;
by ID;
proc print data=CD;
run;
* MANY TO ONE;
data P;
input ID VAR_P $;
datalines;
1 Mary
3 Sue
3 row
5 Hilary
;
proc print data=P;
run;
data Q;
input ID VAR_Q $;
datalines;
2 Smith
3 Roos
proc print data=Q;
run;
proc sort data=P;
by ID;
run;
proc sort data=Q;
by ID;
run;
data PQ;
merge P Q;
by ID;
proc print data=PQ;
run;
* ONE TO MANY;
data M;
input ID VAR_M $;
datalines;
1 Mary
3 Sue
5 Hilary
;
proc print data=M;
run;
data N;
input ID VAR_N $;
datalines;
2 Smith
3 Roos
3 row
proc print data=N;
run;
proc sort data=M;
by ID;
run;
proc sort data=N;
by ID;
run;
data MN;
merge M N;
by ID;
proc print data=MN;
run;
* IN=option;
* The IN=option is used to track which of the original data
sets contributed to each observation in the new data set;
* Unlike most variables, IN=variables are temporary, existing
only during the current DATA step. SAS gives the IN=variables a
value of 0 or 1. A
value of 1 means that data set did contribute to the current observation, and a
value of 0 means the
data set did not
contribute;
data ONE;
input ID NAME1 $;
datalines;
1 Mary
2 jack
3 Sue
5 Hilary
;
proc print data=ONE;
run;
data TWO;
input ID NAME2 $;
datalines;
2 Smith
3 Roos
6 jimmy
;
proc print data=TWO;
run;
proc sort data=ONE;
by ID;
run;
proc sort data=TWO;
by ID;
run;
data COMBINED;
set ONE TWO;
by ID;
proc print data=COMBINED;
run;
data COMBINED2;
set ONE(in=O) TWO(in=T);
by ID;
if O;
proc print data=COMBINED2;
run;
data COMBINED3;
set ONE(in=O) TWO(in=T);
by ID;
if T;
proc print data=COMBINED3;
run;
* IN=option in MERGE;
data COMBINED4;
merge ONE TWO;
by ID;
proc print data=COMBINED4;
run;
* INNER JOIN;
data COMBINED5;
merge ONE(in=O) TWO(in=T);
by ID;
if O & T;
proc print data=COMBINED5;
run;
*RIGHT JOIN;
data COMBINED6;
merge ONE(in=O) TWO(in=T);
by ID;
if T;
proc print data=COMBINED6;
run;
*LEFT JOIN;
data COMBINED6;
merge ONE(in=O) TWO(in=T);
by ID;
if O;
proc print data=COMBINED6;
run;
Subscribe to:
Comments (Atom)



