SAS BASE with Maxim Ivashkov

Monday, October 06, 2008

deleting specifiek records in an index dataset

Several methods can be used when deleting specifiek records from an
index datasets.
I compare proc sql and data step modify. Records to be deleted are in
james_existing_ctt dataset.
Datastep is much faster.

proc sql;
delete * from cma.cma_src_james
where id_cttmmt in (select id_Cttmmt from james_existing_ctt)
;
quit;

NOTE: 7577 rows were deleted from CMA.CMA_SRC_JAMES.

10764 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 12:10.22
cpu time 7:49.40

data cma.CMA_SRC_JAMES;
set james_existing_ctt;

modify cma.CMA_SRC_JAMES key=ID_CTTMMT;
select (_IORC_);
when (%SYSRC(_SOK)) do;
remove;
_iorc_=0;
end;
when (%SYSRC(_DSENOM)) do;
* output;
_error_=0;
end;
otherwise do;
put 'ERR' 'OR: Unknown IO ';
stop;
end;
end;

run;

NOTE: Data set options for master data set should be specified in
MODIFY statement.
NOTE: There were 7577 observations read from the data set
WORK.JAMES_EXISTING_CTT.
NOTE: The data set CMA.CMA_SRC_JAMES has been updated. There were 0
observations rewritten, 0
observations added and 7577 observations deleted.
NOTE: DATA statement used (Total process time):
real time 29.61 seconds
cpu time 10.23 seconds

Friday, September 26, 2008

sql table join with a union of several tables

A join of a table with several other uniform tables can be done in a
single sql statement. In this example a table D (transactional data)
has to be joined with a table where products per transactions are
stored. It used to be a single table CTT_KPR. But recently a new line
of products was added as the separate table CTT_ZKKKPR. In order to
integrate data from the new table with the minimum efforts you can
use Union oprator and extend the existing join. The performance does
not change significantly. The code belows gives an idea how it can
look like.

......

AS D LEFT JOIN
( SELECT
ID_CTTMMT
,ID_MDW
,NUMTRA
,CODKPR
,CODMRK
,CODDVG
,DATPELEND
FROM KKC.CTT_KPR
UNION
SELECT
ID_CTTMMT
,ID_MDW
,NUMTRA
,CODKPR
,CODMRK
,CODDVG
,DATPELEND
FROM KKC.CTT_ZKKKPR
) as KPR
ON
D.ID_CTTMMT = KPR.ID_CTTMMT
AND D.NUMTRA = KPR.NUMTRA
AND D.ID_MDW = KPR.ID_MDW
.....

Friday, September 19, 2008

Datastep with modify and the second history table.

It is amaizing how simple SAS allows different update strategies of
your datawarehouse. So I decided to post in my blog about it.
I use modified update strategy of type 1 (Ralph Kimball). The records
are replaced but the history is kept in the separate table. May be it
is update of type 4 or 5, but I am not sure. Have to check.

The datset with insurance polises OVK_PRO is weekly refreshed with
new polises but also existing polises that have changed. For example
the end date of the polis. The modify statement uses a KEY= Option
and a SET Statement: An individual index based record lookup. My
problem was that I could not fully follow the changes. I update four
fields for the existing polises: DATING=DATING_IN DATEND=DATEND_IN
DATREG=DATREG_IN KLANT_SOORT=KLANT_SOORT_IN and set the update date.
Let's call these four fields - modify fields. So at the end I can see
which records are changed but not which modify fields exactly.
For several resons the changes of the end date of the polis are
intresting to track. However I do not want to keep the history in the
main table because it is alreary too big. So update of type 2 is not
an option. So I tried to shift into a hybrid update strategy: modify
statement (type 1) as usial but with a second dataset
wrsas.ovk_hist_stg for the history.

With few modifications of the original datastep it worked!

So here it is:
When the matching record is found and at least one of the four fields
is different the program comes to the point :

do;
output wrsas.ovk_hist_stg;
....

the copy of the original records is sent to ovk_hist_stg. After that
the fields are updated as usual. I had to change
replace wrsas.OVK_PRO;
Normaly replace works without dataset name. but if there are two sets
, you have to specify otherwise SAS will
complate that it cannot modify the second dataset which is obviously
open for output and not for modifications.
So this is the code:

OVK_PRO - main table
OVK_PRO_UPDATE data staging table with new or modified polises
ovk_hist_stg- the hisotorical table to track changes
OVK_KEY - composite key defined on OVK_PRO . (the fields in the key
must be present in OVK_PRO_UPDATE for the lookup )

%LET STEP=UPDATE OVK_PRO ;
data wrsas.OVK_PRO wrsas.ovk_hist_stg;
set OVK_PRO_UPDATE(
keep=PRODUCT_ID NUMOVK NUMVNRKPR NUMVNRPRD DATING DATEND DATREG KLANT_SOORT
rename=(DATING=DATING_IN DATEND=DATEND_IN DATREG=DATREG_IN
KLANT_SOORT=KLANT_SOORT_IN)
);
do while (_iorc_=0);
modify wrsas.OVK_PRO key=OVK_KEY;
select (_IORC_);
when (%SYSRC(_SOK)) do;
if (DATING ne DATING_IN
or DATEND ne DATEND_IN
or DATREG ne DATREG_IN
or KLANT_SOORT ne KLANT_SOORT_IN) then
do;
output wrsas.ovk_hist_stg;
DATING=DATING_IN;
DATEND=DATEND_IN;
DATREG=DATREG_IN;
KLANT_SOORT=KLANT_SOORT_IN;
UPDATED_WHEN=date();
replace wrsas.OVK_PRO;
end;
_iorc_=0;
end;
when (%SYSRC(_DSENOM)) do;
_error_=0;
end;
otherwise do;
put 'ERR' 'OR: Unknown IO ';
stop;
end;
end;
end;
_iorc_=0; /* reset for next datastep iteration */
run;
%CHECK_ERROR;

Now I am looking forward to the next week to run in on fresh data and
maybe discover some weird data quality issues.

Met vriendelijke groet,
Maxim

http://www.4suc6.com
Share your goals!

Wednesday, August 06, 2008

100 times faster engine behind SAS indexes

Sas datasets are kept in a very efficient dataformat. So how to make
it even faster?

One of the biggest potential performance increases of SAS base
programs hides behind the definition of indexes on your datasets. Here
is a little example.
First we select records from a pretty large table (about 20 millions
records) using filter on the CODSKL. This takes 1:50.00 min.

896 proc sql;
897 create table ctt_bto as
898 select *
899 from cma.cma_src_kcs
900 where CODSKL=238
901 ;
NOTE: Table WORK.CTT_BTO created, with 223 rows and 31 columns.

902 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 1:50.00
cpu time 57.39 seconds

Now let's define a simple index on the dataset and try again. Notice
that the definition of the index takes about the same time as the
previous step. This is however has to be done once. (unless you
desrtoy index with some brutal sort step.)

903 PROC DATASETS LIBRARY=cma nolist;
904 MODIFY CMA_SRC_KCS(alter=XXXXXX write=XXXXXX);
905 INDEX CREATE CODSKL;
NOTE: Simple index CODSKL has been defined.
906 QUIT;

NOTE: MODIFY was successful for CMA.CMA_SRC_KCS.DATA.
NOTE: PROCEDURE DATASETS used (Total process time):
real time 2:08.03
cpu time 1:28.96

SO now , lets try the first step again.

907
908
909 proc sql;
910 create table ctt_bto as
911 select *
912 from cma.cma_src_kcs
913 where CODSKL=238
914 ;
NOTE: Table WORK.CTT_BTO created, with 223 rows and 31 columns.

915 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.67 seconds
cpu time 0.01 seconds

Wow, 0.67 seconds. Not even a one secod. This is impressive increase
compared to almost 2 minitues. 1:50 min = 110 sec. SO 110/0.67=164
times faster. Impressive, is not it?
Well, now you do not have time anymore to drink a cup of coffee while
your programs are running...

4suc6

Monday, August 04, 2008

running production sas code : error handling

While running different production sas programs I developed an error
handling practice that I find practical, easy to implement and
extensible. So, I like it to share it.

SAS base does not have built in jobs workflow management, this comes
with much more exepnsive extensions. So It is amazing how expensive
things come for free with a little bit of coding.

So the easiness comes as explained below:

Before every significant transformation step in sas code I assign the
same macro variable the name of this step. Just something simple but
meaningful. After the step I call error handling macro. The name
speaks for itself.

Example:

%LET STEP=JOIN WITH HANDELINGEN - TRANSACTIONS;
PROC SQL; CREATE table CM_SRC as
SELECT
a.DATCTTMMT
,a.ID_CTTMMT
,NUMTRA
....
whatever
;
quit;

%CHECK_ERROR;

Extensibility comes from the Macro. The standard macro variable STEP
allows processing all steps in the similar manner. A simple version of
CHECK_ERROR macro outputs in the LOG the name of the step and the
system error code. If the step completes without an error the code is
0. In case of a warning SAS assigns system error a specifiek value (do
not remember exactly what).
Let us take a look at the simple version of CHECK_ERROR.

%macro CHECK_ERROR(mstep=&STEP,merr=&SYSERR);
%put &MSTEP &MERR &SYSMSG;
%mend;

As you can see , the macro outputs the name of the step and two SAS
standard macro variables SYSERR and SYSMSG. In the log it is going to
look like this:

JOIN WITH HANDELINGEN - TRANSACTIONS 0

NOTE: There were 125142 observations read from the data set WORK.CMA_SRC_KCS.
NOTE: The data set WORK.CMA_SRC_KCS has 125142 observations and 22 variables.
NOTE: Compressing data set WORK.CMA_SRC_KCS decreased size by 25.06 percent.
Compressed is 1770 pages; un-compressed would require 2362 pages.
NOTE: PROCEDURE SORT used (Total process time):
real time 3.30 seconds
cpu time 0.73 seconds

As you can see there were no error because the code is 0. Otherwise
this would be something >0.

This version of the macro makes easier finding of steps in the log
and analysing them. You can also check error codes afterwards. As I
said this is a basic version.

I can think of making a mini access database and instead of pumping
messages via the log , send them as the insert transactions in the
access database and after that make a report. If anyone wants to
continue work on that and share ideas - be my guest :)

Maxim

Tuesday, July 29, 2008

Computing YTD with SAS

YTD computation method with sas base (year to date)

In one of the weekly reports that I make , the client asked to add
year to date (YTD) sum for all performance metrics. This is the some
for all weeks starting from week 1 etc. So in week 5 the YTD for a
metric N is N in week1+ N in week 2+ ...+ N in week 5.
This is something that SAS does not do automatically and there are
several ways to add this information at the front end. Option 1 is to
add YTD metric for every metric that has to be YTDated. Option 2 is to
add a single variable SCOPE- a mini dimension with two values W, Y.

We will mark all existing records (observations) in the report with W
which means weekly. Additionally we will add a single YTD record per
report row with the value Y (=YTD). Notice that this record should be
added in the report data and i will show how this can be done. So
basically we extend the dataset with extra observations that carry YTD
information. This extended dataset is further used for reporting in
SAS or Excel. using the added dimension SCOPE we can split all metrics
in Weekly or YTD column groups.

Now I will explain the sas code.

proc sort data=OVK_SUM;
by ISOJAAR CODMRK KLANT_SOORT RICHTING WR_CAT ISOWEEK;
quit;

First we need to sort the report data OVK_SUM by variables that are
used to group our metrics. in this case this is Year , code of the
brand, sort of the client, direction, product category and week
(respectively). Notice that the week has to be the last in the list.
The order of other group variables is not important.

data OVK_SUM;
set OVK_SUM;
by ISOJAAR CODMRK KLANT_SOORT RICHTING WR_CAT ISOWEEK;
if first.WR_CAT then do; N_YTD=0; PREM_YTD=0; end;
N_YTD+N; PREM_YTD+PREMIE;
SCALE="W";
OUTPUT;
if LAST.ISOWEEK then do; PREMIE=PREM_YTD; N=N_YTD; SCALE="Y"; OUTPUT; end;
drop PREM_YTD N_YTD;
run;

In this datastep we compute YTD per metric. So by variables have to be
in the same order as by the proc sort. At the first value of the
variable before the week ( in this case this is WR_CAT) we set
temporary YTD metrics N_YTD and PREM_YTD to zero. Next row, we
increment them in a way that SAS knows that they have to be retained.
(N_YTD+N increments the value of N_YTD with the metric we accumulate
but also the incremented value of N_YTD is taken to the next
observation)
SCALE="W" - update the value for our minidimension and we are ready to
output the observation.
The next line if last.isoweek..... is the main trick of this
algorithm. If the observation is the last in the group we will make
additional output of our YTD metrics. So we make our metrics equal to
the computed YTD metrics but to distinguish them from each other we
set SCALE="Y". Because there is the single YTD records per group the
YTD will be properly summed up.

Limitations of this method:
YTD is relevant only for the group of variables we selected. If the
group is extended YTD computation has to be adjusted.

Advantages:
1. dataset can have additional dimensions or observations. THe records
can be properly grouped in the repot. They do not have to be grouped
in advance.
2. extra mini dimension allows to spilit weekly and accumulated YTD
metrics easily in the report heading. As I use excel pivot tables a
lot, this is quit handy.
3. You can slightly change the method if you prefer to add YTD as a
metric instead of mini dimension. In the LAST.ISOWEEK part just
assign value to this YTD metric without extra output. Ready.

So here it is , YTD. Mail me if some quetions occur.

With kind regards
Maxim Ivashkov

Wednesday, July 23, 2008

Treatment for overlaping timestamps in the dimensions with the history

/********************************************************
in datasets with historical records (update type 2 )
the validity of the entity has the beging and the end timestamps.
If the timestamps overlap with eachother for the same entity, this will
duplicate records when you join the dimension with a fact table.
This program will mark all overlaping time stamps.
Replace my names with yours.
Example of the overlaping timestamps:
NUMKLT DATING DATEND
234 01JAN2006 31DEC2008
234 12DEC2006 .
********************************************************/
data srclib.wgvbnd;
set srclib.wgvbnd; * data where the historical records are saved;
* datend - end datestamp;
* dating - begin date stamp;
* NUMKLT - the entity - client number;

by numklt dating;
retain prev_end;
if first.numklt then do;
prev_end=datend;
N=0;
end;
else if prev_end>dating then weird=1;else weird=.;
N+1;
drop prev_end;
run;