SAS Performance Tuning - Optimizing your SAS Codes

SAS performance tuning is often an ignored concept and not much of the focus for most of the programmers out there unless we run into SAS performance issues. Generally when we are done with writing a SAS program we sit back and watch it work, looking out for any errors: and if we find none, its all cool and nice…that sums up the story for most of us. But then there are few, who just can’t stop. They want to take it one step further- they want to make it lean, they want to make it fast, they want to make it better…they just want to tune it. This tutorial is listing down some of the basic SAS code optimization techniques that can make a difference when it matters.

Although SAS code optimization and performance should be kept in mind from the beginning, but its often addressed only when we come across performance issues like, SAS code is taking too long to run. Some come back to it later, some never at all. Tuning your SAS code for performance comes with experience like any other skill but there are some basic rules that cannot be ignored.

Here are my ten golden techniques for optimizing the SAS code and to help you with improving the SAS code performance. These are the best practices that one should keep in mind for writing efficient SAS codes.

Top 10 SAS performance tuning tips:-

IF OR WHERE:
When we need to filter data based upon some conditions we generally use ‘if’ or ‘where’ statements. The way these two statements work is quite different even though they do the same job and yield the same result. The difference between the working mechanism of these two statements becomes critical when you tuning your SAS code for performance:

DATA X;
SET A;
If Amount > 10;
RUN;

In case of the “if” statement, the data is read into PDV(Program Data Vector) before the condition is verified. Thus all the records are read into the program data vector irrespective of their value and the criteria. On the contrary, the where statement checks for the criteria, even before the data is read into the PDV. Hence, the unwanted data records are not read in to the buffer space at all. Thus the “Where” statement will be a better option for data subset, especially in case of datasets with a large number of variables.

DATA X;
SET A;
Where Amount > 10;
RUN;

APPEND NOT SET:

Many a times we need to create a data set that is a combination of two data sets, and we often end up writing something like this:
DATA Z;
SET X Y;
RUN;
There is nothing wrong in the above statement and it does give us the proper result, but what it lacks is the efficiency. Since we use SET statements, both X and Y will be filled into PDV and then only Z will get created.
The better option is to use APPEND instead, In append only X will be taken into PDV and Y will be left as such. Now you might not see any win here when you’re dealing with smaller datasets, but in big datasets you can actually reduce the processing time to 50% if you set the base dataset as the biggest of two.

PROC APPEND BASE=Y DATA=X;
RUN;

DON’T READ EVERYTHING:

When we have multiple conditions that can be fulfilled and need to perform different tasks for each one of them being true, that is when if statement comes handy. We can check multiple conditions and perform different tasks based on the requirement. e.g:

DATA OUT;
SET IN;
If volume le 100 then CAMP_TYPE=’X’;
If volume ge 101 then CAMP_TYPE=’Y’;
RUN;

What is the important factor is the no of conditions checked on each iteration of data step. In example above, all ‘if’ statements are checked for each iteration of data step. So if we have 100/1000 conditions every one of them will be checked for each iteration.
Now the code can be changed to do exactly the same, but with correct SAS performance tuning:-

DATA OUT;
SET IN;
If volume le 100 then CAMP_TYPE=’X’;
Else If volume ge 101 then CAMP_TYPE=’Y’;
RUN;

The difference is that when in a particular iteration a condition is fulfilled, all of the conditions below that point are not checked anymore, and the next iteration starts. This will reduce the number of times, the “if” loop is being iterated. If we have 1000 conditions and the condition no 10 holds true, then SAS won’t be going through rest of 990 conditions which will improve the performance drastically.
The placement of conditions if done based upon the ‘descending order of the probability being true’ will improve the chances of getting an early hit and saving time.

SORT WISELY:

The sort is not always as simple as it seems, especially if the data size is huge and the variables involved are more. If one has to merge a dataset with two other datasets, we can write something like:

PROC SORT DATA = BRANCH;
BY CODE;
RUN;
DATA X;
MERGE BRANCH (in=a) SITE_CODE(in=b);
BY CODE;
RUN;
PROC SORT DATA = BRANCH;
BY CODE STATE;
RUN;
DATA Y;
MERGE BRANCH (in=a) STATE_CODE(in=b);
BY CODE STATE;
RUN;

It works and gives the result but its not the best way of doing it. The better way is actually to try and minimize the sorting by grouping many levels of sorting as one, which will look like:

PROC SORT DATA = BRANCH;
BY CODE STATE;
RUN;
DATA X;
MERGE BRANCH (in=a) SITE_CODE(in=b);
BY CODE;
RUN;
DATA Y;
MERGE BRANCH (in=a) STATE_CODE(in=b);
BY CODE STATE;
RUN;

In the example above instead of sorting by “code” first, and then by “code “ and “state”, the initial sort procedure is directly sorting on “Code State”, so that the sort for the second analysis can be avoided thus improving the performance of your SAS code significantly.

LETS RETAIN IT:

Retain is a very powerful statement that can be used to assign values to variables. Often we need a consistent value for a particular variable across the dataset e.g. we might be creating a data set for Mumbai location only and can do that as:

DATA X_Out;
SET X_In;
LOCATION =’Mumbai’;
RUN;

The problem with the code snippet above is that will do assignment for every iteration of the data step, that means Location variable will be assigned value ‘Mumbai’ over and over again; and if the observation count is huge, it will have an impact on performance.
The same can be achieved by using ‘retain’ wherein the assignment will only happen at the first iteration of the data step, thus following a correct SAS best practice.

DATA X_Out;
SET X_In;
Retain LOCATION ‘Mumbai’;
RUN;

READ MORE NOT LESS:

Often we need to import the data from two or more than two files and then merge them together and then only continue with the rest of the process. One way of doing it is:

filename A ‘/home/ubaid/c_dev’;
filename B ‘/home/ubaid/c_prod’;

Data X;
infile A linesize=160 truncover;
input @01 Camp_Name $char80. @81 Camp_Type $char80. ;

Data Y;
infile B linesize=160 truncover;
input Camp_Name $char80. @81 Camp_Type $char80. ;
Run;

Data Z;
Set X Y;
Run;

What we are doing in the example above is reading two files, one at a time. However we can read both files at once and no merge step is needed: which makes it more efficient and saves us both lines and time(applicable to files with same formatting/layout).

filename AB (‘/home/ubaid/c_dev’ ‘/home/ubaid/c_prod’);

Data Z;
infile AB linesize=160 truncover;
input @01 Camp_Name $char80. @81 Camp_Type $char80. ;
Run;

READ WRITE SMARTLY:

If we have multiple datasets which need to be filtered based on some condition on different columns and flagged, so that when we finally put them all together, we know where the records came from (which source). We might write something like this:

DATA A;
SET X (WHERE=(CAMP_TYPE=’Market’));
SOURCE = 1;

DATA B;
SET Y (WHERE=(ANALYST=’Ubaid’));
SOURCE = 2;

DATA C;
SET Z (WHERE=(PROCESSED=’Y’));
SOURCE = 3;

DATA FINAL;
SET A B C;
RUN;

Now if we want to do it smartly, we need to reduce the pass over of data i.e. Instead of creating four datasets, we create only one: and instead of reading six datasets we read only three:

DATA FINAL;
SET X (WHERE=(CAMP_TYPE=’Market’) IN=A)
Y (WHERE=(ANALYST=’Ubaid’) IN=B)
Z (WHERE=(PROCESSED=’Y’) IN=C);
SELECT;
WHEN (A) THEN SOURCE = 1;
WHEN (B) THEN SOURCE = 2;
WHEN (C) THEN SOURCE = 3;
OTHERWISE;
END;
RUN;

SORT OUT THE SORT:

Often while sorting large datasets we end up with an error stating, we ran out of workspace.
This can happen in joins as well, because join has an inherent sort within it. So if you are trying to join two large datasets that reside in work library and don’t have enough space, the join will fail because of sort failure.

PROC SORT DATA=CUSTOMER OUT=CUSTOMER_SORTED;
BY CUST_KEY;
RUN;

One of the ways to overcome this is to actually sort permanent datasets, instead of temporary datasets.
SAS requires 4 times the space of the dataset in work library to sort a temporary dataset; however it requires only 2 times the space of the dataset in work library to sort a permanent dataset. There will be a trade-off in terms of performance, but at least the job won’t fail.

PROC SORT DATA=MART.CUSTOMER OUT=MART.CUSTOMER_SORTED;
BY CUST_KEY;
RUN;

In case we are not creating a new dataset from sort stage and having space issues, we can use the option overwrite in sort to prevent failures due to space issues,

PROC SORT DATA=CUSTOMER;
BY CUST_KEY;
RUN;

The original dataset is kept, up until the data is written to output dataset. However if we use option overwrite, it will ensure that the original dataset is deleted before data is written to output dataset.
So that you have enough breathing/working space.

PROC SORT DATA=CUSTOMER OVERWRITE;
BY CUST_KEY;
RUN;

And if nothing works, let’s use the TAGSORT. It is a savior when there is not enough disk space, as it only sorts the by variables (first) and not the entire table. The tags contain (sorted) by variables along with _N_(observation no) from original data set. The tags are used to link the records back to keys(by variables) and we have them all sorted.

PROC SORT DATA= CUSTOMER OUT= CUST_SRT TAGSORT;
BY CUST_KEY;
RUN;

This will have impact on performance but will complete without fail.

ITS NOT ALWAYS WORK

If you have been given a task to fine tune a sas program that is stretched across thousands of lines and from the view of it you see there is a fundamental problem of data movement especially a pull from SAS SPDS, a quick fix will be to actually not move the data from that environment during processing.
So instead of processing things in work, they should have been processed in that library itself.

DATA XYZ;
SET HPDS.XYZ;
If Type=’A’;
RUN;

PROC SORT DATA=XYZ OUT=XYZ_OUT;
BY UD;
RUN;

So optimally it should have looked like :

DATA HPDS.XYZ;
SET HPDS.XYZ;
If Type=’A’;
RUN;

PROC SORT DATA=HPDS.XYZ OUT=HPDS.XYZ_OUT;
BY UD;
RUN;

Now to fix it, all I need to do is to actually do the processing in the library (HPDS) instead of work, but since we have one level data set names I can’t do (ctrl+h >>find–WORK.>>replace–HPDS.)

Instead of going through 1000 lines and changing the dataset name(s) to two levels, from one level: we can change the default library for one-level dataset names. So instead of one level datasets getting created in work, they will be created elsewhere:

LIBNAME HPDS “/home/ubaid”;
OPTIONS USER=HPDS;

DATA XYZ;
SET HPDS.XYZ;
If Type=’A’;
RUN;

PROC SORT DATA=XYZ OUT=XYZ_OUT;
BY UD;
RUN;

In above case both datasets, XYZ and XYZ_OUT will be created in library HPDS.

INDEX IS NOT ALWAYS PRETTY

The indexes are held quite high when it comes to performance tuning, they play a significant role especially in joins where they come quite handy. It has been one of the time tested best practice to create indexes. What is more important than creating an index, is to know ’When to create one?’.
In some situations we end up creating indexes that are of no use and its mere waste of time, to create and maintain them; those indexes cannot be of any use to join while processing. The following SAS Performance tuning technique demonstrates exactly why

PROC SQL;
   CREATE TABLE FINAL AS SELECT
   A.CCK, A.CAMP_TYPE, B.CHANNEL
   FROM A.CONTACT_HISTORY LEFT JOIN B.CHANNEL
   WHERE A.CCK =B.CCK
   ORDER BY A.CAMP_TYPE;
QUIT;

In this situation the SQL will perform a left join between the Contact_History and Channel table. So every row from Contact History table needs to be picked because of left join, and since its every row..the index created on table Contact History will not be used by SAS in processing and have got no role to play in the join, however an index on table Channel will play a significant role, because only certain rows will be selected.

Checkout how you can create the cool SAS program to create the Wikipedia of SAS Programs: https://www.thedarwaish.com/index.php/2019/12/15/wikipedia-of-sas-programs/

From here ? If you want to take your journey for SAS performance tuning to next level, do checkout some more advanced but interesting tips at : https://www.phuse.eu/blog/sas%C2%AE-performance-tuning-techniques

Contributed by: Ubaid Darwaish