Automated Classification of SAS Codes

ubaid darwaish

Ubaid Darwaish

Sometimes people stare at my computer screen and ask me about the active window which keeps me absorbed, they ask,” What is that ?’ and I curtly reply , “SAS”. Its when they ask me, “What does it do ?”  I stop typing, take a deep breath and look at their face, fumbling for words.

Its not because I don’t have anything to say, Its because I have too much to choose from, I just try to figure out: from where to start. The truth is that SAS is such a versatile and powerful language that it can pretty much do everything for me. It is not limited to analytics I will argue, analytics is just one face of it.

Apparently  three main streams that SAS is being used for in the industry are  Analytics/ETL and MI. Now that SAS has got different set of tools for each one of those streams like DI Studio for ETL, WRS/VA for MI, SAS E-miner for analytics: but they all run on the same foundation engine. They all have one thing in common, SAS code.

SAS programmers generally end up writing programs for ETL/MI and Analytics in one place: mostly Base SAS or SAS EG and not taking heed of using the right tool for right job. Reasons can be many: lack of skills in handling the new tool, comfortable enough with SAS programming, love for antiques.  Now its all good as long as you know, which program does what; but as the number of programs increase, it becomes extremely difficult to manage and to keep track.

There might arise a situation wherein we want to dedicate the servers for each one of those streams(ETL/MI/Analytics) if only we knew, what program does what and  how to classify them as ETL, MI or Analytical programs.Now either you need to hunt down all the developers that wrote those codes and ask them what they do or find someone who is generous enough to waste his time to actually read all of those programs and classify them. But if you are smart enough, you’ll use the macro below.

Before you dive into the code, let me summarize how it works:-                                                    It will scan through entire directory for all sas programs, read them one by one, and compile the list of procedures used in each of those programs, and based on the procedures used: it classifies them as either ETL/MI/ Analytics. Simple !!!

Ubaid Darwaish

/*Directory where SAS codes reside*/
%LET DIR=/ubaid/BAF/Config/Lev1/SASApp/SASEnvironment/SASCode/Jobs;
/*Directory where to export report*/
%LET EXPO_PATH=/home/ubaid;


/*Code analyzing macro starts*/

%MACRO ANALYZE();


/*Fileref*/

FILENAME IN_CODE PIPE "ls &DIR.";

/*List of all files*/

DATA LIST;
    LENGTH CODE_NAME $100.;
    INFILE IN_CODE TRUNCOVER;
    INPUT CODE_NAME $100.;
RUN;

/*List of sas files from all files*/

DATA CODE_LIST;
    SET LIST;
    IF FIND(CODE_NAME,".sas")>0;
    CALL SYMPUT('NO_OF_FILES',_n_);
RUN;

/*Do for n number of files*/

%DO i=1 %TO &NO_OF_FILES ;

/* Get filenames one at a time*/

DATA _NULL_;
    SET CODE_LIST;
    IF _N_=&I;
    CALL SYMPUT('CODE_NAME',CODE_NAME);
RUN;

/*Read files, one at a time*/

DATA U;
    LENGTH    F1   $ 336        ;
    FORMAT    F1   $CHAR336.    ;
    INFORMAT  F1   $CHAR336.    ;
    INFILE "&DIR./&CODE_NAME."  MISSOVER   DSD ;
    INPUT     F1 : $CHAR336.    ;
RUN;

/*Separate what is required from what is not--Look for "PROC" and not "PROCESS" 
and set Mining Flag*/

DATA V;
SET U;
        IF (FIND(UPCASE(F1),'PROC')>0 AND FIND(UPCASE(F1),'PROCESS')=0)  THEN MINE='Y';
         CODE_NAME=LOWCASE("&CODE_NAME");
RUN;

/*Find the Cordinates(x,y) to extract the procedure names--They will be between 
a "PROC " (proc with space)and next " "(space)*/

DATA W;
    SET V;
    IF MINE='Y';
    LOCX=FIND(UPCASE(F1),"PROC")+5;
    PR_OCX=SUBSTR(F1,LOCX);
    LOCY=FIND(PR_OCX," ");
    PR_OC=COMPRESS(UPCASE(TRANWRD(SUBSTR(PR_OCX,1,LOCY),';',' ')));
    DIR="&DIR";
RUN;

/*First entry in X*/

                %If &i=1 %THEN %DO;
                                    DATA X;
                                    RETAIN DIR CODE_NAME PR_OC;
                                    SET W;
                                    RUN;
                %END;

/*If X exists append*/

                %ELSE %DO;
                                    PROC APPEND BASE=X DATA=W;
                                    RUN;
                %END;


%END;


/*Getting rid of redundant data*/

    PROC SORT DATA=X 
    OUT=Y(KEEP=DIR CODE_NAME PR_OC) NODUPKEY;
    BY CODE_NAME PR_OC;
    RUN;


/*Filename extension*/

DATA _NULL_;
CALL SYMPUT('DT',PUT(DATETIME(),datetime16.));
RUN;


/*Exporting the Procedure Report*/

    PROC EXPORT DATA=Y OUTFILE="&EXPO_PATH/SAS_Procedure_Report_&DT..csv" DBMS=CSV
    REPLACE;
    RUN;

/*Flag Set based on Standard SAS PROCS*/

DATA Z;
SET X;
                IF PR_OC IN 
        (
        'CORR',
        'UNIVARIATE',
        'ANNOVA',
        'GLM',
        'REG',
        'TTEST',
        'SURVEYSELECT',
        'NPAR1WAY'
        )        then ANALYTICS=1;

                                IF PR_OC IN 
        (
        'TABULATE',
        'REPORT',
        'GPLOT',
        'PLOT',
        'PRINT',
        'TRANSPOSE',
        'SUMMARY',
        'MEANS',
        'FREQ',
        'MEANS',
        'FORMS',
        'CHART'
        )                         then MI=1;

                                            IF PR_OC IN 
        (
        'APPEND',
        'SQL',
        'DATASETS'
        )                                     then ETL=1;
RUN;


/*Count of Occurence*/ 

PROC SQL;
    CREATE TABLE ZA AS SELECT SUM(ETL) AS SUM_ET,SUM(MI) AS SUM_MI,SUM(ANALYTICS) AS SUM_AN,CODE_NAME 
    FROM Z GROUP BY CODE_NAME;
QUIT;

/*Probability of being True based on Occurrence*/

Data ZB(KEEP=CODE_NAME ANALYTICS ETL MI);
SET ZA;

    if (SUM_AN gt 0) THEN ANALYTICS='Y';
    ELSE ANALYTICS='N';

        if (SUM_MI gt SUM_ET) AND (SUM_AN le 0) THEN MI='Y';
        ELSE MI='N';

    if (SUM_ET gt SUM_MI) AND (SUM_AN le 0) THEN ETL='Y';
    ELSE ETL='N';

RUN;

/*Exporting the Classification report*/

PROC EXPORT DATA=ZB OUTFILE="&EXPO_PATH/SAS_Code_Report_Class_&DT..csv" DBMS=CSV
REPLACE;
RUN;

%MEND;

Now this is a very basic one, and obviously can be refined to be more exact and accurate. Wishing you all the luck for same.

Contributed by: Ubaid Darwaish