SPDS Tips

Grey Staples

Camelback Systems

Scottsdale, Arizona

 

The Scalable Performance Data Server (SPDS) software of SAS allows computer work to be moved from the individual’s Client computer over to a networked Data Server. This paper discusses various new features of SPDS which can have a dramatic impact on performance. As well, there are subtle changes needed to be made to existing SAS programs, oriented to reading data from magnetic tape or flat files, to take advantage of the SPDS capabilities.

 

BACKGROUND

This SPDS feature of SAS allows the processing load to be split between a Client computer, your workstation, and a Data Server computer holding the data. Queries are composed on the Client using DATA, PROC and SQL statements and then transmitted over the network to where the Server resides. If allowed, SPDS will try to do as much work as it can on the Data Server before sending results back to the remote Client.

SPDS affords the ability to break requests into "threads" and run each concurrently on separate processors which greatly improves performance. Check with your system administrator for how many processors you should use.

The complete reference manual for SPDS is supplied with the license for SPDS.

The generic example mentioned here is of a SPDS warehouse (LIBNAME) with a dataset named XXX, holding a few years worth of history data. Ongoing data is appended monthly with the computer date held in the LOAD_DTE variable. This allows a subset of all the data to be inspected rather than many year’s worth. SPDS Version 3 has an Aging feature allowing data to be similarly grouped (not discussed here).

 

STARTING SPDS

You would issue the command "SPDS programname" at the Unix prompt, rather than "SAS programname" This would invoke the SPDS configuration and enable SPDS commands to be used. You can run SAS programs against SAS datasets from SPDS, but you cannot reference SPDS dataset formats from SAS.

You can also issue just the command SPDS at the prompt and you will get the interactive form of SPDS.

LIBNAME

This statement is needed to connect to SPDS. With it you provide your own Library Reference name, identify to SAS that you are invoking SPDS, and connect to a server with your identifier and password. The statement is as follows:

LIBNAME libname SASSPDS "domain" SERVER=hostname.service USER=’UUUU’ PROMPT=YES;

Where:

Libname is your library reference name
SASSPDS is required since it invokes the SPDS engine software
domain is required since it is the domain name defined on the server
hostname.service is the name of your warehouse data
USER is required to give you access
PROMPT means that you will be prompted for your password. You could have supplied your password with PASSWORD=’PPPP’ as well.

 

SPDS MACROS

The SAS user tells SPDS what options to invoke with Macro Variables. This is similar to the "Options …;" statement of SAS. To get started there are just three you need to add to your program:

You must also add a SAS option to your program: OPTIONS COMPRESS=YES; This option affects performance and data storage.

There is a more detailed discussion of Macros in Appendix A.

 

USE WHERE, NOT IF

Avoid using an IF to subset data. The IF will cause a "full table scan" of all the years worth of history to be processed on the Client, transferring all the data through the network to your workstation for evaluation. That transfer could involve many g igabytes of data passing over the network. The WHERE causes SPDS to perform as much processing as possible on the Data Server, not on your computer. Also SPDS will get the benefit from indexes. (See Appendix C) This means that a major portion of the processing for a query would be done on the Data Server by just using the indexes before selected sections of the data are returned. Performance improvements are impressive. Queries that took hours now could take minutes to process.

 

 

SERVER-SIDE SORTING

SPDS can perform a sort on the Data Server rather than transferring the data to the Client for sorting. You would reword your sort to read:

PROC SORT SASSPDS DATA=libname.xxx OUT=yyy;
BY --;

The addition of the "SASSPDS" is all you need to shift the work to the faster Data Server.

 

 

IMPLICIT SORTS

Usually when you use a BY clause in a "PROC …;" statement you need to first issue a "PROC SORT … BY …;" to get the data into the right sorted order. With SPDS you can eliminate the extra SORT step since the software will detect the BY and perform an automatic SORT on the Data Server. You also greatly reduce the amount of data transferred back and forth between computers over the network.

 

 

KEEP

In SPDS version 3 the KEEP will cause the Data Server to transmit only the requested variables to your client workstation.

When in version 3 always use the KEEP data set option to retain just the variables to actually need for your query. This avoids transferring many more variables, i.e. all of them, to your workstation. You would use the following syntax: (note the second KEEP option on the SET clause)

DATA new.newdata1 (KEEP=var1 var2 ...);
SET libname.xxx (KEEP=var1 var2 ...);

 

 

PROC SQL

The Structured Query Language (SQL) is quite useful for treating a SAS/SPDS dataset as a relational table and performing a variety of actions. See Appendix B for a discussion of the Pass Through feature of SQL. See also Chapter 6 in the SPDS documentation for more information.

As an example we will just deal with creating a SAS temporary, work, dataset from an SPDS dataset.

In SAS or SPDS you could code:

DATA NEWDATA1;
SET libname.XXX;
WHERE ... conditional expressions ...;
RUN;

In SQL you would code:

PROC SQL;
CREATE TABLE NEWDATA1 AS
SELECT var1, var1, ...
FROM libname.XXX
WHERE ... conditional expressions ...;

Note that the RUN; is not needed.

 

The following discussion is another example of Proc SQL. See the manual "SAS Guide to the SQL Procedure" for a more detailed treatment. This code was taken from there.

PROC SQL;
SELECT EMPCITY, SUM(EMPYEARS) AS TOTYEARS
FROM LIBNAME.EMPLOYEE
WHERE EMPTITLE=’salesrep’
GROUP BY EMPCITY
ORDER BY TOTYEARS;

Here we are selecting a variable (EMPCITY), performing a summation operation on another variable (EMPYEARS), creating an alias (TOTYEARS) for the result, naming the SAS or SPDS dataset to be read, filtering on the values of the variable (EMPTITLE), grouping the results by EMPCITY, and sorting the results by TOTYEARS. The result is automatically sent to the LST dataset.

 

 

PARALLEL GROUP BY

SPDS takes advantage of the multiple processors on the Data Server to enhance the performance of queries. You would code the statement as:

PROC SQL;
SELECT variables ... FROM libname.XXX;
WHERE ... expression ...;
GROUP BY variables ...;

The GROUP BY causes SPDS to perform an automatic sort on the Data Server.

 

 

Appendix A

SPDS MACROS Various processing and performance options are used to enhance the use of SPDS. To see what options are in effect at any time you would issue the following:

LIBNAME= ... (see above)
Proc spdo lib=libname;
Spdsmac;
Run;

The results appear in the LOG dataset as follows in this abbreviated listing:

options date compress=yes;

libname LIBREF sasspds "DOMAIN" server=HOST.SERVNAME user='UUUUUUUU'
password=XXXXXXXX
NOTE: User uuuuuuuu(ACL Group GGGGGG) connected to SPDS(SUN) 2.10 server at
Nnn.nnn.nnn.nnn.
WARNING: Read-only access to LIBNAME domain restricted by *LIBNAM* ACL.
NOTE: Libref LIBREF was successfully assigned as follows:
Engine: SASSPDS
Physical Name: ---------------------

proc spdo lib=LIBREF;
spdsmac;
SPDS Macro Variables list/settings.

BY Processing options -

SPDSBSRT=YES Controls Sorting for BY processing on the Data Server when it encounters no index (base SAS requires a PROC SORT whereas SPDS does not).

WHERE clause options

SPDSNIDX=NO Controls whether indexes are used during WHERE clause planning or BY processing. This is a double negative since the NIDX means NO do not use indexing.
SPDSWBTR=NO Controls whether global index is used instead of segmented indexes in WHERE clause evaluation. (Advanced)
SPDSWDEB=NO Controls whether WHERE clause planning tree is printed. (Recommended)

 

Server performance options -

SPDSCOMP=YES Controls Network Data Compression.
SPDSNETP=4096 Network Packet Size. (Change to 100000)
SPDSSYRD=NO Controls whether reads are done synchronously. (Advanced)
run;

Note that SPDSTCNT, for threads used in WHERE clause processing, is not defaulted so it must be specified as %LET SPDSTCNT=8;.

These settings are the defaults, similar to what you see if you issue a "PROC OPTIONS;" in Base SAS. Some of the defaults need to be overridden for performance reasons. Since these are SAS Macro Variables you would use the following macro statement (Note that you can use lower case, but the YES|NO options must be in upper case):

%LET SPDSNETP=100000; This would change the Network Packet Size.

%LET SPDSTCNT=8; This is the count of the number of threads (processors) which can run simultaneously to process a WHERE clause. This is a major performance feature of SPDS.

%LET SPDSWDEB=YES; This prints a report, in the LOG, showing how SPDS has chosen to process your WHERE clause. The report is most helpful to support people in addressing performance issues. The last line of the report shows " whinit returns:" "ALL" indicating that the Data Server can evaluate all of the WHERE clause, and "EVAL1" to indicate that SPDS will use indexes on the Data Server and get the best performance, or "EVAL2" to indicate that a "brute force" full scan of the table will be performed on the Data Server.

 

 

APPENDIX B

SPDS PASS THROUGH

In addition to the "LIBNAME libname SASSPDS 'domain' …" way of connecting to the SPDS Data Server, discussed above, you have an additional feature of SPDS called "Pass Through." Here you establish a connection to an SQL Server using the CONNECT statement and reference it with the CONNECTION TO clause. This further allows SQL statements to be executed on the remote server.

The Pass Through syntax would be:

PROC SQL;
CONNECT TO SASSPDS (DBQ=’domain’ HOST=’host’ SERV=’service’
USER=’XXXXX’ PASSWORD=’YYYY’);
SELECT * FROM CONNECTION TO SASSPDS
(SELECT COUNT(DISTINCT var1) FROM XXX
WHERE LOAD_DTE = ‘01JAN01’D);
DISCONNECT FROM SASSPDS;
QUIT;

All the components of the LIBNAME statement are there, just now expressed as a parameter. With this you are establishing a connection to SPDS, identifying yourself and what you want access to. The clause "SELECT * FROM CONNECTION TO SASSPDS" is necessary to use the Pass Through connection just established. The SQL statement within the parenthesis is regular SQL with a recommended WHERE clause to use indexing on the SQL Server. In this example we have a variable LOAD_DTE which is the computer date when a batch of data was loaded. The DISCONNECT FROM SASSPDS;QUIT; are needed to terminate the Pass Through connection. Note that there is no RUN; statement used with SQL. If you do include one, you will just get a warning.

Please consult the SPDS manual, chapter 3, for a more complete discussion.

 

APPENDIX C

INDEXES

SAS and SPDS offer the ability for indexes to be maintained for any variable, or group of variables. An index is a small control dataset held on the Data Server which greatly improves the performance of queries and processing against the data.

To see what indexes have been defined for an SPDS dataset use a "PROC CONTENTS … (VERBOSE=YES);" to get a detailed listing of index characteristics.

Comments to Webmaste


on September 10, 2007