Blogs

Solution Architect, Advanced Analytics

Netezza and SAS: Integration Best Practices

August 29, 2011

Most Netezza customers use SAS, so it’s natural that customers ask about the best way to integrate SAS and Netezza.  This article outlines key best practices to consider when implementing a solution with SAS and Netezza.

Your best path to integration with Netezza depends on what SAS software you have deployed. Most Netezza customers use either a combination of Base SAS and SAS/Stat, SAS Enterprise Miner, or both.  For this article, we’ll limit the scope to those two options – there are special considerations for SAS Solutions, which we’ll address separately.

If you use SAS Enterprise Miner or SAS Model Manager, implement the SAS Scoring Accelerator for Netezza.  SAS Scoring Accelerator (a software product developed in partnership with Netezza but licensed and supported solely by SAS) enables users to define a scoring process within a SAS Enterprise Miner project and publish it directly to the database environment.  This eliminates the manual coding needed to create a scoring process that runs on Netezza.

To understand why the SAS Scoring Accelerator for Netezza is such a valuable tool, consider the alternatives.  For example, you can run your scoring jobs in SAS through an extract/score/reload operation; to do this, you extract all of the required data from the database, transport it to the SAS server, run your SAS scoring jobs, transport the scores back to the database environment and reload them in the database.  This is a time consuming and expensive operation that becomes increasingly cumbersome as the volume of data and number of production models expands. 

Moreover, the cycle time needed for an extract/score/reload operation makes it impossible to run scores as often as business requires.  This approach is workable if you have a few models to be scored monthly or quarterly; but business increasingly demands model scores with minimal or no latency, which means running scores on demand or on an inter-day cycle.  This is simply not possible using the extract/score/reload approach.

To cite a single example, a client in the financial services industry runs a process each month to extract thousands of variables for millions of customers, score each customer in SAS on hundreds of predictive models, and upload the model scores.  This process takes almost two weeks to run end-to-end, consumes considerable amounts of human and computing resources, and is a major source of pain for this organization.

As an alternative to an extract/score/reload operation, you can write a program from scratch in C, C++ or another supported language and run the scoring job on Netezza.   Deployed in Netezza’s massively parallel architecture, a custom scoring job like this will run in a fraction of the time it would take to run in a single-threaded out-of-database operation.

But custom-coding a scoring model from scratch takes time to design, build, test, validate and tune; customers report cycle times of three to six months to deploy a scoring model into production.  Manual coding also introduces a source of error into the process, so that scoring jobs must be exhaustively validated to ensure they produce the same results as the original predictive model.   

SAS Scoring Accelerator slashes this cycle time because it eliminates the need for manual coding.  Thus, it provides the benefit of a scalable and high performance scoring function, without the costs and delay.

Many Netezza customers use a combination of Base SAS and SAS Stat for model building, but do not use SAS Enterprise Miner and have no plans to do so.  For these customers, the integration path is different and more complex.  To best understand SAS/Netezza integration in this case, there are three important things to know:

  • Most SAS DATA step processing can be done more efficiently in Netezza
  • Unless explicitly stated otherwise, SAS PROCs must run on the SAS host
  • In its current release, SAS/Stat cannot export models

SAS DATA step processing consists of programming logic to read data into SAS data sets and manipulate the data in various ways.  SAS users rely on DATA steps to cleanse, transform and prepare the data for modeling.

Frequently, we find that SAS users who are beginning to work with Netezza simply extract all data needed for modeling, transport the data to the SAS server, then use DATA step processing to build an analysis dataset.  But this kind of processing can be done much faster by replacing DATA step processing with pass-through SQL statements that execute in Netezza. 

Users report that data processing tasks that take hours when run on the SAS server take seconds when run on Netezza.  Leveraging pass-through SQL is entirely consistent with SAS’ recommended programming practices.   It can be done from the SAS Windowing environment and with minimal changes to existing programs.

While it is possible to use pass-through SQL with the SAS/Access Interface to ODBC, Netezza recommends that users deploy the SAS/Access Interface to Netezza.  This interface supports pass-through SQL as well as the ability to pass joins, functions and formats to Netezza, and supports Netezza bulk load and unload.  SAS/Access Interface to Netezza was developed by SAS in partnership with Netezza, and is licensed and supported by SAS.

Customers who experience dramatic performance improvements when they shift data preparation steps into Netezza naturally wonder whether they can do the same with SAS PROCs.  For SAS releases prior to SAS 9.3, all SAS PROCs must run on the SAS host, which means that for best performance the SAS datasets to be used as input for the SAS PROCs should also reside on the host. While it is more efficient to perform data prep on Netezza, users should transfer the cleansed and reduced analysis data set to the SAS server prior to executing SAS PROCs.

SAS users can avoid data movement altogether by replacing the SAS PROC with a Netezza Analytics in-database function.  Detailed discussion of this option is outside the scope of this article, but SAS users should note that many SAS statistical functions are replicated by Netezza functions.

As of Release 9.3, SAS supports seven PROCs in Netezza: FREQ, MEANS, RANK, REPORT, SORT, SUMMARY and TABULATE.   These PROCs are commonly used for basic management reports (such as daily response reports, sales reports, staffing reports, survey crosstabs, and so forth).  Since SAS datasets can be held in Netezza, SAS user groups can use these in-database PROCs to set up a high performance reporting environment that runs entirely in-database on Netezza.

As noted above in the section on SAS Scoring Accelerator, when a predictive model is completed it must be converted into a scoring model to produce business value.  For customers using a combination of Base SAS and SAS/Stat, there are two choices available: score the data in SAS or build a custom scoring model that can run on Netezza.  The pros and cons of these two options are the same as described above.  However, since SAS/Stat does not currently export models to any language other than SAS, there is no software solution to enable seamless integration with Netezza.

Certain vendors currently claim to be able to import SAS/Stat models through PMML.  The reader should note that this capability currently depends on a third-party freeware utility to convert the SAS model to PMML.  It is limited to linear and logistic regression models and is not supported by SAS. 

Leveraging Netezza with SAS may require some change in programming habits.  For organizations with a large number of SAS users, this implies a need for training and change management.  To facilitate this, Netezza partner Destiny Corporation (http://www.destinycorp.com/) offers a two-day training course that provides SAS users with the essential skills needed to fully leverage the Netezza massively parallel environment.

To summarize, for best results when integrating SAS and Netezza, you should take the following steps:

  • If your organization uses SAS Enterprise Miner, license and implement the SAS Scoring Accelerator for Netezza.
  • If your organization uses a combination of SAS programming and SAS/Stat, do the following:
    • License and implement the SAS/Access Interface to Netezza
    • Upgrade to SAS 9.3 as soon as possible
    • Provide training to users that will enable them to fully leverage Netezza:
      • Replace DATA step processing with pass-through SQL
      • Consider replacing SAS PROCs with equivalent Netezza functions
      • Implement basic SAS 9.3 query and reporting capabilities in Netezza

Finally, if your organization is committed to the SAS platform, currently runs a significant number of production scoring models (or expects to do so) and does not use SAS Enterprise Miner, you should consider doing so.  We understand that your organization may be reluctant to invest in additional software licensing, and experienced SAS users may not always see the value of SAS Enterprise Miner.  But we frequently hear from customers who are stuck in the trap of manually coding scoring jobs, or attempting to score models through an export/score/reload operation.  These customers incur significant and measurable costs in human and computing resources, plus opportunity costs in the form of high score latency and delayed deployments. 

As noted previously, it’s possible to make these costs disappear with SAS Scoring Accelerator for Netezza.  The first step on this path is to upgrade the modeling process from SAS/Stat to SAS Enterprise Miner.