Blogs
The Model Migration Bottleneck: Is PMML the Answer?
Enterprises seeking to increase the volume and scale of deployed analytics frequently run into what we call the model migration bottleneck -- that's where analytical models developed offline in a statistical package must be deployed into a scalable IT-managed production environment. Customers and software vendors have tried numerous approaches to managing this constraint, but for many firms the solution is to physically recode and test the model -- a process that can take weeks or even months. When analytics are a key competitive tool and speed is of the essence that is not an acceptable approach.
Clients sometimes ask is of PMML is a solution. PMML, or Predictive Model Markup Language is a standard published by the Data Mining Group, an independent consortium of leading analytical software vendors, service providers, analytics consumers and thought leaders. An XML-based standard, PMML has evolved progressively since first released in 1997, and is currently in Version 4.0.
In theory, PMML offers a vendor-neutral standard for analytic models enabling firms to integrate diverse analytics development and deployment platforms. In practice, PMML falls far short of that vision, for several reasons:
-- Overall vendor support for the standard is spotty. MicroStrategy strongly supports the standard, but at present MicroStrategy is the only major vendor that supports PMML 4.0, a standard published almost two years ago in July 2009.
-- Individual vendors tend to provide piecemeal support. SAS, for example, supports PMML export from SAS Enterprise Miner but not from the more widely used SAS/Stat; SAS does not support PMML import on any of its products.
-- PMML accurately captures an individual model, but not the modeling process. Modelers rarely work on data "as-is". Data transforms must be captured and reverse-engineered in the scoring environment, and this still requires manual recoding. Since any manual recoding requires testing, tuning, migration and version control, many firms have concluded that they might as well recode the entire scoring job and not mess with PMML exports.
In short, while PMML support often appears on RFP functional checklists, actual usage is limited and likely to remain so absent improved vendor support and extension of the PMML framework to support the end-to-end data mining process.
In my opinion, the problem that PMML solves -- publishing analytical models back to the data warehouse -- only exists because modelers bring the data to the analytics and not the other way around. Rather than use PMML, if you build the models in the database, there's no model migration bottleneck.



Comments
Predicting with PMML
Hi Thomas,
I believe your posting touches on a lot of discussion points, all valid. I just wanted to add my two cents. BTW, I am the VP of Analytics for Zementis, the maker of ADAPA, a scoring engine based on PMML. I also co-authored the book "PMML in Action" available on Amazon.
In terms of PMML support, it is not a question of what companies support what version of the standard. PMML 4.0 simply adds elements to PMML 3.2 (the previous version of the language). These include the ability to represent "Time Series", for example. So, just because PMML 3.2 does not support "Time Series", it does not mean it is old and unusable. PMML support is what counts. And, for that matter, all the top analytic companies are behind PMML. Check list of supporters here.
In terms of pre-processing, there is no doubt that a predictive solution is more than the predictive model itself. For this reason, PMML has the power to represent a vast array of pre-processing operations. I was actually invited to write an article for the IBM developerWorks website on that. The article is entitled "Representing predictive solutions in PMML: Move from raw data to predictions".
I agree with you that SAS should probably support PMML on its SAS Base product. That's probably something SAS users should ask for. On the other hand, IBM SPSS offers PMML 4.0 support in Modeler and Statistics. Most definetely, the way to go!
Finally, the idea of representing models in PMML does not preclude you from executing your predictive solutions in-database. Zementis has just released a PMML plugin for in-database scoring. You can read more about it on our website, zementis.com.
I also would like to invite you and all readers to join our on-going PMML discussion in LinkedIn (the PMML group has more than a thousand members).
Kind regards,
Alex Guazzelli
Model Migration & PMML
Hi Thomas,
This is an Interesting post and there's no agrument that many databases these days allow analytics to be brought to the data. Netezza has done an excellent job enabling a variety in-database analytics and that's often the best solution for scoring large volumes records. But it appears your answer to the Migrataion Bottleneck is too not migrate models but to develop them only in the database.
This assumes that the database supports the statistical packages of choice by statisticians, and that those statisticians are comfortable working in the database environment. A recent survey showed that data miners typically use 4 to 5 different data mining tools and, while there's always the potential a particular database will support a particular tool, its fair to say that most data miners will likely be using standalone tools. Migration will still be an issue for many data miners.
I've been involved with PMML for years and I think some of the information in your post is a bit dated:
PMML can always be improved, in fact the group is working to release PMML 4.1 later this summer. But as Alex points out, since many database vendors support PMML as well, I'd argue that it should remain a viable solution to the Migration Bottleneck.
Best regards,
Rick Pecher